[PR #231] [CLOSED] New simd functions (GetMask variants (i8, i16, i32, f32) and equality operator (i8) for SSE2 and NEON). #252

New issue

Closed

opened 2026-05-05 03:44:01 -06:00 by gitea-mirror · 0 comments

gitea-mirror commented

2026-05-05 03:44:01 -06:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ultimatepp/ultimatepp/pull/231
Author: @ismail-yilmaz
Created: 2/7/2025
Status: ❌ Closed

Base: master ← Head: new_simd_functions

📝 Commits (5)

d62bf06 Core/SSE2: GetMask function variants (i8,i16,i32, f32) are added.
eb8d6a7 Core/NEON: GetMask function variants (i8,i16,i32, f32) are added.
32b4c24 Core/SIMD: Equality operator (==) support for i8x16 type (NEON & SSE2)
6301841 autotest: SIMD test code for GetMask (i8, i16, i32, f32) variants.
00d7e59 autottest: Etalog log for SIMD3 test is added.

📊 Changes

5 files changed (+155 additions, -0 deletions)

View changed files

➕ autotest/SIMD3/Etalon.log (+11 -0)
➕ autotest/SIMD3/SIMD3.cpp (+66 -0)
➕ autotest/SIMD3/SIMD3.upp (+10 -0)
📝 uppsrc/Core/SIMD_NEON.h (+53 -0)
📝 uppsrc/Core/SIMD_SSE2.h (+15 -0)

📄 Description

This PR adds some very useful and crucial code seemingly missing in U++ SIMD functions: GetMaski8x16(), GetMaski16x8(), GetMaski32x4(), GetMaskf32x4().

These functions allow developers to do some very useful operations easily using the SIMD instructions: Counting, accumulating and determining positions in arrays (e.g they can be used to vectorize string/byte searches).

A simple example is reference/StreamGetSzPointer example. The vectorized version of the example can be as follows:

int CountLinesOptimizedSIMD(Stream& s) {
	int n = 0;

	for (;;) {
		int sz;
		const byte* p = s.GetSzPtr(sz);

		if (sz) {
			const byte* e = p + sz;
			const byte* e16 = p + (sz & ~15);  // Process in 16-byte chunks
			i8x16 q = i8all('\n');
			while(p < e16) {
				int mask = GetMaski8x16((i8x16(p) == q));
				n += __builtin_popcount(mask); // Unfortunately, this seems to be CLANG/GCC specific.
				p += 16;
			}

			// Process remaining bytes (less than 16)
			while (p < e) {
				n += (*p++ == '\n');
			}
		}
		else {
			int c = s.Get();
			if (c < 0)
				return n;
			n += (c == '\n');
		}
	}

	return n;
}

The results of this operation with a ~2Gib file on a Ryzen 5600 with 16 Gib RAM (CLANG) is as follows (note that the scalar version is increased to 16 bytes chunks and 16 newline checks to give the compiler an opportunity to vectorize):

CountLines(in) = 59150432
Simple 1.500799 s
CountLinesOptimized(in) = 59150432
Optimized 717.475 ms
CountLinesOptimizedSIMD(in) = 59150432
Optimized with SIMD 616.997 ms

Admittedly, this is not definitive, GCC can better vectorize and in fact it does. However, the aim of this example is to show how the usage pattern can be simplified using masks and equality operator for i8x16 data type.

There is also an autotest for the patch.

Please check.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ultimatepp/ultimatepp/pull/231 **Author:** [@ismail-yilmaz](https://github.com/ismail-yilmaz) **Created:** 2/7/2025 **Status:** ❌ Closed **Base:** `master` ← **Head:** `new_simd_functions` --- ### 📝 Commits (5) - [`d62bf06`](https://github.com/ultimatepp/ultimatepp/commit/d62bf06e29a2006fb2f6212a155d7636b3b52bce) Core/SSE2: GetMask function variants (i8,i16,i32, f32) are added. - [`eb8d6a7`](https://github.com/ultimatepp/ultimatepp/commit/eb8d6a72fe2669f15d0b04aff359fa219c4479ea) Core/NEON: GetMask function variants (i8,i16,i32, f32) are added. - [`32b4c24`](https://github.com/ultimatepp/ultimatepp/commit/32b4c2486c2f2dec1417e3709214897c1f0f3405) Core/SIMD: Equality operator (==) support for i8x16 type (NEON & SSE2) - [`6301841`](https://github.com/ultimatepp/ultimatepp/commit/63018419d19354cf02fa8b11158b37b46c9f7f77) autotest: SIMD test code for GetMask (i8, i16, i32, f32) variants. - [`00d7e59`](https://github.com/ultimatepp/ultimatepp/commit/00d7e598adf69f75c3a2b3a14732ef95643d7d12) autottest: Etalog log for SIMD3 test is added. ### 📊 Changes **5 files changed** (+155 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `autotest/SIMD3/Etalon.log` (+11 -0) ➕ `autotest/SIMD3/SIMD3.cpp` (+66 -0) ➕ `autotest/SIMD3/SIMD3.upp` (+10 -0) 📝 `uppsrc/Core/SIMD_NEON.h` (+53 -0) 📝 `uppsrc/Core/SIMD_SSE2.h` (+15 -0) </details> ### 📄 Description This PR adds some very useful and crucial code seemingly missing in U++ SIMD functions: `GetMaski8x16()`, `GetMaski16x8()`, `GetMaski32x4()`, `GetMaskf32x4()`. These functions allow developers to do some very useful operations easily using the SIMD instructions: Counting, accumulating and determining positions in arrays (e.g they can be used to vectorize string/byte searches). A simple example is `reference/StreamGetSzPointer` example. The vectorized version of the example can be as follows: ``` int CountLinesOptimizedSIMD(Stream& s) { int n = 0; for (;;) { int sz; const byte* p = s.GetSzPtr(sz); if (sz) { const byte* e = p + sz; const byte* e16 = p + (sz & ~15); // Process in 16-byte chunks i8x16 q = i8all('\n'); while(p < e16) { int mask = GetMaski8x16((i8x16(p) == q)); n += __builtin_popcount(mask); // Unfortunately, this seems to be CLANG/GCC specific. p += 16; } // Process remaining bytes (less than 16) while (p < e) { n += (*p++ == '\n'); } } else { int c = s.Get(); if (c < 0) return n; n += (c == '\n'); } } return n; } ``` The results of this operation with a ~2Gib file on a Ryzen 5600 with 16 Gib RAM (CLANG) is as follows (note that the scalar version is increased to 16 bytes chunks and 16 newline checks to give the compiler an opportunity to vectorize): ``` CountLines(in) = 59150432 Simple 1.500799 s CountLinesOptimized(in) = 59150432 Optimized 717.475 ms CountLinesOptimizedSIMD(in) = 59150432 Optimized with SIMD 616.997 ms ``` Admittedly, this is not definitive, GCC can better vectorize and in fact it does. However, the aim of this example is to show how the usage pattern can be simplified using masks and equality operator for i8x16 data type. There is also an autotest for the patch. Please check. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>