Optimized Acoustics@home app

Message boards : Number crunching : Optimized Acoustics@home app

To post messages, you must log in.

AuthorMessage
[B@P] Daniel

Send message
Joined: 31 Mar 17
Posts: 10
Credit: 4,726,894
RAC: 573
Message 313 - Posted: 13 Jan 2019, 17:02:22 UTC

I have created optimized Acoustics@home app. There are 5 versions:

- SSE2 - it allows to process 2 matrix rows at once;
- SSE4.1 - it added blend operation, which allows for more efficient implementation of conditional parts of code;
- AVX - it added longer vector registers and new floating-point instructions which use them, so 4 rows can be processed at once;
- AVX2 - it added instructions for integer operations on AVX registers. App uses them too, so it no longer needs to use SSE instructions for them. Additionally AVX2 CPUs do not need workaround for slow unaligned loads, what also improves performance;
- AVX512 - it added new longer vector registers, so 8 rows can be processed at once. It also added new mask registers, what allowed to further optimize SSE and AVX code parts (they are always used, as number of row is usually not dividable by 8).

Here are results from Xeon E5-2683 v3 (Haswell):

Original  39,788
SSE2      21,283
SSE4.1    20,943
AVX       19,658
AVX2      19,043


And this from Xeon W-2102. This machine had some apps running in background, so times are bigger:

Original-Linux    51,988
Original-Windows  46,707
SSE2              30,681
SSE4.1            28,680
AVX               24,659
AVX2              23,882
AVX512            21,569


On Linux optimized apps are about 2 times faster than original ones. On Windows this is smaller, as original Windows apps were faster than Linux ones.

Here are results for 32-bit apps:

Win original  28,975
Win no SSE    33,722
Win SSE2      20,518
Win SSE4.1    19,826

Linux original  77,991
Linux no SSE    25,697
Linux SSE2      23,237
Linux SSE4.1    22,998


This was tested on 64-bit CPUs with AVX support, so results from old 32-bit CPUs may differ. It looks that for Windows original 32-bit app without SSE support combiled by MSVC is faster (I compile with gcc 8.2.0). For Linux new non-SSE app seems to be faster a lot. I suspect that code is better optimized for superscalar capabilities of new CPUs, so it runs faster there. I also suspect that inlining of small functions reduced register spilling, what improved speed. It is hard to tell how it will run on old 32-bit CPUs. Probably new app it will be faster there than original one.

Optimized app can be downloaded from GitHub: https://github.com/sirzooro/Acoustics-at-home/releases/tag/opti1.0. There are multiple app versions, compiled with support for different instruction sets. If you are not sure what your CPU supports, on Windows use CPU-Z, and on Linux check "flags" in /proc/cpuinfo file.

In order to install this app, perform these steps:
- close BOINC (config reload will not work);
- unpack archive to project directory - on Windows it is path like "C:\Users\All Users\BOINC\projects\www.acousticsathome.ru_boinc", on Linux /var/lib/boinc/projects/www.acousticsathome.ru_boinc/ . On Linux also please make sure that cambala_boinc_app file is executable, and both all files in this dir are owned by boinc/boinc user/group;
- start BOINC again.

After doing this, in event log you should see entry for Acoustics@home like "Found app_info.xml; using anonymous platform". Additionally you should see (Opti v1.0) in app name displayed in BOINC Mgr.

All app versions checks if CPU and OS supports required instruction sets. If they are not, app will print appropriate error message and exit with code 1.

AVX/AVX2 app versions requires at least Windows 7 SP1, Windows Server 2008 R2 SP1 or Linux with kernel 2.6.30.
AVX512 app versions requires at least Windows 10, Windows Server 2016 or Linux with kernel 3.15. I am not sure about Windows versions, you can try if earlier versions can run it too.
ID: 313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Oleg Zaikin
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 28 Mar 17
Posts: 116
Credit: 1,601,711
RAC: 0
Message 318 - Posted: 21 Jan 2019, 19:27:11 UTC - in response to Message 313.  

I have created optimized Acoustics@home app. There are 5 versions:


Daniel, you have done a great job! Thank you!
ID: 318 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 1 Jul 18
Posts: 2
Credit: 9,520,518
RAC: 33,476
Message 322 - Posted: 22 Jan 2019, 20:05:03 UTC

I'm using the optimized application, the AV512 one, in a linux dual xeon platinum system. Crunching times are great compared to the standard application. Very good job!

However, the oddity is that the system is pulling much less power than usual when it is supposed that these AVX512 instructions have to push the power consumption up. The CPUs have downclocked which is right with AVX but the overall power is much lower. In Asteroids for example when crunching with this system the power consumption increases around 15% compared with crunching other project which does not use AVX.

Any idea?
ID: 322 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[B@P] Daniel

Send message
Joined: 31 Mar 17
Posts: 10
Credit: 4,726,894
RAC: 573
Message 323 - Posted: 22 Jan 2019, 23:51:09 UTC - in response to Message 322.  

I'm using the optimized application, the AV512 one, in a linux dual xeon platinum system. Crunching times are great compared to the standard application. Very good job!

However, the oddity is that the system is pulling much less power than usual when it is supposed that these AVX512 instructions have to push the power consumption up. The CPUs have downclocked which is right with AVX but the overall power is much lower. In Asteroids for example when crunching with this system the power consumption increases around 15% compared with crunching other project which does not use AVX.

Any idea?

Hmm, interesting. AVX512 causes bigger downclocking that AVX/AVX2. Additionally this app uses only one heavy instruction per loop iteration (double precision divide), plus 2 dp subtractions, dp compare, dp minimum and integer addition. There are also data dependencies between most of them, so CPU cannot execute them in parallel. I suspect that Asteroids app may use more heavy instructions (division, multiplication) and has less data dependencies, so it can use CPU more efficiently.

BTW, Asteroids app has only AVX version, there is no AVX512 version.
ID: 323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 1 Jul 18
Posts: 2
Credit: 9,520,518
RAC: 33,476
Message 324 - Posted: 23 Jan 2019, 5:35:35 UTC

OK, thanks for the explanation, so the summary with the optimized application and this system is around 2,5 times more points and 25% lower power consumption :):):):):)
ID: 324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hal Bregg

Send message
Joined: 31 Oct 18
Posts: 3
Credit: 175,182
RAC: 616
Message 325 - Posted: 23 Jan 2019, 20:41:51 UTC
Last modified: 23 Jan 2019, 20:45:03 UTC

After updating BOINC I noticed that WU completion time changed from about 1hrs 50m to 5hrs.

This is happening on AMD Athlon II X2 220 using SSE2 version of optimized app.
ID: 325 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[B@P] Daniel

Send message
Joined: 31 Mar 17
Posts: 10
Credit: 4,726,894
RAC: 573
Message 326 - Posted: 23 Jan 2019, 21:41:08 UTC - in response to Message 325.  

OK, thanks for the explanation, so the summary with the optimized application and this system is around 2,5 times more points and 25% lower power consumption :):):):):)

There is one more possible explanation. When I worked with sample WU, I found that there was from 1 to 10 rows available to process at once. Every case occurred similar number of times, except for 10 which was about 60% of others (if I remember correctly). This means that CPU usually executes code optimized for AVX, SSE and scalar values (which processes 4, 2, and 1 rows at once). However CPU stays downclocked for a while after executing AVX512 part, what also affects AVX/SSE/scalar parts.

After updating BOINC I noticed that WU completion time changed from about 1hrs 50m to 5hrs.

This is happening on AMD Athlon II X2 220 using SSE2 version of optimized app.

This is normal until BOINC will learn how much time WU needs to complete. After 1 day you should see proper time.
ID: 326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hal Bregg

Send message
Joined: 31 Oct 18
Posts: 3
Credit: 175,182
RAC: 616
Message 327 - Posted: 24 Jan 2019, 9:49:25 UTC - in response to Message 326.  

After updating BOINC I noticed that WU completion time changed from about 1hrs 50m to 5hrs.

This is happening on AMD Athlon II X2 220 using SSE2 version of optimized app.

This is normal until BOINC will learn how much time WU needs to complete. After 1 day you should see proper time.[/quote]

Thanks Daniel.
ID: 327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AquaBoy

Send message
Joined: 20 Feb 18
Posts: 1
Credit: 715,638
RAC: 8,844
Message 330 - Posted: 25 Jan 2019, 5:59:51 UTC

Daniel, thank you so much! The speed of computations on Linux OS has increased by 2 times.
ID: 330 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplehuman

Send message
Joined: 16 Dec 18
Posts: 1
Credit: 1,320,186
RAC: 19,492
Message 331 - Posted: 25 Jan 2019, 12:28:43 UTC
Last modified: 25 Jan 2019, 12:32:17 UTC

It is awesome, Daniel! My computing time is now 19 minutes per task on Linux x64!
But was a bit disappointed that Intel Core i7-8700K doesn't support AVX512 :)
ID: 331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Duce H K

Send message
Joined: 30 Sep 17
Posts: 3
Credit: 508,956
RAC: 2,335
Message 338 - Posted: 2 Feb 2019, 18:08:26 UTC
Last modified: 2 Feb 2019, 18:11:14 UTC

look into app_info.xml ... This is the first time I use exe spoofing. But it's legal:)
Trying to launch avx2 on a Windows, an OC'ed Xeon V3 ES
P.S. I bet avx version is top downloaded=)
ADD
02.02.2019 22:53:09 | Acoustics@home | Found app_info.xml; using anonymous platform
ID: 338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RFGuy_KCCO

Send message
Joined: 13 Apr 17
Posts: 1
Credit: 10,682,721
RAC: 145,497
Message 339 - Posted: 4 Feb 2019, 18:08:11 UTC

Any idea which app is the fastest on AMD Ryzens and ThreadRippers? SSE4.1 or AVX2?
ID: 339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
biodoc

Send message
Joined: 22 Apr 18
Posts: 3
Credit: 1,042,104
RAC: 3,738
Message 340 - Posted: 4 Feb 2019, 20:36:16 UTC
Last modified: 4 Feb 2019, 20:37:27 UTC

My Ryzen 2700x running 64-bit linux:

Task completion times:
project app: 3250 seconds
sse2 opt app: ~1380 seconds
AVX opt app: ~1080 seconds
AVX2 opt app: ~1080 seconds

I settled on the AVX app since AVX2 was maybe a bit slower but I'd need stats to confirm it. I'm assuming the SSE4.1 app would not be any faster than the AVX version. Is that an reasonable assumption?

Thanks for providing the optimized apps Daniel. Great job!
ID: 340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryan

Send message
Joined: 20 Jan 17
Posts: 3
Credit: 17,646,879
RAC: 35,809
Message 341 - Posted: 5 Feb 2019, 15:30:44 UTC - in response to Message 339.  
Last modified: 5 Feb 2019, 15:31:42 UTC

Any idea which app is the fastest on AMD Ryzens and ThreadRippers? SSE4.1 or AVX2?


AVX2 by a wide margin :) At least on a 2990WX.
ID: 341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Optimized Acoustics@home app