Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dpsoft: first submission #572

Merged
merged 21 commits into from
Feb 1, 2024
Merged

dpsoft: first submission #572

merged 21 commits into from
Feb 1, 2024

Conversation

dpsoft
Copy link
Contributor

@dpsoft dpsoft commented Jan 24, 2024

Check List:

  • Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
  • All formatting changes by the build are committed
  • Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
  • Output matches that of calculate_average_baseline.sh
  • For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (237)
  • Execution time: 11s
  • Execution time of reference implementation: 4min 30s
  • System specs: 8 Core i7 @ 2.80GHz, 16GB

@gunnarmorling
Copy link
Owner

Please run mvn verify and append the changes made by the formatter to the PR. Also please make sure that ./test.sh dpsoftpasses.

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 25, 2024

@gunnarmorling done!

@gunnarmorling
Copy link
Owner

There's a test failure, please see CI for details.

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 25, 2024

@gunnarmorling sorry, done!

@gunnarmorling
Copy link
Owner

Getting differences for the 1B rows file now:

Validating calculate_average_dpsoft.sh -- measurements_1B.txt
54c54
< Bloemfontein;-31.8;15.6;61.6
---
> Bloemfontein;-33.2;15.6;62.7
282c282
< Ouarzazate;-26.6;18.9;65.0
---
> Ouarzazate;-28.5;18.9;69.8
306c306
< Prague;-41.9;8.4;56.6
---
> Prague;-41.9;8.4;60.5
357c357
< Tamale;-19.3;27.9;74.7
---
> Tamale;-20.9;27.9;76.8
377c377
< Tromsø;-48.5;2.9;49.7
---
> Tromsø;-48.5;2.9;54.3
410c410
< Zanzibar City;-23.0;26.0;79.8
---
> Zanzibar City;-25.2;26.0;79.8

FAILURE: ./test.sh dpsoft measurements_1B.txt failed

@gunnarmorling
Copy link
Owner

Looking good for the regular 1B file now, but it fails with the 10K key set (see create_measurements3.sh):

java.nio.BufferUnderflowException
	at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
	at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:157)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:137)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@gunnarmorling
Copy link
Owner

Running out of heap space now with 10K keys:

Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
1c1,10000
< Terminating due to java.lang.OutOfMemoryError: Java heap space

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 28, 2024

@gunnarmorling It seems like the issue lies in the size of the segments, when running the code on my machine it throws an exception related to the size exceeding Integer.MAX_VALUE, while on the test machine, it fails with an "underflow" error.

I've fixed(i think so) the issue and conducted tests using the following approach:

1) ./create_measurements3.sh 1000000000
2) ./calculate_average_baseline.sh > baseline-3.out //changing the path to the 10K key set in the source code.
3) ./calculate_average_dposft.sh > dpost-3.out //changing the path to the 10K key set in the source code.

and finally:

4)  diff baseline-3.out dpsoft-3.out -> nothing!

@gunnarmorling
Copy link
Owner

Similar error as before. Note that test runs on 32 machines, you should be able to reproduce the issue by using that many chunks rather than relying on CPU count on your machine.

Using java version 21.0.2-graal in this shell.
Validating calculate_average_dpsoft.sh -- measurements_10K_1B.txt
Exception in thread "Thread-20" Exception in thread "Thread-27" Exception in thread "Thread-30" java.nio.BufferUnderflowException
	at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
	at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
	at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
	at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
	at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
	at java.base/java.lang.Thread.run(Thread.java:1583)
java.nio.BufferUnderflowException
	at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:721)
	at java.base/java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:753)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.hashAndRewind(CalculateAverage_dpsoft.java:164)
	at dev.morling.onebrc.CalculateAverage_dpsoft$MeasurementExtractor.run(CalculateAverage_dpsoft.java:144)
	at java.base/java.lang.Thread.run(Thread.java:1583)

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 29, 2024

@gunnarmorling fixed!

@gunnarmorling
Copy link
Owner

Hum, hum. So all tests pass now, it also passes the 10K keyset, but it's way too fast, finishing in 100ms (current fastest one is 2.7 sec). I feel it may somehow skip portions if the file?

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 29, 2024

@gunnarmorling I'll take a look

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 29, 2024

@gunnarmorling Indeed was truncating the file because it was looking at the segment size instead of the file size.

@gunnarmorling
Copy link
Owner

Same thing as before, it seems to skip parts, 100ms for the 10K key set case:

Benchmark 1: timeout -v 300 ./calculate_average_dpsoft.sh 2>&1
  Time (mean ± σ):      98.1 ms ±   3.1 ms    [User: 144.0 ms, System: 31.2 ms]
  Range (min … max):    95.5 ms … 103.0 ms    5 runs

Summary
  dpsoft: trimmed mean 0.09729906015333334, raw times 0.09915400582,0.09552674882000001,0.09679519882,0.09594797582,0.10304930282000001

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:00.097 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_dpsoft.java)| 21.0.2-graal | [Diego Parra](https://github.com/dpsoft) |  |

@dpsoft
Copy link
Contributor Author

dpsoft commented Jan 31, 2024

@gunnarmorling the file is created with ./create_measurements3.sh 1000000000? and using eight cores for the evaluation?

@gunnarmorling
Copy link
Owner

Looking good now: 00:06.392. Passing for 10K keyset too.

@gunnarmorling gunnarmorling merged commit bec0cef into gunnarmorling:main Feb 1, 2024
1 check passed
@dpsoft
Copy link
Contributor Author

dpsoft commented Feb 1, 2024

@gunnarmorling Thank you so much for the initiative. I would have liked to have had more free cpu cycles to improve my solution. Perhaps next time! :)

@gunnarmorling
Copy link
Owner

gunnarmorling commented Feb 1, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants