Can we avoid allocating a significant amount of memory when executing in a multithreaded environment when precision is changed? #23484

tmbrye · 2025-01-24T19:27:32Z

tmbrye
Jan 24, 2025

We have run the Java onnxruntime for quite a while in a multi-threaded environment and have not noticed memory issues until we started trying very large HuggingFace embedding models. We read in the model and cache it in memory and then multiple threads on the system utilize the run method to score their inputs. This actually works as expected for models that are converted to ONNX without changing the precision of the model.

We read in the model into memory(in this case a 1.17GB model) and it allocates the space it needs which is around 8GB. Then the threads execute the run method and the memory may increase ever so slightly but nothing out of the ordinary. However, when we have converted this model using fp16 or int8 precision the model is obviously reduced in size but upon execution it allocates the space it needs but also starts allocating a good amount more every time a new thread joins the process.

This very quickly overwhelms the system with cpu and memory increases.

My questions is, is this expected behavior that when a model uses fp16 precision that each thread will basically allocate another copy of the model? Are there any settings that might be helpful to mitigate this issue? Thanks in advance.

Just some added information we were on 1.13.1 and moved to 1.16.3 and it is still the same. We have an older glibc so we aren't able to go to higher versions unless we compile ourselves.

Craigacp · 2025-01-24T22:47:14Z

Craigacp
Jan 24, 2025

This allocation is per thread that calls OrtSession.run, or per call to OrtSession.run? Are your inputs & outputs also in fp16/int8 or are those casts in the model?

5 replies

tmbrye Jan 25, 2025
Author

The allocation increase happens when a new thread calls OrtSession.run. Once a thread has called it and the increase happens it can then call it over and over without any added memory increase. The model inputs are int64 and the sentence_embedding output is float32.

Craigacp Jan 25, 2025

How much additional memory is being allocated when a new thread calls OrtSession.run?

Given it's happening only for quantized models then it's probably an issue in the core runtime rather than the Java layer, so we'll need to tag someone who works on the quantized ops. What did you use to quantize the model?

tmbrye Jan 25, 2025
Author

When looking at the gte-multilingual-base models, for the full precision I'm seeing an allocation of about 3.5GB after just preloading the model. Then when each thread adds in I am seeing them allocate another 8.9% of the initial 3.5GB.

For the fp16 model I am seeing a preloaded size of 2.16GB and then it is allocating about 43.5% of that for each newly added thread.

Our original case was a jina embedding model of about 1.15GB in fp16 and it was allocating another 3GB for each thread so that increase in 43.5% versus 8.9% really makes a difference.

I'll need to get back to you on how the models were quantized as the models were generated by others and I am heavy on the developer side and not so much on the modeler side.

Thanks.

Craigacp Jan 25, 2025

Given the machine itself will limit the amount of computation as it's only got some many cores, how many threads are you using before it causes a problem? There's only so many concurrent calls you could have, so you could potentially use something like virtual threads to have a limited number of carrier threads hitting the JNI layer, but lots of virtual threads in Java to make the implementation simpler.

Just to check, this is memory that the process is reporting rather than the JVM itself right?

The core team will probably also ask you to replicate this on the latest version, 1.16.3 is fairly old now. It should be fairly straightforward to compile assuming you can get a newer gcc on your platform.

asmirnov-tba Jan 26, 2025

Dear colleagues, I'm working together with @tmbrye on this problem. I made a number of additionall tests to narrow down the problem and the results of these tests may add some light to the problem we are facing:

Test number one. Standalone Java application.

So, I wrote very simple Java application which runs outside the bigger framework we have to use originally. The code of this application is: Test.java.txt
This application takes the model in ONNX format, tokenizer.json for tokenization, number of parallel threads to run and delay interval in seconds. It starts the new threads with infinite cycle of running the model. The threads are started with 30 seconds delay between each other. I'm running this application and monitor the memory consupmtion on the machine (basically with "free" utility). I'm using the gte-multilingual-base in different precisions (fp32, fp16, int8 and uint8) in different test runs. On the methods of the reducing of the precision - see below in this message.

Results of these tests are following:

First tried it on our target platform on SLES 12SP3 with older glibc. Tested with different onnxruntime versions: 1.13.1 and 1.16.3. Tested with Azul JDK v8
Then I tried it on Ubuntu 24.04 with onnxruntime 1.13.1, 1.16.3 and 1.20.0. Tested with OpenJDK v8, 11 and 21

In ALL tests I see the same results:

fp32, int8 and uint8 gives very stable memory footprint regardless of the number of threads running.
BUT! fp16 allocates another 300Mb - 1Gb for each new thread.

I tried to play with different session options (CPUArenaAllocator, MemoryPatternOptimization, inter and intra threads) - they all gives the same results. Turning off arena allocator makes memory footprint more fluctuative (is it a word?) but the trend to allocate more and more with each new thread is very clear even in this case.
Changing the optimiztion level doesn't make any difference to results.

Test number two. Python application.

My next guess was "Alright, maybe something wrong with Java and JNI and all the stuff". That's why I wrote a very simple Python script with totally same logic as the Java application in previous test: fp32, int8, unit8 - stable, fp16 leaks memory for every new thread. Python test script is here: onnx-multithreading-test-python.py.txt
Tested on Ubuntu 24.04 with onnxruntime 1.16.3 and 1.20.1 - same picture, fp16 leaks memory

How the models was produced

fp32 model was coverted from huggingface to onnx uisng optimum utility. Here is the convertation code:

from optimum.exporters.onnx import main_export

main_export(model_name_or_path="Alibaba-NLP/gte-multilingual-base", output="./",  opset=16, trust_remote_code=True, task="feature-extraction", dtype="fp32")

fp16 model was generated based on fp32 version described above with the following code:

import onnx
from onnxconverter_common import float16

op = onnx.OperatorSetIdProto()
op.version = 16

model_fp16 = float16.convert_float_to_float16(onnx.load('model.onnx'),\
                                              min_positive_val=1e-7, \
                                              max_finite_val=1e4, \
                                              keep_io_types=True, \
                                              disable_shape_infer=False, \
                                              op_block_list=None, \
                                              node_block_list=None)

model_fp16 = onnx.helper.make_model(model_fp16.graph, ir_version = 8, opset_imports = [op]) #to be sure that we have compatible opset and IR version

onnx.save(model_fp16, "model_fp16.onnx")

int8/uint8 versions quantized with following code based on fp32 model:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic("model.onnx",  "model_int8.onnx", weight_type=QuantType.QInt8)
quantize_dynamic("model.onnx",  "model_uint8.onnx", weight_type=QuantType.QUInt8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we avoid allocating a significant amount of memory when executing in a multithreaded environment when precision is changed? #23484

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can we avoid allocating a significant amount of memory when executing in a multithreaded environment when precision is changed? #23484

tmbrye Jan 24, 2025

Replies: 1 comment · 5 replies

Craigacp Jan 24, 2025

tmbrye Jan 25, 2025 Author

Craigacp Jan 25, 2025

tmbrye Jan 25, 2025 Author

Craigacp Jan 25, 2025

asmirnov-tba Jan 26, 2025

Test number one. Standalone Java application.

Test number two. Python application.

How the models was produced

tmbrye
Jan 24, 2025

Replies: 1 comment 5 replies

Craigacp
Jan 24, 2025

tmbrye Jan 25, 2025
Author

tmbrye Jan 25, 2025
Author