-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Developer Applications Demo using Transformers Library #10
Conversation
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Removed app_config.json, instructions to create the .json is added in Readme.md Signed-off-by: Himanshu Upreti <[email protected]>
Remove cert.pem and key.pem and update Readme.md with instructions to generate them Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review is not yet complete. Will review the rest later.
from typing import Dict, List, Optional, Union | ||
from threading import Thread | ||
|
||
from QEfficient.generation.cloud_infer import QAICInferenceSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this import should be below transformers import.
read more here: https://peps.python.org/pep-0008/#imports
self.session = None | ||
self.tokenizer = None | ||
self.is_first_prompt = False | ||
self.model_name = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be replaced with line 66, is there a purpose of assigning empty string here?
self.tokenizer = None | ||
self.is_first_prompt = False | ||
self.model_name = "" | ||
self.qpc_path = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with 61-63?
app/app.py
Outdated
def infer_prompt(msg, chat_history, task, model): | ||
global last_prompt, previous_current_ctx_len, last_state_generation_ids | ||
|
||
qeff_generator_model.curr_cache_index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this required? default is 0 anyway
qeff_generator_model.generated_ids = [] | ||
|
||
if qeff_generator_model.curr_cache_index >= qeff_generator_model.ctx_len - 1: | ||
qeff_generator_model.curr_cache_index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required?
app/app.py
Outdated
if qeff_generator_model.curr_cache_index >= qeff_generator_model.ctx_len - 1: | ||
qeff_generator_model.curr_cache_index = 0 | ||
|
||
qeff_generator_model.curr_cache_index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
except Exception as err: | ||
raise RuntimeError(f"Unable to load tokenizer, {err}") | ||
|
||
if streamer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle else case, raise error
hf_token = None | ||
if os.getenv("HF_TOKEN") is not None: | ||
hf_token = os.getenv('HF_TOKEN') | ||
tokenizer = AutoTokenizer.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this on models present in tests? I have seen it failing with config file not found on huggingface, and the work-around is to use hf_download
, check code in QEfficient/cloud/infer.py::80-87
you can allow only tokenizer file pattern, as you don't need other model files for tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, hf_download are used un-necessarily, even at place where we just need tokenizer still we are downloading all files @ochougul please raise an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we download all the files while only needing tokenizer, assuming that user must already have all the files in cache_dir
, in which case, the files won't be downloaded again, as this was how this API is supposed to be used.
If you want to change it to only download tokenizer files when only tokenizer is needed, you are welcome to update it. This is definitely a better way to do this.
You can raise issue internally yourself, no need to ask anybody. Thanks.
Signed-off-by: Himanshu Upreti <[email protected]>
QEfficient/cloud/infer.py
Outdated
@@ -60,6 +60,7 @@ def main( | |||
device_group: List[int] = [ | |||
0, | |||
], | |||
execute : bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this funtionality exists in QEfficinet/cloud/execute.py::main
, not required to add here.
time.sleep(0.07) | ||
|
||
# calling infer api directly to get qpc_path | ||
app_config[task][model_name]['qpc_path'] = infer_api( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use QEfficient/cloud/execute.py::main
function or directly use latency_kv_stats
function
as I understand you want to compile and then execute, you can use from QEfficient/cloud/compile import main as compile
and use that similar to it's use in infer.py
, Let's not update infer.py
file for this.
As there is no way customer will use execute
option inserted in infer
API via command line.
and infer.py is supposed to be a CLI API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can create a utils function called compile_and_execute
and put it inside, QEffcieint/cloud/utils.py
and then change infer API as well to use same.
It's your choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anujgupt-github @vbaddi I believe we need discuss on requirements on high-level apis and low level apis, their use-cases. Here the use case is not compile and execute. However, we can take https://github.com/daquexian/onnx-simplifier as example to understand top-level cli api and top-level programmable api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please come up with a design if you have better solution. Happy to discuss it.
Thanks for sharing the onnx-simplifier
link.
I agree that we need an API that does end-to-end from HF model till execution, but that doesn't mean, we should destroy code scalability for this.
Signed-off-by: Himanshu Upreti <[email protected]>
Signed-off-by: Himanshu Upreti <[email protected]>
ignore_patterns=["*.txt", "*.onnx", "*.ot", "*.md", "*.tflite", "*.pdf"], | ||
) | ||
tokenizer = AutoTokenizer.from_pretrained(model_hf_path, use_cache=True, padding_side="left") | ||
if hf_token is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code (line 83-90) can also be moved under main function.
and pass model_hf_path
as input to infer_api
, and return qpc_path
as output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should user downloads hf_model files from model_card and provide you model_hf_path ? please keep in mind that this is high_level api for user
Not planned to be revived at the moment, will revisit if really needed. |
Developer Applications on Cloud AI 100 using Transformers Library