Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code vector store #6

Open
wants to merge 68 commits into
base: code_exp
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
e1bbd4d
Update meeting_log.md
sm1lla Dec 5, 2024
b19221d
log avagere scores for individual experiments
sm1lla Dec 10, 2024
9dc2c67
Update meeting_log.md
elenagensch Dec 10, 2024
bfaea4f
Make group team default entity to log wandb experiments
elenagensch Dec 10, 2024
3958d86
Merge pull request #7 from hpi-sam/code_exp
elenagensch Dec 10, 2024
b632923
experiments
joh-dah Dec 10, 2024
a13e686
add RddAttributeMatcher to linter
juliuspor Dec 10, 2024
43d0ffe
Merge branch 'main' into adjust_linter_prompt
joh-dah Dec 10, 2024
ff42568
Delete src/vector_store/code_scraping
elenagensch Dec 10, 2024
10ded63
Use model temperature from config file
sm1lla Dec 11, 2024
564a1e1
Update meeting_log.md
elenagensch Dec 12, 2024
1f891a1
Update meeting_log.md
elenagensch Dec 12, 2024
fdf45b9
small experiments
joh-dah Dec 12, 2024
6c74142
changed to iterate
joh-dah Dec 12, 2024
375d482
config
joh-dah Dec 12, 2024
56750c3
Merge branch 'main' into adjust_linter_prompt
joh-dah Dec 12, 2024
eb25c0e
Use pickle format for incorrectly saved examples
sm1lla Dec 12, 2024
68eb425
run experiments
joh-dah Dec 12, 2024
8f20742
black and experiments
joh-dah Dec 16, 2024
58409c7
Update meeting_log.md
joh-dah Dec 17, 2024
57996bd
Merge pull request #8 from hpi-sam/cleaned_examples
sm1lla Dec 17, 2024
5714342
refetch context from vector store
sm1lla Dec 17, 2024
5f09d97
Implemented automatic prompt generation
joh-dah Dec 17, 2024
ff364a2
Merge branch 'main' into adjust_linter_prompt
joh-dah Dec 17, 2024
14bdb04
add configuration for api ref rag
sm1lla Dec 17, 2024
80c0e29
run embedding models on gpu
cdfhalle Dec 18, 2024
38b0b78
Merge branch 'main' of github.com:hpi-sam/expedite-databricks-connect
cdfhalle Dec 18, 2024
92a0b7f
remove git_access_token
cdfhalle Dec 18, 2024
beb7e3f
clear messages in assistant
sm1lla Dec 19, 2024
e044f83
Make embedding model configurable for api reference rag
sm1lla Dec 19, 2024
e72a46d
Merge pull request #9 from hpi-sam/make_embedding_models_configurable
sm1lla Dec 19, 2024
34b1963
Merge branch 'main' into api_ref
sm1lla Dec 19, 2024
8a5f2ba
save api_ref documents
sm1lla Dec 19, 2024
89a2c17
Adjust retrieval prompt for nv-embed
sm1lla Jan 6, 2025
dd2448f
change system prompt and meta prompt
joh-dah Jan 6, 2025
5175b2c
Merge branch 'main' into adjust_linter_prompt
joh-dah Jan 6, 2025
bfe898f
black
joh-dah Jan 6, 2025
cd5424a
delete unused function
joh-dah Jan 6, 2025
56ef011
adjust config
joh-dah Jan 7, 2025
01ba5d0
fix non generative prompting
joh-dah Jan 7, 2025
b31d9ca
Merge pull request #10 from hpi-sam/adjust_linter_prompt
joh-dah Jan 7, 2025
6360225
Cleanup config
sm1lla Jan 7, 2025
ec47b24
cleanup main
sm1lla Jan 7, 2025
94be99b
Merge branch 'api_ref'
sm1lla Jan 7, 2025
44a0120
use model name from config
sm1lla Jan 7, 2025
1215a85
Update meeting_log.md
juliuspor Jan 9, 2025
e8a77e4
try to add metrics
joh-dah Jan 9, 2025
324dc22
Merge branch 'main' into analyze_iterations
joh-dah Jan 9, 2025
cd32488
Fix mixedRDD example
sm1lla Jan 9, 2025
94731a3
Update meeting_log.md
juliuspor Jan 9, 2025
68541b5
Update meeting_log.md
juliuspor Jan 9, 2025
a03d148
add metrics
joh-dah Jan 9, 2025
e49ff6b
Merge branch 'main' into analyze_iterations
joh-dah Jan 9, 2025
2a41cb6
Update meeting_log.md
juliuspor Jan 9, 2025
43dfd49
Update meeting_log.md
juliuspor Jan 14, 2025
2b13270
final config
joh-dah Jan 14, 2025
27e04c8
Merge branch 'main' into analyze_iterations
joh-dah Jan 14, 2025
5aa7976
Merge pull request #11 from hpi-sam/analyze_iterations
joh-dah Jan 14, 2025
dfd2d38
Move assistant to seperate file
elenagensch Jan 14, 2025
63034fe
idk
cdfhalle Jan 14, 2025
c23478a
forgot something
cdfhalle Jan 14, 2025
ee11de9
add serving script for Qwencoder
cdfhalle Jan 14, 2025
c1e0271
Merge branch 'main' of github.com:hpi-sam/expedite-databricks-connect…
cdfhalle Jan 14, 2025
36fd976
fix merging error
cdfhalle Jan 16, 2025
114c846
Merge pull request #12 from hpi-sam/add_wandb_promt_logging
elenagensch Jan 16, 2025
b41680d
Update meeting_log.md
elenagensch Jan 16, 2025
a73af39
fix logging
cdfhalle Jan 16, 2025
8b9821c
Merge pull request #13 from hpi-sam/add_wandb_promt_logging
cdfhalle Jan 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -296,3 +296,5 @@ dependencies:
variables:
HF_HOME: /raid/shared/masterproject2024/huggingface/
CODE_DATA_FILE_PATH: /raid/shared/masterproject2024/rag/data/code.json
VLLM_BASE_URL: "http://localhost:8000/v1"
OPENAI_API_KEY: "token-abc123"
212 changes: 211 additions & 1 deletion meeting_log.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,179 @@
# Biweekly 16.01.2025
## Attendees
- [x] @juliuspor
- [ ] @sm1lla
- [x] @joh-dah
- [x] @elenagensch
- [ ] @cdfhalle

## Topics
- Presentation at Databricks
- further examples genreated from other examples <- johanna added 4
- run experiments with CoT module from Dspy
- Conrad founf Qwn reasoning model

# Biweekly 14.01.2025

## Attendees
- [x] @juliuspor
- [ ] @sm1lla
- [x] @joh-dah
- [x] @elenagensch
- [ ] @cdfhalle

## Topics
- Presentation at Databricks
- Meeting mit Felix Boelter

## Actions
- Julius: Draft Presentation Databricks
- Johanna: Mehr Beispiele Ausdenken
- Elena: dspy + chain of thought testen

# Meeting with Felix Boelter
- Qwen-2.5 performs on par with GPT-4.
- Tool Calling with DSPy possible
- Leverage "Program of Thought" or "Chain of Thought" methods from DSPy to address iteration issues (currently no updates happening during iteration).

- Dpsy Optimizers:
- Few-shot learning makes optimizers less relevant for our use case.

- Fine-tuning:
- Focus on fine-tuning using generated examples.

- Switch from LangChain to Llama-Index:

RAG (Retrieval-Augmented Generation) Idea:

RAG should focus on searching for relevant spark functions.
Generate queries specifically for the vector store.
Chain of Thought (CoT) Process:

First CoT: Determine which functions to use and understand the model's intentions.
Proposed Pipeline:
Search for a list of functions (verify existence).
Retrieve relevant context.
Use this context to generate code.
Model Performance:

# Bi-weekly with Chris 09.01.2024

## Topics
- Report, what is important?
- State of the Art
- Evaluation

# Meeting with Martin 09.01.2024

## Topics
- Martin will try to get more examples to do industrial follow-up
- Date for Tech-Talk: 19th February 12:00
- Any more requirements for project?
- NO, for us its mostly validation
- Would be awesome to have a more in depth view
- If you are interested in publishing it, Martin would be interested in reviewing
- Should we include documentation in the final submission?
- NO, not in particular
- I dont care about the code, but about the outcome


# Biweekly 09.01.2024

## Attendees
- [x] @juliuspor
- [x] @sm1lla
- [x] @joh-dah
- [x] @elenagensch
- [x] @cdfhalle

## Topics
- Video (bestes Video jemals)
- Prüfen wie sinnvoll Iterations sind
- Conrad: Embedding Vector Database ist nicht so sinnvoll (evtl. nach einfachem Funktionsnamen suchen)

## Actions
- Julius: frontend
- Johanna: Sinnhaftigkeit von Iterations prüfen
- Elena, Smilla, Conrad: Fehlerhafte Outputs analysieren

# Biweekly 12.12.2024

## Attendees
- [x] @juliuspor
- [x] @sm1lla
- [x] @joh-dah
- [x] @elenagensch
- [x] @cdfhalle

## Topics
- **High priority**: Discuss deliverables for video next Tuesday.
- Prompt Engineering: generate prompt automatically
- Julius neues modell ausprobieren (4-6/14 gelöst)
- Elena: stack overflow code: Der github code scheint besser zu funktionieren
- Smilla: Beispiele die falsch geparsed werden etc. fixen, context mit generiertem code fetchen
- Conrad: Mit Embeddng models herumexperimentiert

## Actions
- Video machen
- Julius: Webapp bauen, damit alles ein bisschen schicker aussieht

# Biweekly 12.12.2024

## Attendees
- [ ] @juliuspor
- [x] @sm1lla
- [ ] @joh-dah
- [x] @elenagensch
- [x] @cdfhalle

## Topics
- **High priority**: Discuss deliverables for video next Tuesday.
- Embedding functions in Conrad.
- Vectorize: Compare RAG embedding models, evaluate differences (Llama 70B, OpenAI embedding).
- Retriever from RAG: Benchmark papers—most retrievers are embedding models, ranking models, or Salesforce SFR embedding.
- Experiment with adding context to generated code.
- Code context experiments are in WandB reports (some configurations slightly outperform or perform similarly to setups without RAG).
- stackoverflow code now in json, no experiments yet
- Evaluate via pickle not Csvs
- Hybrid API migration paper: started to apply to our problem, sketch out code.
- The responsibility to note what one is working on lies with each individual. Please document it in the meeting log yourself.

## Actions
- stackoverflow, code experiments: @elenagensch
- @cdfhalle
- @joh-dah
- @sm1lla
- @juliuspor

# Biweekly 10.12.2024

| Name | About | Title | Agenda | Timekeeping | Notes |
|------------------|-----------------------------|---------------|----------------|-------------|--------------|
| Meeting Template | | Meeting 01/01/0001 | | | @elenagensch |

## Agenda
- OpenAI credits: Julius will write an email.
- Linter feedback as inline comments in the code, PR to be created.
- Conrad suggests "Chain of Thought."
- Julius will try using Google for solutions.
- Discuss deliverables for Martin on Thursday (code video + final presentation).

## Attendees

- [x] @juliuspor
- [ ] @sm1lla
- [x] @joh-dah
- [x] @elenagensch
- [x] @cdfhalle

## Topics

- **OpenAI Credits**: Julius will write and send an email regarding this.
- **Linter Feedback Prompt engineering**: Feedback is added as inline comments in the code, followed by creating a PR.
- **Chain of Thought**: potential approach for better logical reasoning - @cdfhalle.
- **Google Searches**: explore solutions using Google for prompts - @juliuspor.
- **Deliverables**: Code video and final presentation to be discussed on Thursday.

# Biweekly 01.01.0001

| Name | About | Title | Agenda | Timekeeping | Notes |
Expand Down Expand Up @@ -30,7 +206,41 @@

*

---
# Meeting with Martin 05.12.2024

## Agenda
- Present progress

## Attendees

- [ ] @juliuspor
- [x] @sm1lla
- [x] @joh-dah
- [ ] @elenagensch
- [x] @cdfhalle


## Meeting notes


- Finding out which kind of errors
- use python interpreter to check if it is syntactically correct?
- import all imports and then eval code

- Martin's idea for more examples:
- use unit tests from pyspark for rdds etc
- record small demo video 3-5 minutes, little presentation and go through code and example
- for beginning of January
- final presentation in Databricks office

Internal discussion with Chris:
- get runtime errors by running code by executing code just to find out if this would help
- track for individual examples the distribution of how often they fail
- track what makes the code complex
- how is the code split? - are there different ways of splitting it?
- try OpenAI model -> which subscription necessary?
- Elena's card receipt for team building

# Biweekly 05.12.2024

| Name | About | Title | Agenda | Timekeeping | Notes |
Expand Down
107 changes: 69 additions & 38 deletions src/config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use_rag: true
# Vector Store Settings
num_rag_docs: 1
vectorstore_type: "api_ref" # Mögliche Werte: 'docs', 'code', 'api_ref'
vectorstore_type: "code" # Mögliche Werte: 'docs', 'code', 'api_ref'
vectorstore_settings:
docs:
docs:
Expand All @@ -10,7 +10,7 @@ vectorstore_settings:


code:
vector_store_path: "/raid/shared/masterproject2024/vector_stores/code_vector_store_small"
vector_store_path: "/raid/shared/masterproject2024/vector_stores/code/"
data_path: "/raid/shared/masterproject2024/rag/data/code.json"
repo_branch_list:
- { repo: "mrpowers-io/quinn", branch: "main"}
Expand All @@ -23,56 +23,87 @@ vectorstore_settings:
type: connect

api_ref:
vector_store_path: "/raid/shared/masterproject2024/vector_stores/vector_store"

vector_store_path: "/raid/shared/masterproject2024/vector_stores/api/nv_split512"
split_documents: False
chunk_size: 512
chunk_overlap: 50

iterate: false
iteration_limit: 5
# Types of messages the linter should return. Possible values: 'error', 'warning', 'convention' (maybe more)
linter_feedback_types:
- error
# current model options:
# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16
# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
# - meta-llama/CodeLlama-70b-Python-hf
model_name: "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
model_temperature: 0.2
# model length in token make sure the same value is used for serving the model
max_model_length: 8192
answer_token_length: 2048

# Iteration Settings
iterate: true
iteration_limit: 3


# Linter Settings
linter_config :
enabled_linters:
enabled_linters: # List of linters to use. Possible values: 'pylint', 'mypy', 'flake8', 'spark_connect'
- pylint
- mypy
- flake8
- spark_connect
feedback_types:
- error
- warning # Return only these severities. Possible values: 'error', 'warning', 'convention' (maybe more)
feedback_types: # Return only these severities. Possible values: 'error', 'warning', 'convention' (maybe more)
- error


# Models Settings
model_name: "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
# current model options:
# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16
# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
# - meta-llama/CodeLlama-70b-Python-hf
model_temperature: 0.2
embedding_model_name: "nvidia/NV-Embed-v2"
max_model_length: 8192 # model length in token make sure the same value is used for serving the model
answer_token_length: 2048


# Prompt Settings
generate_prompt: false # If true, the prompt will be generated by the LLM. If false initial_prompt and iterated_prompt will be used.
use_error: true
use_rag: true
initial_prompt: "
Update the provided PySpark code to be compatible with Spark Connect.
The rewritten code should have exactly the same functionality as the original code and should return exactly the same output.
This is the original code that does not work with spark connect:

"
error_prompt: "\nWhen executed, the code produces the following error: "

iterated_prompt : "
Unfortunately, the code does not seem to work. This can be due to the fact that the code is not compatible with Spark Connect or
other issues.
Please fix the issues and make sure that the code you produce is correct and compatible with spark connect.


"
linter_prompt: "\n\nIssues in the code detected by the linter are listed here:

"

context_prompt: "\n\nIn case it is helpful you can use the following context to help you with the task:

"

linter_error_prompt : "
Unfortunately, the code does not seem to work with spark connect.
Please rewrite the code to work with spark connect. Make sure the code is correct python code that can be executed without errors.
The Spark Connect Linter produces the following error:


system_prompt: "Update the provided PySpark code to be compatible with Spark Connect while maintaining its original functionality and output.

# Steps

* Analyze the provided PySpark code and the linter feedback to identify compatibility issues with Spark Connect.
* Use the given context to inform your updates and ensure the rewritten code is functionally equivalent to the original.
* Address each identified issue and make necessary modifications to the code.
* Verify that the updated code maintains the same output as the original code.

# Output Format

Return the updated PySpark code snippet as a plain text string, without any additional formatting or comments.

# Notes

* Ensure that the updated code only includes changes necessary for Spark Connect compatibility, avoiding any unnecessary modifications.
* Use the provided linter feedback as a guide, but also consider any additional context or requirements that may impact the updated code.
* The output should be a self-contained code snippet that can be used in place of the original code.
"
system_prompt: "You will be provided with PySpark Code that is not compatible with Spark Connect.
You will return an updated version of the code that has exactly the same output but is compatible with Spark Connect.
Only return code blocks."
use_error: true

# Experiment Settings
number_of_examples: 14
eval_iterations: 5
log_results: false
eval_iterations: 15
log_results: true
run_name: null

Loading