hpi-sam · elenagensch · Dec 5, 2024 · Dec 10, 2024 · Dec 10, 2024 · Dec 10, 2024
diff --git a/environment.yml b/environment.yml
@@ -296,3 +296,5 @@ dependencies:
 variables:
   HF_HOME: /raid/shared/masterproject2024/huggingface/
   CODE_DATA_FILE_PATH: /raid/shared/masterproject2024/rag/data/code.json
+  VLLM_BASE_URL: "http://localhost:8000/v1"
+  OPENAI_API_KEY: "token-abc123"
diff --git a/meeting_log.md b/meeting_log.md
@@ -1,3 +1,179 @@
+# Biweekly 16.01.2025
+## Attendees
+- [x] @juliuspor
+- [ ] @sm1lla
+- [x] @joh-dah  
+- [x] @elenagensch  
+- [ ] @cdfhalle  
+
+## Topics
+- Presentation at Databricks
+- further examples genreated from other examples <- johanna added 4
+- run experiments with CoT module from Dspy
+- Conrad founf Qwn reasoning model
+
+# Biweekly 14.01.2025
+
+## Attendees
+- [x] @juliuspor
+- [ ] @sm1lla
+- [x] @joh-dah  
+- [x] @elenagensch  
+- [ ] @cdfhalle  
+
+## Topics
+- Presentation at Databricks
+- Meeting mit Felix Boelter
+
+## Actions
+- Julius: Draft Presentation Databricks
+- Johanna: Mehr Beispiele Ausdenken
+- Elena: dspy + chain of thought testen
+
+# Meeting with Felix Boelter
+- Qwen-2.5 performs on par with GPT-4.
+- Tool Calling with DSPy possible
+- Leverage "Program of Thought" or "Chain of Thought" methods from DSPy to address iteration issues (currently no updates happening during iteration).
+
+- Dpsy Optimizers:
+- Few-shot learning makes optimizers less relevant for our use case.
+
+- Fine-tuning:
+- Focus on fine-tuning using generated examples.
+
+- Switch from LangChain to Llama-Index:
+
+RAG (Retrieval-Augmented Generation) Idea:
+
+RAG should focus on searching for relevant spark functions.
+Generate queries specifically for the vector store.
+Chain of Thought (CoT) Process:
+
+First CoT: Determine which functions to use and understand the model's intentions.
+Proposed Pipeline:
+Search for a list of functions (verify existence).
+Retrieve relevant context.
+Use this context to generate code.
+Model Performance:
+
+# Bi-weekly with Chris 09.01.2024  
+
+## Topics
+- Report, what is important?
+  - State of the Art
+  - Evaluation
+
+# Meeting with Martin 09.01.2024  
+
+## Topics
+- Martin will try to get more examples to do industrial follow-up
+- Date for Tech-Talk: 19th February 12:00
+- Any more requirements for project?
+  - NO, for us its mostly validation
+  - Would be awesome to have a more in depth view
+  - If you are interested in publishing it, Martin would be interested in reviewing
+- Should we include documentation in the final submission?
+  - NO, not in particular
+  - I dont care about the code, but about the outcome
+
+
+# Biweekly 09.01.2024  
+
+## Attendees
+- [x] @juliuspor
+- [x] @sm1lla
+- [x] @joh-dah  
+- [x] @elenagensch  
+- [x] @cdfhalle  
+
+## Topics
+- Video (bestes Video jemals)
+- Prüfen wie sinnvoll Iterations sind 
+- Conrad: Embedding Vector Database ist nicht so sinnvoll (evtl. nach einfachem Funktionsnamen suchen)
+
+## Actions
+- Julius: frontend
+- Johanna: Sinnhaftigkeit von Iterations prüfen
+- Elena, Smilla, Conrad: Fehlerhafte Outputs analysieren 
+
+# Biweekly 12.12.2024  
+
+## Attendees  
+- [x] @juliuspor  
+- [x] @sm1lla  
+- [x] @joh-dah  
+- [x] @elenagensch  
+- [x] @cdfhalle  
+
+## Topics  
+- **High priority**: Discuss deliverables for video next Tuesday.  
+- Prompt Engineering: generate prompt automatically
+- Julius neues modell ausprobieren (4-6/14 gelöst)
+- Elena: stack overflow code: Der github code scheint besser zu funktionieren
+- Smilla: Beispiele die falsch geparsed werden etc. fixen, context mit generiertem code fetchen
+- Conrad: Mit Embeddng models herumexperimentiert
+
+## Actions
+- Video machen
+- Julius: Webapp bauen, damit alles ein bisschen schicker aussieht
+
+# Biweekly 12.12.2024  
+
+## Attendees  
+- [ ] @juliuspor  
+- [x] @sm1lla  
+- [ ] @joh-dah  
+- [x] @elenagensch  
+- [x] @cdfhalle  
+
+## Topics  
+- **High priority**: Discuss deliverables for video next Tuesday.  
+- Embedding functions in Conrad.  
+- Vectorize: Compare RAG embedding models, evaluate differences (Llama 70B, OpenAI embedding).  
+- Retriever from RAG: Benchmark papers—most retrievers are embedding models, ranking models, or Salesforce SFR embedding.  
+- Experiment with adding context to generated code.  
+- Code context experiments are in WandB reports (some configurations slightly outperform or perform similarly to setups without RAG).
+- stackoverflow code now in json, no experiments yet
+- Evaluate via pickle not Csvs
+- Hybrid API migration paper: started to apply to our problem, sketch out code.
+- The responsibility to note what one is working on lies with each individual. Please document it in the meeting log yourself.
+
+## Actions
+-  stackoverflow, code experiments: @elenagensch
+- @cdfhalle
+- @joh-dah
+- @sm1lla
+- @juliuspor
+
+# Biweekly 10.12.2024
+
+| Name            | About                       | Title         | Agenda         | Timekeeping | Notes        |
+|------------------|-----------------------------|---------------|----------------|-------------|--------------|
+| Meeting Template |  | Meeting 01/01/0001 | |    | @elenagensch |
+
+## Agenda
+- OpenAI credits: Julius will write an email.
+- Linter feedback as inline comments in the code, PR to be created.
+- Conrad suggests "Chain of Thought."
+- Julius will try using Google for solutions.
+- Discuss deliverables for Martin on Thursday (code video + final presentation).
+
+## Attendees
+
+- [x] @juliuspor  
+- [ ] @sm1lla  
+- [x] @joh-dah  
+- [x] @elenagensch  
+- [x] @cdfhalle  
+
+## Topics
+
+- **OpenAI Credits**: Julius will write and send an email regarding this.
+- **Linter Feedback Prompt engineering**: Feedback is added as inline comments in the code, followed by creating a PR.
+- **Chain of Thought**: potential approach for better logical reasoning - @cdfhalle.
+- **Google Searches**:  explore solutions using Google for prompts - @juliuspor.
+- **Deliverables**: Code video and final presentation to be discussed on Thursday.
+
 # Biweekly 01.01.0001
 
 | Name            | About                       | Title         | Agenda | Timekeeping | Notes |
@@ -30,7 +206,41 @@
 
 *
 
----
+# Meeting with Martin 05.12.2024
+
+## Agenda
+-  Present progress
+
+## Attendees
+
+- [ ] @juliuspor
+- [x] @sm1lla
+- [x] @joh-dah
+- [ ] @elenagensch
+- [x] @cdfhalle
+
+
+## Meeting notes
+
+
+- Finding out which kind of errors
+  - use python interpreter to check if it is syntactically correct?
+  - import all imports and then eval code
+
+- Martin's idea for more examples:
+    - use unit tests from pyspark for rdds etc
+- record small demo video 3-5 minutes, little presentation and go through code and example
+    - for beginning of January
+- final presentation in Databricks office 
+
+Internal discussion with Chris:
+- get runtime errors by running code by executing code just to find out if this would help
+- track for individual examples the distribution of how often they fail
+    - track what makes the code complex
+- how is the code split? - are there different ways of splitting it?
+- try OpenAI model -> which subscription necessary?
+- Elena's card receipt for team building
+
 # Biweekly 05.12.2024
 
 | Name            | About                       | Title         | Agenda | Timekeeping | Notes |

diff --git a/src/config.yaml b/src/config.yaml
@@ -1,6 +1,6 @@
-use_rag: true
+# Vector Store Settings
 num_rag_docs: 1
-vectorstore_type: "api_ref" # Mögliche Werte: 'docs', 'code', 'api_ref'
+vectorstore_type: "code" # Mögliche Werte: 'docs', 'code', 'api_ref'
 vectorstore_settings: 
     docs: 
         docs: 
@@ -10,7 +10,7 @@ vectorstore_settings:
 
 
     code: 
-        vector_store_path: "/raid/shared/masterproject2024/vector_stores/code_vector_store_small"
+        vector_store_path: "/raid/shared/masterproject2024/vector_stores/code/"
         data_path: "/raid/shared/masterproject2024/rag/data/code.json"
         repo_branch_list:
             - { repo: "mrpowers-io/quinn", branch: "main"}
@@ -23,56 +23,87 @@ vectorstore_settings:
         type: connect
 
     api_ref:
-        vector_store_path: "/raid/shared/masterproject2024/vector_stores/vector_store"
-
+        vector_store_path: "/raid/shared/masterproject2024/vector_stores/api/nv_split512"
+        split_documents: False
+        chunk_size: 512
+        chunk_overlap: 50
 
-iterate: false
-iteration_limit: 5
-# Types of messages the linter should return. Possible values: 'error', 'warning', 'convention' (maybe more) 
-linter_feedback_types: 
-    - error
-# current model options:
-# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16
-# - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
-# - meta-llama/CodeLlama-70b-Python-hf
-model_name: "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
-model_temperature: 0.2
-# model length in token make sure the same value is used for serving the model
-max_model_length: 8192
-answer_token_length: 2048
+
+# Iteration Settings    
+iterate: true
+iteration_limit: 3
+
+
+# Linter Settings
 linter_config : 
-    enabled_linters: 
+    enabled_linters: # List of linters to use. Possible values: 'pylint', 'mypy', 'flake8', 'spark_connect'
         - pylint
-        - mypy
-        - flake8
         - spark_connect
-    feedback_types: 
-        - error
-        - warning  # Return only these severities. Possible values: 'error', 'warning', 'convention' (maybe more)
+    feedback_types: # Return only these severities. Possible values: 'error', 'warning', 'convention' (maybe more) 
+        - error 
+
+
+# Models Settings
+model_name: "neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16"
+# current model options:
+#  - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a16
+#  - neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
+#  - meta-llama/CodeLlama-70b-Python-hf
+model_temperature: 0.2
+embedding_model_name: "nvidia/NV-Embed-v2"
+max_model_length: 8192 # model length in token make sure the same value is used for serving the model
+answer_token_length: 2048
+
+
+# Prompt Settings
+generate_prompt: false # If true, the prompt will be generated by the LLM. If false initial_prompt and iterated_prompt will be used.
+use_error: true
+use_rag: true
 initial_prompt: "
+        Update the provided PySpark code to be compatible with Spark Connect.
         The rewritten code should have exactly the same functionality as the original code and should return exactly the same output.
         This is the original code that does not work with spark connect:
 
         "
-error_prompt: "\nWhen executed, the code produces the following error: "
+
+iterated_prompt : "
+        Unfortunately, the code does not seem to work. This can be due to the fact that the code is not compatible with Spark Connect or
+        other issues.
+        Please fix the issues and make sure that the code you produce is correct and compatible with spark connect.
+
+
+"
+linter_prompt: "\n\nIssues in the code detected by the linter are listed here:
+
+        "
 
 context_prompt: "\n\nIn case it is helpful you can use the following context to help you with the task: 
 
                 "
 
-linter_error_prompt : "
-        Unfortunately, the code does not seem to work with spark connect.
-        Please rewrite the code to work with spark connect. Make sure the code is correct python code that can be executed without errors.
-        The Spark Connect Linter produces the following error:
-
-
+system_prompt: "Update the provided PySpark code to be compatible with Spark Connect while maintaining its original functionality and output.
+
+# Steps
+
+* Analyze the provided PySpark code and the linter feedback to identify compatibility issues with Spark Connect.
+* Use the given context to inform your updates and ensure the rewritten code is functionally equivalent to the original.
+* Address each identified issue and make necessary modifications to the code.
+* Verify that the updated code maintains the same output as the original code.
+
+# Output Format
+
+Return the updated PySpark code snippet as a plain text string, without any additional formatting or comments.
+
+# Notes
+
+* Ensure that the updated code only includes changes necessary for Spark Connect compatibility, avoiding any unnecessary modifications.
+* Use the provided linter feedback as a guide, but also consider any additional context or requirements that may impact the updated code.
+* The output should be a self-contained code snippet that can be used in place of the original code.
 "
-system_prompt: "You will be provided with PySpark Code that is not compatible with Spark Connect.
-                You will return an updated version of the code that has exactly the same output but is compatible with Spark Connect.
-                Only return code blocks."
-use_error: true
+
+# Experiment Settings
 number_of_examples: 14
-eval_iterations: 5
-log_results: false
+eval_iterations: 15
+log_results: true
 run_name: null