ast2vec Version 0.2.1

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.

Introduction

ast2vec is a neural network that can translate a Python syntax tree to a 256-dimensional vector and a 256-dimensional vector back to a syntax tree. If you use ast2vec in academic work, please cite the paper

Paaßen, B., McBroom, J., Jeffries, B., Koprinska, I., and Yacef, K. (2021). Mapping Python Programs to Vectors using Recursive Neural Encodings. Journal of Educational Datamining. In press. Link

Note: This network was updated in June 2021 to account for changes in the Python grammar since version 3.8. We also slightly changed the architecture, meaning less parameters, better autoencoding performance, and an easier interface.

Quickstart Guide

If you wish to translate a Python program to a vector, you need to apply the following steps.

# compile the Python code into a syntax tree
src = "print('Hello world!')"
import ast
import python_ast_utils
tree = python_ast_utils.ast_to_tree(ast.parse(src))

# load the ast2vec model
import ast2vec
model = ast2vec.load_model()

# translate to a vector
x = model.encode(tree)

The vector x is 256-dimensional and represents the syntax tree.

If you wish to decode a vector back into a syntax tree, you need to apply the following function.

tree = model.decode(x, max_size = 300)

The optional max_size argument is important to prevent endless loops during decoding.

The ast2vec module also contains functions encode_trees and decode_points to encode and decode multiple objects at the same time.

Docker Quickstart

One can also do encoding described it the previous section using Docker. First, pull an existing image:

docker pull inpefess/ast2vec

Or build it from scratch:

git clone https://gitlab.com/inpefess/ast2vec.git
cd ./ast2vec
docker build -t inpefess/ast2vec .

Then run the container with:

docker run --name ast2vec -p 8080:8080 -d inpefess/ast2vec

After doing that, one can use the TorchServe REST API to request embeddings:

curl http://127.0.0.1:8080/predictions/ast2vec\
         -H 'Content-Type: application/json'\
         -d '{"data": "print(\"Hello, world\")"}'

In case you have a Kubernetes cluster, you can run several containers (ast2vec, redis for caching requests, prometheus for collecting metrics, and grafana for drawing dashboards) with a single command:

kubectl apply -f ast2vec.yml

Then get the Kubernetes cluster IP:

kubectl cluster-info

and use this IP (with no port specified) instead of 127.0.0.1:8080 in the preceding curl example.

Latency and other metrics can be monitored through Grafana dashboards. One can access them by opening [kubernetes-cluster-ip]/grafana in a browser.

Tutorial

A more detailed tutorial using an entire mock data set of programs is given in the tutorial.ipynb notebook.

Background

ast2vec is a recursive tree grammar autoencoder as proposed by Paaßen, Koprinska, and Yacef (2021). In more detail, the encoder part of the model follows the recursive behavior of a bottom-up tree parser of the Python programming language to generate the overall vector encoding. Consider the example of the syntax tree Module(Expr(Call(Name, Str))) which corresponds to the program print('Hello, world!'). To process this syntax tree we start with the leaf nodes Name and Str and encode them as learned 256-dimensional vectors. Note that every Name and every Str receive the same vector encoding, irrespective of content. Next, we apply a small neural network for the Name node which takes the two vector encodings for Name and Str as input and returns the vector code for the entire subtree Call(Name, Str). This process continues recursively until we have encoded the entire tree.

For decoding, we apply an inverse mechanism: Given the vector code x, we first apply a classifier h which chooses the current syntactic element (e.g. Module). Once we know the syntactic element we apply a separate decoding function for each child of such a syntactic element, yielding the vector codes for all children. Then, we apply the same scheme recursively until no child is left to decode anymore.

The entire network is trained in a variational autoencoding framework (Kingma and Welling, 2013). In particular, we train the network to maximize the probability that the vector code for a training tree gets decoded back to itself, even after adding a small amount of Gaussian noise. This scheme has the advantage that we do not need any expert annotation. We only need syntax trees as training data.

In particular, our training data consisted of 448,992 Python programs recorded as part of the the National Computer Science School (NCSS), an Australian educational outreach programme. The course was delivered by the Grok Learning platform. After compilation, we were left with 86,991 unique abstract syntax trees. We performed training for 130,000 epochs (each epoch with a mini-batch of 32 trees), which corresponds to roughly one epoch per program (32 * 130,000 = 416,000). All training was performed on a consumer-grade laptop with a 2017 Intel core i7 CPU and took roughly one week of real time.

ast2vec.pt : A pytorch model file containing the ast2vec model parameters.
ast2vec.py : A Python module containing convenience functions to use ast2vec.
LICENSE.md : A copy of the GPLv3 license.
mock_dataset : A directory containing an example dataset of Python code for the tutorial.ipynb notebook.
python_ast_utils.py : A Python module containing convenience functions to access Python abstract syntax trees as well as the formal grammar describing the Python programming language.
README.md : This file.
recursive_tree_grammar_auto_encoder.py : A pyTorch implementation of recursive tree grammar autoencoders (Paaßen, Koprinska, and Yacef, 2021).
tree.py : A Python implementation of a recursive tree datastructure for internal use.
tree_grammar.py : A Python implementation of regular tree grammars.
tutorial.ipynb : An ipython notebook illustrating how to use ast2vec to analyze program data.
variable_classifier.py : A scikit-learn compatible classifier to infer the variable references and function calls in a syntax tree based on training data. Refer to the tutorial.ipynb notebook for an example.
version_0_1_0 : A directory containing the version as described in the paper. This version is only compatible with Python version 3.7 and thus not up to date anymore.

License

All code enclosed in this repository is licensed under the GNU General Public License (Version 3). This documentation as well as the neural network parameters are licensed under Creative Commons Attribution Share-Alike (CC-BY-SA 4.0).

Dependencies

This library depends on NumPy for matrix operations, on scikit-learn for the support vector machine solver, and on pyTorch for the actual ast2vec model. For the respective licenses, please refer to their websites.

Literature

Paaßen, B., McBroom, J., Jeffries, B., Koprinska, I., and Yacef, K. (2021). Mapping Python Programs to Vectors using Recursive Neural Encodings. Journal of Educational Datamining. in press. Link
Paaßen, B., Koprinska, I., and Yacef, K. (2021). Recursive Tree Grammar Autoencoeders. Submitted to the IEEE Transactions on Neural Networks and Learning Systems. Under Review. Link
Kingma, D., and Welling, M. (2013). Auto-Encoding Variational Bayes. Proceedings of the 1st International Conference on Learning Representations (ICLR 2013). Link

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
mock_dataset		mock_dataset
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
ast2vec.pt		ast2vec.pt
ast2vec.py		ast2vec.py
ast2vec.yml		ast2vec.yml
config.properties		config.properties
handler.py		handler.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
prometheus.yml		prometheus.yml
pyproject.toml		pyproject.toml
python_ast_utils.py		python_ast_utils.py
rtgae2.py		rtgae2.py
tree.py		tree.py
tree_grammar.py		tree_grammar.py
tutorial.ipynb		tutorial.ipynb
variable_classification_test.py		variable_classification_test.py
variable_classifier.py		variable_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ast2vec Version 0.2.1

Introduction

Quickstart Guide

Docker Quickstart

Tutorial

Background

Contents

License

Dependencies

Literature

About

Releases

Packages

Contributors 2

Languages

License

inpefess/ast2vec

Folders and files

Latest commit

History

Repository files navigation

ast2vec Version 0.2.1

Introduction

Quickstart Guide

Docker Quickstart

Tutorial

Background

Contents

License

Dependencies

Literature

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages