Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom query fix #1428

Merged
merged 60 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
0d377e6
code merge, custom query changes
abhishekpradeepmishra Jun 30, 2023
88cd753
documentation update for custom query
abhishekpradeepmishra Jun 30, 2023
40fa9c1
Merge branch 'master' into custom-query
burhan94 Jul 12, 2023
aa714b5
updated table creation script
abhishekpradeepmishra Jul 25, 2023
709e04b
changed componenttype to view, and other related changes
abhishekpradeepmishra Aug 4, 2023
6c471b0
Merge branch 'awslabs:master' into custom-query
abhishekpradeepmishra Aug 17, 2023
91c5b79
Merge pull request #11 from abhishekpradeepmishra/custom-query
abhishekpradeepmishra Aug 17, 2023
fe5b02d
RDF
mhavey Jun 30, 2023
9999b1d
simplify
mhavey Jul 13, 2023
5eff9d5
Update README.md
mhavey Jul 14, 2023
98c9e38
Create PropertyGraph.md
mhavey Jul 14, 2023
8230afa
Create RDF.md
mhavey Jul 14, 2023
b1fd967
Update README.md
mhavey Jul 14, 2023
106b33a
Update PropertyGraph.md
mhavey Jul 14, 2023
f02f5ab
Update README.md
mhavey Jul 14, 2023
c232969
Update PropertyGraph.md
mhavey Jul 14, 2023
a08696f
Update PropertyGraph.md
mhavey Jul 14, 2023
e315c53
Update RDF.md
mhavey Jul 14, 2023
33d1a97
Create sample-cli-rdf.sh
mhavey Jul 14, 2023
c022067
Update sample-cli-script.sh
mhavey Aug 18, 2023
bb88e07
Update sample-cli-script.sh
mhavey Jul 14, 2023
a913898
Update RDF.md
mhavey Jul 14, 2023
a6bcc70
Update RDF.md
mhavey Jul 14, 2023
81a772b
Update RDF.md
mhavey Jul 15, 2023
36d859e
Update RDF.md
mhavey Jul 15, 2023
67cecf5
Update RDF.md
mhavey Jul 16, 2023
4fff460
Update RDF.md
mhavey Jul 16, 2023
0ef048f
Update RDF.md
mhavey Jul 16, 2023
7eae39f
Update RDF.md
mhavey Jul 16, 2023
d794f95
Update RDF.md
mhavey Jul 16, 2023
221a467
Update RDF.md
mhavey Jul 16, 2023
44d5128
Update RDF.md
mhavey Jul 16, 2023
a4598b1
Update README.md
mhavey Jul 16, 2023
9a69e8b
Update RDF.md
mhavey Jul 16, 2023
5fbc363
Update RDF.md
mhavey Jul 16, 2023
c02b319
Add files via upload
mhavey Jul 17, 2023
902ba32
Update RDF.md
mhavey Jul 17, 2023
0e7ae59
Update RDF.md
mhavey Jul 17, 2023
e0d05e3
Update sample-cli-rdf.sh
mhavey Jul 27, 2023
0de6035
Update RDF.md
mhavey Jul 27, 2023
6e760ff
Add files via upload
mhavey Jul 27, 2023
ac16116
Update RDF.md
mhavey Jul 27, 2023
945199b
Update athena-neptune.yaml
mhavey Jul 28, 2023
6ee38d7
Update NeptuneRecordHandler.java
mhavey Jul 28, 2023
2801b13
Update NeptuneConnection.java
mhavey Jul 28, 2023
d5e0278
Update RDFHandler.java
mhavey Jul 28, 2023
abe4632
Update NeptuneSparqlConnection.java
mhavey Jul 28, 2023
bccae6e
loglevel
mhavey Aug 22, 2023
77bb6b8
loglevel
mhavey Aug 22, 2023
12bc032
PR review
mhavey Oct 5, 2023
0c69757
fixed issue in record handler
mhavey Oct 9, 2023
0127769
merge
mhavey Oct 24, 2023
34a2564
Merge branch 'master' into custom-query-fix
burhan94 Nov 4, 2023
cde76a6
Merge branch 'master' into custom-query-fix
burhan94 Nov 14, 2023
beb0cb4
Update PropertyGraphHandler.java
mhavey Nov 15, 2023
d1304cd
Merge branch 'master' into custom-query-fix
mhavey Nov 16, 2023
04e52d0
Merge branch 'master' into custom-query-fix
mhavey Nov 16, 2023
47568ad
Merge branch 'master' into custom-query-fix
mhavey Nov 21, 2023
58dc948
Update README.md - remove merge comments
mhavey Nov 21, 2023
1087a9e
Update README.md - remove merge comments
mhavey Nov 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions athena-neptune/athena-neptune.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ Parameters:
Description: 'To find the Neptune cluster resource ID in the Amazon Neptune AWS Management Console, choose the DB cluster that you want. The Resource ID is shown in the Configuration section.'
Type: String
NeptuneGraphType:
Description: 'Type of graph created in Neptune, defaults to PROPERTYGRAPH. RDF support is yet to be implemented'
Description: 'Type of graph created in Neptune, defaults to PROPERTYGRAPH. Allowed values: PROPERTYGRAPH, RDF'
Type: String
Default: 'PROPERTYGRAPH'
AllowedValues: ["PROPERTYGRAPH", "RDF"]
GlueDatabaseName:
Description: 'Name of the Neptune cluster specific Glue Database that contains schemas of graph vertices'
Type: String
Expand Down Expand Up @@ -131,4 +132,4 @@ Resources:
- VPCAccessPolicy: {}
VpcConfig:
SecurityGroupIds: !Ref SecurityGroupIds
SubnetIds: !Ref SubnetIds
SubnetIds: !Ref SubnetIds
141 changes: 141 additions & 0 deletions athena-neptune/docs/aws-glue-sample-scripts/PropertyGraph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Property Graph Glue Data Catalog Setup

Column types for tables representing Property Graph nodes or edges map from node or edge property tables. As an example, if we have a node labelled “country” with properties “type”, “code” and “desc”. In the Glue database, we will create a table named “country” with columns “type”, “code” and “desc”. Setup data types of the columns based on their data types in the property graph.

Refer to the diagram below:

![](./assets/connector-propertygraph.png)

## Create AWS Glue Catalog Database and Tables

AWS Glue Catalog Database and Tables can be created either by using [Amazon Neptune Export Configuration](#create-aws-glue-database-and-tables-using-amazon-neptune-export-configuration) or [Manually](#create-aws-glue-database-and-tables-manually).

### Create AWS Glue Database and Tables using Amazon Neptune Export Configuration

You can use the sample node.js script [here](./automation/script.js) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route corresponding to the Air Routes Property Graph sample dataset. The node.js script uses the Amazon Neptune export configuration file. There is a sample export configuration for the Air Routes sample dataset in the [folder](./automation).

From inside the [folder](./automation), run these commands

Install dependencies

```
npm install
```

Make sure you have access to your AWS environment via CLI and Execute the script

```
node script.js

```
If you are using a different dataset make sure to replace the config.json with export output from your database. Refer [this](https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export) for how to export configuration from Amazon Neptune database. You have to download the source code and build it. Once you have built the neptune-export jar file, run the below command from machine where your Amazon Neptune cluster is accessible, to generated export configuration

```
bin/neptune-export.sh create-pg-config -e <neptuneclusterendpoint> -d <outputfolderpath>

```

### Create AWS Glue Database and Tables manually


If you want to create database and tables manually, you can use the sample shell script [here](./manual/sample-cli-script.sh) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route corresponding to the Air Routes Property Graph sample dataset.

If you're planning to use your own data set instead of the Air Routes sample dataset, then you need to modify the script according to your data structure.

Ensure to have the right executable permissions on the script once you download it.

```
chmod 755 sample-cli-script.sh
```
Ensure to setup credentials for your AWS CLI to work.

Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the AWS Glue tables which should be the same as your Neptune Cluster's AWS region.

```
./sample-cli-script.sh <aws-profile> <aws-region>
```


If all goes well you now have the Glue Database and Tables that are required for your Athena Neptune Connector setup and you can move on to those steps mentioned [here](../neptune-connector-setup/).

### Sample table post setup

![](./assets/table.png)

### Query examples

##### Graph Query

```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)
```

##### Equivalent Athena Query
```
SELECT
a.id as "source",b.id as "destination" FROM "graph-database"."airport" as a
inner join "graph-database"."route" as b
on a.id = b.out
inner join "graph-database"."airport" as c
on c.id = b."in"
limit 10;
```

## Custom query

Neptune connector custom query feature allows you to specify a custom Glue table, which matches response of a Gremlin Query. For example a gremlin query like

```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)

```

matches to a Glue table

![](./assets/customquery-exampletable.png)

Refer example scripts on how to create a table [here](./manual/sample-cli-script.sh)

> **NOTE**
>
> Custom query feature allows simple type (example int,long,string,dateime) projections as query output


### Example query patterns

##### project node properties

```
g.V().hasLabel("airport").valueMap("code","city","country").limit(10000)
```

##### project edge properties

```
g.E().hasLabel("route").valueMap("dist").limit(10000)
```

##### n hop query with select clause

```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by("code").limit(10)

```

##### n hop query with project clause
```
g.V().hasLabel("airport").as("s").out("route").as("d").project("source","destination").by(select("s").id()).by(select("d").id()).limit(10)

```

### Sample table post setup

![](./assets/customtable.png)

### Benefits

Using custom query feature you can project output of a gremlin query directly. This helps to avoid the effort to write a lengthly sql query on the graph model. It also allows more control on how the table schema should be designed for analysis purpose. You can limit the number of records to retrieve in the gremlin query itself.




173 changes: 173 additions & 0 deletions athena-neptune/docs/aws-glue-sample-scripts/RDF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# RDF Glue Data Catalog Setup

To query RDF data using this connector, create a table in the Glue data catalog that maps to RDF data in the Neptune database. There are two styles of mapping available:

- **Class-based**: The table represents an RDFS class. Each row represents an RDF resource whose type of that class. Columns represent datatype or object properties. See the airport_rdf example below.
- **Query-based**: The table represents the resultset of a SPARQL query. Each row is one result. See the route_rdf example below.

In each case, you define columns and use table properties to map RDF to that column structure. Here is a summary of table properties to indicate RDF mapping:

|Property|Values|Description|
|--------|------|-----------|
|componenttype|rdf||
|querymode|class, sparql||
|sparql|SPARQL query to use to find resultset.|Only if querymode='sparql'. Omit prefixes. Define prefixes as table properties.|
|classuri|Class of resources to find|In curie form prefix:classname. Only if querymode='class'. Connector will query for resources whose RDF type is this classuri.|
|subject|Name of column that is the subject in triples.|Only if querymode='class'. Connector will query for resources whose RDF type is this classuri. In that query. THiS IS THE SUBJECT.|
|preds_prefix|Prefix for predicates to find|Only if querymode='class'. If that prefix is P, you must define property prefix_P. For each resource, the connector finds column values as objects of predicates preds_prefix:colname|
|prefix_|Default prefix for query| URI prefix without angled brackets|
|prefix_X|Prefix known by shortform X| URI prefix without angled brackets|
|strip_uri|true, false|Only only localname of URIs in resultset|

## Examples
We provide examples of both class-based and query-based tables. The examples use the Air Routes dataset.

### Step 1: Create Neptune Cluster and Seed Air Routes Data in Neptune
In your Neptune cluster, seed the Air Routes dataset as RDF using the instructions in [../neptune-cluster-setup/README.md](../neptune-cluster-setup/README.md).

### Step 2: Create Glue Tables
Create the Glue tables. We provide a shell script [manual/sample-cli-script.sh](manual/sample-cli-script.sh).

Ensure to have the right executable permissions on the script once you download it.

```
chmod 755 sample-cli-script.sh
```
Ensure to setup credentials for your AWS CLI to work.

Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the
AWS Glue tables which should be the same as your Neptune Cluster's AWS region.

```
./sample-cli-script.sh <aws-profile> <aws-region>

```

Next we study the structure of each of the tables created.

### Step 3: Understanding Class-Based Tables
The **airport_rdf** table is a class-based table. Its rows represent individual RDF resources that have a specified RDFS class. The column names represent predicates. The column values represent objects.

The next figure shows the column structure of the table:

![](./assets/airportrdf_schema.png)

We set the table properties as follows:
- componenttype:rdf
- querymode: class
- classuri: class:Airport
- subject: id
- preds_prefix: prop
- prefix_class: http://kelvinlawrence.net/air-routes/class/
- prefix_prop: http://kelvinlawrence.net/air-routes/datatypeProperty/

The next figure shows the properties:

![](./assets/airportrdf_props.png)

We set **componenttype** to **rdf** to indicate this is an RDF-based table. We set **querymode** to **class** to indicate the RDF mapping is class-based. We indicate the class using **classuri**. The value is given in CURIE form as **class:Airport**. Here **class** is a prefix. The full value is defined by the **prefix_class** property. We can see that the fully-qualified class URI is **http://kelvinlawrence.net/air-routes/class/Airport**.

One column must map to the URI of the resource itself. That is given by **subject**. In this example, the **subject** is **id**. Each other column must map to the local name of the predicate. **prefix_prop** is the prefix of the predicates.

The connector creates a SPARQL query based on these settings, runs it against the Neptune cluster, and returns the results in the tabular form specified. The query for the above example is the following

```
PREFIX class: <http://kelvinlawrence.net/air-routes/class/> # from predicate_class
PREFIX prop: <http://kelvinlawrence.net/air-routes/datatypeProperty/> # from predicate_prop

# each variable selected must be a column name
SELECT ?id ?type ?code ?icao ?desc ?region ?runways ?longest ?elev ?country ?city ?lat ?lon
WHERE {
?id rdf:type class:Airport . # id is subject, class prefix is defined by prefix_class, Airport is defined by classuri
?id prop:type ?type . # type is a column name, prop is prefix defined by prefix_prop
?id prop:code ?code .
?id prop:icao ?icao .
?id prop:desc ?desc .
?id prop:region ?region .
?id prop:runways ?runways .
?id prop:longest ?longest .
?id prop:elev ?elev .
?id prop:country ?country .
?id prop:city ?city .
?id prop:lat ?lat .
?id prop:lon ?lon .
}
```
In the above, the **?id*** variable brings back a URI rather than a literal. The connector returns it as a string containing the full URI. You can specify **strip_uri** to force the connector to return only the local part, that is the part after the final hash or slash.

The class-based approach is suitable if your RDF model follows the convention where resources belong to a specific class and properties have the same predicate URI structure. If your data does not follow this approach, or if you simply need more flexibility, use the query-based approach discussed below.

To apply this approach to your own dataset, we recommend running a SPARQL query against your data to introspect its structure. The following query checks for distinct predicates in a sample of 1000 resources of a given class. These predicates can then be columns in the tabular representation of that class.

```
select distinct ?p where { ?s rdf:type #MYCLASS . ?s ?p ?o } LIMIT 1000
```
### Step 4: Understanding Query-Based Tables
The **route_rdf** table is a query-based table. Its rows represent results from a SPARQL select query.

![](./assets/routerdf.png)

We set the table properties as follows:
- querymode: sparql
- sparql: select ?incode ?outcode ?dist where { ?resin op:route ?resout . GRAPH ?route { ?resin op:route ?resout } . ?route prop:dist ?dist . ?resin prop:code ?incode .?resout prop:code ?outcode . }
- prefix_prop: http://kelvinlawrence.net/air-routes/datatypeProperty/
- prefix_op: http://kelvinlawrence.net/air-routes/objectProperty/
- strip_uri :true

The connector creates a SPARQL query given by **query** with the prefixes given by **prefix_prop** and **prefix_op**. For clarity, we add comments to explain the query.

```
PREFIX prop: <http://kelvinlawrence.net/air-routes/datatypeProperty/>
PREFIX op: <http://kelvinlawrence.net/air-routes/objectProperty/>

select ?incode ?outcode ?dist where {
?resin op:route ?resout . # Find two airport resources with an op:route relationship
GRAPH ?route { ?resin op:route ?resout } . # The distance of the route is modeled as a named graph. Get the route
?route prop:dist ?dist . # Get distance from named graph
?resin prop:code ?incode . # Get airport code of first airport
?resout prop:code ?outcode . # Get airport code of second airport
}
```
The connector maps the results **incode**, **output**, **dist** from SPARQL to the column structure of the table.

Query mode is the most flexible way to map RDF to table structure. We recommend testing the query by directly running against the Neptune cluster. When you are happy with its results, use that query to define the Glue table.

The **route_rdf_nopfx** table is similar to **route_rdf** except the prefixes are included in the SPARQL rather than kept in separate table properties.

![](./assets/routerdf_nopfx.png)

### Step 5: Deploy the Athena Connector
Deploy the Athena connector using RDF as the graph type. See [../neptune-connector-setup/README.md](../neptune-connector-setup/README.md).

In this example, use the following settings:

- ApplicationName: AthenaNeptuneConnectorRDF
- AthenaCatalogName: athena-catalog-neptune-rdf
- GlueDatabaseName: graph-database-rdf
- NeptuneGraphType: RDF

### Step 6: Query
Once connector is deployed, you can run SQL queries against the Athena service to retrieve this RDF data.

The following query accesses the class-based table to retrieve 100 airports.

```
select * from "graph-database-rdf"."airport_rdf"
LIMIT 100
```

The following query accesses the query-based table to retrieve 100 routes. You can run this against either route_rdf or route_rdf_pfx tables.

```
select * from "graph-database-rdf"."route_rdf"
LIMIT 100
```

The following query accesses the query-based table to retrieve routes from YOW airport. You can run this against either route_rdf or route_rdf_pfx tables.

```
select * from "graph-database-rdf"."route_rdf".
where incode='YOW'
```


Loading