Skip to content

Commit

Permalink
Improved Neptune connector docs, CFN deploy for example (#1998)
Browse files Browse the repository at this point in the history
  • Loading branch information
mhavey authored Jun 24, 2024
1 parent 03c110b commit fcf55b3
Show file tree
Hide file tree
Showing 9 changed files with 351 additions and 300 deletions.
8 changes: 8 additions & 0 deletions athena-neptune/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Neptune Athena Connector Example

To get started with the Neptune Athena Connector, follow these steps:

1. Create a Amazon Neptune database cluster, if you do not already have one. Then populate the database with the sample `air routes` dataset. This is available in both Labeled Property Graph (LPG) and Resource Description Framework (RDF) formats. You may load both if you would like to test the connector against both formats. For more, see [neptune-cluster-setup/README.md](neptune-cluster-setup/README.md).
2. The connector requires you to define a table structure in AWS Glue. Follow [aws-glue-sample-scripts/README.md](aws-glue-sample-scripts/README.md) to setup for the `air routes` dataset.
3. Deploy the connector following [neptune-connector-setup/README.md](neptune-connector-setup/README.md). To use both LPG and RDF, deploy two copies of the connector.

143 changes: 33 additions & 110 deletions athena-neptune/docs/aws-glue-sample-scripts/PropertyGraph.md
Original file line number Diff line number Diff line change
@@ -1,141 +1,64 @@
# Property Graph Glue Data Catalog Setup

Column types for tables representing Property Graph nodes or edges map from node or edge property tables. As an example, if we have a node labelled “country” with properties “type”, “code” and “desc”. In the Glue database, we will create a table named “country” with columns “type”, “code” and “desc”. Setup data types of the columns based on their data types in the property graph.
To query property graph data using this connector, create a table in the Glue data catalog that maps to property graph data in the Neptune database. There are three styles of mapping available:

Refer to the diagram below:
- *Vertex-based*: The table represents a vertex with a specified label in the graph. Each row represents a specific vertex. Its columns include the vertex ID and vertex property values. Examples tables include `airport`, `country`, and `continent` tables.
- *Edge-based*: The table represents an edge with a specified label in the graph. Each row represents a specific edge. Its column include the edge ID, source and target vertex IDs, and edge property values. An example is the `route` table.
- *Query-based*: The table represents the resultset of a Gremlin query. Each row is one result. An example is the `customairport` table.

![](./assets/connector-propertygraph.png)
Columns are named the same as their properties. Reserved column names are:
- `id`: vertex ID if `componenttype` is 'vertex`. edge ID if `componenttype` is 'edge`.
- `out`: If `componenttype` is edge, this is the vertex ID of the *from* vertex.
- `in`: If `componenttype` is edge, this is the vertex ID of the *to* vertex.

## Create AWS Glue Catalog Database and Tables
Advanced properties for the table are:

AWS Glue Catalog Database and Tables can be created either by using [Amazon Neptune Export Configuration](#create-aws-glue-database-and-tables-using-amazon-neptune-export-configuration) or [Manually](#create-aws-glue-database-and-tables-manually).
|Property|Values|Description|
|--------|------|-----------|
|componenttype|`vertex`, `edge`, or `view`||
|glabel|vertex label or edge type. If not specified, this is assumed to be the table name||
|query|Gremlin query if `componenttype` is `view`|

### Create AWS Glue Database and Tables using Amazon Neptune Export Configuration
## Examples

You can use the sample node.js script [here](./automation/script.js) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route corresponding to the Air Routes Property Graph sample dataset. The node.js script uses the Amazon Neptune export configuration file. There is a sample export configuration for the Air Routes sample dataset in the [folder](./automation).

From inside the [folder](./automation), run these commands

Install dependencies

```
npm install
```

Make sure you have access to your AWS environment via CLI and Execute the script

```
node script.js
```
If you are using a different dataset make sure to replace the config.json with export output from your database. Refer [this](https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export) for how to export configuration from Amazon Neptune database. You have to download the source code and build it. Once you have built the neptune-export jar file, run the below command from machine where your Amazon Neptune cluster is accessible, to generated export configuration

```
bin/neptune-export.sh create-pg-config -e <neptuneclusterendpoint> -d <outputfolderpath>
```

### Create AWS Glue Database and Tables manually


If you want to create database and tables manually, you can use the sample shell script [here](./manual/sample-cli-script.sh) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route corresponding to the Air Routes Property Graph sample dataset.

If you're planning to use your own data set instead of the Air Routes sample dataset, then you need to modify the script according to your data structure.

Ensure to have the right executable permissions on the script once you download it.

```
chmod 755 sample-cli-script.sh
```
Ensure to setup credentials for your AWS CLI to work.

Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the AWS Glue tables which should be the same as your Neptune Cluster's AWS region.

```
./sample-cli-script.sh <aws-profile> <aws-region>
```


If all goes well you now have the Glue Database and Tables that are required for your Athena Neptune Connector setup and you can move on to those steps mentioned [here](../neptune-connector-setup/).

### Sample table post setup
The next screenshot shows columns and advanced properties for the sample `airport` table that maps to airport vertices in Neptune. It is a vertex table, indicated the `componenttype` of `vertex`. Its columns include `id` (the airport vertex ID) plus `type`, `code`, `icao`, and `desc` (vertex properties).

![](./assets/table.png)

### Query examples

##### Graph Query
Here is an edge table for `route`. Columns include built-in `id`, `out`, and `in`. The `dist` column maps to an edge property of the `route` edge.

```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)
```

##### Equivalent Athena Query
```
SELECT
a.id as "source",b.id as "destination" FROM "graph-database"."airport" as a
inner join "graph-database"."route" as b
on a.id = b.out
inner join "graph-database"."airport" as c
on c.id = b."in"
limit 10;
```
![](./assets/table_route.png)

## Custom query
Finally, here is a table that presents a custom view. Notice `componentype` is `view`.

Neptune connector custom query feature allows you to specify a custom Glue table, which matches response of a Gremlin Query. For example a gremlin query like
![](./assets/table_custom.png)

The `query` property is
```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by("code").limit(10)
```

matches to a Glue table

![](./assets/customquery-exampletable.png)

Refer example scripts on how to create a table [here](./manual/sample-cli-script.sh)
Columns are `source` and `destination`, which are the values returned by the Gremlin query above.

> **NOTE**
>
> Custom query feature allows simple type (example int,long,string,dateime) projections as query output
Run SQL queries against the Athena service to retrieve this property graph data.


### Example query patterns

##### project node properties
The following query retrieves 100 airports.

```
g.V().hasLabel("airport").valueMap("code","city","country").limit(10000)
select * from "graph-database"."airport"
LIMIT 100
```

##### project edge properties
The following query retrieves 100 routes.

```
g.E().hasLabel("route").valueMap("dist").limit(10000)
select * from "graph-database"."route"
LIMIT 100
```

##### n hop query with select clause
The following query uses the custom view to get source-destination routes:

```
g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by("code").limit(10)
```

##### n hop query with project clause
select * from "graph-database"."customairport"
LIMIT 100
```
g.V().hasLabel("airport").as("s").out("route").as("d").project("source","destination").by(select("s").id()).by(select("d").id()).limit(10)
```

### Sample table post setup

![](./assets/customtable.png)

### Benefits

Using custom query feature you can project output of a gremlin query directly. This helps to avoid the effort to write a lengthly sql query on the graph model. It also allows more control on how the table schema should be designed for analysis purpose. You can limit the number of records to retrieve in the gremlin query itself.




42 changes: 3 additions & 39 deletions athena-neptune/docs/aws-glue-sample-scripts/RDF.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

To query RDF data using this connector, create a table in the Glue data catalog that maps to RDF data in the Neptune database. There are two styles of mapping available:

- **Class-based**: The table represents an RDFS class. Each row represents an RDF resource whose type of that class. Columns represent datatype or object properties. See the airport_rdf example below.
- **Query-based**: The table represents the resultset of a SPARQL query. Each row is one result. See the route_rdf example below.
- **Class-based**: The table represents an RDFS class. Each row represents an RDF resource whose type of that class. Columns represent datatype or object properties. See the `airport_rdf` example below.
- **Query-based**: The table represents the resultset of a SPARQL query. Each row is one result. See the `route_rdf` example below.

In each case, you define columns and use table properties to map RDF to that column structure. Here is a summary of table properties to indicate RDF mapping:

Expand All @@ -22,30 +22,6 @@ In each case, you define columns and use table properties to map RDF to that col
## Examples
We provide examples of both class-based and query-based tables. The examples use the Air Routes dataset.

### Step 1: Create Neptune Cluster and Seed Air Routes Data in Neptune
In your Neptune cluster, seed the Air Routes dataset as RDF using the instructions in [../neptune-cluster-setup/README.md](../neptune-cluster-setup/README.md).

### Step 2: Create Glue Tables
Create the Glue tables. We provide a shell script [manual/sample-cli-script.sh](manual/sample-cli-script.sh).

Ensure to have the right executable permissions on the script once you download it.

```
chmod 755 sample-cli-script.sh
```
Ensure to setup credentials for your AWS CLI to work.

Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the
AWS Glue tables which should be the same as your Neptune Cluster's AWS region.

```
./sample-cli-script.sh <aws-profile> <aws-region>
```

Next we study the structure of each of the tables created.

### Step 3: Understanding Class-Based Tables
The **airport_rdf** table is a class-based table. Its rows represent individual RDF resources that have a specified RDFS class. The column names represent predicates. The column values represent objects.

The next figure shows the column structure of the table:
Expand Down Expand Up @@ -102,7 +78,6 @@ To apply this approach to your own dataset, we recommend running a SPARQL query
```
select distinct ?p where { ?s rdf:type #MYCLASS . ?s ?p ?o } LIMIT 1000
```
### Step 4: Understanding Query-Based Tables
The **route_rdf** table is a query-based table. Its rows represent results from a SPARQL select query.

![](./assets/routerdf.png)
Expand Down Expand Up @@ -136,18 +111,7 @@ The **route_rdf_nopfx** table is similar to **route_rdf** except the prefixes ar

![](./assets/routerdf_nopfx.png)

### Step 5: Deploy the Athena Connector
Deploy the Athena connector using RDF as the graph type. See [../neptune-connector-setup/README.md](../neptune-connector-setup/README.md).

In this example, use the following settings:

- ApplicationName: AthenaNeptuneConnectorRDF
- AthenaCatalogName: athena-catalog-neptune-rdf
- GlueDatabaseName: graph-database-rdf
- NeptuneGraphType: RDF

### Step 6: Query
Once connector is deployed, you can run SQL queries against the Athena service to retrieve this RDF data.
Run SQL queries against the Athena service to retrieve this RDF data.

The following query accesses the class-based table to retrieve 100 airports.

Expand Down
Loading

0 comments on commit fcf55b3

Please sign in to comment.