Improved Neptune connector docs, CFN deploy for example (#1998)

awslabs · Jun 24, 2024 · fcf55b3 · fcf55b3
1 parent 03c110b
commit fcf55b3
Show file tree

Hide file tree

Showing 9 changed files with 351 additions and 300 deletions.
diff --git a/athena-neptune/docs/README.md b/athena-neptune/docs/README.md
@@ -0,0 +1,8 @@
+# Neptune Athena Connector Example
+
+To get started with the Neptune Athena Connector, follow these steps:
+
+1. Create a Amazon Neptune database cluster, if you do not already have one. Then populate the database with the sample `air routes` dataset. This is available in both Labeled Property Graph (LPG) and Resource Description Framework (RDF) formats. You may load both if you would like to test the connector against both formats. For more, see [neptune-cluster-setup/README.md](neptune-cluster-setup/README.md).
+2. The connector requires you to define a table structure in AWS Glue. Follow [aws-glue-sample-scripts/README.md](aws-glue-sample-scripts/README.md) to setup for the `air routes` dataset.
+3. Deploy the connector following [neptune-connector-setup/README.md](neptune-connector-setup/README.md). To use both LPG and RDF, deploy two copies of the connector.
+
diff --git a/athena-neptune/docs/aws-glue-sample-scripts/PropertyGraph.md b/athena-neptune/docs/aws-glue-sample-scripts/PropertyGraph.md
@@ -1,141 +1,64 @@
 # Property Graph Glue Data Catalog Setup
 
-Column types for tables representing Property Graph nodes or edges map from node or edge property tables. As an example, if we have a node labelled “country” with properties “type”, “code” and “desc”.  In the Glue database, we will create a table named “country” with columns “type”, “code” and “desc”. Setup data types of the columns based on their data types in the property graph. 
+To query property graph data using this connector, create a table in the Glue data catalog that maps to property graph data in the Neptune database. There are three styles of mapping available:
 
-Refer to the diagram below:
+- *Vertex-based*: The table represents a vertex with a specified label in the graph. Each row represents a specific vertex. Its columns include the vertex ID and vertex property values. Examples tables include `airport`, `country`, and `continent` tables.
+- *Edge-based*: The table represents an edge with a specified label in the graph. Each row represents a specific edge. Its column include the edge ID, source and target vertex IDs, and edge property values. An example is the `route` table.
+- *Query-based*: The table represents the resultset of a Gremlin query. Each row is one result. An example is the `customairport` table.
 
-![](./assets/connector-propertygraph.png)
+Columns are named the same as their properties. Reserved column names are:
+- `id`: vertex ID if `componenttype` is 'vertex`. edge ID if `componenttype` is 'edge`.
+- `out`: If `componenttype` is edge, this is the vertex ID of the *from* vertex.
+- `in`: If `componenttype` is edge, this is the vertex ID of the *to* vertex.
 
-## Create AWS Glue Catalog Database and Tables
+Advanced properties for the table are:
 
-AWS Glue Catalog Database and Tables can be created either by using [Amazon Neptune Export Configuration](#create-aws-glue-database-and-tables-using-amazon-neptune-export-configuration) or [Manually](#create-aws-glue-database-and-tables-manually). 
+|Property|Values|Description|
+|--------|------|-----------|
+|componenttype|`vertex`, `edge`, or `view`||
+|glabel|vertex label or edge type. If not specified, this is assumed to be the table name||
+|query|Gremlin query if `componenttype` is `view`|
 
-### Create AWS Glue Database and Tables using Amazon Neptune Export Configuration
+## Examples
 
-You can use the sample node.js script [here](./automation/script.js) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route corresponding to the Air Routes Property Graph sample dataset. The node.js script uses the Amazon Neptune export configuration file. There is a sample export configuration for the Air Routes sample dataset in the [folder](./automation).
-
-From inside the [folder](./automation), run these commands
-
-Install dependencies
-
-```
-npm install
-```
-
-Make sure you have access to your AWS environment via CLI and Execute the script
-
-```
-node script.js
-
-```
-If you are using a different dataset make sure to replace the config.json with export output from your database. Refer [this](https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-export) for how to export configuration from Amazon Neptune database.  You have to download the source code and build it. Once you have built the neptune-export jar file, run the below command from machine where your Amazon Neptune cluster is accessible, to generated export configuration
-
-```
-bin/neptune-export.sh create-pg-config -e <neptuneclusterendpoint> -d <outputfolderpath>
-
-```
-
-### Create AWS Glue Database and Tables manually
-
-
-If you want to create database and tables manually, you can use the sample shell script [here](./manual/sample-cli-script.sh) to create a Glue Database by the name "graph-database" and tables: airport, country, continent and route  corresponding to the Air Routes Property Graph sample dataset. 
-
-If you're planning to use your own data set instead of the Air Routes sample dataset, then you need to modify the script according to your data structure. 
-
-Ensure to have the right executable permissions on the script once you download it.
-
-```
-chmod 755 sample-cli-script.sh
-```
-Ensure to setup credentials for your AWS CLI to work.
-
-Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the AWS Glue tables which should be the same as your Neptune Cluster's AWS region.
-
-```
-./sample-cli-script.sh  <aws-profile> <aws-region>
-```
-
-
-If all goes well you now have the Glue Database and Tables that are required for your Athena Neptune Connector setup and you can move on to those steps mentioned [here](../neptune-connector-setup/).
-
-### Sample table post setup
+The next screenshot shows columns and advanced properties for the sample `airport` table that maps to airport vertices in Neptune. It is a vertex table, indicated the `componenttype` of `vertex`. Its columns include `id` (the airport vertex ID) plus `type`, `code`, `icao`, and `desc` (vertex properties).
 
 ![](./assets/table.png)
 
-### Query examples
-
-##### Graph Query
+Here is an edge table for `route`. Columns include built-in `id`, `out`, and `in`. The `dist` column maps to an edge property of the `route` edge.
 
-```
-g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)
-```
-
-#####  Equivalent Athena Query
-```
-SELECT 
-a.id as "source",b.id as "destination" FROM "graph-database"."airport" as a 
-inner join "graph-database"."route" as b 
-on a.id = b.out
-inner join "graph-database"."airport" as c 
-on c.id = b."in"
-limit 10;
-```
+![](./assets/table_route.png)
 
-## Custom query
+Finally, here is a table that presents a custom view. Notice `componentype` is `view`. 
 
-Neptune connector custom query feature allows you to specify a custom Glue table, which matches response of a Gremlin Query. For example a gremlin query like 
+![](./assets/table_custom.png)
 
+The `query` property is 
 ```
-g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by(id()).limit(10)
-
+g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by("code").limit(10)
 ```
 
-matches to a Glue table 
-
-![](./assets/customquery-exampletable.png)
-
-Refer example scripts on how to create a table [here](./manual/sample-cli-script.sh)
+Columns are `source` and `destination`, which are the values returned by the Gremlin query above.
 
-> **NOTE**
->
-> Custom query feature allows simple type (example int,long,string,dateime) projections as query output
+Run SQL queries against the Athena service to retrieve this property graph data. 
 
-
-### Example query patterns 
-
-##### project node properties
+The following query retrieves 100 airports.
 
 ```
-g.V().hasLabel("airport").valueMap("code","city","country").limit(10000)
+select * from "graph-database"."airport"
+LIMIT 100
 ```
 
-##### project edge properties
+The following query retrieves 100 routes.
 
 ```
-g.E().hasLabel("route").valueMap("dist").limit(10000)
+select * from "graph-database"."route"
+LIMIT 100
 ```
 
-##### n hop query with select clause
+The following query uses the custom view to get source-destination routes:
 
 ```
-g.V().hasLabel("airport").as("source").out("route").as("destination").select("source","destination").by("code").limit(10)
-
-```
-
-##### n hop query with project clause
+select * from "graph-database"."customairport"
+LIMIT 100
 ```
-g.V().hasLabel("airport").as("s").out("route").as("d").project("source","destination").by(select("s").id()).by(select("d").id()).limit(10)
-
-```
-
-### Sample table post setup
-
-![](./assets/customtable.png)
-
-###  Benefits
-
-Using custom query feature you can project output of a gremlin query directly. This helps to avoid the effort to write a lengthly sql query on the graph model. It also allows more control on how the table schema should be designed for analysis purpose. You can limit the number of records to retrieve in the gremlin query itself.
-
-
-
-
diff --git a/athena-neptune/docs/aws-glue-sample-scripts/RDF.md b/athena-neptune/docs/aws-glue-sample-scripts/RDF.md
@@ -2,8 +2,8 @@
 
 To query RDF data using this connector, create a table in the Glue data catalog that maps to RDF data in the Neptune database. There are two styles of mapping available:
 
-- **Class-based**: The table represents an RDFS class. Each row represents an RDF resource whose type of that class. Columns represent datatype or object properties. See the airport_rdf example below.
-- **Query-based**: The table represents the resultset of a SPARQL query. Each row is one result. See the route_rdf example below.
+- **Class-based**: The table represents an RDFS class. Each row represents an RDF resource whose type of that class. Columns represent datatype or object properties. See the `airport_rdf` example below.
+- **Query-based**: The table represents the resultset of a SPARQL query. Each row is one result. See the `route_rdf` example below.
 
 In each case, you define columns and use table properties to map RDF to that column structure. Here is a summary of table properties to indicate RDF mapping:
 
@@ -22,30 +22,6 @@ In each case, you define columns and use table properties to map RDF to that col
 ## Examples
 We provide examples of both class-based and query-based tables. The examples use the Air Routes dataset. 
 
-### Step 1: Create Neptune Cluster and Seed Air Routes Data in Neptune
-In your Neptune cluster, seed the Air Routes dataset as RDF using the instructions in [../neptune-cluster-setup/README.md](../neptune-cluster-setup/README.md). 
-
-### Step 2: Create Glue Tables
-Create the Glue tables. We provide a shell script [manual/sample-cli-script.sh](manual/sample-cli-script.sh). 
-
-Ensure to have the right executable permissions on the script once you download it.
-
-```
-chmod 755 sample-cli-script.sh
-```
-Ensure to setup credentials for your AWS CLI to work.
-
-Replace &lt;aws-profile> with the AWS profile name that carries your credentials and replace &lt;aws-region> with AWS region where you are creating the 
-AWS Glue tables which should be the same as your Neptune Cluster's AWS region.
-
-```
-./sample-cli-script.sh  <aws-profile> <aws-region>
-
-```
-
-Next we study the structure of each of the tables created.
-
-### Step 3: Understanding Class-Based Tables
 The **airport_rdf** table is a class-based table. Its rows represent individual RDF resources that have a specified RDFS class. The column names represent predicates. The column values represent objects. 
 
 The next figure shows the column structure of the table:
@@ -102,7 +78,6 @@ To apply this approach to your own dataset, we recommend running a SPARQL query
 ```
 select distinct ?p where { ?s rdf:type #MYCLASS . ?s ?p ?o } LIMIT 1000
 ```
-### Step 4: Understanding Query-Based Tables
 The **route_rdf** table is a query-based table. Its rows represent results from a SPARQL select query.
 
 ![](./assets/routerdf.png)
@@ -136,18 +111,7 @@ The **route_rdf_nopfx** table is similar to **route_rdf** except the prefixes ar
 
 ![](./assets/routerdf_nopfx.png)
 
-### Step 5: Deploy the Athena Connector
-Deploy the Athena connector using RDF as the graph type. See [../neptune-connector-setup/README.md](../neptune-connector-setup/README.md). 
-
-In this example, use the following settings:
-
-- ApplicationName: AthenaNeptuneConnectorRDF
-- AthenaCatalogName: athena-catalog-neptune-rdf
-- GlueDatabaseName: graph-database-rdf
-- NeptuneGraphType: RDF
-
-### Step 6: Query
-Once connector is deployed, you can run SQL queries against the Athena service to retrieve this RDF data. 
+Run SQL queries against the Athena service to retrieve this RDF data. 
 
 The following query accesses the class-based table to retrieve 100 airports.