-
Notifications
You must be signed in to change notification settings - Fork 72
Row Batcher
The RowBatcher interface supports efficient export for all rows from a TDE view. In the typical case, the view contains the instances of an entity, and the exported entity instances provide input for downstream data consumers such as Business Intelligence tools or legacy databases.
The RowBatcher interface gets the rows in batches, formatting each batch either as a single CSV, JSON, or XML structure or as line-delimited JSON (with one JSON structure per row).
On the database server, the prerequisite is to define a TDE that matches the documents and projects the rows that populate the view.
To export different sets of rows for different purposes, construct a TDE view for each export. A single TDE can define multiple views.
Insert the TDE into the schema database either before inserting the document or followed by reindexing the documents.
On the client, the application performs the following sequence of actions to export a view:
-
Call the DatabaseClientFactory.newClient() factory to create a DatabaseClient.
-
Call the DatabaseClient.newDataMovementManager() factory to create a DataMovementManager.
-
Create a sample handle from one of the implementations of ContentHandle to specify how you want to handle the row content. For instance, to consume each batch of rows as a string, use a StringHandle; to consume each batch of rows as a Jackson JsonNode, use a JacksonHandle. See the section on Providing a Sample Handle For the Rows.
-
Call the DataMovementManager.newRowBatcher() factory with the rows batch handle to create a RowBatcher, controlling concurrency and throughput with the withThreadCount() and withBatchSize() methods.
-
Call RowBatcher.getRowManager() to get the RowManager, which you can use to specify the data type and row structure styles for the row batches. See the section on Setting the Data Type and Row Structure Styles.
-
Call the RowManager.newPlanBuilder() factory to create a PlanBuilder and use the PlanBuilder to create a PlanBuilder.ModifyPlan to export the view. See the section on Building a Plan For Exporting the View.
-
Call RowBatcher.withBatchView() to initialize the RowBatcher with the plan for exporting the view.
-
Call RowBatcher.onSuccess() with a callback (typically, a lambda function) that receives each retrieved batch of rows and call RowBatcher.onFailure() with a callback that specifies how to respond to errors. See the section on Listening For Success and Failure.
-
Call DataMovementManager.startJob() to start receiving rows and RowBatcher.awaitCompletion() to block the application thread until the last row has been processed by the success listener.
See an Example of code that executes these steps.
In the Java API, a handle acts as an adapter in providing a common interface for heterogenous representations of content.
To indicate how to represent the exported rows, you pass a sample handle
to the DataMovementManager.newRowBatcher() factory method.
For example, to get each batch of exported rows as a string,
construct the RowBatcher with a StringHandle.
The handle must extend both of the ContentHandle and StructuredReadHandle abstract interfaces.
Some handles have implicit formats. For instance, JacksonHandle has an implicit format of Format.JSON. If the handle doesn't have an implicit format, you must specify the format of the sample handle before constructing the RowBatcher.
You may also specify the mime type on the handle. For instance,
to export the rows as CSV, set the format to Format.TEXT and
the mime type to text/csv
.
The RowBatcher.getRowManager() method gets the RowManager for the exported rows.
You can use the RowManager setter methods to control the output:
- RowManager.setDatatypeStyle() specifies whether to emit data types in the header.
- RowManager.setRowStructureStyle() specifies whether to emit rows as objects or arrays.
The RowManager.newPlanBuilder() factory creates a PlanBuilder to build an export plan.
The efficient assignment of rows to batches is predetermined based on internal factors, which has important consequences for building the export plan.
The export plan has the following characteristics:
- must begin with a fromView() accessor for the rows to export.
- may filter the exported rows with a where() operation prior to any joins.
- may project columns from the exported rows with a select() operation.
- may add expressions columns to the exported rows using as() in a select() operation.
- may join other views with the exported view.
- may join documents or uris to the exported rows.
The export plan has the following limitations:
-
cannot sort the export view with an orderBy() operation. The sort operation would apply only to the exported rows that happen to be assigned to the same batch.
-
cannot group the export view with a groupBy() operation. The grouping operation would apply only to the exported rows that happen to be assigned to the same batch.
-
cannot take a slice of exported rows with a limit() or offset() operation. The slice operation would apply only to the exported rows that happen to be assigned to the same batch.
-
cannot apply a map(), prepare(), or reduce() operation.
-
cannot use parameter placeholders anywhere in the plan.
Joined rows can originate in any accessor including other views (such as dimension tables), triples, or lexicons. The joined rows can be modified by any operation prior to the join with the export rows. As an example, the joined rows could be grouped on the join key prior to the join to preserve a consistent number of rows in the batch.
The export can get data from documents by joining documents in the plan. The joined documents may be the source documents for rows or documents with document URIs provided by rows.
Pass the built plan to the RowBatcher.withBatchView() method to initialize the RowBatcher with the plan.
The batch size applies to the number of rows taken from the export view at the start of the plan. You can adjust the batch size if the plan changes the number of rows in the results. For instance, you might:
- Increase the batch size when the plan filters the export rows.
- Decrease the batch size when the plan joins on a one-to-many relationship.
Recommended: While developing, test the export plan to confirm that the plan produces rows with the desired shape before exporting the entire view. After each change to the plan, perform the following actions:
- Construct a test plan to get some sample rows by appending either a ModifyPlan.limit() or ModifyPlan.offsetLimit() operation to the export plan.
- Get the sample rows by calling the RowManager.resultDoc() method with an argument of the test plan and the same handle used as the sample handle for the RowBatcher.
- Once the modifications to the export plan produce sample rows with the desired shape, export the entire view.
The RowBatcher.onSuccess() method specify a function (typically, a Java lambda) that's called with each batch of successfully exported rows.
The RowBatcher passes a RowBatchSuccessListener.RowBatchResponseEvent in the call to the success listener.
The event object provides the RowBatchSuccessListener.RowBatchResponseEvent.getRowsDoc() getter method to get the batch of rows in the content representation supported by the sample handle provided when constructing RowBatcher. For example, for a RowBatcher constructed with a StringHandle, the method returns the row batch as a Java String.
That is, the generic type of the ContentHandle is also the generic type of the RowBatcher and RowBatchSuccessListener.RowBatchResponseEvent and thus the type of the return value for the RowBatchSuccessListener.RowBatchResponseEvent.getRowsDoc() method.
You should also use the RowBatcher.onFailure() method to specify the disposition of any failures during the job.
The RowBatcher passes a RowBatchFailureListener.RowBatchFailureEvent and the exception for the row batch in the call to the failure listener.
The failure listener can controls the disposition of the failure by calling RowBatchFailureListener.RowBatchFailureEvent setter methods:
- RowBatchFailureListener.RowBatchFailureEvent.withDisposition() specifies whether to retry the batch request, skip the batch request, or stop the job.
- RowBatchFailureListener.RowBatchFailureEvent.withMaxRetries() specifies the maximum number of retries if retrying the batch request
The RowBatcher calls the failure listener again if the retry doesn't succeed until the maximum number of retries is reached or the failure listener specifies a disposition other than retry.
If continuous updates during the export job can affect the documents that populate the exported view, rows could have modified values or could be deleted after the batch with the rows has been exported.
For many uses, minor inconsistency is not problematic. After all, continuous updates also mean that the exported rows won't reflect the current state of the view after the export job finishes.
Where the export should reflect a consistent snapshot of the view, however, the export job should use a point-in-time query. Before starting the job, call the RowBatcher.withConsistentSnapshot() setter method to configure the RowBatcher to retrieve every batch of rows with its state at the time that the export job started.
The following example shows a simple job that exports the rows from a view in CSV format.
// get the database client (often done once per application)
DatabaseClient db = DatabaseClientFactory.newClient(...);
// get the data movement manager (often done once per application)
DataMovementManager moveMgr = db.newDataMovementManager();
// construct a handle for how to represent the retrieved rows
StringHandle sampleHandle =
new StringHandle().withFormat(Format.TEXT)
.withMimetype("text/csv");
// construct the multi-threaded exporter
RowBatcher<String> rowBatcher =
moveMgr.newRowBatcher(sampleHandle)
.withBatchSize(30)
.withThreadCount(threads);
// configure the export for consistent data types
RowManager rowMgr = rowBatcher.getRowManager();
rowMgr.setDatatypeStyle(RowManager.RowSetPart.HEADER);
// build the plan for the export view
PlanBuilder planBuilder = rowMgr.newPlanBuilder();
PlanBuilder.ModifyPlan exportPlan =
planBuilder.fromView(...)
.select(/* project index columns and add expression columns */);
// specify processing for exported rows and for request failures
rowBatcher.withBatchView(exportPlan)
.onSuccess(event -> {
try {
BufferedReader reader =
new BufferedReader(new StringReader(event.getRowsDoc()));
reader.readLine(); // consume the CSV header line
reader.lines().forEach(line -> {/*
client processing of exported rows
*/});
} catch (Throwable e) {
// logging for errors during client processing of exported rows
}})
.onFailure((event, throwable) -> {
event.withDisposition(BatchFailureDisposition.SKIP);
// logging for errors during retrieval of row batches
});
// start the job and then wait for the export to complete
moveMgr.startJob(rowBatcher);
rowBatcher.awaitCompletion();