Skip to content

nhs-ciao/ciao-docs-parser

Repository files navigation

ciao-docs-parser

CIP to parse documents such as PDF or DOC for key/value properties

Introduction

The purpose of this CIP is to process an incoming binary document by parsing it and extracting a set of key/value properties before publishing the parsed document for further processing by other CIPs.

ciao-docs-parser is built on top of Apache Camel and Spring Framework, and can be run as a stand-alone Java application, or via Docker.

Each application can host multiple [routes](<http://camel.apache.org/routes.html), where each route follows the following basic structure:

input folder -> DocumentParser -> output queue (JMS)

  • The input folder supports any document format recognised by the configured parsers and extractors.
  • The output queue format accepts a JSON-encoded representation of ParsedDocument.

The details of the JMS queues and document parsers are specified at runtime through a combination of ciao-configuration properties and Spring XML files.

The following document parsers are provided:

The following properties extractors are provided:

For more advanced usages, a custom document parser can be integrated by implementing parser Java interfaces and providing a suitable spring XML configuration on the classpath.

Configuration

For further details of how ciao-configuration and Spring XML interact, please see ciao-core.

Spring XML

On application start-up, a series of Spring Framework XML files are used to construct the core application objects. The created objects include the main Camel context, input/output components, routes and any intermediate processors.

The configuration is split into multiple XML files, each covering a separate area of the application. These files are selectively included at runtime via CIAO properties, allowing alternative technologies and/or implementations to be chosen. Each imported XML file can support a different set of CIAO properties.

The Spring XML files are loaded from the classpath under the META-INF/spring package.

Core:

  • beans.xml - The main configuration responsible for initialising properties, importing additional resources and starting Camel.

Repositories:

An `IdempotentRepository' is configured to enable [multiple consumers](http://camel.apache.org competing-consumers.html) access the same folder concurrently.

  • 'repository/memory.xml' - An in-memory implementation suitable for use when there is only a single consumer, or multiple-consumers are all contained within the same JVM instance.
  • 'repository/hazelcast.xml' - A grid-based implementation backed by Hazelcast. The component is hosted entirely within the JVM process and uses a combination of multicast and point-to-point networking to maintain a cross-server data grid.

Processors:

  • processors/default.xml - Creates individual parsers from the ciao-docs-parser-kings module, and initialises an auto-detect parser to try each sequentially until a match is found.

Messaging:

  • messaging/activemq.xml - Configures ActiveMQ as the JMS implementation for input/output queues.
  • messaging/activemq-embedded.xml - Configures an internal embedded ActiveMQ as the JMS implementation for input/output queues. (For use during development/testing)

CIAO Properties

At runtime ciao-docs-parser uses the available CIAO properties to determine which Spring XML files to load, which Camel routes to create, and how individual routes and components should be wired.

Camel Logging:

Spring Configuration:

  • repositoryConfig - Selects which repository configuration to load: repositories/${repositoryConfig}.xml
  • processorConfig - Selects which processor configuration to load: processors/${processorConfig}.xml
  • messagingConfig - Selects which messaging configuration to load: messaging/${messagingConfig}.xml

Routes:

  • documentParserRoutes - A comma separated list of route names to build

The list of route names serves two purposes. Firstly it determines how many routes to build, and secondly each name is used as a prefix to specify the individual properties of that route.

Route Configuration:

For 'specific' properties unique to a single route, use the prefix: documentParserRoutes.${routeName}.

For 'generic' properties covering all routes, use the prefix: documentParserRoutes.

  • inputFolder - Selects which folder to consume incoming documents from
  • inProgressFolder - Selects which folder files should be moved to while they are being processing
  • completedFolder - Selects which folder files should be moved to after they have processing has completed
  • errorFolder - Selects which folder files should be moved to if they cannot be processed due to an unrecoverable error (e.g. unsupported file format)
  • idempotentRepositoryId - The Spring ID of the IdempotentRepository used by the route. This enables support for the Competing Consumers Pattern.
  • inProgressRepositoryId - The Spring ID of the in-progress IdempotentRepository used by the route. This enables support for the Competing Consumers Pattern.
  • processorId - The Spring ID of the parser to use when parsing documents
  • outputQueue - Selects which queue to publish parsed documents to

Folder Configuration:

The completedFolder and errorFolder route options can include [Camel Simple Language] (https://camel.apache.org/simple.html) expressions. The following additional headers can be referenced:

  • CamelCorrelationId - A unique ID associated with the processing of the source document
  • ciaoSourceFileName - The file name of the source document
  • ciaoTimestamp - The time processing was started expressed as a Unix timestamp (i.e. milliseconds since 1970)

The inProgressFolder folder option does not support Simple expressions - instead this option should be specified as a standard file path (absolute or relative to the working directory). While a document is being processed, data relating to the processing will be stored in a sub-folder of inProgressFolder/{correlationId}.

For more details of the in-progress folder structure, see the state-machine documentation from ciao-docs-finalizer.

Hazelcast Configuration:

The following properties are applicable for repositoryConfig=hazelcast:

  • hazelcast.group.name - Name of the hazelcast cluster group
  • hazelcast.group.password - Password of the hazelcast cluster group
  • hazelcast.network.port - The network port to use for the hazelcast server - if the port is already in use it will be incremented until a free port is found
  • hazelcast.network.publicAddress - The (optional) public address of the hazelcast node - this can be used if nodes need to communicate over NAT.
  • hazelcast.network.join.tcp_ip - Comma separated list of static cluster members - if empty, multicast join should be enabled
  • hazelcast.network.join.multicast.enabled - Boolean value specifying whether multicast join should be used to find cluster members - if false, static TCP-IP members should be specified
  • hazelcast.network.join.multicast.group - Multicast address to use for finding cluster members
  • hazelcast.network.join.multicast.port - Multicast port to use for finding cluster members

Example

# Camel logging
camel.log.mdc=true
camel.log.trace=false
camel.log.debugStreams=false

# Select which processor config to use (via dynamic spring imports)
processorConfig=default

# Select which idempotent repository config to use (via dynamic spring imports)
 repositoryConfig=memory
repositoryConfig=hazelcast

# Select which messaging config to use (via dynamic spring imports)
messagingConfig=activemq
# messagingConfig=activemq-embedded

# ActiveMQ settings (if messagingConfig=activemq)
activemq.brokerURL=tcp://localhost:61616
activemq.userName=smx
activemq.password=smx

# Hazelcast settings (if repositoryConfig=hazelcast)
hazelcast.group.name=ciao-docs-parser
hazelcast.group.password=ciao-docs-parser-pass
hazelcast.network.port=5701
hazelcast.network.publicAddress=
hazelcast.network.join.tcp_ip.members=
hazelcast.network.join.multicast.enabled=true
hazelcast.network.join.multicast.group=224.2.2.3
hazelcast.network.join.multicast.port=54327

# Setup route names (and how many routes to build)
documentParserRoutes=discharge-notification,ed-discharge,auto-detect

# Setup 'shared' properties across all-routes
documentParserRoutes.outputQueue=parsed-documents
documentParserRoutes.inProgressFolder=./in-progress
documentParserRoutes.idempotentRepositoryId=idempotentRepository
documentParserRoutes.inProgressRepositoryId=inProgressRepository

# Setup per-route properties (can override the shared properties)
documentParserRoutes.discharge-notification.inputFolder=./input/discharge-notifications
documentParserRoutes.discharge-notification.completedFolder=./completed/discharge-notifications/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.discharge-notification.errorFolder=./error/discharge-notifications/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.discharge-notification.processorId=dischargeNotificationProcessor

documentParserRoutes.ed-discharge.inputFolder=./input/ed-discharges
documentParserRoutes.ed-discharge.completedFolder=./completed/ed-discharges/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.ed-discharge.errorFolder=./error/ed-discharges/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.ed-discharge.processorId=edDischargeProcessor

documentParserRoutes.auto-detect.inputFolder=./input/auto-detect
documentParserRoutes.auto-detect.completedFolder=./completed/auto-detect/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.auto-detect.errorFolder=./error/auto-detect/${date:now:yyyy-MM-dd}/${header.CamelCorrelationId}
documentParserRoutes.auto-detect.processorId=autoDetectProcessor

Building and Running

To pull down the code, run:

git clone https://github.com/nhs-ciao/ciao-docs-parser.git

You can then compile the module via:

cd ciao-docs-parser-parent
mvn clean install -P bin-archive

This will compile a number of related modules - the main CIP module is ciao-docs-parser, and the full binary archive (with dependencies) can be found at ciao-docs-parser\target\ciao-docs-parser-{version}-bin.zip. To run the CIP, unpack this zip to a directory of your choosing and follow the instructions in the README.txt.

The CIP requires access to various file system directories and network ports (dependent on the selected configuration):

etcd:

  • Connects to: localhost:2379

ActiveMQ:

  • Connects to: localhost:61616

Hazelcast:

  • Multicast discovery: 224.2.2.3:54327 (If enabled)
  • Listens on: *:5701 (If port is already taken, the port number is incremented until a free port is found)
  • Connects to clustered nodes defined by the hazelcast.network.join.tcp_ip.members property

Filesystem:

  • If etcd is not available, CIAO properties will be loaded from: ~/.ciao/
  • The default configuration creates/uses input, completed, and error directories in the CIP working directory. These can be altered by changing the CIAO properties configuration (via etcd, or the properties file in ~/.ciao/)

About

CIP to parse documents such as PDF or DOC for key/value properties

Resources

License

Stars

Watchers

Forks

Packages

No packages published