Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: parser interpretation improvements #1487

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
"pelias-logger": "^1.2.0",
"pelias-microservice-wrapper": "^1.7.0",
"pelias-model": "^7.0.0",
"pelias-parser": "1.53.0",
"pelias-parser": "github:pelias/parser#venue_improvements",
"pelias-query": "^11.0.0",
"pelias-sorting": "^1.2.0",
"predicates": "^2.0.0",
Expand Down
94 changes: 72 additions & 22 deletions sanitizer/_text_pelias_parser.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ const parser = new AddressParser();
const _ = require('lodash');
const MAX_TEXT_LENGTH = 140;

// this constant defines a lower boundary for the solution score returned
// by the Pelias parser. Any solutions which scored lower than this value
// will simply have their entire body returned as the $subject
const MIN_ACCEPTABLE_SCORE = 0.3;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users would probably appreciate this being tuneable via a config option.


// this constant defines the minimum amount of characters that can be
// interpreted as a $prefix when assigning them to $subject.
// this is useful for cases such as 'St Francis' where the parser returns
// { locality: 'Francis' }, it's not able to classify the 'St' token.
// in this case the leading 'St' would normally be considered the $subject
// and the parsed_text would be { subject: 'St', admin: 'Francis' }.
// the $MIN_PREFIX_CHAR_LENGTH var is a simple way of working around this problem.
const MIN_PREFIX_CHAR_LENGTH = 3;

/**
this module provides fulltext parsing using the pelias/parser module.
see: https://github.com/pelias/parser
Expand Down Expand Up @@ -52,8 +66,36 @@ function _sanitize (raw, clean) {
return messages;
}

/**
The Pelias parser is responsible for interpreting an input string and then returning
one or more solutions for how the input tokens can been logically classified.

However, the responsibility of *how* to interpret those solutions and how best
to map them to variables we can use to query elasticsearch is the responsibility
of the function below.

The general idea is to split tokens along a 'cursor boundary', so on the left side of
the cursor we have the 'subject' of the query (ie. the place in question) and on the
right side of the query we have the administrative tokens.

note: The function below is *only* concerned with selecting $parsed_text.subject and
$parsed_text.admin values

see: `query/text_parser_pelias.js` for the logic which adds the other fields such
as $parsed_text.housenumber and $parsed_text.street etc. *before* this gets run.

Postcodes are a special case, they belong to neither the admin hierarchy nor to the
subject itself. They are more similar in that sense to a house number, because they
are a property of the address. They are tricky because they usually appear somewhere
in the middle of the input text. They shouldn't be included in either the subject or
the admin parts, unless the postcode is itself the subject of the query.

Here are some examples of how we might split a query into subject and admin:
"100 Foo Street Brookyln NYC" -> ["100 Foo Street", "Brookyln NYC"]
"Foo Bar 10017 Berlin Germany" -> ["Foo Bar", "Berlin Germany"]
*/
function parse (clean) {

// parse text
let start = new Date();
const t = new Tokenizer(clean.text);
Expand All @@ -73,7 +115,7 @@ function parse (clean) {
let solution = new Solution();
if (t.solution.length) { solution = t.solution[0]; }

// 1. map the output of the parser in to parsed_text
// 1. map the output of the parser into $parsed_text
let parsed_text = { subject: undefined };

solution.pair.forEach(p => {
Expand All @@ -95,38 +137,37 @@ function parse (clean) {
// ' VVVV NN SSSSSSS AAAAAA PPPPP '
let mask = solution.mask(t);

// the entire input text as seen by the parser with any postcode classification(s) removed
// the entire input text as seen by the parser with any postcode and unit
// classification(s) removed
let body = t.span.body.split('')
.map((c, i) => (mask[i] !== 'P') ? c : ' ')
.map((c, i) => !/[PU]/.test(mask[i]) ? c : ' ')
.join('');

// same as $body above but with consecutive whitespace squashed and trimmed.
const normalizedBody = t.section.map(sp => sp.body).join(' ').replace(/\s+/g, ' ').trim();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite the same as above since it includes postalcode classifications.
Need to give this some more consideration before merging 🤔


// scan through the input text and 'bucket' characters in to one of two buckets:
// prefix: all unparsed characters that came before any parsed fields
// postfix: all characters from the first admin field to the end of the string

// set cursor to the first classified character from selected classes
let cursor = mask.search(/[NSAP]/);

// >> solution includes venue classification
// set cursor after the venue name
if (mask.includes('V')) { cursor = mask.lastIndexOf('V') +1; }

if (cursor === -1) { cursor = body.length; }
let prefix = _.trim(body.substr(0, cursor), ' ,');

// solution includes address classification
// if solution includes address classification
// set cursor after the last classified address character
if (mask.search(/[NS]/) > -1) {
if ( mask.includes('N') && mask.includes('S') ) {
cursor = Math.max(mask.lastIndexOf('N'), mask.lastIndexOf('S')) + 1;
}
// solution includes admin classification

// else if solution includes admin classification
// set cursor to the first classified admin character
else if( mask.includes('A') ){ cursor = mask.indexOf('A'); }
// >> solution includes venue classification
// set cursor after the venue name
else if (mask.includes('V')) { cursor = mask.lastIndexOf('V') + 1; }

// else set cursor to end-of-text
else { cursor = body.length; }

if (cursor === -1) { cursor = body.length; }
let prefix = _.trim(body.substr(0, cursor), ' ,');
let postfix = _.trim(body.substr(cursor), ' ,');

// clean up spacing around commas
Expand Down Expand Up @@ -163,21 +204,30 @@ function parse (clean) {
// 4. set 'subject', this is the text which will target the 'name.*'
// fields in elasticsearch queries

// in the case where the solution score is very low we simply use the entire
// input as the $subject.
if ( solution.score < MIN_ACCEPTABLE_SCORE ) {
parsed_text = { subject: normalizedBody };
}
// an address query
if (!_.isEmpty(parsed_text.housenumber) && !_.isEmpty(parsed_text.street)) {
else if (!_.isEmpty(parsed_text.housenumber) && !_.isEmpty(parsed_text.street)) {
parsed_text.subject = `${parsed_text.housenumber} ${parsed_text.street}`;
}
// an intersection query
else if (!_.isEmpty(parsed_text.street) && !_.isEmpty(parsed_text.cross_street)) {
parsed_text.subject = `${parsed_text.street} & ${parsed_text.cross_street}`;
}
// a street query
else if (!_.isEmpty(parsed_text.street)) {
else if (!_.isEmpty(parsed_text.street) && _.isEmpty(parsed_text.venue)) {
parsed_text.subject = parsed_text.street;
}
// query with a $prefix such as a venue query
else if (!_.isEmpty(prefix)){
parsed_text.subject = prefix;
if (prefix.length >= MIN_PREFIX_CHAR_LENGTH) {
parsed_text.subject = prefix;
} else {
parsed_text = { subject: normalizedBody };
}
}
// a postcode query
else if (!_.isEmpty(parsed_text.postcode)) {
Expand Down Expand Up @@ -225,10 +275,10 @@ function parse (clean) {
}
}
}

// unknown query type
else {
parsed_text.subject = t.span.body;
parsed_text = { subject: normalizedBody };
}

return parsed_text;
Expand Down
93 changes: 83 additions & 10 deletions test/unit/sanitizer/_text_pelias_parser.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ module.exports.tests.text_parser = function (test, common) {
});

let cases = [];

// USA queries
cases.push(['soho, new york, NY', {
subject: 'soho',
Expand Down Expand Up @@ -154,6 +154,29 @@ module.exports.tests.text_parser = function (test, common) {
admin: 'Kentucky'
}]);

// badly parsed venue names should at least have correct $subject
cases.push(['Kells Irish Restaurant & Pub', {
subject: 'Kells Irish Restaurant & Pub'
}, true]);
cases.push(['The Village Zombie', {
subject: 'The Village Zombie'
}, true]);
cases.push(['Chili\'s Bar & Grill', {
subject: 'Chili\'s Bar & Grill'
}, true]);
cases.push(['OYO Hotels & Homes', {
subject: 'OYO Hotels & Homes'
}, true]);
cases.push(['Adamas Pharmaceuticals', {
subject: 'Adamas Pharmaceuticals'
}, true]);
cases.push(['St Francis Extended', {
subject: 'St Francis Extended'
}, true]);
cases.push(['St Francis', {
subject: 'St Francis'
}, true]);

// street (USA style)
cases.push(['M', { subject: 'M' }, true]);
cases.push(['Ma', { subject: 'Ma' }, true]);
Expand Down Expand Up @@ -282,11 +305,11 @@ module.exports.tests.text_parser = function (test, common) {
cases.push(['Kasch', { subject: 'Kasch' }, true]);
cases.push(['Kaschk', { subject: 'Kaschk' }, true]);
cases.push(['Kaschk ', { subject: 'Kaschk' }, true]);
// cases.push(['Kaschk B', { subject: 'Kaschk' }, true]); // jitter issue
cases.push(['Kaschk Be', { subject: 'Kaschk' }, true]);
// cases.push(['Kaschk Ber', { subject: 'Kaschk' }, true]); // jitter issue
// cases.push(['Kaschk Berl', { subject: 'Kaschk' }, true]); // jitter issue
// cases.push(['Kaschk Berli', { subject: 'Kaschk' }, true]); // jitter issue
cases.push(['Kaschk B', { subject: 'Kaschk B' }, true]);
cases.push(['Kaschk Be', { subject: 'Kaschk Be' }, true]);
cases.push(['Kaschk Ber', { subject: 'Kaschk Ber' }, true]);
cases.push(['Kaschk Berl', { subject: 'Kaschk Berl' }, true]);
cases.push(['Kaschk Berli', { subject: 'Kaschk Berli' }, true]);
Copy link
Member Author

@missinglink missinglink Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to test that Pelias still finds the venue with partially complete city name as before:

Screenshot 2020-09-01 at 11 11 34

cases.push(['Kaschk Berlin', { subject: 'Kaschk' }, true]);

cases.push(['A', { subject: 'A' }, true]);
Expand All @@ -308,7 +331,7 @@ module.exports.tests.text_parser = function (test, common) {
// cases.push(['Air & Space Museu', { subject: 'Air & Space Museu' }, true]); // jitter issue
cases.push(['Air & Space Museum', { subject: 'Air & Space Museum' }, true]);
cases.push(['Air & Space Museum ', { subject: 'Air & Space Museum' }, true]);
cases.push(['Air & Space Museum D', { subject: 'Air & Space Museum' }, true]);
cases.push(['Air & Space Museum D', { subject: 'Air & Space Museum D' }, true]);
cases.push(['Air & Space Museum DC', { subject: 'Air & Space Museum' }, true]);

// admin areas
Expand All @@ -322,7 +345,7 @@ module.exports.tests.text_parser = function (test, common) {
cases.push(['New York', { subject: 'New York' }, true]);
cases.push(['New York N', { subject: 'New York' }, true]);
cases.push(['New York NY', { subject: 'New York' }, true]);

cases.push(['B', { subject: 'B' }, true]);
cases.push(['Be', { subject: 'Be' }, true]);
cases.push(['Ber', { subject: 'Ber' }, true]);
Expand All @@ -342,6 +365,46 @@ module.exports.tests.text_parser = function (test, common) {
cases.push(['Berlin Deutschlan', { subject: 'Berlin' }, true]);
cases.push(['Berlin Deutschland', { subject: 'Berlin' }, true]);

// venue name with ampersand
// note: this query is ambigious as it could refer to either the
// UK high street brand "Marks & Spencer" or an intersection of
// Marks St and Spencer Ave.
// note: what we're looking for here is that we are using the whole
// input as the $subject regardless and without jitter.
cases.push(['M', { subject: 'M' }, true]);
cases.push(['Ma', { subject: 'Ma' }, true]);
cases.push(['Mar', { subject: 'Mar' }, true]);
cases.push(['Mark', { subject: 'Mark' }, true]);

// note: for the following 5 keystrokes the $subject is simplified to 'Marks'
cases.push(['Marks', { subject: 'Marks' }, true]);
cases.push(['Marks ', { subject: 'Marks' }, true]);
cases.push(['Marks &', { subject: 'Marks' }, true]);
cases.push(['Marks & ', { subject: 'Marks' }, true]);
cases.push(['Marks & S', { subject: 'Marks' }, true]);

cases.push(['Marks & Sp', { subject: 'Marks & Sp' }, true]);
cases.push(['Marks & Spe', { subject: 'Marks & Spe' }, true]);
cases.push(['Marks & Spen', { subject: 'Marks & Spen' }, true]);
cases.push(['Marks & Spenc', { subject: 'Marks & Spenc' }, true]);
cases.push(['Marks & Spence', { subject: 'Marks & Spence' }, true]);
cases.push(['Marks & Spencer', { subject: 'Marks & Spencer' }, true]);

// venue is also known colloquially as "M AND S"
cases.push(['M', { subject: 'M' }, true]);
cases.push(['M ', { subject: 'M' }, true]);
cases.push(['M &', { subject: 'M &' }, true]);
cases.push(['M & ', { subject: 'M &' }, true]);
cases.push(['M & S', { subject: 'M & S' }, true]);

cases.push(['M', { subject: 'M' }, true]);
cases.push(['M ', { subject: 'M' }, true]);
cases.push(['M a', { subject: 'M a' }, true]);
cases.push(['M an', { subject: 'M an' }, true]);
cases.push(['M and', { subject: 'M and' }, true]);
cases.push(['M and ', { subject: 'M and' }, true]);
cases.push(['M and S', { subject: 'M & S' }, true]);

// postcodes
cases.push(['2000', { subject: '2000' }, true]);
cases.push(['Sydney 2000', { subject: '2000' }, true]);
Expand All @@ -356,6 +419,16 @@ module.exports.tests.text_parser = function (test, common) {
cases.push(['e8 1dn', { subject: 'e8 1dn' }, true]);
// cases.push(['london e8 1dn', { subject: 'e8 1dn' }, true]); // issue

// unit number between address and locality
cases.push(['7750 Kennedy Rd #2A Markham', {
subject: '7750 Kennedy Rd',
housenumber: '7750',
street: 'Kennedy Rd',
unit: '#2A',
locality: 'Markham',
admin: 'Markham',
}]);

cases.forEach(testcase => {
let input = testcase[0];
let expected = testcase[1];
Expand Down Expand Up @@ -421,9 +494,9 @@ module.exports.tests.text_parser = function (test, common) {
const raw = {
text: `
Sometimes we make the process more complicated than we need to.
We will never make a journey of a thousand miles by fretting about
We will never make a journey of a thousand miles by fretting about
how long it will take or how hard it will be.
We make the journey by taking each day step by step and then repeating
We make the journey by taking each day step by step and then repeating
it again and again until we reach our destination.` };
const clean = {};
const messages = sanitizer.sanitize(raw, clean);
Expand Down