Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate EML semantic annotation indexing issues on the CNs #15

Open
amoeba opened this issue Apr 7, 2022 · 2 comments
Open

Investigate EML semantic annotation indexing issues on the CNs #15

amoeba opened this issue Apr 7, 2022 · 2 comments
Assignees

Comments

@amoeba
Copy link

amoeba commented Apr 7, 2022

@mbjones and @taojing2002 saw some errors reported in the CN indexing logs. @taojing2002 and I looked and couldn't find the errors that were reported but I decided to just go ahead and verify there weren't any issues.

I started by assuming the ADC indexing was working and queried the ADC Solr index for documents with semantic annotations (n=1351). I then checked that each object (1) existed on the CN and (2) was indexed on the CN and (3) if indexed, whether or not it had annotations.

Status Count
Not found 1
Not indexed 58
Indexed, but missing annotations 11
Indexed, w/ annotations 1281

We should manually harvest that one "Not found" object, manually reindex the 58 and 11 above.

Full details, with PIDs
NOT FOUND

These need to get harvested

urn:uuid:a6bbe9d0-c281-4402-bf88-4f3c52c66fda

NOT INDEXED

These need to get indexed

doi:10.18739/A2VM42X50
doi:10.18739/A2P26Q37R
doi:10.18739/A2DN3ZW19
doi:10.18739/A2513TW18
doi:10.18739/A23B5W79P
doi:10.18739/A27S7HS4C
doi:10.18739/A2MC8RG5C
doi:10.18739/A2WH2DF2X
doi:10.18739/A2J960930
doi:10.18739/A25717N6T
doi:10.18739/A2416T00Z
doi:10.18739/A21834279
doi:10.18739/A2S46H60V
doi:10.18739/A2ZP3W08H
doi:10.18739/A2P55DG9K
doi:10.18739/A2CV4BR73
doi:10.18739/A20G3GZ2F
doi:10.18739/A24746R6K
doi:10.18739/A2VX06340
doi:10.18739/A2M61BQ0R
doi:10.18739/A2CJ87K8H
doi:10.18739/A2DJ58G97
doi:10.18739/A2086356K
doi:10.18739/A2930NV40
doi:10.18739/A24B2X49X
doi:10.18739/A2SX64931
doi:10.18739/A2599Z20Q
doi:10.18739/A2M32N96B
doi:10.18739/A2RX93D3S
doi:10.18739/A2FJ29C9G
doi:10.18739/A23775V6B
doi:10.18739/A2QZ22H33
doi:10.18739/A2QV3C418
doi:10.18739/A2WP9T67J
doi:10.18739/A2DV1CN8T
doi:10.18739/A21G0HV2P
doi:10.18739/A2BV79V7V
urn:uuid:69a40625-277a-4793-aa10-f148332d2456
doi:10.18739/A2N58CK9B
doi:10.18739/A2WS8HM1Z
doi:10.18739/A2707WP0Q
doi:10.18739/A21J9776B
doi:10.18739/A2CN6Z02Z
doi:10.18739/A2NC5SC63
doi:10.18739/A2HD7NS6P
doi:10.18739/A2GM81P1M
doi:10.18739/A24B2X59C
doi:10.18739/A2VQ2S97V
doi:10.1594/PANGAEA.779181
urn:uuid:29fbd2eb-3319-46ed-b416-3638ec020571
urn:uuid:40b5819c-a8d8-4f82-a9c4-ce2ec6cec1f0
urn:uuid:4cc06919-9562-4b4b-af99-fe524f118181
urn:uuid:d59a7b20-5704-4d37-9ee1-78e7a2e78982
urn:uuid:445deff9-b8cb-4023-8d7e-52802e429358
urn:uuid:b9b256da-0a15-459c-9b5a-36195e0dbb59
urn:uuid:b59de2d0-8531-456f-ab1a-dd009df9c844
urn:uuid:1c6521de-e47e-46ca-b9c8-d3910fe1fa9c
urn:uuid:02022a31-97b5-4178-b692-6d2a77c120eb

INDEXED, BUT NO ANNOTATIONS

These need reindexing and verification after they're reindexed

doi:10.18739/A28W3827B
doi:10.18739/A2319S30Q
doi:10.18739/A2N29P67H
doi:10.18739/A2DF6K36X
doi:10.18739/A2ZG6G71C
doi:10.18739/A20000081
doi:10.18739/A2VM42Z20
urn:uuid:72088082-251e-48f7-be83-9dd7508177e1
urn:uuid:09e1cd68-2209-4f42-ab41-291db507effa
doi:10.18739/A2T14TQ57
doi:10.18739/A2445HD27
@amoeba amoeba self-assigned this Apr 7, 2022
@amoeba
Copy link
Author

amoeba commented Apr 8, 2022

@taojing2002 can we work together to reharvest and reindex the PIDs above?

@amoeba
Copy link
Author

amoeba commented May 16, 2022

We're still at:

Status Count
Not indexed 58
Indexed, but missing annotations 11

@taojing2002 and I talked about this and this isn't a quick fix because we don't don't actually have a true reindex operation on the CNs in the sense that we can't trigger the CN to perform the usual index processing we do when an object is created or updated. We have separate tool (d1_index_build_tool.jar) that's run out of band that submits the update directly to Solr which just happens to use most of the same code and data as the actual CN index processor.

Unfortunately, d1_index_build_tool.jar is having an issue specifically with the semantic annotations so we'll need to track that down before we can move forward here. If we can't do that, a workaround would be to update sysmeta on the Member Node for each of the above objects, which would trigger a full sync->harvest->index cycle.

I think d1_index_build_tool is part of https://github.com/DataONEorg/cn-buildout though I'm not sure at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant