-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/parlamint turkey #1730
base: develop
Are you sure you want to change the base?
Feature/parlamint turkey #1730
Conversation
'party_ids': party_ids, | ||
'party_nodes': party_nodes, | ||
'org_ids': org_ids, | ||
'org_nodes': org_nodes, # unsure about this still: might be a bit of a memory leak to store the nodes in the metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By which I mean: Nodes are pretty big in BeautifulSoup, so to store them all for each person seems not ideal. I can look at this later myself too, just haven't gotten around to it yet. The problem is just that org_nodes
is currently used to extract the role in the specific corpus definition...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word is "expensive" or perhaps "inefficient". "Memory leak" means something more specific, which is also more problematic but which is fortunately not what you mean here. A memory leak is always expensive but the opposite is not true.
(I recommend changing the wording because a memory leak is often a serious bug that must be hunted down. No need to put out a red flag here.)
surname = node.persName.surname.text.strip() | ||
forename = node.persName.forename.text.strip() | ||
name = ' '.join([forename, surname]) | ||
role = node.persName.roleName.text.strip() if node.persName.roleName else None #too simple, needs to be able to have different values and be gotten from the org_node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently handled in the specific corpus defition, by a way that requires all org_nodes to be saved to the person, see my other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Some thoughts:
- how specific is the
ParlamintTurkiye
corpus definition for that country's data? Would it be an idea to make aParlamint
subclass that other corpora can inherit from? I can imagine that most field names would be the same - can you make a
tests/data
directory with some example data? That would make it easier to review, and you'll need it anyway for the unit tests - without example data I cannot sure this suggestion makes sense, but it seems you get a "role id" from the metadata about the persons, with which you might be able to search the
org
metadata dictionary?
This is the corpus definition for the ParlaMint 4.0 Turkey corpus. It includes an updated utils file for Parlamint 4.0 data, which is still a work in progress. I will comment on code that can use improvement in the utils file. For the Turkey corpus I think I am done!