Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/expand corpus model #1226

Merged
merged 47 commits into from
Sep 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
57db604
add field model
lukavdplas Aug 2, 2023
9eee471
rename Field -> FieldDefinition
lukavdplas Aug 2, 2023
2f5cc09
add field saving function
lukavdplas Aug 2, 2023
97d67be
move filter serialization to SearchFilter class
lukavdplas Aug 2, 2023
12de803
save fields when loading corpus
lukavdplas Aug 2, 2023
7a8a583
add command for loading corpora
lukavdplas Aug 3, 2023
a089297
split load_corpus and save_corpus modules
lukavdplas Aug 3, 2023
e81f0b5
add documentation on loadcorpora command
lukavdplas Aug 3, 2023
112ca3a
log loading corpora
lukavdplas Aug 3, 2023
de22aa1
draft expanded corpus model
lukavdplas Aug 3, 2023
2b8f856
add corpus models to admin
lukavdplas Aug 3, 2023
5a93af3
extract corpus attributes from definition
lukavdplas Aug 3, 2023
66f5ad2
fix validation
lukavdplas Aug 3, 2023
12102dc
add required properties to model test
lukavdplas Aug 4, 2023
5d88da4
use atomic transaction in try/except block
lukavdplas Aug 4, 2023
2433570
more unit tests
lukavdplas Aug 4, 2023
51826ee
change fixture at start of tests
lukavdplas Aug 4, 2023
d662dc8
add active status to corpora
lukavdplas Aug 4, 2023
051a2af
add test for saving broken corpus
lukavdplas Aug 4, 2023
132cf28
set active status based on settings
lukavdplas Aug 4, 2023
2dd4827
add test for purity
lukavdplas Aug 4, 2023
7926308
switch to separate CorpusConfiguration model
lukavdplas Aug 7, 2023
4014d6e
no description parameter in corpus constructor
lukavdplas Aug 7, 2023
a49d107
clean up model test code
lukavdplas Aug 7, 2023
bef569e
add default for visualisations
lukavdplas Aug 7, 2023
6f6b924
add db validation test to all corpus definitions
lukavdplas Aug 7, 2023
f490799
add '' value for unknown languages
lukavdplas Aug 7, 2023
39c4c48
remove duplicate field in parliament-germany-old
lukavdplas Aug 7, 2023
ec8d00c
serialise from database models
lukavdplas Aug 7, 2023
a6f7cbf
remove obsolete test
lukavdplas Aug 7, 2023
546efd5
adjust date parsing in frontend
lukavdplas Aug 7, 2023
57fb325
use ISO format for dates in date filter JSON
lukavdplas Aug 7, 2023
cd24efe
rename load_corpus -> load_corpus_definition
lukavdplas Aug 7, 2023
7018847
rely on database rather than python class
lukavdplas Aug 7, 2023
7bc82b5
add documentation
lukavdplas Aug 7, 2023
71a45fb
pretty up admin
lukavdplas Aug 8, 2023
47039d4
infer "active" status of corpus
lukavdplas Aug 8, 2023
f31fead
squash migrations
lukavdplas Aug 8, 2023
3a8df49
rename _try_saving_corpus -> _save_or_skip_corpus
lukavdplas Aug 10, 2023
ce6fda2
fix typos
lukavdplas Aug 10, 2023
f7c79d3
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Aug 17, 2023
2c67649
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Aug 17, 2023
50f097b
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Aug 21, 2023
6a17e1e
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Aug 24, 2023
a45fe0e
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Sep 7, 2023
7a6b424
fix outdated function call
lukavdplas Sep 11, 2023
1ad6a8c
Merge branch 'develop' into feature/expand-corpus-model
lukavdplas Sep 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@ If you are overriding the default settings, you may pass `--pythonpath` and `--s

### Running the application (development server)

If you made any changes to your configured corpora, load them into the database before running the aplication.

```console
$ python manage.py loadcorpora
```

Start the development server with:

```console
$ python manage.py runserver
```
Expand Down
139 changes: 136 additions & 3 deletions backend/addcorpus/admin.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,140 @@
from django.contrib import admin
from .models import Corpus
from django.contrib import admin, messages
from .models import Corpus, CorpusConfiguration, Field

def show_warning_message(request):
'''
Message to display when loading a form for a resource based on a python class
'''

messages.add_message(
request,
messages.WARNING,
'Corpus configurations are based on python classes; any changes here will be reset on server startup'
)

class CorpusAdmin(admin.ModelAdmin):
readonly_fields = ['name', 'description']
readonly_fields = ['name', 'configuration']
fields = ['name', 'groups', 'configuration']

class InlineFieldAdmin(admin.StackedInline):
model = Field
fields = ['display_name', 'description']
show_change_link = True
extra = 0

class CorpusConfigurationAdmin(admin.ModelAdmin):
readonly_fields = ['corpus']

inlines = [
InlineFieldAdmin
]

fieldsets = [
(
None,
{
'fields': [
'corpus',
'title',
'description',
'description_page',
'image',
]
}
), (
'Content',
{
'fields': [
'category',
'languages',
'min_date',
'max_date',
'document_context',
]
}
), (
'Elasticsearch',
{
'fields': [
'es_index',
'es_alias',
]
}
), (
'Scans',
{
'fields': [
'scan_image_type',
'allow_image_download',
]
}
), (
'Word models',
{
'fields': ['word_models_present']
}
)
]

def get_form(self, request, obj=None, **kwargs):
show_warning_message(request)
return super().get_form(request, obj, **kwargs)


class FieldAdmin(admin.ModelAdmin):
readonly_fields = ['corpus_configuration']

fieldsets = [
(
None,
{
'fields': [
'name',
'corpus_configuration',
'display_name',
'description',
'hidden',
'downloadable',
]
}
),
(
'Indexing options',
{
'fields': [
'es_mapping',
'indexed',
'required',
]
}
), (
'Search interface',
{
'fields': [
'search_filter',
'results_overview',
'searchable',
'search_field_core',
'sortable',
'primary_sort',
]
}
), (
'Visualisations',
{
'fields': [
'visualizations',
'visualization_sort',
]
}
)
]

def get_form(self, request, obj=None, **kwargs):
show_warning_message(request)
return super().get_form(request, obj, **kwargs)


admin.site.register(Corpus, CorpusAdmin)
admin.site.register(CorpusConfiguration, CorpusConfigurationAdmin)
admin.site.register(Field, FieldAdmin)
107 changes: 11 additions & 96 deletions backend/addcorpus/corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,7 @@
'''

from . import extract
from zipfile import ZipExtFile
import itertools
import inspect
import json
import bs4
import csv
import sys
Expand All @@ -19,8 +16,9 @@
from addcorpus.constants import CATEGORIES

import logging
logger = logging.getLogger('indexing')
from ianalyzer.settings import NEW_HIGHLIGHT_CORPORA

logger = logging.getLogger('indexing')


class CorpusDefinition(object):
Expand Down Expand Up @@ -235,72 +233,6 @@ def es_mapping(self):
}
}

def json(self):
'''
Corpora should be able to produce JSON, so that the fields they define
can be used by other codebases, while retaining the Python class as the
single source of truth.
'''
corpus_dict = self.serialize()
json_dict = json.dumps(corpus_dict)
return json_dict

def serialize(self):
"""
Convert corpus object to a JSON-friendly dict format.
"""
corpus_dict = {}

# gather attribute names
# exclude:
# - methods not implemented in Corpus class
# - hidden attributes
# - attributes listed in `exclude`
# - bound methods
exclude = ['data_directory', 'es_settings', 'word_model_path']
corpus_attribute_names = [
a for a in dir(self)
if a in dir(CorpusDefinition) and not a.startswith('_') and a not in exclude and not inspect.ismethod(self.__getattribute__(a))
]

# collect values
corpus_attributes = [(a, getattr(self, a)) for a in corpus_attribute_names ]

for ca in corpus_attributes:
if ca[0] == 'fields':
field_list = []
for field in self.fields:
field_list.append(field.serialize())
corpus_dict[ca[0]] = field_list
elif ca[0] == 'languages':
format = lambda tag: Language.make(standardize_tag(tag)).display_name() if tag else 'Unknown'
corpus_dict[ca[0]] = [
format(tag)
for tag in ca[1]
]
elif ca[0] == 'category':
corpus_dict[ca[0]] = self._format_option(ca[1], CATEGORIES)
elif type(ca[1]) == datetime:
timedict = {'year': ca[1].year,
'month': ca[1].month,
'day': ca[1].day,
'hour': ca[1].hour,
'minute': ca[1].minute}
corpus_dict[ca[0]] = timedict
else:
corpus_dict[ca[0]] = ca[1]
return corpus_dict

def _format_option(self, value, options):
'''
For serialisation: format language or category based on list of options
'''
return next(
nice_string
for code, nice_string in options
if value == code
)

def sources(self, start=datetime.min, end=datetime.max):
'''
Obtain source files for the corpus, relevant to the given timespan.
Expand Down Expand Up @@ -753,13 +685,13 @@ def __init__(self,
name=None,
display_name=None,
display_type=None,
description=None,
description='',
indexed=True,
hidden=False,
results_overview=False,
csv_core=False,
search_field_core=False,
visualizations=None,
visualizations=[],
visualization_sort=None,
es_mapping={'type': 'text'},
search_filter=None,
Expand All @@ -772,9 +704,11 @@ def __init__(self,
**kwargs
):

mapping_type = es_mapping['type']

self.name = name
self.display_name = display_name
self.display_type = display_type
self.display_name = display_name or name
self.display_type = display_type or mapping_type
self.description = description
self.search_filter = search_filter
self.results_overview = results_overview
Expand All @@ -790,41 +724,22 @@ def __init__(self,

self.sortable = sortable if sortable != None else \
not hidden and indexed and \
es_mapping['type'] in ['integer', 'float', 'date']
mapping_type in ['integer', 'float', 'date']

self.primary_sort = primary_sort

# Fields are searchable if they are not hidden and if they are mapped as 'text'.
# Keyword fields without a filter are also searchable.
self.searchable = searchable if searchable != None else \
not hidden and indexed and \
((self.es_mapping['type'] == 'text') or
(self.es_mapping['type'] == 'keyword' and self.search_filter == None))
((mapping_type == 'text') or
(mapping_type == 'keyword' and self.search_filter == None))
# Add back reference to field in filter
self.downloadable = downloadable

if self.search_filter:
self.search_filter.field = self

def serialize(self):
"""
Convert Field object to a JSON-friendly dict format.
"""
field_dict = {}
for key, value in self.__dict__.items():
if key == 'search_filter' and value != None:
filter_name = str(type(value)).split(
sep='.')[-1][:-2]
search_dict = {'name': filter_name}
for search_key, search_value in value.__dict__.items():
if search_key == 'search_filter' or search_key != 'field':
search_dict[search_key] = search_value
field_dict['search_filter'] = search_dict
elif key != 'extractor':
field_dict[key] = value

return field_dict


# Helper functions ############################################################

Expand Down
26 changes: 16 additions & 10 deletions backend/addcorpus/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,27 @@ class Filter(object):
A filter is the interface between the form that is presented to users and
the ElasticSearch filter that is sent to the client.
'''
# TODO Far as I can tell, this is a specific implementation of a problem
# with which WTForms deals in general. Therefore, this should be embedded
# in WTForms.



def __init__(self, description=None):
self.field = None # Must be filled after initialising
self.description = description


def serialize(self):
name = str(type(self)).split(sep='.')[-1][:-2]
search_dict = {'name': name}
for key, value in self.__dict__.items():
if key == 'search_filter' or key != 'field':
if type(value) == datetime:
search_dict[key] = value.isoformat()
else:
search_dict[key] = value
return search_dict

class DateFilter(Filter):
'''
Filter for datetime values: produces two datepickers for min and max date.
'''

def __init__(self, lower, upper, *nargs, **kwargs):
self.lower = lower
self.upper = upper
Expand All @@ -35,7 +41,7 @@ class RangeFilter(Filter):
'''
Filter for numerical values: produces a slider between two values.
'''

def __init__(self, lower, upper, *nargs, **kwargs):
self.lower = lower
self.upper = upper
Expand All @@ -46,7 +52,7 @@ class MultipleChoiceFilter(Filter):
'''
Filter for keyword values: produces a set of buttons.
'''

def __init__(self, option_count=10, *nargs, **kwargs):
self.option_count = option_count
# option_count defines how many buckets are retrieved
Expand All @@ -58,7 +64,7 @@ class BooleanFilter(Filter):
'''
Filter for boolean values: produces a drop-down menu.
''' #TODO checkbox?

def __init__(self, true, false, *nargs, **kwargs):
self.true = true
self.false = false
Expand Down
Loading