Scraper Data Discussion #54

jpahm · 2022-02-27T09:34:20Z

jpahm
Feb 27, 2022
Maintainer

The purpose of this is to promote discussion of API data types based on sample data collected from the Coursebook scraper so far. The exact format of this sample data is completely subject to change per our decisions here, so this is simply to be treated as a sample of the kind of data the scraper makes available.

Sample data from the scraper is available here.

jpahm · 2022-03-02T06:33:34Z

jpahm
Mar 2, 2022
Maintainer Author

I have collected some points about the currently available data that will need to be considered alongside schema requirements.

Prereqs and coreqs for a section are currently stored together as an array of strings under Section.Requisites. These will need to be parsed further and converted into more useful objects (these strings provided by Coursebook do not have a consistent format as far as I'm aware, so this may be of significant challenge) Examples of these strings are as follows:

"Prerequisite: CHEM 1111 or CHEM 1115. Corequisite: CHEM 1312. Repeat Restriction."
"Corequisite: CHEM 1111. Repeat Restriction."
"Prerequisite: CHEM 2323. Corequisite: CHEM 2125. Repeat Restriction."
"Prerequisites: (CHEM 1112 or CHEM 1116) and (CHEM 1312 or CHEM 1316)."
"Prerequisites: NATS 3341 and a university grade point average of at least 2.750. Prerequisite or Corequisite: NATS 3343."

The scraper currently collects core requirement information for a section under Section.CoreInfo which is a string containing any relevant core requirements a course meets. This string will need to be parsed into some sort of object. Once again these strings don't have very consistent formatting, so this could be challenging. Examples of these strings are as follows:

"Texas Core Area 090 - Component Area Option"
"Texas Core Areas 030+090 - Life and Physical Sciences + CAO"
"Texas Core Areas 020+090 - Mathematics + CAO"

The scraper currently doesn't deal with special cases, all sections are treated the same and any available data is used, with missing data being set to null. We'll have to determine whether the semester data collected is adequate for what we need, or if certain aspects of the scraping need to be made to deal with more special parsing cases.

As far as point 3 goes, I have already identified two special cases the scraper needs to be able to handle. One being sections that do not reward integer amounts of credits, and the other being sections that have multiple discrete meeting schedules rather than a single consistent meeting schedule. Currently my plans for these two cases are as follows:

Non-integer credit count:

Store as -1 and leave handling to elsewhere
Store credit counts as strings instead of ints, parse into object elsewhere (this is the best option in my opinion)

Multiple meeting schedules:

Combine Section.MeetingDays, Section.Times, and Section.Location into a single Schedule object, then make a Section.Schedules property that is an array of these objects

Any suggestions for any of these points are welcome.

0 replies

jmyrick02 · 2022-03-02T10:32:21Z

jmyrick02
Mar 2, 2022

It would be useful to know how data will be stored in the database given a schema. I will include a proposed Course schema and a proposed Section schema here. Note that this is the representation in MongoDB (for example the ObjectId type). I will update with more detailed documentation once we all agree or modify my proposals. I have also included potential schemas for all component data-types of the Course and Section schemas. I wrote this up without much collaboration or oversight and only just now so please critique and give feedback on my proposals.

Course = {
    "_id": ObjectId,
    "course_number": string,
    "subject_prefix": string,
    "title": string,
    "description": string,
    "school": string,
    "credit_hours": string,
    "class_level": string,
    "activity_type": string,
    "grading": string,
    "internal_course_number": string,
    "prerequisites": Collection,
    "corequisites": Collection,
    "lecture_contact_hours": string,
    "laboratory_contact_hours": string,
    "offering_frequency": string,
    "attributes": Object,
}

Section = {
    "_id": ObjectId,
    "section_number": string,
    "course_reference": ObjectId,
    "section_corequisites": Collection,
    "academic_session": AcademicSession,
    "professors": Array<ObjectId>,
    "teaching_assistants": Array<Assistant>,
    "internal_class_number": string,
    "instruction_mode": string,
    "meetings": Array<Meeting>,
    "syllabus_uri": string,
    "grade_distribution": Array<number>,
    "attributes": Object,
}

AcademicSession = {
    "_id": ObjectId,
    "name": string,
    "start_date": Date,
    "end_date": Date,
}

Assistant = {
    "_id": ObjectId,
    "first_name": string,
    "last_name": string,
    "role": string,
    "email": string,
}

Meeting = {
    "_id": ObjectId,
    "start_date": Date,
    "end_date": Date,
    "meeting_days": Array<String>,
    "start_time": Time,
    "end_time": Time,
    "modality": string,
    "location": Location,
}

Professor = {
    "_id": ObjectId,
    "first_name": string,
    "last_name": string,
    "title": string,
    "email": string,
    "phone_number": string,
    "office": Location,
    "profile_uri": string,
    "office_hours": Array<Meeting>,
}

Location = {
    "_id": ObjectId,
    "building": string,
    "room": string,
    "map_uri": string,
}

Date = {
    "day": string,
    "month": string,
    "year": string,
}

Time = { // Not sure if UTC or CST
    "_id": ObjectId,
    "hour": number,
    "minute": number,
    "second": number,
}

Collection = {
    "_id": ObjectId,
    "name": string,
    "abbreviation": string,
    "type": string,
    "requisite_type": string,
    "required": number,
    "total": number,
    "options": Array<ObjectId | Collection>,
}

27 replies

jmyrick02 Mar 4, 2022

Thinking on it overnight, I really do think a more general approach is desirable. The end-user code for evaluating AND and OR with minTrue is ultimately going to be extremely similar, so it makes sense to just combine them into one requirement type. In my opinion, this is also conceptually easier to grasp and cleaner to write code for on both the back-end and for the end-users. I will now propose a compromise solution that in my opinion takes the positives of both of our proposals.

My solution scraps the Collection type and instead replaces it with CollectionRequirement that derives from the abstract Requirement. This is essentially your OR proposal but in my opinion the name makes more sense since OR implies that only 1 needs to be true. Since the CollectionRequirement is now a very general type, it makes sense to have a separate Degree type which contains other information. We could build types for major required and core classes but I don't think they're necessary as they will be never be free-standing and will always be a child of some other datatype, which will indicate what they hold. Furthermore, there will be no differences between these particular types except for which type it is, so no information is actually being given. I will now write my proposed schema out explicitly.

RequirementType = "course" | "section" | "exam" | "major" | "minor" | "gpa" | "consent" | "collection" | "other"

EvaluationType = "requirement_count" | "hour_count"

CollectionRequirement extends Requirement = {
    "_id": ObjectId,
    "type": "collection",
    "evaluation": EvaluationType,
    "required": number,
    "total": number,
    "requirements": Array<Requirement>,
}

DegreeSubtype = "major" | "minor" | "concentration"

Degree = {
    "_id": ObjectId,
    "subtype": DegreeSubtype,
    "name": string,
    "abbreviation": string,
    "minimum_credit_hours": number,
    "requirements": CollectionRequirement,
}

jpahm Mar 4, 2022
Maintainer Author

I think this is a good compromise, however I do have two small changes that I think would be beneficial.

Get rid of EvaluationType and CollectionRequirement.evaluation, instead make hour_count a type of requirement (this would fit more consistently with the rest of the model and be easier to work with, imo)
Potentially remove total, since the total number of requirements can already easily be determined from the requirements array if it's needed (it could be argued that the convenience of having the total pre-prepared is a justification for keeping this field, however I have my doubts the total would ever need to be known in advance anyways, as the front-end should be able to just do a foreach to iterate automatically)

Other than those points, which I am also willing to compromise on, I'm happy with this schema and support it as our final revision.

jmyrick02 Mar 5, 2022

I think these recommendations are really good. I agree that we don't need a total property and that a different Requirement for the two types of evaluation. Given these alterations, here's the pertinent revised parts of the schema.

RequirementType = "course" | "section" | "exam" | "major" | "minor" | "gpa" | "consent" | "collection" | "hours" | "other"

CollectionRequirement extends Requirement = {
    "_id": ObjectId,
    "type": "collection",
    "required": number,
    "options": Array<Requirement>,
}

HoursRequirement extends Requirement = {
    "_id": ObjectId,
    "type": "hours",
    "required": number,
    "options": Array<CourseRequirement>,
}

DegreeSubtype = "major" | "minor" | "concentration"

Degree = {
    "_id": ObjectId,
    "subtype": DegreeSubtype,
    "name": string,
    "abbreviation": string,
    "minimum_credit_hours": number,
    "requirements": CollectionRequirement,
}

jpahm Mar 5, 2022
Maintainer Author

This looks great. If everyone approves, I think we should go forward with providing the other teams a final schema document and finish documentation.

TrystonMinsquero Mar 5, 2022

Awesome. Whoever posts the final schema document with documentation in #57 first gets a cookie.

jpahm · 2022-03-03T06:26:19Z

jpahm
Mar 3, 2022
Maintainer Author

Overall I'm feeling pleased with the data collected by the scraper as of now and believe it will be able to support whatever schema plans we develop. However, I have noticed a large issue with the current Coursebook parsing, in that sometimes instructors and assistants are incorrectly listed multiple times.

Here is a real example of such an error:

{
	"Name": "Lin Jia",
	"Role": "Primary Instructor (50%)",
	"EMail": "[email protected]"
},
{
	"Name": "Stefanie Boyd",
	"Role": "Primary Instructor (50%)",
	"EMail": "[email protected]"
},
{
	"Name": "Stefanie Boyd",
	"Role": "Primary Instructor (50%)",
}

I have already identified the cause of this error and plan on fixing it shortly. I would encourage everyone to skim through the sample data linked at the top of this discussion and bring up any comparable issues they notice so they can be resolved.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper Data Discussion #54

{{title}}

Replies: 3 comments 27 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scraper Data Discussion #54

jpahm Feb 27, 2022 Maintainer

Replies: 3 comments · 27 replies

jpahm Mar 2, 2022 Maintainer Author

I have collected some points about the currently available data that will need to be considered alongside schema requirements.

Non-integer credit count:

Multiple meeting schedules:

jmyrick02 Mar 2, 2022

jmyrick02 Mar 4, 2022

jpahm Mar 4, 2022 Maintainer Author

jmyrick02 Mar 5, 2022

jpahm Mar 5, 2022 Maintainer Author

TrystonMinsquero Mar 5, 2022

jpahm Mar 3, 2022 Maintainer Author

jpahm
Feb 27, 2022
Maintainer

Replies: 3 comments 27 replies

jpahm
Mar 2, 2022
Maintainer Author

jmyrick02
Mar 2, 2022

jpahm Mar 4, 2022
Maintainer Author

jpahm Mar 5, 2022
Maintainer Author

jpahm
Mar 3, 2022
Maintainer Author