Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Quantity Data Set Handling #607

Open
JamesHabben opened this issue Nov 13, 2023 · 8 comments
Open

Large Quantity Data Set Handling #607

JamesHabben opened this issue Nov 13, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@JamesHabben
Copy link
Collaborator

Problem

overall, there are modules that are or have potential for parsing out a large number of data records, and writing those records into HTML tags creates a large overhead of both file storage and processing, leading to bloated reports and potential non-load issues on less powerful computers.

Data

I noticed that the health - heart rate output from Josh's public image creates around 23.5k records and a 10mb html file. it is also timing out on some of my computers due to memory. the health - steps is 15.5k records and a 4mb html file.

Solution

i think we can address this with a relatively low impact change by loading data from either a separate json file, or possibly a sqlite db file. json would be a lower impact to the code base since it would be native to javascript. i am exploring with the structure of having an option to have a module write the data to a json file and have the html file load it when rendered. it wont be 'true' json since that typically has the field names in front of every value for every record. instead, an array of data rows, that are themselves just array of data fields will drastically reduce the size of the data set.

true json

[
  {
    "field1": "value1",
    "field2": ""
  }
]

array of array

[
  [ "value1", ""]
}

Tag: @Johann-PLW , @abrignoni

@JamesHabben
Copy link
Collaborator Author

here is a really rough pass. it errors on some modules, but it processes enough to do some testing.
https://github.com/JamesHabben/iLEAPP/tree/dynamicreport-dataarrays

the health - heart rate module gets 23.5k records from josh's public image. previous html file was around 10-11mb. using this branch, that file cuts down to around 6mb. on the larger file, my browser was timing out with all the data. with this branch, i get 12 loops of the timer circle, and the data is all loaded. once it loads, the pages are actually quite responsive when moving around the pages. its also quite quick when using the sorts. its just the initial load that takes some time.

@Johann-PLW can you try this against your larger data set to see if it helps?

@JamesHabben
Copy link
Collaborator Author

i am less concerned about the size of the html file itself. this is more about the browser being able to load and work with the large amount of data. we could do some other things to do some file reduction, but doing something like gzip compression might actually add overhead to the browser making it worse. if this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

@Johann-PLW
Copy link
Collaborator

If this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

That's something I've actually been thinking about, perhaps grouping a large amount of data by year, day, month or hour.
With the work you've already done on the HTML report, we could display less info on the screen and additional details that are redundant in the form of a tooltip.

I'll do some tests with my dataset with the code of your 'dynamicreport-dataarray' branch and let you know.

@Johann-PLW
Copy link
Collaborator

@JamesHabben
Unfortunately, it doesn't work with my personal dataset (encrypted backup of an iPhone 13 mini with iOS 17.0.3)

The heart rate query matches 1028115 records.
The previous generated HTML file was 240.9 MB, the new one (with your updated code) is 235.1 MB.
The web browsers (Safari & Chrome) are still unresponsive after 10 minutes trying to load the data.

The number of steps query matches 493272 records.
The previous generated HTML file was 66.7 MB, the new one is bigger with 81.5 MB
Both web browsers are unresponsive.

Tests were conducted on a MacBook Pro 2019 - 2,4 GHz Intel Core i9 8 cores - 32 GB RAM with macOS 13.5.1
Web browsers are Safari 6.6 (18615.3.12.11.2) et Google Chrome 119.0.6045.159

@JamesHabben
Copy link
Collaborator Author

oof. not sure why heart rate didnt reduce more, and frustrated at steps increase. i can reduce some of that using less text in the structure, but i dont think that will make much different in the browser loading this data set. what do you think about sampling the data on the python side? 1mil records is a lot of data and will be hard to incorporate processing it in a broader framework like this. i wonder if we can find a framework that can do some time based sampling, averaging, anomoly highlight, and pass a reduced set of data to the browser.

@JamesHabben
Copy link
Collaborator Author

@Johann-PLW What's the time range and frequency of your heart beat data? If we did some summary of data, say every 15 mins, how many records would that reduce? Might have to adjust based on the frequency.

We can provide typical summary numbers like minimum, maximum, average, mean, etc. and if the user wants to investigate in more detail, then TSV output is available.

While typing this though, I wanted to do some math. I think hourly summary periods really might need to be the one.

Here are my calcs:

1.	Hourly Records:
•	1 record per hour
•	24 records per day
•	In 3 years:  = 26,280 records
•	In 5 years:  = 43,800 records
•	In 10 years:  = 87,600 records
2.	Half-hourly Records:
•	2 records per hour
•	48 records per day
•	In 3 years:  = 52,560 records
•	In 5 years:  = 87,600 records
•	In 10 years:  = 175,200 records
3.	Every 15 Minutes:
•	4 records per hour
•	96 records per day
•	In 3 years:  = 105,120 records
•	In 5 years:  = 175,200 records
•	In 10 years:  = 350,400 records

@Johann-PLW
Copy link
Collaborator

@JamesHabben
I have records since April 2015 and the frequency depends of my activity:

  • up to 15 per minute
  • 10 ~ 150 per hour
  • 100 ~ 1000 per day
  • 10k ~ 20k per month

I think we could also remove some columns like Device and Manufacturer

image

As the device and/or software used to collect the data, and the timezone are very repetitive, could we use an array to store once the information and display it in all records?

@JamesHabben JamesHabben added the enhancement New feature or request label Jan 3, 2024
@JamesHabben
Copy link
Collaborator Author

i think we have this solved with the upcoming lava release. @Johann-PLW do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants