Large Quantity Data Set Handling #607

JamesHabben · 2023-11-13T19:28:53Z

Problem

overall, there are modules that are or have potential for parsing out a large number of data records, and writing those records into HTML tags creates a large overhead of both file storage and processing, leading to bloated reports and potential non-load issues on less powerful computers.

Data

I noticed that the health - heart rate output from Josh's public image creates around 23.5k records and a 10mb html file. it is also timing out on some of my computers due to memory. the health - steps is 15.5k records and a 4mb html file.

Solution

i think we can address this with a relatively low impact change by loading data from either a separate json file, or possibly a sqlite db file. json would be a lower impact to the code base since it would be native to javascript. i am exploring with the structure of having an option to have a module write the data to a json file and have the html file load it when rendered. it wont be 'true' json since that typically has the field names in front of every value for every record. instead, an array of data rows, that are themselves just array of data fields will drastically reduce the size of the data set.

true json

[
  {
    "field1": "value1",
    "field2": ""
  }
]

array of array

[
  [ "value1", ""]
}

Tag: @Johann-PLW , @abrignoni

The text was updated successfully, but these errors were encountered:

JamesHabben · 2023-11-14T18:09:57Z

here is a really rough pass. it errors on some modules, but it processes enough to do some testing.
https://github.com/JamesHabben/iLEAPP/tree/dynamicreport-dataarrays

the health - heart rate module gets 23.5k records from josh's public image. previous html file was around 10-11mb. using this branch, that file cuts down to around 6mb. on the larger file, my browser was timing out with all the data. with this branch, i get 12 loops of the timer circle, and the data is all loaded. once it loads, the pages are actually quite responsive when moving around the pages. its also quite quick when using the sorts. its just the initial load that takes some time.

@Johann-PLW can you try this against your larger data set to see if it helps?

JamesHabben · 2023-11-14T18:51:29Z

i am less concerned about the size of the html file itself. this is more about the browser being able to load and work with the large amount of data. we could do some other things to do some file reduction, but doing something like gzip compression might actually add overhead to the browser making it worse. if this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

Johann-PLW · 2023-11-14T19:48:12Z

If this branch code doesnt work to load all of your large data set, we may need to explore other approaches of breaking that data up into segments.

That's something I've actually been thinking about, perhaps grouping a large amount of data by year, day, month or hour.
With the work you've already done on the HTML report, we could display less info on the screen and additional details that are redundant in the form of a tooltip.

I'll do some tests with my dataset with the code of your 'dynamicreport-dataarray' branch and let you know.

Johann-PLW · 2023-11-14T22:22:31Z

@JamesHabben
Unfortunately, it doesn't work with my personal dataset (encrypted backup of an iPhone 13 mini with iOS 17.0.3)

The heart rate query matches 1028115 records.
The previous generated HTML file was 240.9 MB, the new one (with your updated code) is 235.1 MB.
The web browsers (Safari & Chrome) are still unresponsive after 10 minutes trying to load the data.

The number of steps query matches 493272 records.
The previous generated HTML file was 66.7 MB, the new one is bigger with 81.5 MB
Both web browsers are unresponsive.

Tests were conducted on a MacBook Pro 2019 - 2,4 GHz Intel Core i9 8 cores - 32 GB RAM with macOS 13.5.1
Web browsers are Safari 6.6 (18615.3.12.11.2) et Google Chrome 119.0.6045.159

JamesHabben · 2023-11-14T22:33:25Z

oof. not sure why heart rate didnt reduce more, and frustrated at steps increase. i can reduce some of that using less text in the structure, but i dont think that will make much different in the browser loading this data set. what do you think about sampling the data on the python side? 1mil records is a lot of data and will be hard to incorporate processing it in a broader framework like this. i wonder if we can find a framework that can do some time based sampling, averaging, anomoly highlight, and pass a reduced set of data to the browser.

JamesHabben · 2023-11-15T16:32:44Z

@Johann-PLW What's the time range and frequency of your heart beat data? If we did some summary of data, say every 15 mins, how many records would that reduce? Might have to adjust based on the frequency.

We can provide typical summary numbers like minimum, maximum, average, mean, etc. and if the user wants to investigate in more detail, then TSV output is available.

While typing this though, I wanted to do some math. I think hourly summary periods really might need to be the one.

Here are my calcs:

1.	Hourly Records:
•	1 record per hour
•	24 records per day
•	In 3 years:  = 26,280 records
•	In 5 years:  = 43,800 records
•	In 10 years:  = 87,600 records
2.	Half-hourly Records:
•	2 records per hour
•	48 records per day
•	In 3 years:  = 52,560 records
•	In 5 years:  = 87,600 records
•	In 10 years:  = 175,200 records
3.	Every 15 Minutes:
•	4 records per hour
•	96 records per day
•	In 3 years:  = 105,120 records
•	In 5 years:  = 175,200 records
•	In 10 years:  = 350,400 records

Johann-PLW · 2023-11-15T18:03:25Z

@JamesHabben
I have records since April 2015 and the frequency depends of my activity:

up to 15 per minute
10 ~ 150 per hour
100 ~ 1000 per day
10k ~ 20k per month

I think we could also remove some columns like Device and Manufacturer

As the device and/or software used to collect the data, and the timezone are very repetitive, could we use an array to store once the information and display it in all records?

JamesHabben · 2024-11-04T19:25:04Z

i think we have this solved with the upcoming lava release. @Johann-PLW do you agree?

JamesHabben added the enhancement New feature or request label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Quantity Data Set Handling #607

Large Quantity Data Set Handling #607

JamesHabben commented Nov 13, 2023

JamesHabben commented Nov 14, 2023

JamesHabben commented Nov 14, 2023

Johann-PLW commented Nov 14, 2023

Johann-PLW commented Nov 14, 2023

JamesHabben commented Nov 14, 2023

JamesHabben commented Nov 15, 2023

Johann-PLW commented Nov 15, 2023

JamesHabben commented Nov 4, 2024

Large Quantity Data Set Handling #607

Large Quantity Data Set Handling #607

Comments

JamesHabben commented Nov 13, 2023

Problem

Data

Solution

JamesHabben commented Nov 14, 2023

JamesHabben commented Nov 14, 2023

Johann-PLW commented Nov 14, 2023

Johann-PLW commented Nov 14, 2023

JamesHabben commented Nov 14, 2023

JamesHabben commented Nov 15, 2023

Johann-PLW commented Nov 15, 2023

JamesHabben commented Nov 4, 2024