Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss ADR: Media Manager Architecture #851

Open
JamesHabben opened this issue Oct 11, 2024 · 6 comments
Open

Discuss ADR: Media Manager Architecture #851

JamesHabben opened this issue Oct 11, 2024 · 6 comments

Comments

@JamesHabben
Copy link
Collaborator

JamesHabben commented Oct 11, 2024

Media Manager Implementation Discussion

We're proposing a new Media Management System to centralize and streamline how we handle media files across different modules in our project. This issue is to discuss the proposed architecture and technical specifications.

Key Points:

  1. Centralized media check-in process
  2. Avoidance of duplicate media entries
  3. Automatic reference tracking
  4. Integration with existing parser systems

Proposed Changes:

  1. New check_in_media function that modules will use to register media files
  2. New database tables: media_items and media_references
  3. Modules will store MediaReference objects as cell values
  4. Updates to automatic parsers in ilapfuncs.py and lavafuncs.py to handle MediaReference objects

Documents for Review:

Questions for Discussion:

  1. Is the proposed check_in_media function signature sufficient? Should we add or remove any parameters?
  2. Are there any concerns about performance, especially for projects with a large number of media files?
  3. How should we handle cases where the same media file is referenced by multiple modules or artifacts?
  4. Are there any additional metadata fields we should consider extracting and storing?
  5. How should we approach implementing the automatic parser updates in ilapfuncs.py and lavafuncs.py?

Please review the linked documents and share your thoughts, concerns, or suggestions.

@JamesHabben JamesHabben changed the title Media Manager Architecture Discuss ADR: Media Manager Architecture Oct 21, 2024
@abrignoni
Copy link
Owner

Per discussion on another platform:

  • The check_in_media function will add the metadata as stated in the documents for review into a data structure that LAVA will use in order to display the media and the metadata to the user.
  • Refactoring of module code will be minimal so far as long as the HTML output is still used. This we hope will not be a long time.
  • The seeker functions in the LEAPP will need to be refactored considerably in order to add this new functionality in a way that is transparent to the module developer.
  • The data directory in the LEAPP report folder will be the ultimate repository of all data to include media. This will provide consistency across all reporting sources be it zip, tar, or from a directory in the file system.

@abrignoni
Copy link
Owner

Some of my thoughts on the questions:

  1. Any metadata we can pull from files we should. I agree on EXIF data as soon as feasible.
  2. As discussed previously the app loading items as they are being accessed is a key feature of this reporting system.
  3. The LAVA database will have the paths for each file it needs to show module table. As the user moves from module to module in LAVA the tool will just use the reference. The metadata will be on the media/file data structure.

@snoop168
Copy link
Collaborator

snoop168 commented Nov 4, 2024

Proposing a workflow to consider as it may relate to the development of this feature (assuming my understanding of everything is correct so far):

  1. Developer adds a path regex to the artifact/module for files to include
  2. When the artifact/module runs, all files responsive to this path regex will be iterated. If that file is not already in the media_items table it will be added, given a unique identifier and the file copied out to the data folder. If its already present in media_items table no action is needed as it relates to the media_items table or copying of file. An entry is added in the media_references table for this file/artifact. I'm thinking the media_items table may require 2 path columns, first will be the path of the file in the extraction. The second would be some relative path in the data folder so LAVA can locate it. Most of the time for actual files these will generally be the same or easily derived from one another, however these 2 columns will be more important for files that are extracted as embedded data in another file. (example blob data extracted from a database and then written to some "virtual" file.)
  3. When the developer wants to include this file into the report I would think they will refer to it by its unique identifier. There should be a search function to search within the current media manager (media_items table). The search function should probably minimally support an "ends with" search functionality since its likely the most common scenario is that the artifact processor will know that the file should be one of the ones that matched it's grep pattern, and now we just need to uniquely locate it without necessarily knowing the entire data path (ex iOS apps that don't know their own data container GUID - while code can be written to learn their own GUID it might not be necessary under most circumstances). Any search functionality should optionally take the artifacts identifier that requested the media file to be used as a filter so it can search for only files within a path that it requested. The search functionality can return a list of media_item identifiers that match the search. The developer can then store these identifier(s) in a column that will be tagged as a "mediaitem" special column. This special column could allow for either a list, or a single media item. This column can be handled within LAVA for the most appropriate display based on the type of file. These files will still need to be displayed properly in the HTML. Perhaps the HTML generation code can be programmed to understand the special column type "mediaitem" and handle it identically to how LAVA will

@JamesHabben
Copy link
Collaborator Author

JamesHabben commented Nov 6, 2024

@snoop168
... all files responsive to this path regex will be iterated. If that file is not already in the media_items table it will be added, given a unique identifier and the file copied out to the data folder. If its already present in media_items table no action is needed as it relates to the media_items table or copying of file.

The module will need to take an action (IE 'check in') a media file even if its in the search pattern it provides. We do not currently export all files matching these patterns, and I don't think we should. We may want to provide a similar type of 'check in' type function for non-media type files. I am thinking plist, sqlite, or other parsed data files. These currently only get exported if the module does it in its own code.

edit: We should not assume that all files in a pattern match are relevant to what the module or artifact is attempting to parse. Some of the patterns have to include a lot to access the true target files.

@Johann-PLW
Copy link
Collaborator

Some of my thoughts after reading this discussion but I am pretty sure new ones will come during the implementation phase.

Regarding the check_in_media function:

  • I think the function signature is sufficient.
  • If file exists, it returns the existing MediaReference but I think we need to store in the database that this modules/artifacts also uses this particular file.
  • Data folder is a good place to store files in their original path but at the moment, FileSeekerDir does not copy matching files into it
  • As we can have different kind of metadata (EXIF, PDF or Office documents, ID3, etc...), this data should be store in the database in a data structure like JSON. This is already the case for the timeline database.
  • We should also only store files really used by the module/artifact and not files mathing the regex

Concerns about performance:

  • I am pretty sure that it is more efficient to query a database to check if a file was already used/extracted for a particular module/artifact than browsing the data folder on the media storage

Implemetation

  • I think we need to rewrite the seeker functions or pay more attention to the regex used in the modules/artifacts. Actually, we copy in the data folder every file matching the regex even if those files are not use by the module/artifact.
  • Seeker functions could be use to fill in the database and copy the files into the data folder and ilapfuncs and lavafuncs used to query the database and generate the HTML/LAVA outputs.

@JamesHabben
Copy link
Collaborator Author

As I am thinking over this, a note I want to include is that I think we will end up not storing these media files in their original paths. I think we may go more similar to how itunes backups are stored. have a media folder and probably even subfolders inside, although i dont think we need the 2 hex character prefix. the checkin function would sha1 the given file and store the file in the subfolder that matches its first char in the sha1. file should have its appropriate extension for its type (jpg, pdf, etc) for ease of opening. this way, if an image that exists in the photo album was also sent in an imessage we identify that they are matching hashes and record each location as a reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants