Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux: module symbols: Fixes and code improvements #1509

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

gcmoreira
Copy link
Contributor

  • Fixed an issue causing the generation of an invalid, extra symbol.
  • Reuse ELF sym API instead of reimplemented it
  • Updated the function to return the symbol index, enabling the use of additional module tables.
  • Ensured the ELF symbol object has cached_strtab set, allowing retrieval of critical symbol information like names.
  • Added typing hints

- Fixed an issue causing the generation of an invalid, extra symbol.
- Reuse ELF sym API instead of reimplemented it
- Updated the function to return the symbol index, enabling the use of additional module tables.
- Ensured the ELF symbol object has `cached_strtab` set, allowing retrieval of critical symbol information like names.
- Added typing hints
@gcmoreira
Copy link
Contributor Author

Hey @ikelos, this is part of a larger pull request that I'm breaking into smaller, more manageable changes. While these updates might seem disconnected at the moment, they will align once the full implementation is in place.

The module::get_symbols() method interface has been updated. Since it was only used internally, I've adjusted all relevant references to match the new implementation.

Given the scope of these changes, at least a framework minor version update might be required. Let me know your thoughts!

addr = self._cached_strtab + self.st_name

layer = self._context.layers[self.vol.layer_name]
name_bytes = layer.read(addr, max_size, pad=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a try/except should be wrapped around the layer read to prevent from memory smear ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With pad=True it shouldn't raise any exceptions

Copy link
Contributor

@Abyss-W4tcher Abyss-W4tcher Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I had concerns about this early part, if self.st_name is an absurd value due to smear ?

def read(self, offset: int, length: int, pad: bool = False) -> bytes:
"""Reads from the file at offset for length."""
if not self.is_valid(offset, length):
invalid_address = offset
if self.minimum_address < offset <= self.maximum_address:
invalid_address = self.maximum_address + 1
raise exceptions.InvalidAddressException(
self.name, invalid_address, "Offset outside of the buffer boundaries"
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pad=True, this for because ignore_errors=pad, will return immediately:

for offset, _, mapped_offset, mapped_length, layer in self.mapping(
offset, length, ignore_errors=pad
):

Then this line will pad everything with zeros:

return recovered_data + b"\x00" * (length - len(recovered_data))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying it !

Copy link
Member

@ikelos ikelos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions around some of the API changes, some of which seem unnecessary and would require bumps, but otherwise looks ok...

"""Get symbols of the module

Yields:
A symbol object
A tuple containing the ELF symbol index and the corresponding ELF symbol object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes assumptions about the ordering of the symbols returned (and it's also technically a change in the API since it'll break anything the currently doesn't expect a tuple) so if you want to make this change it'll need a MAJOR version bump somewhere, but I can't see the use in it? What's the reason for needing to return the index?

The only existing call to this that I saw get updated throws the value of the index away?

Copy link
Contributor Author

@gcmoreira gcmoreira Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I can't see the use in it?

yes, I tried to explain this here:

Hey @ikelos, this is part of a larger pull request that I'm breaking into smaller, more manageable changes. While these updates might seem disconnected at the moment, they will align once the full implementation is in place.

Yes, symbols in the symbol table are ordered, which is important because it is not the only table in a Linux kernel module. The module also contains a types table, where the indices correspond to those in the symbol table. This means that the type for the symbol at index (n) in the symbol table is located at the same index in the types table.

We could handle this in the caller using enumerate, but that would mean get_symbols() must always preserve the original order. It cannot skip any symbol, whether it's invalid, null, empty, or whatever. If skipped, the symbols could shift in relation to the other tables, breaking the framework.

I believe this approach ensures the symbols stay correctly aligned with their indices.
Should this be considered a major version bump for the whole framework? Unfortunately, object extensions don't have versioning in place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there's a better way of marrying the two up? The symbols in an ISF can have a type parameter, telling you which type they use? Is that not coming out of the dwarf2json? Enumerating them here seems really fragile and we'll have to remember it if we ever tinker with any of the way the symbol tables work. See here.

Recalculating it each time just opens up the chance something will go wrong later, the mapping is static data, it should be encoded in the ISF itself, surely?

Copy link
Contributor Author

@gcmoreira gcmoreira Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it's not related to those symbols. These are kernel module symbols, and this information is not part of the ISF. It is gathered directly from memory rather than from d2j. The symbol types are not included in the elf_sym structure and never will be, as this data is embedded within the module, not within the symbol itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikelos If we want to avoid a version major bump, I can make an intermediate function like:
get_symbols_ex() .. which yield index and symbol .. and get_symbols() calls it but ignoring the index. Then, who needs both can still call directly to get_symbols_ex().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I missed where they came from then (there's a lot of special things in the linux extensions that I'm not up to speed on). If you're happy it makes sense, then that's ok. Yes, please put it in a different function and make the old one a shell that just returns the same type and has the same signature and name as the original. Probably wise to include a deprecation warning in that call too, so anyone that is using it knows to shift themselves over...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't someone just call enumerate on the generator before first use though? If you're changing the function already I guess it makes it less brittle to ensure it's always count from the start, just trying to keep API changes to a minimum. As I say, if really want it I'm ok with it, just want to make sure it doesn't need changing again down the line... 5:)

@@ -328,22 +329,29 @@ def cached_strtab(self):
def cached_strtab(self, cached_strtab):
self._cached_strtab = cached_strtab

def get_name(self):
addr = self._cached_strtab + self.st_name
def get_name(self, max_size=KSYM_NAME_LEN) -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is an additive change to the API. I'm not sure what the first versioned component above this would be though? Can we avoid this being a parameter, but make it a constant instead? I suspect the maximum doesn't change often, and it would keep the API the same but pull the value out to make it more destinctly something people can see and set when needed?

Copy link
Contributor Author

@gcmoreira gcmoreira Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove it from the parameters if you prefer. Since the ELF API could be used for other purposes than just kernel symbols, and for some of those use cases, a shorter maximum symbol name length could make more sense there, I assumed that a general maximum-maximum length would be fine for all cases but allowing to customize a shorter one if for some reason it's needed. I also made it a parameter with default value to minimize the impact of this change.

The Linux kernel has it's maximum length, but for general/user-space ELF files the maximum size can be different. The kernel used to have a maximum length of 128 bytes until Rust happened. In kernels 6.1, it was increased to 512 bytes in this commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, ok. I think still best to make it a constant, and then we can make a new class that overrides that class local constant down the line (and the class local constant can be set from constants if you want)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you mean something like this (internal or not)?

class elf_sym(objects.StructType):
    _MAX_NAME_LENGTH = KSYM_NAME_LEN

Copy link
Member

@ikelos ikelos Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants.KSYM_NAME_LEN, but yeah, so there's constants (user overridable in one clear place, class constant so that inheritors could override it, and then the actual use point). Does that make sense or is it just over complicating things? I don't mind so much, so long as it's pulled out somewhere sensible... 5:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants