marimo's text format is hard for humans/non-marimo tools to understand #1379

Ubehebe · 2024-05-15T15:12:01Z

Ubehebe
May 15, 2024

One of the main advantages of marimo compared to other notebook formats is that marimo notebooks are syntactically valid Python files. This means that tools that analyze Python files (linters, formatters, type-checkers, IDEs) can generally do something useful with marimo notebooks without any setup.

As I've used marimo more, I've discovered some exceptions. marimo notebooks are syntactically valid Python, but they aren't idiomatic Python. This means that some tools can't analyze marimo notebooks in a useful way.

Here's an example. When you use an import in a notebook:

import pandas as pd

df = pd.DataFrame(...)

marimo serializes that to disk as something like:

@app.cell
def __():
  import pandas as pd
  return pd

@app.cell
def __(pd):
  df = pd.DataFrame(...)

This is basically a serialized DAG: the nodes (cells) are represented by top-level functions decorated with @app.cell, and the edges (dependencies) are represented by function params/return values.

The serialization is elegant, but tools other than marimo can't understand the indirection -- for example, they can't understand that the DataFrame constructor comes from the pandas import. This means that:

tools like ruff can't remove unused imports from marimo notebooks. If you delete the DataFrame instantiation above and run the file through ruff, the import pandas as pd statement remains.
IDEs can't navigate from the DataFrame instantiation to its definition in pandas.

I can see a few approaches we might take to improve this situation, but before proposing anything specific, I wanted to start a discussion. Maintainers, have you thought about this? How important do you think it is to improve?

My own view is that it's medium importance. For (1), unused imports can significantly slow down notebook execution. And for (2), being able to use IDE features to edit marimo notebooks would make large codebases significantly more maintainable (refactoring, etc.).

Thanks for your time!

akshayka · 2024-05-20T18:50:48Z

akshayka
May 20, 2024
Maintainer

@Ubehebe , sorry for the late response -- just saw your post.

I can see a few approaches we might take to improve this situation, but before proposing anything specific, I wanted to start a discussion. Maintainers, have you thought about this? How important do you think it is to improve?

I've thought about this insofar as the indirection bothers me, too. But I haven't spent time trying to design something better. In particular, number 2 resonates with me -- it would be great to make editing in an IDE/text editor easier.

Definitely open to hearing your suggestions. Thanks so much for the thoughtful message!

2 replies

MeRe9 Dec 29, 2024

Design something better: https://hamilton.dagworks.io

dmadisetti Dec 29, 2024
Collaborator

The trade off marimo has to make is automatically serializing from a notebook/cells to python, while keeping it readable. With pure python approach, there is definitely a bit more flexibility. Reading through the docs, I don't think Hamilton is inherently more readable, and seems to even have a few design patterns similar to marimo

That being said, it looks like Hamilton has some good ideas for data pipelines that could inspire marimo. Do you have any experience with Hamilton drivers? What has been your impression?

Ubehebe · 2024-05-22T02:23:34Z

Ubehebe
May 22, 2024
Author

I think the main question I have is: why does marimo have to serialize the dataflow graph into the source file? Why can't it be an in-memory data structure on the server?

The first problem is that marimo needs some way to partition a Python source file into cells. The @app.cell decoration and the synthetic functions are as good a way as any. (There are other possible approaches, but the synthetic functions don't confuse non-marimo tools, so I'm not that concerned about them.)

The edges of the dataflow graph (the parameters of the synthetic functions) are what confuse non-marimo tools. Why do they need to be in the source file? Couldn't marimo run the dataflow analysis once on startup and keep the graph in-memory? This might slow down the initial time to interactive, but I doubt it would be significant when import statements regularly take 1+ second.

If we're able to make marimo notebooks more idiomatic Python so that other tools can work on them seamlessly, I think that's a good tradeoff.

3 replies

akshayka May 22, 2024
Maintainer

Why do they need to be in the source file? Couldn't marimo run the dataflow analysis once on startup and keep the graph in-memory?

@Ubehebe, you're totally right, they don't need to be in the source file. In fact marimo doesn't even read the args and returns of the app.cell decorated functions, and instead just redoes the dataflow analysis as you suggest.

The main reason references are included as cell/function args is so that the code is more legible to human eyes. For example, when designing the file format, I believed:

@app.cell
def __():
  x = 0
  return x

@app.cell
def __(x):
  y = x + 1
  return y

was more legible than

@app.cell
def __():
  x = 0

@app.cell
def __():
  y = x + 1

Because in the former, at least x is bound to something (the function argument). Plus, it makes using the cell.run() API a bit easier, since you can read off the references from the signature.

But, I might have been mistaken! I am open to not serializing the edges in the file format if it would make the Python more idiomatic. How does this improve the IDE experience? Do you have a suggestion based on removing the edges that would make the file format more idiomatic?

Ubehebe May 22, 2024
Author

I am open to not serializing the edges in the file format if it would make the Python more idiomatic. How does this improve the IDE experience?

I realized I've been thinking only about imports. marimo currently serializes all names that appear in a Python file into the params of synthetic functions. Top-level imports are a kind of name. If marimo didn't serialize top-level imports into the function params, IDEs could navigate from the use of an import to its definition. Simple example:

# user writes
import pandas as pd

df = pd.DataFrame(...)

# current marimo serialization
import marimo

app = marimo.App()

@app.cell
def __():
  import pandas as pd
  return pd

@app.cell
def __(pd):
  df = pd.DataFrame(...) # bad: `pd` param "shadows" top-level import
  return df

# if marimo didn't serialize top-level imports
import marimo
import pandas as pd

app = marimo.App()

@app.cell
def __():
  df = pd.DataFrame(...) # good: IDE knows `pd` is pandas

This might be worth doing to get ruff unused imports working, plus limited IDE navigation. But this doesn't help tools with names that are not top-level imports (like local variables). If marimo doesn't serialize any names, then non-marimo tools can't do much with synthetic functions like this, as you point out.

@app.cell
def __():
  y = x + 2 # marimo knows where this x comes from, other tools do not

I think the fundamental problem is that marimo's choice of marker for cells (functions) introduces a lexical scope. This hides relationships that would otherwise be legible to both humans and non-marimo tools.

Did you consider using a cell marker that doesn't introduce a lexical scope?

# marimo:cell
x = 1

# marimo:cell
y = x + 2

Putting behavior in comments is grungy, but this has the advantage that the relationship between the names is immediately apparent.

akshayka May 31, 2024
Maintainer

Sorry for the delayed response @Ubehebe

Did you consider using a cell marker that doesn't introduce a lexical scope?

Yea we did consider this. The reason we didn't go down this route is that we wanted the notebook files to be importable as Python modules, for reusability -- to be able to support reusing cells, functions, or classes from one notebook in another.

Writing the notebook code as a flat script would either mean that the notebook would be executed on import, or the code wouldn't be reusable because it would be nested under an if __name__ == "__main__" guard.

But I do agree that the flat version you suggest provides a much better editing experience. I wonder if we can somehow get the best of both worlds.

Ubehebe · 2024-05-31T13:33:12Z

Ubehebe
May 31, 2024
Author

I renamed this topic to reflect the most important issue (and to focus less on specific solutions). marimo's text format is good compared to other notebook formats, but I think it can be even better.

1 reply

chenller Jun 2, 2024

According to my understanding, marimo is a functional programming idea. Marimo describes the relationship between cells as the relationship between functions. Functional programming is necessary in the process of writing code, but using only functional programming can lack flexibility. The main function of the notebook is to enable the execution of Python code in blocks. All variables are global variables.
My point is, the basic execution logic is similar to a notebook, but some cells have 'reactive' properties. This means there are two types of cells. Variables in notebook-like cells should not have the same global variable names as those in cells with 'reactive' properties.
Scripts can be stored as:

# current marimo serialization
import marimo

app = marimo.App()

@app.cell
def __fun1():
  import pandas as pd
  return pd

@app.cell
def __fun2(pd):
  df = pd.DataFrame(...) # bad: `pd` param "shadows" top-level import
  return df
@app.cell_notebook
def __fun_notebook():
  notebook code 1
  notebook code 2
  ...
  pd=__fun1()
  df=__fun2(pd)
  notebook code n
  notebook code n+1
  ...

dmadisetti · 2024-06-02T22:22:20Z

dmadisetti
Jun 2, 2024
Collaborator

marimo could also just have a flat script version. Exports are already possible, it would just mean handling imports. The difficulty is that there is then 3 formats (normal python, script python, and markdown) to manage. Maybe this is worth it? But also removes the non-linearity of marimo notebooks.

Alternatively, deeper code editor integration (think "VS code" or vim plugin) could be set up to use marimo's LSP directly and overcome the issues described. marimo's current imports are actually good, because they are lazy.
I think this is the best solution.

I thought about the following alternative (see below), before I realized maybe this is an editor issue and not a marimo issue. I'm only including because it's already written but I think it's a lot of engineering effort vs pay off (not my decision to make, just my thoughts)

marimo could parse out import statements and put in relevant stubs. e.g.

# Cell 1
import pandas as pd

y = pd.df(...)

Gets turned into

import marimo
import pandas as pd

app = marimo.App()
__import_manager__ = app.__import_manager__

@app.cell
def __():
  __import_manager__("pandas", _as="pd")
  y = pd.df(...)
  return y

and behind the scenes, __import_manager__ asserts that pandas has been imported as pd by checking against sys and injects pd into locals.

Benefits:

Allows users to still declare import X in the notebook editor (this conversion is on write to disk)
Allows marimo to still extract pd as a def in the relevant cell by checking ast against very specific function calls.
Allows the same scoping as the current behavior

sanity check:

# Cell 2
if isinstance(y, list):
    import numpy as np
    y = np.array(y)
x = pd.df(np.sqrt(y))

should fail if y is not a list, as np is only defined in the conditional.
With the proposed stub:

import marimo
import pandas as pd
import numpy as np

app = marimo.App()
__import_manager__ = app.__import_manager__

@app.cell
def cell2(y):
  if isinstance(y, list):
    __import_manager__("numpy", _as="np")
    y = np.array(y)
  x = pd.df(np.sqrt(y))
  return x

Thanks to the decorator, cell2 should have np already scrubbed from scope but pd should be present- so hitting directly pd.df(np.sqrt(y)) will return a name error for np. If y is a list, __import_manager__ should do something like

def __import_manager__(module, _as=None):
   _locals = inspect.currentframe().f_back.f_locals
   if module not in sys.modules:
     __import__(module)
   _locals[_as] = sys.modules[module]

which will put np into scope. It's a little messy, and unintuitive, but if it's documented and users are encouraged to group imports together, then I think this solves the scoping issue and makes behavior consistent with what one would expect from the notebook.

1 reply

akshayka Jun 3, 2024
Maintainer

Thanks for putting together thoughtful response.

I've thought about this and I've landed where you have:

deeper code editor integration (think "VS code" or vim plugin) could be set up to use marimo's LSP directly and overcome the issues described. marimo's current imports are actually good, because they are lazy.
I think this is the best solution.

Installing VS code/vim plugins wouldn't be too much friction, and would allow for a good editing experience without compromising on lazy imports and simplicity of the file format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

marimo's text format is hard for humans/non-marimo tools to understand #1379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

marimo's text format is hard for humans/non-marimo tools to understand #1379

Ubehebe May 15, 2024

Replies: 4 comments · 7 replies

akshayka May 20, 2024 Maintainer

MeRe9 Dec 29, 2024

dmadisetti Dec 29, 2024 Collaborator

Ubehebe May 22, 2024 Author

akshayka May 22, 2024 Maintainer

Ubehebe May 22, 2024 Author

akshayka May 31, 2024 Maintainer

Ubehebe May 31, 2024 Author

chenller Jun 2, 2024

dmadisetti Jun 2, 2024 Collaborator

akshayka Jun 3, 2024 Maintainer

Ubehebe
May 15, 2024

Replies: 4 comments 7 replies

akshayka
May 20, 2024
Maintainer

dmadisetti Dec 29, 2024
Collaborator

Ubehebe
May 22, 2024
Author

akshayka May 22, 2024
Maintainer

Ubehebe May 22, 2024
Author

akshayka May 31, 2024
Maintainer

Ubehebe
May 31, 2024
Author

dmadisetti
Jun 2, 2024
Collaborator

akshayka Jun 3, 2024
Maintainer