Adding functions that transform a table longer and wider #319

tnederlof · 2020-08-23T14:51:33Z

A pretty common operation with tabular data is to make data longer and make data wide. These type of operations are often referred to as pivots, stack/unstack, spread/gather, melt/cast, etc. I wanted to link to different implementations of this idea both in Julia and outside of Julia, link to an attempt @davidanthoff has already started, and then start the design of a long-lasting implementation.

My knowledge of the inner workings of the queryverse is pretty limited and with all of the macro use, I expect it will take me a bit to get comfortable but I welcome the challenge. While some design decisions must be thought of within the context of the code, some high-level feature design can probably be discussed before laying down the implementation code.

David and I had a brief conversation about this here: https://discourse.julialang.org/t/stack-and-unstack-in-query-jl/45381/4

Some reference links

DataFrames.jl
stack - http://juliadata.github.io/DataFrames.jl/stable/lib/functions/#DataFrames.stack
unstack - http://juliadata.github.io/DataFrames.jl/stable/lib/functions/#DataFrames.unstack

reshape2 (R)
melt - https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/melt
cast/dcast - https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/cast

tidyr pre v1.0 (R)
gather - https://tidyr.tidyverse.org/reference/gather.html
spread - https://tidyr.tidyverse.org/reference/spread.html

tidyr v1.0 (R)
pivot_longer - https://tidyr.tidyverse.org/reference/pivot_longer.html
pivot_wider - https://tidyr.tidyverse.org/reference/pivot_wider.html

pandas (python)
melt/wide_to_long - https://pandas.pydata.org/docs/reference/api/pandas.melt.html & https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
pivot - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html

Common Features

making data "longer"

select columns to be made into long format (column names become repeated in a column often called the identifier column, with their corresponding cells being put into a column often called the measurement column)
provide a column name for the newly created identifier column
provide a column name for the newly created measurement column

making data "wider"

select an existing identifier column (these will become column names)
select an existing measurement column (these will become cell values under their corresponding identifier columns names)

The above is a big simplification as most implementations of these functions do "extra" things including...

fill missing values (when taking long data and making it wider there is no guarantee there is a measurement for every unique identifier variable) when data is made square, usually, NA/missing is the default.
remove missing values (when taking wide data with NA/missing values and making it long sometimes a user wants to remove the rows where the measurement cells are NA/missing)
type transformation (sometimes measurement data will come in all as a string and can be converted to a different type)
ability to handle multiple transformations at once (create more than one identifier column and/or create more than one measurement column). The main focus of the updated pivot_longer and pivot_wider functions beyond their ease of use/naming was being able to do multiple transformations at once.

Of this list above I imagine the decision whether or not to handle multiple transformations at once in the query.jl implementation will be non-trivial. In order to handle multiple transformations, one has to set rules on how to take column names and parse them into discreet parts (each of which becomes a new identifier column) for the wide to long case. In the long to wide case you are usually selecting an identifier variable column and multiple observation columns, the combination of which become new wide columns. Tidyr lays this out nicely so there is plenty to look at when it comes to these cases but it definitely adds complexity and additional arguments (separators between newly created columns, prefixes, pattern matching, etc.)

Latest Thinking

Hadley (creator of tidyr) has thought a lot about this subject and has written numerous posts over the years.
He recently wrote this https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0/ and gave a talk about this subject https://www.youtube.com/watch?v=D48JHU4llkk. His main point is all of these confusing names like spread, gather, melt, cast, pivot, etc make it really hard for the average user to remember what they do without using documentation, and even then it can be confusing. He has settled on pivot_longer (result has less columns and more rows) and pivot_wider (result has more columns and less rows). I personally love using _longer and _wider in the name as it makes it clear which direction you are moving in, however, there are probably different prefixes that could all work.

@davidanthoff has created a gather (wide to long transformation) function (https://github.com/queryverse/QueryOperators.jl/blob/master/src/enumerable/enumerable_gather.jl) which will be a big help for me to get up to speed on the inner workings.

This was a long post so please add/remove/correct/discuss what I wrote above so we can understand what has come before and then design a well-thought set of transformation functions.

davidanthoff added the enhancement label Jan 6, 2021

davidanthoff added this to the Backlog milestone Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding functions that transform a table longer and wider #319

Adding functions that transform a table longer and wider #319

tnederlof commented Aug 23, 2020 •

edited

Loading

Adding functions that transform a table longer and wider #319

Adding functions that transform a table longer and wider #319

Comments

tnederlof commented Aug 23, 2020 • edited Loading

Some reference links

Common Features

Latest Thinking

tnederlof commented Aug 23, 2020 •

edited

Loading