-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Gap Filling on Time Series Data #4809
Comments
@waynexia @jiacai2050 (CeresDB) @v0y4g3r (GrepTime) @gruuya (SeaFowl) As I believe you are building other timeseries database systems on DataFusion, I wonder if you have any thoughts about adding such a feature to DataFusion? We plan on building this feature natively in IOx but might be willing to upstream it as well if there is community interest cc @waitingkuo @andygrove and @liukun4515 in case you have thoughts as well |
It looks good to have I build something similar in the very recent: https://github.com/GreptimeTeam/greptimedb/blob/develop/src/promql/src/extension_plan/instant_manipulate.rs#L423 And we plan to expose this functionality to SQL interface in some way, which will become something similar to this proposal I think. But my concern is, to provide a good use experience and functionality, we may need a bunch of "gap-filling" functions. Like filling it with null, filling it with the last value, filling it with the last value if the gap is less than 1 day otherwise left blank etc. I'm not sure if these "time-series functions" is also useful to other users of DataFusion (but it might be fine as PostgreSQL also provides such utils). |
@alamb Thanks for bring me in. I have one concern when implement this feature in datafusion. Suppose time range of one query is
Then what is the result of In Prometheus, it will be 10 when lookback-delta is I don't know how timescale deal with this case, IMO rewrite time query of one query may not suitable for datafusion since it's a generic SQL engine. Any ideas about this first value issue? |
We are interested in this too. Are you aware of any approaches other than the two we have so far ( |
I don't know if any upcoming SQL standard for this but I didn't look hard at it either. This use case is common. It is often called bucketing with "gap filling" or "interpolation" in other SQL implementations. This type of query is not easy to express in ANSI-SQL and thus databases often offer some sort of SQL extension. Here are some example extensions I found:
All of these extensions have two main features:
|
I agree specifying the interpolation policy / gap filling is important. In addition to the simple https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/ function they have https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/interpolate/ I wonder if that would be sufficient 🤔 |
I am not sure
It may well be the case that this is something that is not easy / reasonable to express in SQL (
The Another way I could imagine is to run a subquery that has the full range |
Thanks for those links, |
Here is a design document to do this work in DataFusion: Initially I wrote this design thinking it would go into IOx, but since there is some interest here, I have adapted it a bit for DataFusion. Feedback on the approach would be appreciated! |
I would like to think a little bit on how one can do this with multi-row window functions (i.e. window functions that may generate multiple rows for every frame). It seems the syntax and semantics of such an approach would be more in-line with standard SQL's treatment of windowing (which would be good from a POLA perspective). I will share my thoughts in a few days when things mature a little bit. |
If the SQL and usage is compatible with the PG SQL syntax, I think it can be added in datafusion easily. |
Thank @wolffcm . |
I'm afraid this doesn't work, timescale docs says
start/finish works after scan data(using where), if fetched data contains no value of I did following tests against timescale: CREATE TABLE stocks_real_time (
time TIMESTAMPTZ NOT NULL,
price DOUBLE PRECISION NULL
);
SELECT create_hypertable('stocks_real_time','time');
insert into stocks_real_time values
('2022-10-01', 10),
('2022-10-03', 30),
('2022-10-04', 40),
('2022-10-05', 50);
SELECT
time_bucket_gapfill('1 day', time, timestamp '2022-09-30', timestamp '2022-10-10') AS day,
avg(price) AS value,
locf(avg(price)),
interpolate(avg(price))
FROM stocks_real_time
WHERE time > '2022-10-02' AND time < '2022-10-05'
GROUP BY day
ORDER BY day; It will output
Subquery seems unnecessary, if time range in time_bucket_gapfill different with range in where clause, maybe we can overwrite where clause, and filter data in GapFill plan node, something like this(adopted from google docs above):
|
I understand what you're suggesting, but I worry that rewriting a filter like that would have unforeseen effects that are difficult to understand. For example, if the input to I think this problem is a really tricky one. In the TImeScale docs for locf(
avg(temperature),
(SELECT temperature FROM metrics m2 WHERE m2.time < now() - INTERVAL '2 week' AND m.device_id = m2.device_id ORDER BY time DESC LIMIT 1)
) I'm curious about what you think of that approach. |
It seems subquery is more flexible and "safe" than rewrite where clause. One more question: if the subquery return multiple values, which value will be chosen by locf? |
@alamb |
I just tried this in TimeScale. if more than one value is returned, it returns an error |
I agree with @liukun4515 that unless we come up with something that deviates very little from standard SQL (and/or PG), it may be prudent to think on this and maybe leave it to other packages if we can't find a way. I am not hopeless BTW -- I think there are ways to do this in a very much standard-like way, I just haven't had the time to look into it. |
I think in terms of IOx we are happy to do it downstream in IOx via the existing DataFusion extension points as well -- I think it would help @wolffcm to know which way we are leaning we can avoid too much rework |
This issue didn't move at all for 1.5 year, what's the status here? I see that IOx has implemented this as an UDF, but it seems to me like the standard SQL (and/or PG) is taken too seriously within DataFusion and is inherently limiting the adoptability within time-series applications, such as finance or IoT, both quite big and growing industries. I wouldn't suggest this if IOx stayed open-source, but since it is not anymore, couldn't it be supported at least through some kind of feature flag, something like |
I don't think there is any new status to report from my perspective
Our implementation's source is currently in FWIW "soon" InfluxData plans an open source offering based on the 3.0 architecture (aka IOx) but I don't have any additional specific details to share there
I think that would be possible To improve timeseries support in DataFusion itself, I think working on ASOF join might be a good first step as that is more "standard" perhaps #318 Note there are a bunch of interesting timeseries optimizations such as #10316 and #10313 that could be added |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A common use case when working with time series data is to compute an aggregate value for windows of time, e.g., every minute, hour, week or whatever. It is possible to do this with the
DATE_BIN
function in DataFusion. However,DATE_BIN
will not produce any value for a window that did not contain any rows.For example, for this input date:
We might run this query;
And we would get something like:
Generating a row in the output for
2022-12-02
is difficult to do with ANSI-SQL. Here is one attempt: Fill Gaps in Time Series with this simple trick in SQL. Having to write SQL like this for what is an intuitive and common use case is frustrating.Describe the solution you'd like
It would be good to have a concise, idiomatic way to do this. Many vendors provide a solution for this problem. The have the following in common:
One such solution would be to use a function like TimeScale's functions
time_bucket_gapfill
andlocf
(last observation carried forward):https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/
The above query might be changed to this, using
time_bucket_gapfill
andlocf
:TimeScale also provides
interpolate
to populate a gap with an interpolated value (e.g., would put20
in the gap for the example).I've written up an approach to this work here:
https://docs.google.com/document/d/1vIcs9uhlCX_AkD9bemcDx-YhBOVe_TW5sBbXtKCHIfk/edit?usp=sharing
Initially we (InfluxData) were going to implement this in IOx directly, but seems like it could be worthy of upstreaming into DataFusion.
Describe alternatives you've considered
Postgres provides a general purpose way to generate data:
https://www.postgresql.org/docs/9.1/functions-srf.html#FUNCTIONS-SRF-SERIES
But this seems like it would be more difficult to use than something like
time_bucket_gapfill
.The text was updated successfully, but these errors were encountered: