Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a duckdb_to_parquet using low level arrow functions #33

Open
ddotta opened this issue Apr 14, 2023 · 3 comments
Open

Add a duckdb_to_parquet using low level arrow functions #33

ddotta opened this issue Apr 14, 2023 · 3 comments

Comments

@ddotta
Copy link
Owner

ddotta commented Apr 14, 2023

Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...

Originally posted by @jllipatz in #27 (comment)

@ddotta
Copy link
Owner Author

ddotta commented Apr 14, 2023

_Originally posted by @jllipatz in #27 (comment)

Hello,

The couple dbSendQuery, dbFetch doesn't fit with duckdb : the query is solved before reaching the dbFetch overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.

`SQL2parquet <- function(query,path,chunk_size=1e6)
{
con <- dbConnect(duckdb::duckdb())

reader <- duckdb_fetch_record_batch(
dbSendQuery(con,query,arrow=TRUE),
chunk_size=chunk_size)

file <- FileOutputStream$create(path)
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))

i <- 0
while (!is.null(batch)) {
  i <- i+1
  message(sprintf("%d, %d rows",i,nrow(batch)))
  writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
  batch <- NULL; gc()
  batch <- reader$read_next_batch()
}

writer$Close()

}
file$close()
}
`

@ddotta
Copy link
Owner Author

ddotta commented Apr 14, 2023

_Originally posted by @nbc in #27 (comment)

Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.

@ddotta ddotta mentioned this issue Apr 14, 2023
@ddotta
Copy link
Owner Author

ddotta commented Apr 14, 2023

I agree with @nbc. I find your idea @jjlipatz really very interesting and promising 🚀
Only this represents a fairly high entry cost to master all these low-level features from arrow 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant