Add a `duckdb_to_parquet` using low level arrow functions #33

ddotta · 2023-04-14T13:25:31Z

Just to signal that I wrote an enhanced interface to SQL that is in https://github.com/jllipatz/SQL. That's just WIP...

Originally posted by @jllipatz in #27 (comment)

ddotta · 2023-04-14T13:28:07Z

_Originally posted by @jllipatz in #27 (comment)

Hello,

The couple dbSendQuery, dbFetch doesn't fit with duckdb : the query is solved before reaching the dbFetch overfilling the RAM . Here is a solution that works without consuming a lot of RAM. Also it runs much more faster than simly including a COPY TO parquet in the SQL query. Perhaps should it be the beginning of a new function in {parquetize} if somebody adds the partitioning ways that exist for the other functions.

`SQL2parquet <- function(query,path,chunk_size=1e6)
{
con <- dbConnect(duckdb::duckdb())

reader <- duckdb_fetch_record_batch(
dbSendQuery(con,query,arrow=TRUE),
chunk_size=chunk_size)

file <- FileOutputStream$create(path)
batch <- reader$read_next_batch()
if (!is.null(batch)) {
s <- batch$schema
writer <- ParquetFileWriter$create(s,file,
properties = ParquetWriterProperties$create(names(s)))

i <- 0
while (!is.null(batch)) {
  i <- i+1
  message(sprintf("%d, %d rows",i,nrow(batch)))
  writer$WriteTable(arrow_table(batch),chunk_size=chunk_size)
  batch <- NULL; gc()
  batch <- reader$read_next_batch()
}

writer$Close()

}
file$close()
}
`

ddotta · 2023-04-14T13:29:40Z

_Originally posted by @nbc in #27 (comment)

Hi @jllipatz thanks, I'm very interested, I think preparing a parquet file in duckdb could be a good use case but I don't feel comfortable enough in arrow's guts to start working on this for the moment. I must explore more.

ddotta · 2023-04-14T13:36:31Z

I agree with @nbc. I find your idea @jjlipatz really very interesting and promising 🚀
Only this represents a fairly high entry cost to master all these low-level features from arrow 😢

ddotta mentioned this issue Apr 14, 2023

Feature/dbi #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `duckdb_to_parquet` using low level arrow functions #33

Add a `duckdb_to_parquet` using low level arrow functions #33

ddotta commented Apr 14, 2023 •

edited

Loading

ddotta commented Apr 14, 2023

ddotta commented Apr 14, 2023

ddotta commented Apr 14, 2023

Add a duckdb_to_parquet using low level arrow functions #33

Add a duckdb_to_parquet using low level arrow functions #33

Comments

ddotta commented Apr 14, 2023 • edited Loading

ddotta commented Apr 14, 2023

ddotta commented Apr 14, 2023

ddotta commented Apr 14, 2023

Add a `duckdb_to_parquet` using low level arrow functions #33

Add a `duckdb_to_parquet` using low level arrow functions #33

ddotta commented Apr 14, 2023 •

edited

Loading