Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborted (core dumped) with hafenkran/duckdb-bigquery with nodejs API #104

Open
vwxyzjn opened this issue Jan 10, 2025 · 4 comments
Open

Comments

@vwxyzjn
Copy link

vwxyzjn commented Jan 10, 2025

I have detailed the issue there hafenkran/duckdb-bigquery#58.

Using the duckdb cli works and python API also works, but the nodejs API fails...

@vwxyzjn
Copy link
Author

vwxyzjn commented Jan 10, 2025

duckdb-async also works

image
import { Database } from "duckdb-async";

async function main() {
    try {
        // Initialize DuckDB with config
        console.log('Connecting to DuckDB...');
        const db = await Database.create('cache.db', {
            allow_unsigned_extensions: 'true'
        });
        
        // Create table if not exists
        console.log('Ensuring local table exists...');
        await db.exec(`
            CREATE TABLE IF NOT EXISTS metrics (
                task_name VARCHAR,
                task_idx BIGINT,
                task_config JSON,
                model_config JSON,
                compute_config VARCHAR,
                metrics JSON,
                run_date VARCHAR,
                num_instances BIGINT,
                processing_time DOUBLE,
                workspace VARCHAR,
                experiment_id VARCHAR,
                eval_sha VARCHAR,
                task_hash VARCHAR,
                model_hash VARCHAR
            );
        `);
        console.log('Table created successfully!');

        // Install and load bigquery extension
        console.log('Setting up bigquery extension...');
        await db.exec('INSTALL bigquery FROM community;');
        await db.exec('LOAD bigquery;');

        // Get total count first
        console.log('Counting total rows to copy...');
        const countResult = await db.all(`
            SELECT COUNT(*) as total_count 
            FROM bigquery_scan('testestes.deletable.model_evaluations');
        `);
        
        const total_to_copy = Number(countResult[0].total_count);
        console.log(`Found ${total_to_copy.toLocaleString()} rows to copy`);

        
    } catch (error) {
        console.error('Error:', error);
        process.exit(1);
    }
}

main();

@jraymakers
Copy link
Contributor

Thanks for the report. Unfortunately it's difficult for me to diagnose, because I don't have access to a BigQuery environment, so I can't run your script. (I unsurprisingly get an error about Google credentials.)

If there is a problem in Node Neo (as opposed to, say, the BigQuery extension), then it should be possible to create a repro without that extension. Do you still see the problem if you alter your example to avoid using that extension, perhaps by first exporting all or part of the data in a separate step?

Also, it would be helpful to know what version of @duckdb/node-api you're using, and on which platform.

@carlopi
Copy link
Collaborator

carlopi commented Jan 11, 2025

@vwxyzjn: what platforms are those environment?

Could you check what's the result of PRAGMA platform; for the CLI, the Python API, the duckdb-async and the node-neo API in your machine?

It's unclear what to make of the answer then, but could be of help tracking this down.
Also some general information on the architecture / OS you are running this in would be of help (already asked by @jraymakers I now see...)

@vwxyzjn
Copy link
Author

vwxyzjn commented Jan 11, 2025

D PRAGMA platform;
┌──────────────────┐
│     platform     │
│     varchar      │
├──────────────────┤
│ linux_amd64_gcc4 │
└──────────────────┘

Yeah I feel like the only way for reproduction is if you all create a bigquery table yourself... Here is the command to create tables on bigquery

CREATE OR REPLACE TABLE `ai2-allennlp.deletable.model_evaluations`
AS
SELECT 
  ROW_NUMBER() OVER() as id,
  CONCAT('project_', CAST(FLOOR(RAND() * 100) AS STRING)) as project,
  CONCAT('user_', CAST(FLOOR(RAND() * 1000) AS STRING)) as username,
  CONCAT('model_', CAST(FLOOR(RAND() * 1000) AS STRING)) as model_name,
  CONCAT('run_', CAST(FLOOR(RAND() * 1000) AS STRING)) as run_id,
  CASE CAST(FLOOR(RAND() * 5) AS INT64)
    WHEN 0 THEN 'gsm8k'
    WHEN 1 THEN 'ifeval'
    WHEN 2 THEN 'popqa'
    WHEN 3 THEN 'mmlu:cot::summarize'
    ELSE 'mmlu_abstract_algebra:mc'
  END as task_name,
  RAND() as primary_score
FROM 
  UNNEST(GENERATE_ARRAY(1, 1000000));  -- 1 million rows

Below is the package.json and lock file.
https://gist.github.com/vwxyzjn/289c63935dd24568f4db94c57973eda0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants