Request to add additional documentation on running the benchmark on AWS or Azure #24

rbalamohan · 2023-05-15T07:51:20Z

Tried this on AWS and got the following exception when trying with iceberg as datasource. This was with Spark 3.3.1.

It will be helpful to add additional documentation on running the benchmark on AWS or on Azure.

2023-05-15T07:43:26,031 INFO [main] telemetry.JDBCTelemetryRegistry: Creating new logging tables... 2023-05-15T07:43:26,323 INFO [main] telemetry.JDBCTelemetryRegistry: Logging tables created. 2023-05-15T07:43:26,545 INFO [main] common.LSTBenchmarkExecutor: Running experiment: spark_del_sf_10 2023-05-15T07:43:26,548 INFO [main] common.LSTBenchmarkExecutor: Experiment start time: 2023_05_15_07_43_26_546 2023-05-15T07:43:26,548 INFO [main] common.LSTBenchmarkExecutor: Starting repetition: 0 2023-05-15T07:43:26,590 INFO [main] common.LSTBenchmarkExecutor: Running setup phase... 2023-05-15T07:43:55,924 INFO [main] common.LSTBenchmarkExecutor: Phase setup finished in 29 seconds. 2023-05-15T07:43:55,925 INFO [main] common.LSTBenchmarkExecutor: Running setup_data_maintenance phase... 2023-05-15T07:45:36,793 INFO [main] common.LSTBenchmarkExecutor: Phase setup_data_maintenance finished in 100 seconds. 2023-05-15T07:45:36,794 INFO [main] common.LSTBenchmarkExecutor: Running init phase... 2023-05-15T07:45:37,414 INFO [main] common.LSTBenchmarkExecutor: Phase init finished in 0 seconds. 2023-05-15T07:45:37,414 INFO [main] common.LSTBenchmarkExecutor: Running build phase... 2023-05-15T07:45:37,450 ERROR [pool-2-thread-4] common.LSTBenchmarkExecutor: Exception executing statement: 1_create_call_center.sql_0 java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html

at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html

at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$isV2Provider(ResolveSessionCatalog.scala:606) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:154) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:138) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:138) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:134) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning$(AnalysisHelper.scala:130) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUpWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:111) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:110) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:43) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:215) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:212) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:284) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:327) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:284) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:274) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:274) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:188) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:227) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:223) at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:172) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:223) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:187) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:208) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:79) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:214) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:554) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:214) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:213) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:69) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:99) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:291) ... 16 more Caused by: java.lang.ClassNotFoundException: iceberg.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:661) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:661) at scala.util.Failure.orElse(Try.scala:224) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:661) ... 74 more

The text was updated successfully, but these errors were encountered:

Neuw84 · 2023-09-26T16:59:43Z

Where did you tried to run it? Amazon EMR?¿

If I have time would like to try it and will then make a pull request in order to simplify setup on AWS.

jcamachor · 2023-09-27T03:50:45Z

@Neuw84 , that would be greatly appreciated. We lack documentation for environment setup (e.g., in any cloud provider), which currently is not automated in LST-Bench. Thus, having someone going over those steps and creating a PR to document them would be very valuable.

Neuw84 · 2023-12-20T08:48:04Z

Hi,

Just started to work on it via EMR. @jcamachor one quick question. I have already made the connection via spark-jdbc but it is not clear for me if it is expecting that on the defined path the data (csv files in each folder for tpcds exist). Or what is expecting in those folders? (external_data_path & data_path)

Caused` by: java.io.FileNotFoundException: File s3://data-bench-lstm/sf_10/catalog_sales does not `exist.

# Description: Experiment Configuration
---
version: 1
id: spark_del_sf_10
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
# TODO: In the future, many of these could be automatically generated by the framework.
metadata:
  system: spark
  system_version: 3.3.1
  table_format: iceberg
  table_format_version: 2.2.0
  scale_factor: 10
  mode: cow
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
  external_catalog: spark_catalog
  external_database: external_tpcds
  external_table_format: csv
  external_data_path: 's3://data-bench-lstm/sf_10/'
  external_options_suffix: ',header="true"'
  external_tblproperties_suffix: ''
  catalog: spark_catalog
  database: delta_tpcds
  table_format: iceberg
  scale_factor: 10
  data_path: 's3://data-bench-lstm/delta/sf_10/'
  options_suffix: ''
  tblproperties_suffix: ''

Thanks!

jcamachor · 2023-12-20T17:51:57Z

Thanks @Neuw84 .

I have already made the connection via spark-jdbc but it is not clear for me if it is expecting that on the defined path the data (csv files in each folder for tpcds exist). Or what is expecting in those folders? (external_data_path & data_path)

The external_data_path refers to the path to the CSV files indeed, while data_path refers to the path where the Iceberg/Delta/Hudi tables using the CSV data will be created.
Currently the setup SQL statements in LST-Bench expect that the data is partitioned (see https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup/spark/ddl-external-tables.sql). If you want to execute data maintenance phases, you'll need to generate and organize the data accordingly as well (see https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup_data_maintenance/spark/ddl-external-tables-refresh.sql).
As mentioned earlier, I hope to document the setup more comprehensively and eventually automate it, although I haven't had the opportunity to do so yet.

Neuw84 · 2024-01-04T10:41:40Z

Thanks for you inputs @jcamachor, will try again as I have time now.

We have an automated way of producing TPCDS data using a marketplace connector, the only thing that I need to test is whether it partition the data as the scripts want but I can fix that.

https://aws.amazon.com/marketplace/pp/prodview-xtty6azr4xgey

Will let you know, if you advance on the setup documentation also ping me 👍 .

P.D. On the future my plan is to contribute to OneTable where I know you are a pusher!

Bests!

jcamachor added the documentation Improvements or additions to documentation label May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request to add additional documentation on running the benchmark on AWS or Azure #24

Request to add additional documentation on running the benchmark on AWS or Azure #24

rbalamohan commented May 15, 2023

Neuw84 commented Sep 26, 2023

jcamachor commented Sep 27, 2023

Neuw84 commented Dec 20, 2023 •

edited

Loading

jcamachor commented Dec 20, 2023 •

edited

Loading

Neuw84 commented Jan 4, 2024

Request to add additional documentation on running the benchmark on AWS or Azure #24

Request to add additional documentation on running the benchmark on AWS or Azure #24

Comments

rbalamohan commented May 15, 2023

Neuw84 commented Sep 26, 2023

jcamachor commented Sep 27, 2023

Neuw84 commented Dec 20, 2023 • edited Loading

jcamachor commented Dec 20, 2023 • edited Loading

Neuw84 commented Jan 4, 2024

Neuw84 commented Dec 20, 2023 •

edited

Loading

jcamachor commented Dec 20, 2023 •

edited

Loading