Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to add additional documentation on running the benchmark on AWS or Azure #24

Open
rbalamohan opened this issue May 15, 2023 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@rbalamohan
Copy link

Tried this on AWS and got the following exception when trying with iceberg as datasource. This was with Spark 3.3.1.

It will be helpful to add additional documentation on running the benchmark on AWS or on Azure.

2023-05-15T07:43:26,031 INFO [main] telemetry.JDBCTelemetryRegistry: Creating new logging tables... 2023-05-15T07:43:26,323 INFO [main] telemetry.JDBCTelemetryRegistry: Logging tables created. 2023-05-15T07:43:26,545 INFO [main] common.LSTBenchmarkExecutor: Running experiment: spark_del_sf_10 2023-05-15T07:43:26,548 INFO [main] common.LSTBenchmarkExecutor: Experiment start time: 2023_05_15_07_43_26_546 2023-05-15T07:43:26,548 INFO [main] common.LSTBenchmarkExecutor: Starting repetition: 0 2023-05-15T07:43:26,590 INFO [main] common.LSTBenchmarkExecutor: Running setup phase... 2023-05-15T07:43:55,924 INFO [main] common.LSTBenchmarkExecutor: Phase setup finished in 29 seconds. 2023-05-15T07:43:55,925 INFO [main] common.LSTBenchmarkExecutor: Running setup_data_maintenance phase... 2023-05-15T07:45:36,793 INFO [main] common.LSTBenchmarkExecutor: Phase setup_data_maintenance finished in 100 seconds. 2023-05-15T07:45:36,794 INFO [main] common.LSTBenchmarkExecutor: Running init phase... 2023-05-15T07:45:37,414 INFO [main] common.LSTBenchmarkExecutor: Phase init finished in 0 seconds. 2023-05-15T07:45:37,414 INFO [main] common.LSTBenchmarkExecutor: Running build phase... 2023-05-15T07:45:37,450 ERROR [pool-2-thread-4] common.LSTBenchmarkExecutor: Exception executing statement: 1_create_call_center.sql_0 java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html

at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.ClassNotFoundException: Failed to find data source: iceberg. Please find packages at https://spark.apache.org/third-party-projects.html

at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$isV2Provider(ResolveSessionCatalog.scala:606) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:154) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:138) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:138) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:134) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning$(AnalysisHelper.scala:130) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUpWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:111) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:110) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.apply(ResolveSessionCatalog.scala:43) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:215) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:212) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:284) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:327) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:284) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:274) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:274) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:188) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:227) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:223) at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:172) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:223) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:187) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:208) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:79) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:214) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:554) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:214) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:213) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:69) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:99) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:291) ... 16 more Caused by: java.lang.ClassNotFoundException: iceberg.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:661) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:661) at scala.util.Failure.orElse(Try.scala:224) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:661) ... 74 more

@jcamachor jcamachor added the documentation Improvements or additions to documentation label May 16, 2023
@Neuw84
Copy link

Neuw84 commented Sep 26, 2023

Where did you tried to run it? Amazon EMR?¿

If I have time would like to try it and will then make a pull request in order to simplify setup on AWS.

@jcamachor
Copy link
Contributor

@Neuw84 , that would be greatly appreciated. We lack documentation for environment setup (e.g., in any cloud provider), which currently is not automated in LST-Bench. Thus, having someone going over those steps and creating a PR to document them would be very valuable.

@Neuw84
Copy link

Neuw84 commented Dec 20, 2023

Hi,

Just started to work on it via EMR. @jcamachor one quick question. I have already made the connection via spark-jdbc but it is not clear for me if it is expecting that on the defined path the data (csv files in each folder for tpcds exist). Or what is expecting in those folders? (external_data_path & data_path)

Caused` by: java.io.FileNotFoundException: File s3://data-bench-lstm/sf_10/catalog_sales does not `exist.

# Description: Experiment Configuration
---
version: 1
id: spark_del_sf_10
repetitions: 1
# Metadata accepts any key-value that we want to register together with the experiment run.
# TODO: In the future, many of these could be automatically generated by the framework.
metadata:
  system: spark
  system_version: 3.3.1
  table_format: iceberg
  table_format_version: 2.2.0
  scale_factor: 10
  mode: cow
# The following parameter values will be used to replace the variables in the workload statements.
parameter_values:
  external_catalog: spark_catalog
  external_database: external_tpcds
  external_table_format: csv
  external_data_path: 's3://data-bench-lstm/sf_10/'
  external_options_suffix: ',header="true"'
  external_tblproperties_suffix: ''
  catalog: spark_catalog
  database: delta_tpcds
  table_format: iceberg
  scale_factor: 10
  data_path: 's3://data-bench-lstm/delta/sf_10/'
  options_suffix: ''
  tblproperties_suffix: ''

Thanks!

@jcamachor
Copy link
Contributor

jcamachor commented Dec 20, 2023

Thanks @Neuw84 .

I have already made the connection via spark-jdbc but it is not clear for me if it is expecting that on the defined path the data (csv files in each folder for tpcds exist). Or what is expecting in those folders? (external_data_path & data_path)

The external_data_path refers to the path to the CSV files indeed, while data_path refers to the path where the Iceberg/Delta/Hudi tables using the CSV data will be created.
Currently the setup SQL statements in LST-Bench expect that the data is partitioned (see https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup/spark/ddl-external-tables.sql). If you want to execute data maintenance phases, you'll need to generate and organize the data accordingly as well (see https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup_data_maintenance/spark/ddl-external-tables-refresh.sql).
As mentioned earlier, I hope to document the setup more comprehensively and eventually automate it, although I haven't had the opportunity to do so yet.

@Neuw84
Copy link

Neuw84 commented Jan 4, 2024

Thanks for you inputs @jcamachor, will try again as I have time now.

We have an automated way of producing TPCDS data using a marketplace connector, the only thing that I need to test is whether it partition the data as the scripts want but I can fix that.

https://aws.amazon.com/marketplace/pp/prodview-xtty6azr4xgey

Will let you know, if you advance on the setup documentation also ping me 👍 .

P.D. On the future my plan is to contribute to OneTable where I know you are a pusher!

Bests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants