Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
snazy committed Nov 27, 2023
1 parent ffd26af commit a28f126
Show file tree
Hide file tree
Showing 23 changed files with 2,409 additions and 5,171 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/demos-docker-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [3.7]
python-version: ['3.10']

steps:
- uses: actions/checkout@v3
Expand Down
21 changes: 20 additions & 1 deletion .github/workflows/notebooks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,25 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [3.7]
python-version: ['3.10']

steps:
- uses: actions/checkout@v3
- name: Install system dependencies
run: sudo apt-get install libsasl2-dev libsasl2-modules
- name: Set up Java
uses: actions/setup-java@v3
with:
distribution: 'temurin'
# Need Java 8 for Hive + 11 for Spark (and Nessie)
java-version: |
8
11
- name: setup JAVAx_HOME
run: |
echo "JAVA8_HOME=$JAVA_HOME_8_X64" >> ${GITHUB_ENV}
echo "JAVA11_HOME=$JAVA_HOME_11_X64" >> ${GITHUB_ENV}
echo "JAVA_HOME=$JAVA_HOME_11_X64" >> ${GITHUB_ENV}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
Expand All @@ -65,3 +78,9 @@ jobs:
- name: Test Notebooks with Tox
working-directory: notebooks/tests
run: tox
- name: Dump Hive output
working-directory: notebooks/
if: failure()
run:
find . -name "nohup*"
cat nohup.out
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ notebooks/iceberg-*-runtime-*
notebooks/hadoop-*
notebooks/apache-hive-*-bin
notebooks/metastore_db
notebooks/hiveserver2.pid
notebooks/*.log
notebooks/*.out
# using sed on mac always needs a backup file
Expand All @@ -38,6 +39,9 @@ venv/
__pycache__/
.pytest_cache

# pyenv
.python-version

# Jetbrains IDEs
/.idea
*.iws
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Nessie version is set in Binder at `docker/binder/requirements_base.txt`. Curren

### Iceberg

Currently we are using Iceberg `0.13.1` and it is specified in both iceberg notebooks as well as `docker/utils/__init__.py`
Currently we are using Iceberg `1.4.2` and it is specified in both iceberg notebooks as well as `docker/utils/__init__.py`

### Spark

Expand All @@ -30,7 +30,7 @@ Only has to be updated in `docker/binder/requirements.txt`. Currently, Iceberg s

### Flink

Flink version is set in Binder at `docker/binder/requirements_flink.txt`. Currently, we are using `1.13.6`.
Flink version is set in Binder at `docker/binder/requirements_flink.txt`. Currently, we are using `1.17.1`.

### Hadoop

Expand All @@ -53,7 +53,7 @@ Of course, Binder just lets a user "simply start" a notebook via a simple "click

## Development
For development, you will need to make sure to have the following installed:
- Python 3.7+
- Python 3.10+
- pre-commit

Regarding pre-commit, you will need to make sure is installed through `pre-commit install` in order to install the hooks locally since this repo
Expand Down
2 changes: 1 addition & 1 deletion binder/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Tag will be automatically generated through pre-commit hook if any changes
# happened in the docker/ folder
FROM ghcr.io/projectnessie/nessie-binder-demos:649ec80b8fa7d9666178380a33b2e645a52d5985
FROM ghcr.io/projectnessie/nessie-binder-demos:85bb4614c389c32c7107a7911a30a057486b9a72

# Create the necessary folders for the demo, this will be created and owned by {NB_USER}
RUN mkdir -p notebooks && mkdir -p datasets
Expand Down
8 changes: 4 additions & 4 deletions binder/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Building binder locally

### Prerequisites
You need to have a python 3.7+ installed.
We recommend to use [pyenv](https://github.com/pyenv/pyenv) for managing your python environment(s).
You need to have a python 3.10+ installed.
We recommend to use [pyenv](https://github.com/pyenv/pyenv) for managing your python environment(s).

To build the binder image locally, firstly, you need to install `jupyter-repo2docker` dependency:

Expand All @@ -29,8 +29,8 @@ Run (or look into) the `build_run_local_docker.sh` script how to do this semi-au
After those steps, the binder should be running on your local machine.
Next, find the output similar to this:
```shell
[C 13:38:25.199 NotebookApp]
[C 13:38:25.199 NotebookApp]

To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-40-open.html
Or copy and paste this URL:
Expand Down
1 change: 1 addition & 0 deletions docker/binder/apt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

# Packages needed for mybinder.org

openjdk-8-jdk-headless
openjdk-11-jdk-headless
# SASL lib needed for thrift API to access Hive
libsasl2-dev
Expand Down
2 changes: 1 addition & 1 deletion docker/binder/postBuild
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ python -m ipykernel install --name "flink-demo" --user
python -c "import utils;utils._copy_all_hadoop_jars_to_pyflink()"
conda deactivate

python -c "import utils;utils.fetch_nessie()"
python -c "import utils;utils.fetch_nessie_jar()"

python -c "import utils;utils.fetch_spark()"

Expand Down
8 changes: 5 additions & 3 deletions docker/binder/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
-r requirements_base.txt
findspark==2.0.1
pandas==1.3.5
pyhive[hive]==0.6.5
pyspark==3.2.1
# Need this numpy version due to compatibility reasons with numpy/pyspark
numpy==1.21.6
pandas==1.5.3
pyhive[hive_pure_sasl]==0.7.0
pyspark==3.2.4
2 changes: 1 addition & 1 deletion docker/binder/requirements_base.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
pynessie==0.30.0
pynessie==0.65.0
4 changes: 1 addition & 3 deletions docker/binder/requirements_flink.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
-r requirements_base.txt
apache-flink==1.13.6
# flink requires pandas<1.2.0 see https://github.com/apache/flink/blob/release-1.13.6/flink-python/setup.py#L313
pandas==1.1.5
apache-flink==1.17.1
1 change: 1 addition & 0 deletions docker/binder/runtime.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python-3.10
11 changes: 7 additions & 4 deletions docker/binder/start
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,22 @@
# limitations under the License.
#

nohup ./nessie-quarkus-runner &

SPARK_VERSION=$(python -c "import utils;print(utils._SPARK_VERSION)")
HADOOP_VERSION=$(python -c "import utils;print(utils._HADOOP_VERSION)")
HIVE_VERSION=$(python -c "import utils;print(utils._HIVE_VERSION)")

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export JAVA11_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export JAVA8_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JAVA_HOME=$JAVA11_HOME
export PATH=$JAVA_HOME/bin:$PATH

nohup java -jar nessie-quarkus-runner.jar > nohup-nessie.out &

export SPARK_HOME=$PWD/spark-$SPARK_VERSION-bin-hadoop3.2
export HADOOP_HOME=$PWD/hadoop-$HADOOP_VERSION

#Start Hive
chmod +x $PWD/binder/start.hive
nohup $PWD/binder/start.hive $PWD $PWD/binder/resources $HIVE_VERSION
nohup $PWD/binder/start.hive $PWD $PWD/binder/resources $HIVE_VERSION > nohup-hive.out

exec "$@"
58 changes: 54 additions & 4 deletions docker/binder/start.hive
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ RESOURCE_DIR=$2
HIVE_VERSION=$3
HIVE_FOLDER_NAME="apache-hive-$HIVE_VERSION-bin"
HIVE_WAREHOUSE_DIR=$HIVE_PARENT_DIR/hive_warehouse
HIVE_PID_FILE=$HIVE_PARENT_DIR/hiveserver2.pid
HIVE_DB=$HIVE_PARENT_DIR/metastore_db

if [ -z "$HIVE_PARENT_DIR" ]; then
echo "Input the parent dir as the first argument"
Expand All @@ -38,21 +40,69 @@ fi

export HIVE_HOME=$HIVE_PARENT_DIR/$HIVE_FOLDER_NAME

# Create hive warehouse folder
mkdir $HIVE_WAREHOUSE_DIR

# Copy the needed configs to Hive folder
cp $RESOURCE_DIR/hive/config/hive-site.xml ${HIVE_HOME}/conf/

# Set Hive warehouse path in the hive-site.xml
sed -i.bak "s~HIVE_WAREHOUSE_DIR~$HIVE_WAREHOUSE_DIR~g" ${HIVE_HOME}/conf/hive-site.xml

# Check for Java 8 + 11 for tox (also in /notebooks/tests/scripts/start_hive)
if [[ -z ${JAVA8_HOME} || -z ${JAVA11_HOME} || ! -d ${JAVA8_HOME} || ! -d ${JAVA11_HOME} ]] ; then
cat <<! > /dev/stderr
============================================================================================================
Define the JAVA8_HOME and JAVA11_HOME environment variables to point to Java 8 and Java 11 development kits.
============================================================================================================
Need Java 8 for Hive server to work.
Java 11 (not newer!) is required for Spark, but also Nessie.
!
exit 1
fi

# Kill an already running hiveserver
if [[ -f $HIVE_PID_FILE ]] ; then
kill "$(cat $HIVE_PID_FILE)" || true
fi

# Remove an already metastore-db
if [[ -d $HIVE_DB ]] ; then
echo "Removing existing $HIVE_DB"
rm -rf $HIVE_DB
fi

# (Re-)create hive warehouse folder
rm -rf $HIVE_WAREHOUSE_DIR
mkdir -p $HIVE_WAREHOUSE_DIR

# Initialize Hive's Derby database
$HIVE_HOME/bin/schematool -dbType derby -initSchema
echo "Finished initializing Derby database for Hive."

# increase the Heap memory being used by Hive-MapReduce jobs
export HADOOP_HEAPSIZE=1500

# Use Java 8 for Hive :facepalm:
OLD_PATH="$PATH"
export PATH="$JAVA8_HOME/bin:$PATH"
export JAVA_HOME=$JAVA8_HOME
cat <<!
For Hive Server:
================
Using JAVA_HOME=$JAVA_HOME
java binary: $(which java)
$(java -version)
!

# Once we are done from initializing the database, we start Hive
$HIVE_HOME/bin/hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 &
$HIVE_HOME/bin/hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console &
echo $! > $HIVE_PARENT_DIR/hiveserver2.pid

# Reset environment
export JAVA_HOME=$JAVA11_HOME
export PATH=$OLD_PATH
40 changes: 17 additions & 23 deletions docker/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2020 Dremio
#
Expand All @@ -18,7 +19,6 @@
import os
import shutil
import site
import stat
import sysconfig
import tarfile
from typing import Optional
Expand All @@ -36,14 +36,16 @@
_SPARK_FILENAME = None
_SPARK_URL = None

_HADOOP_VERSION = "2.10.1"
_HADOOP_VERSION = "2.10.2"
_HADOOP_FILENAME = f"hadoop-{_HADOOP_VERSION}"
_HADOOP_URL = f"https://archive.apache.org/dist/hadoop/common/hadoop-{_HADOOP_VERSION}/{_HADOOP_FILENAME}.tar.gz"

_FLINK_MAJOR_VERSION = "1.13"
_FLINK_MAJOR_VERSION = "1.17"

_ICEBERG_VERSION = "0.13.1"
_ICEBERG_FLINK_FILENAME = f"iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}-{_ICEBERG_VERSION}.jar"
_ICEBERG_VERSION = "1.4.2"
_ICEBERG_FLINK_FILENAME = (
f"iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}-{_ICEBERG_VERSION}.jar"
)
_ICEBERG_FLINK_URL = f"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}/{_ICEBERG_VERSION}/{_ICEBERG_FLINK_FILENAME}"
_ICEBERG_HIVE_FILENAME = f"iceberg-hive-runtime-{_ICEBERG_VERSION}.jar"
_ICEBERG_HIVE_URL = f"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive-runtime/{_ICEBERG_VERSION}/{_ICEBERG_HIVE_FILENAME}"
Expand All @@ -54,8 +56,12 @@
f"https://archive.apache.org/dist/hive/hive-{_HIVE_VERSION}/{_HIVE_FILENAME}.tar.gz"
)

_NESSIE_VERSION = "0.74.0"


def _link_file_into_dir(source_file: str, target_dir: str, replace_if_exists=True) -> None:
def _link_file_into_dir(
source_file: str, target_dir: str, replace_if_exists=True
) -> None:
assert os.path.isfile(source_file)
assert os.path.isdir(target_dir)

Expand All @@ -75,7 +81,7 @@ def _link_file_into_dir(source_file: str, target_dir: str, replace_if_exists=Tru
os.link(source_file, target_file)
assert os.path.isfile(target_file), (source_file, target_file)

action = 'replaced' if replaced else 'created'
action = "replaced" if replaced else "created"
print(f"Link target was {action}: {target_file} (source: {source_file})")


Expand Down Expand Up @@ -112,7 +118,9 @@ def _copy_all_hadoop_jars_to_pyflink() -> None:
pyflink_lib_dir = _find_pyflink_lib_dir()
for _jar_count, jar in enumerate(_jar_files()):
_link_file_into_dir(jar, pyflink_lib_dir)
print(f"Linked {_jar_count} HADOOP jar files into the pyflink lib dir at location {pyflink_lib_dir}")
print(
f"Linked {_jar_count} HADOOP jar files into the pyflink lib dir at location {pyflink_lib_dir}"
)


def _find_pyflink_lib_dir() -> Optional[str]:
Expand All @@ -139,16 +147,6 @@ def _download_file(filename: str, url: str) -> None:
f.write(r.content)


def fetch_nessie() -> str:
"""Download nessie executable."""
runner = "nessie-quarkus-runner"

url = _get_base_nessie_url()
_download_file(runner, url)
os.chmod(runner, os.stat(runner).st_mode | stat.S_IXUSR)
return runner


def fetch_nessie_jar() -> str:
"""Download nessie Jar in order to run the tests in Mac"""
runner = "nessie-quarkus-runner.jar"
Expand All @@ -159,12 +157,8 @@ def fetch_nessie_jar() -> str:


def _get_base_nessie_url() -> str:
import pynessie

version = pynessie.__version__

return "https://github.com/projectnessie/nessie/releases/download/nessie-{}/nessie-quarkus-{}-runner".format(
version, version
_NESSIE_VERSION, _NESSIE_VERSION
)


Expand Down
Loading

0 comments on commit a28f126

Please sign in to comment.