layout | title |
---|---|
default |
Using DeepDive with Postgres-XL |
This document describes how to install and configure Postgres-XL to work with DeepDive. It also describes two caveats needed in writing queries when using XL, and some FAQs about using XL.
After installing XL, DeepDive should work well with it. Apart from the following caveat below, you should not observe any difference than if you were running PostgreSQL.
First, you should add a DISTRIBUTE BY HASH
clause in all CREATE TABLE
commands. Do
not use the column id
as the distribution key. Do not use a distribution
key that is not initially assigned.
Refer to the XL manual for more information.
Second, you should always create tables as CREATE UNLOGGED TABLE ...
. The unlogged
keyword turns off write-ahead logging (WAL), and gives a significant speedup
in writes (often 10X). WAL is not needed for DeepDive apps, because the
database is only used for processing, not for persisting data.
For more details on XL-specific SQL queries in a DeepDive example application, see this example.
We now describe how to install XL and configure it to be used with DeepDive. The steps were tested to install XL on Ubuntu 15.04.
We assume that the user executing these commands has sudo rights.
Set the following parameters in the /etc/sysctl.d/50-pgxl.conf
file:
sudo tee /etc/sysctl.d/50-pgxl.conf <<EOF
kernel.sem = 1000 32000 32 1000
# up to 400GB shared memory
kernel.shmmax = 429496729600
EOF
After making these changes, run
sudo sysctl --system
Finally, adjust ssh settings in /etc/ssh/ssh_config
and /etc/ssh/sshd_config
files:
sudo tee -a /etc/ssh/ssh_config <<EOF
StrictHostKeyChecking no
EOF
sudo tee -a /etc/ssh/sshd_config <<EOF
MaxStartups 100
EOF
After that, run
sudo service ssh restart
sudo apt-get update
sudo apt-get install -y screen curl git rsync openssl locales openssh-server openssh-client \
build-essential gcc flex bison make cmake jade openjade docbook docbook-dsssl \
zlib1g-dev libreadline6-dev python-dev libssl-dev
sudo localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
Now set $TARGET_DIR
to the directory into which you would like to install XL.
TARGET_DIR=/opt/pgxl
sudo mkdir -p $TARGET_DIR
sudo chown $USER $TARGET_DIR
Then create a file /etc/ld.so.conf.d/pgxl.conf
for the $TARGET_DIR
:
sudo tee /etc/ld.so.conf.d/pgxl.conf <<EOF
$TARGET_DIR/lib
EOF
The remaining parts of the installation do not require sudo rights. However, make sure
the user account doing the installation has write permissions to $TARGET_DIR
.
Create a build directory and download the XL sources.
BUILD_DIR=~/pgxl_install
mkdir -p $BUILD_DIR
cd $BUILD_DIR
wget -O pgxl-v9.2.tar.gz http://sourceforge.net/projects/postgres-xl/files/Releases/Version_9.2rc/postgres-xl-v9.2-src.tar.gz/download
tar -xzf pgxl-v9.2.tar.gz
Now, you can build the sources.
cd $BUILD_DIR/postgres-xl/
./configure --with-python --with-openssl --prefix $TARGET_DIR
make -j 4
make install
Additionally modules can be added as follows.
for pkg in btree_gin btree_gist earthdistance fuzzystrmatch hstore intagg intarray oid2name \
pg_buffercache pgcrypto pgxc_clean pgxc_ctl pgxc_ddl pgxc_monitor stormstats \
tablefunc tsearch2 unaccent; do
cd $BUILD_DIR/postgres-xl/contrib/$pkg
make; make install
done
Now you need to generate ssh keys for localhost
. Run:
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cd ~/.ssh
if [ ! -f id_rsa.pub ]; then
ssh-keygen -t rsa -N "" -f id_rsa
fi
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys
Then you should be able to ssh
into localhost
without password, and you can
move on.
Now set $DATA_DIR
to the directory, where the database files will be stored.
Be sure that you have write permission to this directory.
DATA_DIR=$TARGET_DIR/data
mkdir -p $DATA_DIR
In order to run XL, we still need to create a file with configuration settings. Run
mkdir -p $TARGET_DIR/conf
Now, assuming that $USER
is set to the account under which pgxl should execute, the
following settings should provide a good starting point.
USER_DB_PORT=5432
MAX_USER_CONNECTIONS=100
DATA_NODE_SHARED_BUFFERS="2000MB"
DATA_NODE_WORK_MEM="128MB"
DATA_NODE_MAINTENANCE_MEM="128MB"
DATA_NODE_WAL_BUFFERS="16MB"
DATA_NODE_CHECKPOINT_SEGMENTS="256"
Now create a file $TARGET_DIR/conf/pgxc_ctl.conf
as follows:
cat <<EOT > $TARGET_DIR/conf/pgxc_ctl.conf
pgxcOwner=$USER
pgxcUser=\$pgxcOwner
tmpDir=/tmp
localTmpDir=\$tmpDir
configBackup=n
pgxcInstallDir=$TARGET_DIR
gtmName=gtm
gtmMasterServer=localhost
gtmMasterPort=20001
gtmMasterDir=$DATA_DIR/gtm
gtmSlave=n
gtmProxy=n
coordMasterDirs=($DATA_DIR/coord)
coordNames=(coord)
coordMasterServers=(localhost)
coordPorts=($USER_DB_PORT)
poolerPorts=(20002)
coordMaxWALSenders=(5)
coordSlave=n
datanodeMasterDirs=($DATA_DIR/data1 $DATA_DIR/data2)
datanodeNames=(data1 data2)
datanodeMasterServers=(localhost localhost)
datanodePorts=(3001 3002)
datanodePoolerPorts=(4001 4002)
datanodeMaxWALSenders=(5 5)
datanodeSpecificExtraConfig=(none none)
datanodeSpecificExtraPgHba=(none none)
datanodeSlave=n
coordExtraConfig=coordExtraConfig
cat > \$coordExtraConfig <<EOF
#================================================
# Added to all the coordinator postgresql.conf
log_destination = 'stderr'
logging_collector = on
log_directory = 'logs'
listen_addresses = '*'
log_filename = 'coordinator.log'
max_connections = $MAX_USER_CONNECTIONS
shared_buffers = $DATA_NODE_SHARED_BUFFERS
checkpoint_segments = $DATA_NODE_CHECKPOINT_SEGMENTS
work_mem = $DATA_NODE_WORK_MEM
maintenance_work_mem = $DATA_NODE_MAINTENANCE_MEM
wal_buffers = $DATA_NODE_WAL_BUFFERS
EOF
datanodeExtraConfig=datanodeExtraConfig
cat > \$datanodeExtraConfig <<EOF
#================================================
# Added to all the datanode postgresql.conf
log_destination = 'stderr'
logging_collector = on
log_directory = 'logs'
log_filename = 'datanode.log'
max_connections = $MAX_USER_CONNECTIONS
shared_buffers = $DATA_NODE_SHARED_BUFFERS
checkpoint_segments = $DATA_NODE_CHECKPOINT_SEGMENTS
work_mem = $DATA_NODE_WORK_MEM
maintenance_work_mem = $DATA_NODE_MAINTENANCE_MEM
wal_buffers = $DATA_NODE_WAL_BUFFERS
EOF
EOT
Above configuration creates a cluster with two data nodes, one coordinator, and one manager.
Configure the necessary paths into your ~/.bashrc
, so they're effective via ssh:
( echo 1i # insert following lines at the beginning of bashrc
cat <<EOF
export PATH=$TARGET_DIR/bin:\$PATH
export PGXC_CTL_HOME=$TARGET_DIR/conf
EOF
echo .; echo wq ) | ed ~/.bashrc
Run source ~/.bashrc
.
Execute the following command to launch the database server
pgxc_ctl "init all"
Run the following to complete the installation of extensions:
psql -p $USER_DB_PORT template1 -c "create extension plpythonu"
psql -p $USER_DB_PORT template1 -c "create extension intarray"
psql -p $USER_DB_PORT template1 -c "create extension hstore"
Follow these commands and you should get similar output.
$ psql postgres
psql (8.2.15)
Type “help” for help.
postgres=#
postgres=# \l
List of databases
Name | Owner | Encoding | Access privileges
-----------+-------+----------+-------------------
postgres | Xxx | UTF8 |
template0 | Xxx | UTF8 | =c/Xxx
: Xxx=CTc/Xxx
template1 | Xxx | UTF8 | =c/Xxx
: Xxx=CTc/Xxx
(3 rows)
postgres=# \q
Use pgxc_ctl "stop all"
and pgxc_ctl "start all"
to stop / start the XL server at any time.
-
How do I enable fuzzy string match / install "contrib" module in XL?
To enable *fuzzystringmatch* or another *contrib* module available for XL, see above build instructions on how to build XL with extensions.
-
Can I create a cluster with N nodes?
You certainly can. Note that for clusters larger than 16 nodes you may need to adjust certain configuration parameters, especially buffer sizes (so that you don't run out of memory) and kernel settings (so that the system can open enough SSH sessions).