Search

Top 60 Oracle Blogs

Recent comments

Ingest a Single Table from Microsoft SQL Server Data into Hadoop

Introduction

This blog describes the best-practice approach in regards to the data ingestion from SQL Server into Hadoop. The case scenario is described as under:

  • Single table ingestion (no joins)
  • No partitioning
  • Complete data ingestion (trash old and replace new)
  • Data stored in Parquet format

Pre-requisites

This example has been tested using the following versions:

  • Hadoop 2.5.0-cdh5.3.0
  • Hive 0.13.1-cdh5.3.0
  • Sqoop 1.4.5-cdh5.3.0
  • Oozie client build version: 4.0.0-cdh5.3.0

Process Flow Diagram

process_flow1

Configuration

  • Create the following directory/file structure (one per data ingestion process). For a new ingestion program please adjust the directory/file names as per requirements. Make sure to replace the
    tag with your table name
_ingest
+ hive-
create-schema.hql
+ oozie-properties
.properties
+ oozie-
-ingest
+ lib
kite-data-core.jar
kite-data-mapreduce.jar
sqljdbc4.jar
coordinator.xml
impala_metadata.sh
workflow.xml
  • The ingestion process is invoked using an oozie workflow. The workflow invokes all steps necessary for data ingestion including pre-processing, ingestion using sqoop and post-processing.
oozie-
-ingest

This directory stores all files that are required by the oozie workflow engine. These files should be stored in HDFS for proper functioning of oozie
oozie-properties
This directory stores the
.properties. This file stores the oozie variables such as database users, name node details etc. used by the oozie process at runtime.
hive-

This directory stores a file called create-schema.hql  which contains the schema definition of the HIVE tables. This file is required to be run in HIVE only once.
  • Configure files under oozie-
    -ingest
1.   Download kite-data-core.jar and kite-data-mapreduce.jar files from http://mvnrepository.com/artifact/org.kitesdk
2.  Download sqljdbc4.jar from https://msdn.microsoft.com/en-us/sqlserver/aa937724.aspx
3.  Configure coordinator.xml. Copy and paste the following XML.
-ingest-coordinator” frequency=”${freq}” start=”${startTime}” end=”${endTime}” timezone=”UTC” xmlns=”uri:oozie:coordinator:0.2″>


${workflowRoot}/workflow.xml
partition_name
${coord:formatTime(coord:nominalTime(), ‘YYYY-MM-dd’)}



4.  Configure workflow.xml. This workflow has three actions:

a) mv-data-to-old – Deletes old data before refreshing new
b) sqoop-ingest-
– Sqoop action to fetch table from SQL Server
c) invalidate-impala-metadata – Revalidate Impala data after each refresh
Copy and paste the following XML.
-ingest” xmlns=”uri:oozie:workflow:0.2″>

/*.parquet’ />
/.metadata’ />
”/>

”>

${jobTracker}
${nameNode}

mapred.job.queue.name
${queueName}
import
–connect
${db_string}
–table
${db_table}
–columns
${db_columns}
–username
${db_username}
–password
${db_password}
–split-by
${db_table_pk}
–target-dir
${sqoop_directory}/

–as-parquetfile
–compress
–compression-codec
org.apache.hadoop.io.compress.SnappyCodec




${jobTracker}
${nameNode} mapred.job.queue.name
${queueName}

${impalaFileName}
${impalaFilePath}





Workflow failed with error message ${wf:errorMessage(wf:lastErrorNode())}

5. Configure impala_metadata.sh. This file will execute commands to revalidate impala metadata after each restore. Copy and paste the following data.

#!/bin/bash
export PYTHON_EGG_CACHE=./myeggs
impala-shell -i  -q “invalidate metadata .
  • Configure files under oozie-properties. Create file oozie.properties with contents as under. Edit the parameters as per requirements.
# Coordinator schedulings
freq=480
startTime=2015-04-28T14:00Z
endTime=2029-03-05T06:00Z
jobTracker=
nameNode=hdfs://
queueName=
rootDir=${nameNode}/user//oozie
workflowRoot=${rootDir}/
-ingest
oozie.use.system.libpath=true
oozie.coord.application.path=${workflowRoot}/coordinator.xml
# Sqoop settings
sqoop_directory=${nameNode}/data/sqoop
# Hive/Impala Settings
hive_db_name=
impalaFileName=impala_metadata.sh
impalaFilePath=/user/oozie/
-ingest/impala_metadata.sh
# MS SQL Server settings
db_string=jdbc:sqlserver://;databaseName=
db_username=
db_password=
db_table=
db_columns=
  • Configure files under hive-
    . Create a new file create-schema.hql with contents as under.
DROP TABLE IF EXISTS ;CREATE EXTERNAL TABLE ()
STORED AS PARQUET
LOCATION ‘hdfs:///data/sqoop/
';

Deployment

  • Create new directory in HDFS and copy files
$ hadoop fs -mkdir /user//oozie/
-ingest
$ hadoop fs -copyFromLocal /
/oozie-
-ingest/lib /user//oozie/
-ingest
$ hadoop fs -copyFromLocal /
/oozie-
-ingest/ coordinator.xml /user//oozie/
-ingest
$ hadoop fs -copyFromLocal /
/oozie-
-ingest/ impala_metadata.sh /user//oozie/
-ingest
$ hadoop fs -copyFromLocal /
/oozie-
-ingest/ workflow.xml /user//oozie/
-ingest
  • Create new directory in HDFS for storing data files
$ hadoop fs -mkdir /user/SA.HadoopPipeline/oozie/
-ingest
$ hadoop fs -mkdir /data/sqoop/
  • Now we are ready to select data in HIVE. Go to URL http://:8888/beeswax/#query.
a. Choose existing database on left or create new.
b. Paste contents of create-schema.hql in Query window and click Execute.
c. You should now have an external table in HIVE pointing to data in hdfs:///data/sqoop/
  • Create Oozie job
a. Choose existing database on left or create new.
$ oozie job -run -config /home//</
/oozie-properties/oozie.properties

Validation and Error Handling

  • At this point an oozie job should be created. To validate the oozie job creation open URL http://:8888/oozie/list_oozie_coordinators. Expected output as under. In case of error please review the logs for recent runs.
 oozie1
  • To validate the oozie job is running open URL http://:8888/oozie/list_oozie_workflows/ . Expected output as under. In case of error please review the logs for recent runs.
 oozie2
  • To validate data in HDFS execute the following command. You should see a file with *.metadata extension and a number of files with *.parquet extension.
$ hadoop fs -ls /data/sqoop/
/
  • Now we are ready to select data in HIVE or Impala.
    For HIVE go to URL http://:8888/beeswax/#query
    For Impala go to URL http://:8888/impala
    Choose the newly created database on left and execute the following SQL – select * from limit 10
    You should see the the data being outputted from the newly ingested data.