In the world of Big Data to import data from a DB into HDFS you need Apache Sqoop.
sqoop import --connect jdbc:mysql://localhost/mysql --username training -P --warehouse-dir /home/training/db --table user 17/02/23 10:38:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0 Enter password: 17/02/23 10:38:24 INFO manager.SqlManager: Using default fetchSize of 1000 17/02/23 10:38:24 INFO tool.CodeGenTool: Beginning code generation 17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1 17/02/23 10:38:24 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1 17/02/23 10:38:24 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce Note: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 17/02/23 10:38:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/7f3a9709c50f58c2c6bb24de91922c6b/user.jar 17/02/23 10:38:29 WARN manager.MySQLManager: It looks like you are importing from mysql. 17/02/23 10:38:29 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 17/02/23 10:38:29 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 17/02/23 10:38:29 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job. 17/02/23 10:38:29 WARN manager.CatalogQueryManager: The table user contains a multi-column primary key. Sqoop will default to the column Host only for this job. 17/02/23 10:38:29 INFO mapreduce.ImportJobBase: Beginning import of user 17/02/23 10:38:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 17/02/23 10:38:33 INFO db.DBInputFormat: Using read commited transaction isolation 17/02/23 10:38:33 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`Host`), MAX(`Host`) FROM `user` 17/02/23 10:38:33 WARN db.TextSplitter: Generating splits for a textual index column. 17/02/23 10:38:33 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 17/02/23 10:38:33 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 17/02/23 10:38:33 INFO mapred.JobClient: Running job: job_201702221239_0003 17/02/23 10:38:34 INFO mapred.JobClient: map 0% reduce 0% 17/02/23 10:38:55 INFO mapred.JobClient: map 17% reduce 0% 17/02/23 10:38:56 INFO mapred.JobClient: map 33% reduce 0% 17/02/23 10:39:08 INFO mapred.JobClient: map 50% reduce 0% 17/02/23 10:39:09 INFO mapred.JobClient: map 67% reduce 0% 17/02/23 10:39:21 INFO mapred.JobClient: map 83% reduce 0% 17/02/23 10:39:22 INFO mapred.JobClient: map 100% reduce 0% 17/02/23 10:39:26 INFO mapred.JobClient: Job complete: job_201702221239_0003 17/02/23 10:39:26 INFO mapred.JobClient: Counters: 23 17/02/23 10:39:26 INFO mapred.JobClient: File System Counters 17/02/23 10:39:26 INFO mapred.JobClient: FILE: Number of bytes read=0 17/02/23 10:39:26 INFO mapred.JobClient: FILE: Number of bytes written=1778658 17/02/23 10:39:26 INFO mapred.JobClient: FILE: Number of read operations=0 17/02/23 10:39:26 INFO mapred.JobClient: FILE: Number of large read operations=0 17/02/23 10:39:26 INFO mapred.JobClient: FILE: Number of write operations=0 17/02/23 10:39:26 INFO mapred.JobClient: HDFS: Number of bytes read=791 17/02/23 10:39:26 INFO mapred.JobClient: HDFS: Number of bytes written=818 17/02/23 10:39:26 INFO mapred.JobClient: HDFS: Number of read operations=6 17/02/23 10:39:26 INFO mapred.JobClient: HDFS: Number of large read operations=0 17/02/23 10:39:26 INFO mapred.JobClient: HDFS: Number of write operations=6 17/02/23 10:39:26 INFO mapred.JobClient: Job Counters 17/02/23 10:39:26 INFO mapred.JobClient: Launched map tasks=6 17/02/23 10:39:26 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=89702 17/02/23 10:39:26 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 17/02/23 10:39:26 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 17/02/23 10:39:26 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 17/02/23 10:39:26 INFO mapred.JobClient: Map-Reduce Framework 17/02/23 10:39:26 INFO mapred.JobClient: Map input records=8 17/02/23 10:39:26 INFO mapred.JobClient: Map output records=8 17/02/23 10:39:26 INFO mapred.JobClient: Input split bytes=791 17/02/23 10:39:26 INFO mapred.JobClient: Spilled Records=0 17/02/23 10:39:26 INFO mapred.JobClient: CPU time spent (ms)=5490 17/02/23 10:39:26 INFO mapred.JobClient: Physical memory (bytes) snapshot=666267648 17/02/23 10:39:26 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4423995392 17/02/23 10:39:26 INFO mapred.JobClient: Total committed heap usage (bytes)=191102976 17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Transferred 818 bytes in 56.7255 seconds (14.4203 bytes/sec) 17/02/23 10:39:26 INFO mapreduce.ImportJobBase: Retrieved 8 records.
Example above dumps table user from mysql DB into hadoop.
First connect to DB using –connect
–username would by the authentication username, -P to ask for password at prompt. –warehouse-dir HDFS parent for table destination and –table to select the table to import.
Below dumped content is shown.
hdfs dfs -cat /home/training/db/user/part-m-0000* 127.0.0.1,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0 localhost,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0 localhost.localdomain,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0 localhost,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0 localhost.localdomain,,,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0 localhost,training,*27CF0BD18BDADD517165824F8C1FFF667B47D04B,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0 localhost,hiveuser,*2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0 localhost,hue,*15221DE9A04689C4D312DEAC3B87DDF542AF439E,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,,,,0,0,0,0