In this post we will use flume to dump Apache webserver logs into HDFS. We already have a web server running and flume installed, but we need to configure a target and a source.
We use the following file as target.
## TARGET AGENT ## ## configuration file location: /etc/flume-ng/conf ## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector #http://flume.apache.org/FlumeUserGuide.html#avro-source collector.sources = AvroIn collector.sources.AvroIn.type = avro collector.sources.AvroIn.bind = 0.0.0.0 collector.sources.AvroIn.port = 4545 collector.sources.AvroIn.channels = mc1 mc2 ## Channels ## ## Source writes to 2 channels, one for each sink collector.channels = mc1 mc2 #http://flume.apache.org/FlumeUserGuide.html#memory-channel collector.channels.mc1.type = memory collector.channels.mc1.capacity = 100 collector.channels.mc2.type = memory collector.channels.mc2.capacity = 100 ## Sinks ## collector.sinks = LocalOut HadoopOut ## Write copy to Local Filesystem #http://flume.apache.org/FlumeUserGuide.html#file-roll-sink collector.sinks.LocalOut.type = file_roll collector.sinks.LocalOut.sink.directory = /var/log/flume-ng collector.sinks.LocalOut.sink.rollInterval = 0 collector.sinks.LocalOut.channel = mc1 ## Write to HDFS #http://flume.apache.org/FlumeUserGuide.html#hdfs-sink collector.sinks.HadoopOut.type = hdfs collector.sinks.HadoopOut.channel = mc2 collector.sinks.HadoopOut.hdfs.path = /user/training/flume/events/%{log_type}/%y%m%d collector.sinks.HadoopOut.hdfs.fileType = DataStream collector.sinks.HadoopOut.hdfs.writeFormat = Text collector.sinks.HadoopOut.hdfs.rollSize = 0 collector.sinks.HadoopOut.hdfs.rollCount = 10000 collector.sinks.HadoopOut.hdfs.rollInterval = 600
and below as source
## SOURCE AGENT ## ## Local instalation: /home/ec2-user/apache-flume ## configuration file location: /home/ec2-user/apache-flume/conf ## bin file location: /home/ec2-user/apache-flume/bin ## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent # http://flume.apache.org/FlumeUserGuide.html#exec-source source_agent.sources = apache_server source_agent.sources.apache_server.type = exec source_agent.sources.apache_server.command = tail -f /var/log/httpd/access_log source_agent.sources.apache_server.batchSize = 1 source_agent.sources.apache_server.channels = memoryChannel source_agent.sources.apache_server.interceptors = itime ihost itype # http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor source_agent.sources.apache_server.interceptors.itime.type = timestamp # http://flume.apache.org/FlumeUserGuide.html#host-interceptor source_agent.sources.apache_server.interceptors.ihost.type = host source_agent.sources.apache_server.interceptors.ihost.useIP = false source_agent.sources.apache_server.interceptors.ihost.hostHeader = host # http://flume.apache.org/FlumeUserGuide.html#static-interceptor source_agent.sources.apache_server.interceptors.itype.type = static source_agent.sources.apache_server.interceptors.itype.key = log_type source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined # http://flume.apache.org/FlumeUserGuide.html#memory-channel source_agent.channels = memoryChannel source_agent.channels.memoryChannel.type = memory source_agent.channels.memoryChannel.capacity = 100 ## Send to Flume Collector on Hadoop Node # http://flume.apache.org/FlumeUserGuide.html#avro-sink source_agent.sinks = avro_sink source_agent.sinks.avro_sink.type = avro source_agent.sinks.avro_sink.channel = memoryChannel source_agent.sinks.avro_sink.hostname = 192.168.46.169 source_agent.sinks.avro_sink.port = 4545
We start the target
flume-ng agent -c conf -f flume-trg-agent.conf -n collector
And now we start the source
flume-ng agent -c conf -f flume-src-agent.conf -n source_agent
And after some time we can see Apache logs being written to HDFS
hdfs dfs -cat /user/training/flume/events/apache_access_combined/170327/FlumeData.1490635725898 192.168.46.169 - - [27/Mar/2017:10:28:42 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:28:57 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:07 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:08 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:29:10 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET / HTTP/1.1" 200 401 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:48 -0700] "GET /favicon.ico HTTP/1.1" 404 289 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:51 -0700] "GET /first.html HTTP/1.1" 200 291 "http://192.168.46.169/" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:52 -0700] "GET /index.html HTTP/1.1" 200 401 "http://192.168.46.169/first.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:30:53 -0700] "GET /second.html HTTP/1.1" 200 293 "http://192.168.46.169/index.html" "Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131029 Firefox/17.0" 192.168.46.169 - - [27/Mar/2017:10:32:01 -0700] "GET / HTTP/1.1" 200 401 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"