使用python来执行MapReduce任务,由如下步骤组成
1 配置mapred-site.xml
首先通过下方命令得到一串地址
hadoop classpath
打开文件
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
把上面得到的地址复制到xml的结果中,插入原有的mapred-site.xml
文件中
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=结果</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=结果</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=结果</value>
</property>
2 准备文件
word.txt
Hello world
Hi world
I love world
Hello Ming
把文件传到HDFS上
hdfs dfs -mkdir /test-py-mapreduc
hdfs dfs -mkdir /test-py-mapreduc-output
hdfs dfs -put ./word.txt /test-py-mapreduce/word.txt
3 编写python的MapReduce脚本
map.py
import sys
for line in sys.stdin:
fields = line.strip().split()
for item in fields:
print(item + ' ' + '1')
reduce.py
import sys
result = {}
for line in sys.stdin:
kvs = line.strip().split(' ')
k = kvs[0]
v = kvs[1]
if k in result:
result[k]+=1
else:
result[k] = 1
for k,v in result.items():
print("%s\t%s" %(k,v))
4 执行脚本
# 在 $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-xxx.jar
STREAM_JAR_PATH="/usr/hadoop/share/hadoop/tools/lib/hadoop-streaming-xxx.jar"
# 输入文件夹
INPUT_FILE_PATH_1="/test-py-mapreduce/"
OUTPUT_PATH="/test-output"
# 启动命令
hadoop jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_1 \
-output $OUTPUT_PATH \
-mapper "python map.py" \
-reducer "python reduce.py" \
-file ./map.py \
-file ./reduce.py
看一下结果
hdfs dfs -get /test-output/part-00000 part-00000
cat part-00000
love 1
I 1
Ming 1
Hi 1
world 3
Hello 2