2016年2月19日 星期五

一些與hadoop有關的命令記錄

直接對同機台上另一個帳號下命令

su hdfs -c '~/hdfs/sbin/stop-dfs.sh'

使用hadoop streaming

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
    -input /usr/ctfan/input01 \
    -output /usr/ctfan/output/folder \
    -mapper mapper.py \
    -reducer reducer.py \
    -file ~/ctfan/pyStreaming/mapper.py \
    -file ~/ctfan/pyStreaming/reducer.py

mapper.py

#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

reducer.py

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

單機測試streaming命令

echo 'a b c a b' | ./mapper.py | sort -k1,1 | ./reducer.py

參考網頁

hadoop官方文件
Michael G. Noll's Blog