使用Docker搭建Hadoop集群和Spark集群

大数据 专栏收录该内容
1 篇文章 0 订阅

一、前言

Hadoop是分布式管理、存储、计算的生态系统,Hadoop的框架最核心的设计就是:HDFS和MapReduce,HDFS分布式文件系统(Hadoop Distributed File System)为海量的数据提供了存储,而MapReduce则为海量的数据提供了计算。
Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark是开源的类Hadoop MapReduce的通用并行框架,Spark拥有Hadoop MapReduce所具有的优点,但不同于MapReduce的是——Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

二、安装Docker和Docker-compose

参考以前的文章

三、网络

将Hadoop集群和Spark集群装在同一个网络中,以便Spark能访问到Hadoop中的HDFS,可以将计算的结果保存到HDFS的文件中。

  1. # 创建一个名为anron的docker网络
  2. docker network create --subnet 172.20.0.1/16 anron

如果提示以下错误信息,那就把172.20.0.1换个网段后再试

Error response from daemon: Pool overlaps with other one on this address space

 四、Hadoop集群

4.1 集群的组成

hoodoop集群包括有:

  • namenode                  1个节点
  • datanode                    2个节点(datanode1,datanode2)
  • resourcemanager       1个节点
  • nodemanager             1个节点
  • historyserver               1个节点

namenode、datanode1、datanode2在hadoop-1.yml文件中

resourcemanager、nodemanager、historyserver在hadoop-2.yml文件中

hadoop.env、hadoop-1.yml、hadoop-2.yml这3个文件放在宿主机的同个目录下

4.2 hadoop.env文件

  1. CORE_CONF_fs_defaultFS=hdfs://namenode:9000
  2. CORE_CONF_hadoop_http_staticuser_user=root
  3. CORE_CONF_hadoop_proxyuser_hue_hosts=*
  4. CORE_CONF_hadoop_proxyuser_hue_groups=*
  5. CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
  6. HDFS_CONF_dfs_webhdfs_enabled=true
  7. HDFS_CONF_dfs_permissions_enabled=false
  8. HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
  9. YARN_CONF_yarn_log___aggregation___enable=true
  10. YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
  11. YARN_CONF_yarn_resourcemanager_recovery_enabled=true
  12. YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
  13. YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
  14. YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
  15. YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
  16. YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
  17. YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
  18. YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
  19. YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
  20. YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
  21. YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
  22. YARN_CONF_yarn_timeline___service_enabled=true
  23. YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
  24. YARN_CONF_yarn_timeline___service_hostname=historyserver
  25. YARN_CONF_mapreduce_map_output_compress=true
  26. YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
  27. YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
  28. YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
  29. YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
  30. YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
  31. YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle
  32. MAPRED_CONF_mapreduce_framework_name=yarn
  33. MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
  34. MAPRED_CONF_mapreduce_map_memory_mb=4096
  35. MAPRED_CONF_mapreduce_reduce_memory_mb=8192
  36. MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
  37. MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
  38. MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
  39. MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
  40. MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/

4.3 hadoop-1.yml文件

  1. version: '3'
  2. networks:
  3. anron:
  4. external: true
  5. volumes:
  6. hadoop_namenode:
  7. hadoop_datanode1:
  8. hadoop_datanode2:
  9. hadoop_historyserver:
  10. services:
  11. namenode:
  12. container_name: namenode
  13. image: bde2020/hadoop-namenode
  14. ports:
  15. - 9000:9000
  16. - 9870:9870
  17. volumes:
  18. - hadoop_namenode:/hadoop/dfs/name
  19. environment:
  20. - CLUSTER_NAME=test
  21. env_file:
  22. - ./hadoop.env
  23. networks:
  24. - anron
  25. datanode1:
  26. container_name: datanode1
  27. image: bde2020/hadoop-datanode
  28. depends_on:
  29. - namenode
  30. volumes:
  31. - hadoop_datanode1:/hadoop/dfs/data
  32. environment:
  33. SERVICE_PRECONDITION: "namenode:9870"
  34. env_file:
  35. - ./hadoop.env
  36. networks:
  37. - anron
  38. datanode2:
  39. container_name: datanode2
  40. image: bde2020/hadoop-datanode
  41. depends_on:
  42. - namenode
  43. volumes:
  44. - hadoop_datanode2:/hadoop/dfs/data
  45. environment:
  46. SERVICE_PRECONDITION: "namenode:9870"
  47. env_file:
  48. - ./hadoop.env
  49. networks:
  50. - anron

 4.4 hadoop-2.yml文件

  1. version: '3'
  2. networks:
  3. anron:
  4. external: true
  5. volumes:
  6. hadoop_namenode:
  7. hadoop_datanode1:
  8. hadoop_datanode2:
  9. hadoop_historyserver:
  10. services:
  11. historyserver:
  12. container_name: historyserver
  13. image: bde2020/hadoop-historyserver
  14. ports:
  15. - 8188:8188
  16. volumes:
  17. - hadoop_historyserver:/hadoop/yarn/timeline
  18. environment:
  19. SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode1:9864 datanode2:9864 resourcemanager:8088"
  20. env_file:
  21. - ./hadoop.env
  22. networks:
  23. - anron
  24. nodemanager:
  25. container_name: nodemanager
  26. image: bde2020/hadoop-nodemanager
  27. environment:
  28. SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode1:9864 datanode2:9864 resourcemanager:8088"
  29. env_file:
  30. - ./hadoop.env
  31. networks:
  32. - anron
  33. resourcemanager:
  34. container_name: resourcemanager
  35. image: bde2020/hadoop-resourcemanager
  36. ports:
  37. - 8088:8088
  38. environment:
  39. SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode1:9864 datanode2:9864"
  40. env_file:
  41. - ./hadoop.env
  42. networks:
  43. - anron

4.5 启动Hadoop集群 

先启动hadoop-1.yml 

docker-compose -f hadoop-1.yml up

等namenode容器(或者在WebUI中查看)出现以下信息,提示hdfs的安全模式已经关闭

 Safe mode is OFF

然后再启动hadoop-2.yml 

docker-compose -f hadoop-2.yml up

注意:由于resourcemanager在启动的时候需要创建目录/rmstate,SafeMode下是不可以更改文件只能读取,导致resourcemanager无法启动。namenode大概在启动30秒后会自动关闭SafeMode,所有这里把yml文件分成2个,先启动hadoop-1,再启动hadoop-2。

当然也可以手动开启或关闭SafeMode

  1. # 查看safemode
  2. docker exec -it namenode hdfs dfsadmin -safemode get
  3. # 打开safemode
  4. docker exec -it namenode hdfs dfsadmin -safemode enter
  5. # 关闭safemode
  6. docker exec -it namenode hdfs dfsadmin -safemode leave

 4.6 查看WebUI

查看HDFS文件系统 

 查看resourcemanager

4.7 运行wordcount例子

  1. # 进入namenode容器
  2. docker exec -it namenode bash
  3. # 在namenode容器里创建目录和2个文件
  4. mkdir input
  5. echo "Hello World" > input/f1.txt
  6. echo "Hello Docker" > input/f2.txt
  7. # 在HDFS创建一个input目录(绝对路径为/user/root/input)
  8. hdfs dfs -mkdir -p input
  9. # 把容器/input目录下的所有文件拷贝到HDFS的input目录,如果HDFS的input目录不存在会报错
  10. hdfs dfs -put /input/* input
  11. # 在容器里运行WordCount程序,该程序需要2个参数:HDFS输入目录和HDFS输出目录(需要先把hadoop-mapreduce-examples-2.7.1-sources.jar从宿主机拷贝到容器里)
  12. hadoop jar hadoop-mapreduce-examples-2.7.1-sources.jar org.apache.hadoop.examples.WordCount input output
  13. # 打印输出刚才运行的结果,结果保存到HDFS的output目录下
  14. hdfs dfs -cat output/part-r-00000

 五、Spark集群

5.1 hadoop-3.yml

  1. version: '3'
  2. networks:
  3. anron:
  4. external: true
  5. services:
  6. spark-master:
  7. container_name: spark-master
  8. image: bde2020/spark-master
  9. environment:
  10. - INIT_DAEMON_STEP=setup_spark
  11. - constraint:node==master
  12. ports:
  13. - 8080:8080
  14. - 7077:7077
  15. networks:
  16. - anron
  17. spark-worker-1:
  18. container_name: spark-worker-1
  19. image: bde2020/spark-worker
  20. depends_on:
  21. - spark-master
  22. environment:
  23. - SPARK_MASTER=spark://spark-master:7077
  24. - constraint:node==worker1
  25. ports:
  26. - 8081:8081
  27. networks:
  28. - anron
  29. spark-worker-2:
  30. container_name: spark-worker-2
  31. image: bde2020/spark-worker
  32. depends_on:
  33. - spark-master
  34. environment:
  35. - SPARK_MASTER=spark://spark-master:7077
  36. - constraint:node==worker2
  37. ports:
  38. - 8082:8081
  39. networks:
  40. - anron

5.2 启动Hadoop集群 

启动hadoop-3.yml 

docker-compose -f hadoop-3.yml up

5.3 查看WebUI

5.4 运行wordcount例子

  1. # 进入spark-worker容器
  2. docker exec -it spark-worker-1 bash
  3. # 运行spark-shell
  4. /spark/bin/spark-shell --master spark://spark-master:7077
  5. #从HDFS读取文件计算后输出到HDFS
  6. val textFile=sc.textFile("hdfs://namenode:9000/user/root/input")
  7. val wordCounts = textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
  8. # 参数是目录不是文件
  9. wordCounts.saveAsTextFile("hdfs://namenode:9000/user/root/out1");
  10. #从HDFS读取文件计算后输出到控制台
  11. val textFile=sc.textFile("hdfs://namenode:9000/user/root/input")
  12. val wordCounts = textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
  13. wordCounts.collect
  14. #读取spark中的文件计算后输出到控制台
  15. # 1.如果是spark-shell --master spark://spark-master:7077启动,spark集群下每台机子都要有/input目录,否则提示文件不存在
  16. # 2.如果是spark-shell --master local启动,只要本机有/input目录就可以了
  17. val textFile=sc.textFile("file:///input")
  18. val wordCounts = textFile.flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
  19. wordCounts.collect

注意:先要确保HDFS中存在/user/root/input目录及相应的文件