IDE: scala版的 Eclipse
scala version:2.10.4
spark:1.1.1
文件内容:
hello world
hello word
world word hello
1、新建scala工程
3、代码
import org.apache.spark.SparkConf
import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._ object WordCount { def main(args:Array[String]){ val conf=new SparkConf().setAppName("word Count").setMaster("local") val sc=new SparkContext(conf) val textFile=sc.textFile("test.txt") val mapRdd = textFile.flatMap(line=>line.split(",")).map(x=>(x,1)).reduceByKey(_+_) mapRdd.collect().foreach(println) }}
4、运行
在spark的bin目录里运行spark-shell,待spark启动后,
运行结果:
(hello,3)
(word,2)(world,2)代码分析:
1、import org.apache.spark.SparkContext._ 这句的作用 引入隐式转换
不然会出现value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)]
2、map 和flatMap区别
看3个代码
textFile.map(_.split(",")).collect().foreach(println)
textFile.map(_.split(",")).collect().foreach(x=> println(x.mkString(","))) textFile.flatMap(_.split(",")).collect().foreach(println)输出分别为:
[Ljava.lang.String;@736caf7a [Ljava.lang.String;@4ce7fffa [Ljava.lang.String;@497486b3hello,world
hello,wordworld,word,hellohello
worldhellowordworldwordhello从代码1和代码2可以看出map的结果应该是:
Array(Array("hello","world"),Array("hello","word"),Array("world","word","hello"))
flatMap的输出结果应该是
Array("hello","world","hello","word","world","word","hello")
flatMap就是在map基础上平铺展开
[Ljava.lang.String;@736caf7a这个是什么类型,在scala的命令行界面输入
val a=Array("hello")
println(a) 输出 [Ljava.lang.String;@cc4a0dd
3、整个程序的流程中各个环节的输出
flatMap--》map--》reduceByKey
Array("hello","world","hello","word","world","word","hello")---》
Array[Strng,Int](("hello",1),("world",1),("hello",1),("word",1),("world",1),("word",1),("hello",1))-->
Array[String,Int](("hello",3),("world",2),("word",2))
4、如果求出现次数最多的单词
flatMap--》map--》reduceByKey--》reduce
val maxNum= mapRdd.reduce((a,b)=>if (a._2>b._2) a else b)
println(maxNum)
输出("hello",3)