关注 spark技术分享,
撸spark源码 玩spark最佳实践

WordCount using Spark shell

WordCount using Spark shell

It is like any introductory big data example should somehow demonstrate how to count words in distributed fashion.

In the following example you’re going to count the words in README.md file that sits in your Spark distribution and save the result under README.count directory.

You’re going to use the Spark shell for the example. Execute spark-shell.

  1. Read the text file – refer to Using Input and Output (I/O).

  2. Split each line into words and flatten the result.

  3. Map each word into a pair and count them by word (key).

  4. Save the result into text files – one per partition.

After you have executed the example, see the contents of the README.count directory:

The files part-0000x contain the pairs of word and the count.

Further (self-)development

Please read the questions and give answers first before looking at the link given.

  1. Why are there two files under the directory?

  2. How could you have only one?

  3. How to filter out words by name?

  4. How to count words?

Please refer to the chapter Partitions to find some of the answers.

赞(0) 打赏
未经允许不得转载:spark技术分享 » WordCount using Spark shell
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏