关注 spark技术分享,
撸spark源码 玩spark最佳实践

Running Spark Applications on Windows

Running Spark Applications on Windows

Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.

Note
A Spark application could be spark-shell or your own custom Spark application.

What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.

You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows’ POSIX-incompatible NTFS filesystem.

Note
You do not have to install Apache Hadoop to work with Spark or run Spark applications.
Tip
Read the Apache Hadoop project’s Problems running Hadoop on Windows.

Among the issues is the infamous java.io.IOException when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).

Note

You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (cmd) ran as Administrator, i.e. using Run as administrator option while executing cmd.

Read the official document in Microsoft TechNet — Start a Command Prompt as an Administrator.

Download winutils.exe binary from https://github.com/steveloughran/winutils repository.

Note
You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2 (here is the direct link to winutils.exe binary).

Save winutils.exe binary to a directory of your choice, e.g. c:\hadoop\bin.

Set HADOOP_HOME to reflect the directory with winutils.exe (without bin).

Set PATH environment variable to include %HADOOP_HOME%\bin as follows:

Tip
Define HADOOP_HOME and PATH environment variables in Control Panel so any Windows program would use them.

Create C:\tmp\hive directory.

Note

c:\tmp\hive directory is the default value of hive.exec.scratchdir configuration property in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.

You can change hive.exec.scratchdir configuration property to another directory as described in Changing hive.exec.scratchdir Configuration Property in this document.

Execute the following command in cmd that you started using the option Run as administrator.

Check the permissions (that is one of the commands that are executed under the covers):

Open spark-shell and observe the output (perhaps with few WARN messages that you can simply disregard).

As a verification step, execute the following line to display the content of a DataFrame:

Note

Disregard WARN messages when you start spark-shell. They are harmless.

If you see the above output, you’re done. You should now be able to run Spark applications on your Windows. Congrats!

Changing hive.exec.scratchdir Configuration Property

Create a hive-site.xml file with the following content:

Start a Spark application, e.g. spark-shell, with HADOOP_CONF_DIR environment variable set to the directory with hive-site.xml.

赞(0) 打赏
未经允许不得转载:spark技术分享 » Running Spark Applications on Windows
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏