Lightning-fast unified analytics engine

闪电般的统一分析引擎

Useful Developer Tools

有用的开发工具

Reducing Build Times

减少构建时间

SBT: Avoiding Re-Creating the Assembly JAR

避免重新创建组装JAR。

Spark’s default build strategy is to assemble a jar including all of its dependencies. This can be cumbersome when doing iterative development. When developing locally, it is possible to create an assembly jar including all of Spark’s dependencies and then re-package only Spark itself when making changes.

Spark的默认构建策略是组装一个jar,包括它的所有依赖项。这在进行迭代开发时很麻烦。当在本地开发时,可以创建一个包含所有Spark依赖项的组装jar,然后在进行更改时重新封装。

$ build/sbt clean package
$ ./bin/spark-shell
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
# ... do some local development ... #
$ build/sbt compile
# ... do some local development ... #
$ build/sbt compile
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell
 
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ build/sbt ~compile

Maven: Speeding up Compilation with Zinc

美芬:用锌加速编译。

Zinc is a long-running server version of SBT’s incremental compiler. When run locally as a background process, it speeds up builds of Scala-based projects like Spark. Developers who regularly recompile Spark with Maven will be the most interested in Zinc. The project site gives instructions for building and running zinc; OS X users can install it using brew install zinc.

锌是SBT增量编译器的长期服务器版本。当在本地作为后台进程运行时,它加速了基于scala的项目的构建,比如Spark。经常使用Maven重新编译Spark的开发人员将对锌最感兴趣。该项目网站提供了建设和运行锌的指导;OS X用户可以使用brew安装的锌来安装。

If using the build/mvn package zinc will automatically be downloaded and leveraged for all builds. This process will auto-start after the first time build/mvn is called and bind to port 3030 unless the ZINC_PORT environment variable is set. The zinc process can subsequently be shut down at any time by running build/zinc-<version>/bin/zinc -shutdown and will automatically restart whenever build/mvn is called.

如果使用build/mvn包锌,将自动下载并使用所有构建版本。这个过程将在第一次构建/mvn被调用后自动启动,除非设置ZINC_PORT环境变量,然后绑定到端口3030。在任何时候,通过运行构建/锌- <版本> /bin/ zn -关机,并在调用构建/mvn时自动重启,将会关闭锌进程。

Building submodules individually

分别建立子

For instance, you can build the Spark Core module using:

例如,您可以使用以下方法构建Spark核心模块:

$ # sbt
$ build/sbt
> project core
> package

$ # or you can build the spark-core module with sbt directly using:
$ build/sbt core/package

$ # Maven
$ build/mvn package -DskipTests -pl core

Running Individual Tests

运行单个测试

When developing locally, it’s often convenient to run a single test or a few tests, rather than running the entire test suite.

在本地开发时,运行单个测试或几个测试,而不是运行整个测试套件通常是很方便的。

Testing with SBT

测试与SBT

The fastest way to run individual tests is to use the sbt console. It’s fastest to keep a sbt console open, and use it to re-run tests as necessary. For example, to run all of the tests in a particular project, e.g., core:

运行单个测试的最快方法是使用sbt控制台。保持sbt控制台打开是最快的,并在必要时使用它来重新运行测试。例如,在一个特定的项目中运行所有的测试,例如,core:

$ build/sbt
> project core
> test

You can run a single test suite using the testOnly command. For example, to run the DAGSchedulerSuite:

您可以使用testOnly命令运行单个测试套件。例如,要运行dagorderersuite:

> testOnly org.apache.spark.scheduler.DAGSchedulerSuite

The testOnly command accepts wildcards; e.g., you can also run the DAGSchedulerSuite with:

testOnly命令接受通配符;例:你也可以用以下方法来运行dagtimeersuite:

> testOnly *DAGSchedulerSuite

Or you could run all of the tests in the scheduler package:

或者您可以在调度程序包中运行所有的测试:

> testOnly org.apache.spark.scheduler.*

If you’d like to run just a single test in the DAGSchedulerSuite, e.g., a test that includes “SPARK-12345” in the name, you run the following command in the sbt console:

如果您想在dagtimeersuite中仅运行一个测试,例如在名称中包含“SPARK-12345”的测试,您将在sbt控制台运行以下命令:

> testOnly *DAGSchedulerSuite -- -z "SPARK-12345"

If you’d prefer, you can run all of these commands on the command line (but this will be slower than running tests using an open cosole). To do this, you need to surround testOnly and the following arguments in quotes:

如果您愿意,您可以在命令行上运行所有这些命令(但这比使用开放的cosole运行测试要慢一些)。要做到这一点,您需要将testOnly和以下参数放在引号中:

$ build/sbt "core/testOnly *DAGSchedulerSuite -- -z SPARK-12345"

For more about how to run individual tests with sbt, see the sbt documentation.

有关如何使用sbt运行单个测试的更多信息,请参阅sbt文档。

Testing with Maven

测试与Maven

With Maven, you can use the -DwildcardSuites flag to run individual Scala tests:

使用Maven,您可以使用-DwildcardSuites标志来运行单个Scala测试:

build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.DAGSchedulerSuite test

You need -Dtest=none to avoid running the Java tests. For more information about the ScalaTest Maven Plugin, refer to the ScalaTest documentation.

您需要-Dtest=none以避免运行Java测试。有关ScalaTest Maven插件的更多信息,请参阅ScalaTest文档。

To run individual Java tests, you can use the -Dtest flag:

要运行单个Java测试,您可以使用-Dtest标志:

build/mvn test -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite test

ScalaTest Issues

——scalate问题

If the following error occurs when running ScalaTest

如果在运行ScalaTest时发生以下错误。

An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException

It is due to an incorrect Scala library in the classpath. To fix it:

这是因为类路径中的Scala库不正确。修复:

  • Right click on project
  • Select Build Path | Configure Build Path
  • Add Library | Scala Library
  • Remove scala-library-2.10.4.jar - lib_managed\jars

In the event of “Could not find resource path for Web UI: org/apache/spark/ui/static”, it’s due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:

在“无法找到Web UI的资源路径:org/apache/spark/ui/static”的情况下,这是由于类路径问题(一些类可能没有编译)。为了解决这个问题,可以从命令行运行一个测试:

build/sbt "test-only org.apache.spark.rdd.SortingSuite"

Running Different Test Permutations on Jenkins

在Jenkins身上运行不同的测试排列。

When running tests for a pull request on Jenkins, you can add special phrases to the title of your pull request to change testing behavior. This includes:

在对Jenkins的pull请求进行测试时,您可以在pull请求的标题中添加特殊的短语来更改测试行为。这包括:

  • [test-maven] - signals to test the pull request using maven
  • [test-hadoop2.7] - signals to test using Spark’s Hadoop 2.7 profile

Binary compatibility

二进制兼容性

To ensure binary compatibility, Spark uses MiMa.

为了确保二进制兼容性,Spark使用MiMa。

Ensuring binary compatibility

确保二进制兼容性

When working on an issue, it’s always a good idea to check that your changes do not introduce binary incompatibilities before opening a pull request.

在处理一个问题时,最好先检查一下您的更改是否在打开一个pull请求之前不引入二进制不兼容。

You can do so by running the following command:

您可以通过运行以下命令来实现:

$ dev/mima

A binary incompatibility reported by MiMa might look like the following:

MiMa报告的二进制不兼容性可能看起来如下:

[error] method this(org.apache.spark.sql.Dataset)Unit in class org.apache.spark.SomeClass does not have a correspondent in current version
[error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this")

If you open a pull request containing binary incompatibilities anyway, Jenkins will remind you by failing the test build with the following message:

如果您打开一个包含二进制不兼容性的pull请求,Jenkins将通过以下消息来提醒您失败:

Test build #xx has finished for PR yy at commit ffffff.

  This patch fails MiMa tests.
  This patch merges cleanly.
  This patch adds no public classes.

Solving a binary incompatibility

解决一个二进制不兼容性

If you believe that your binary incompatibilies are justified or that MiMa reported false positives (e.g. the reported binary incompatibilities are about a non-user facing API), you can filter them out by adding an exclusion in project/MimaExcludes.scala containing what was suggested by the MiMa report and a comment containing the JIRA number of the issue you’re working on as well as its title.

如果您认为二进制不兼容是合理的,或者MiMa报告了假阳性(例如报告的二进制不兼容性是关于一个非用户面对的API),您可以通过在项目/ mima排除中添加一个排除方法来过滤它们。scala包含MiMa报告中建议的内容,以及包含您正在处理的问题的JIRA号的注释以及它的标题。

For the problem described above, we might add the following:

对于上述问题,我们可以添加以下内容:

// [SPARK-zz][CORE] Fix an issue
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.SomeClass.this")

Otherwise, you will have to resolve those incompatibilies before opening or updating your pull request. Usually, the problems reported by MiMa are self-explanatory and revolve around missing members (methods or fields) that you will have to add back in order to maintain binary compatibility.

否则,在打开或更新您的pull请求之前,您必须解决这些不兼容问题。通常,MiMa报告的问题都是自解释的,并且围绕着丢失的成员(方法或字段),为了维护二进制兼容性,您将不得不添加这些成员(方法或字段)。

Checking Out Pull Requests

检查拉请求

Git provides a mechanism for fetching remote pull requests into your own local repository. This is useful when reviewing code or testing patches locally. If you haven’t yet cloned the Spark Git repository, use the following command:

Git提供了一种机制,可以将远程pull请求提取到本地存储库中。这在检查本地代码或测试补丁时很有用。如果您还没有克隆出Spark Git存储库,请使用以下命令:

$ git clone https://github.com/apache/spark.git
$ cd spark

To enable this feature you’ll need to configure the git remote repository to fetch pull request data. Do this by modifying the .git/config file inside of your Spark directory. The remote may not be named “origin” if you’ve named it something else:

要启用这个特性,您需要配置git远程存储库以获取拉请求数据。通过修改Spark目录中的.git/config文件来实现这一点。如果您将其命名为“源”,则远程可能不会被命名为“origin”:

[remote "origin"]
  url = git@github.com:apache/spark.git
  fetch = +refs/heads/*:refs/remotes/origin/*
  fetch = +refs/pull/*/head:refs/remotes/origin/pr/*   # Add this line

Once you’ve done this you can fetch remote pull requests

完成此操作后,您可以获取远程pull请求。

# Fetch remote pull requests
$ git fetch origin
# Checkout a remote pull request
$ git checkout origin/pr/112
# Create a local branch from a remote pull request
$ git checkout origin/pr/112 -b new-branch

Generating Dependency Graphs

生成依赖图

$ # sbt
$ build/sbt dependency-tree
 
$ # Maven
$ build/mvn -DskipTests install
$ build/mvn dependency:tree

Organizing Imports

组织进口

You can use a IntelliJ Imports Organizer from Aaron Davidson to help you organize the imports in your code. It can be configured to match the import ordering from the style guide.

您可以使用来自Aaron Davidson的IntelliJ导入组织来帮助您在代码中组织导入。它可以配置为与样式指南中的导入排序相匹配。

IDE Setup

IDE设置

IntelliJ

IntelliJ

While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.

虽然许多Spark开发人员在命令行上使用SBT或Maven,但我们使用的最常见的IDE是IntelliJ IDEA。您可以免费获得community edition (Apache committers可以获得免费的IntelliJ Ultimate edition许可证),并从Preferences >插件中安装JetBrains Scala插件。

To create a Spark project for IntelliJ:

为IntelliJ创建一个Spark项目:

  • Download IntelliJ and install the Scala plug-in for IntelliJ.
  • Go to File -> Import Project, locate the spark source directory, and select “Maven Project”.
  • In the Import wizard, it’s fine to leave settings at their default. However it is usually useful to enable “Import Maven projects automatically”, since changes to the project structure will automatically update the IntelliJ project.
  • As documented in Building Spark, some build configurations require specific profiles to be enabled. The same profiles that are enabled with -P[profile name] above may be enabled on the Profiles screen in the Import wizard. For example, if developing for Hadoop 2.7 with YARN support, enable profiles yarn and hadoop-2.7. These selections can be changed later by accessing the “Maven Projects” tool window from the View menu, and expanding the Profiles section.

Other tips:

其他建议:

  • “Rebuild Project” can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the “Generate Sources and Update Folders For All Projects” button in the “Maven Projects” tool window to manually generate these sources.
  • Some of the modules have pluggable source directories based on Maven profiles (i.e. to support both Scala 2.11 and 2.10 or to allow cross building against different versions of Hive). In some cases IntelliJ’s does not correctly detect use of the maven-build-plugin to add source directories. In these cases, you may need to add source locations explicitly to compile the entire project. If so, open the “Project Settings” and select “Modules”. Based on your selected Maven profiles, you may need to add source folders to the following modules:
    • spark-hive: add v0.13.1/src/main/scala
    • spark-streaming-flume-sink: add target\scala-2.11\src_managed\main\compiled_avro
    • spark-catalyst: add target\scala-2.11\src_managed\main
  • Compilation may fail with an error like “scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar”. If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the “Additional compiler options” field. It will work then although the option will come back when the project reimports. If you try to build any of the projects using quasiquotes (eg., sql) then you will need to make that jar a compiler plugin (just below “Additional compiler options”). Otherwise you will see errors like:
    /Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
    Error:(147, 9) value q is not a member of StringContext
     Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type
          q"""
          ^ 
    

Eclipse

Eclipse

Eclipse can be used to develop and test Spark. The following configuration is known to work:

Eclipse可以用于开发和测试Spark。下面的配置可以工作:

The easiest way is to download the Scala IDE bundle from the Scala IDE download page. It comes pre-installed with ScalaTest. Alternatively, use the Scala IDE update site or Eclipse Marketplace.

最简单的方法是从Scala IDE下载页面下载Scala IDE包。它预先安装了ScalaTest。或者,使用Scala IDE更新站点或Eclipse市场。

SBT can create Eclipse .project and .classpath files. To create these files for each Spark sub project, use this command:

SBT可以创建Eclipse .project和.classpath文件。要为每个Spark子项目创建这些文件,请使用以下命令:

sbt/sbt eclipse

To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace. Do not select “Copy projects into workspace”.

要导入一个特定的项目,例如spark-core,选择文件|将现有项目导入工作区。不要选择“将项目复制到工作区中”。

If you want to develop on Scala 2.10 you need to configure a Scala installation for the exact Scala version that’s used to compile Spark. Since Scala IDE bundles the latest versions (2.10.5 and 2.11.8 at this point), you need to add one in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your Scala 2.10.5 distribution. Once this is done, select all Spark projects and right-click, choose Scala -> Set Scala Installation and point to the 2.10.5 installation. This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.

如果您想在Scala 2.10上开发,您需要为Scala版本配置一个Scala安装,用于编译Spark。由于Scala IDE绑定了最新版本(在这一点上是2.10.5和2.11.8),您需要在Eclipse首选项中添加一个-> Scala ->安装,它指向您的Scala 2.10.5分布的lib/目录。完成此操作后,选择所有Spark项目并单击右键,选择Scala ->设置Scala安装,并指向2.10.5安装。这应该清除所有关于无效的交叉编译库的错误。现在,一个干净的构建应该会成功。

ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test.

ScalaTest可以通过右键单击源文件并选择Run作为| Scala测试来执行单元测试。

If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini in the Eclipse install directory. Increase the following setting as needed:

如果出现Java内存错误,可能需要在eclipse中增加设置。在Eclipse安装目录中。根据需要增加以下设置:

--launcher.XXMaxPermSize
256M

Nightly Builds

每夜构建

Packages are built regularly off of Spark’s master branch and release branches. These provide Spark developers access to the bleeding-edge of Spark master or the most recent fixes not yet incorporated into a maintenance release. These should only be used by Spark developers, as they may have bugs and have not undergone the same level of testing as releases. Spark nightly packages are available at:

软件包定期由Spark的主分支和发布分支构建。这些特性使Spark开发人员能够访问Spark master的前沿,或者最新的补丁还没有被纳入到维护版本中。这些应该只被Spark开发人员使用,因为它们可能有bug,并且没有经历与发布版本相同的测试级别。每夜的花束可在:

Spark also publishes SNAPSHOT releases of its Maven artifacts for both master and maintenance branches on a nightly basis. To link to a SNAPSHOT you need to add the ASF snapshot repository to your build. Note that SNAPSHOT artifacts are ephemeral and may change or be removed. To use these you must add the ASF snapshot repository at <a href=”https://repository.apache.org/snapshots/.

Spark还会在夜间发布其Maven构件的快照版本,以供管理和维护分支使用。要链接到一个快照,您需要将ASF快照存储库添加到您的构建中。注意,快照工件是短暂的,可以更改或删除。要使用这些,您必须在

groupId: org.apache.spark
artifactId: spark-core_2.10
version: 1.5.0-SNAPSHOT

Profiling Spark Applications Using YourKit

使用您的工具包分析Spark应用程序。

Here are instructions on profiling Spark applications using YourKit Java Profiler.

下面是使用YourKit Java Profiler分析Spark应用程序的说明。

On Spark EC2 images

在火花EC2图片

  • After logging into the master node, download the YourKit Java Profiler for Linux from the YourKit downloads page. This file is pretty big (~100 MB) and YourKit downloads site is somewhat slow, so you may consider mirroring this file or including it on a custom AMI.
  • Unzip this file somewhere (in /root in our case): unzip YourKit-JavaProfiler-2017.02-b66.zip
  • Copy the expanded YourKit files to each node using copy-dir: ~/spark-ec2/copy-dir /root/YourKit-JavaProfiler-2017.02
  • Configure the Spark JVMs to use the YourKit profiling agent by editing ~/spark/conf/spark-env.sh and adding the lines
    SPARK_DAEMON_JAVA_OPTS+=" -agentpath:/root/YourKit-JavaProfiler-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
    export SPARK_DAEMON_JAVA_OPTS
    SPARK_EXECUTOR_OPTS+=" -agentpath:/root/YourKit-JavaProfiler-2017.02/bin/linux-x86-64/libyjpagent.so=sampling"
    export SPARK_EXECUTOR_OPTS
    
  • Copy the updated configuration to each node: ~/spark-ec2/copy-dir ~/spark/conf/spark-env.sh
  • Restart your Spark cluster: ~/spark/bin/stop-all.sh and ~/spark/bin/start-all.sh
  • By default, the YourKit profiler agents use ports 10001-10010. To connect the YourKit desktop application to the remote profiler agents, you’ll have to open these ports in the cluster’s EC2 security groups. To do this, sign into the AWS Management Console. Go to the EC2 section and select Security Groups from the Network & Security section on the left side of the page. Find the security groups corresponding to your cluster; if you launched a cluster named test_cluster, then you will want to modify the settings for the test_cluster-slaves and test_cluster-master security groups. For each group, select it from the list, click the Inbound tab, and create a new Custom TCP Rule opening the port range 10001-10010. Finally, click Apply Rule Changes. Make sure to do this for both security groups. Note: by default, spark-ec2 re-uses security groups: if you stop this cluster and launch another cluster with the same name, your security group settings will be re-used.
  • Launch the YourKit profiler on your desktop.
  • Select “Connect to remote application…” from the welcome screen and enter the the address of your Spark master or worker machine, e.g. ec2--.compute-1.amazonaws.com
  • YourKit should now be connected to the remote profiling agent. It may take a few moments for profiling information to appear.

Please see the full YourKit documentation for the full list of profiler agent startup options.

请查看完整的您的工具文档,以获得完整的profiler代理启动选项列表。

In Spark unit tests

在火花单元测试

When running Spark tests through SBT, add javaOptions in Test += "-agentpath:/path/to/yjp" to SparkBuild.scala to launch the tests with the YourKit profiler agent enabled.
The platform-specific paths to the profiler agents are listed in the YourKit documentation.

当通过SBT运行Spark测试时,在Test += "-agentpath:/path/to/yjp"中添加javaOptions以进行SparkBuild。使用您的工具包profiler代理启动测试。对profiler代理的特定于平台的路径在YourKit文档中列出。