关注 spark技术分享,
撸spark源码 玩spark最佳实践

StreamingQueryManager — Streaming Query Management

StreamingQueryManager — Streaming Query Management

StreamingQueryManager is the management interface for streaming queries in a single SparkSession.

Table 1. StreamingQueryManager API
Method Description

active

Returns active structured queries

addListener

awaitAnyTermination

Waits for any streaming query to be terminated

get

Gets the StreamingQuery by id

removeListener

De-registers the StreamingQueryListener

resetTerminated

Resets the internal registry of the terminated streaming queries (that lets awaitAnyTermination to be used again)

StreamingQueryManager is available using SparkSession.streams property.

StreamingQueryManager is created when SessionState is created.

StreamingQueryManager.png
Figure 1. StreamingQueryManager
Tip
Refer to the Mastering Apache Spark 2 gitbook to learn about SessionState.

StreamingQueryManager is used (internally) to create a StreamingQuery (with its StreamExecution).

StreamingQueryManager createQuery.png
Figure 2. StreamingQueryManager Creates StreamingQuery (and StreamExecution)

StreamingQueryManager takes a single SparkSession when created.

Table 2. StreamingQueryManager’s Internal Registries and Counters (in alphabetical order)
Name Description

activeQueries

Registry of StreamingQueries per UUID

Used when StreamingQueryManager is requested for active streaming queries, get a streaming query by id, starts a streaming query and is notified that a streaming query has terminated.

activeQueriesLock

awaitTerminationLock

lastTerminatedQuery

StreamingQuery that has recently been terminated, i.e. stopped or due to an exception.

null when no streaming query has terminated yet or resetTerminated.

listenerBus

Used to:

stateStoreCoordinator

Getting All Active Streaming Queries — active Method

Getting Active Continuous Query By Name — get Method

get method returns a StreamingQuery by name.

It may throw an IllegalArgumentException when no StreamingQuery exists for the name.

Registering StreamingQueryListener — addListener Method

addListener requests the StreamingQueryListenerBus to add the input listener.

De-Registering StreamingQueryListener — removeListener Method

removeListener requests StreamingQueryListenerBus to remove the input listener.

Waiting for Any Streaming Query Termination — awaitAnyTermination Method

awaitAnyTermination acquires a lock on awaitTerminationLock and waits until any streaming query has finished (i.e. lastTerminatedQuery is available) or timeoutMs has expired.

awaitAnyTermination re-throws the StreamingQueryException from lastTerminatedQuery if it reported one.

resetTerminated Method

resetTerminated forgets about the past-terminated query (so that awaitAnyTermination can be used again to wait for a new streaming query termination).

Internally, resetTerminated acquires a lock on awaitTerminationLock and simply resets lastTerminatedQuery (i.e. sets it to null).

Creating Streaming Query — createQuery Internal Method

createQuery creates a StreamingQueryWrapper (for a StreamExecution per the input user-defined properties).

Internally, createQuery first finds the name of the checkpoint directory of a query (aka checkpoint location) in the following order:

  1. Exactly the input userSpecifiedCheckpointLocation if defined

  2. spark.sql.streaming.checkpointLocation Spark property if defined for the parent directory with a subdirectory per the optional userSpecifiedName (or a randomly-generated UUID)

  3. (only when useTempCheckpointLocation is enabled) A temporary directory (as specified by java.io.tmpdir JVM property) with a subdirectory with temporary prefix.

Note
userSpecifiedCheckpointLocation can be any path that is acceptable by Hadoop’s Path.

If the directory name for the checkpoint location could not be found, createQuery reports a AnalysisException.

createQuery reports a AnalysisException when the input recoverFromCheckpointLocation flag is turned off but there is offsets directory in the checkpoint location.

createQuery makes sure that the logical plan of the structured query is analyzed (i.e. no logical errors have been found).

(only when spark.sql.adaptive.enabled Spark property is turned on) createQuery prints out a WARN message to the logs:

In the end, createQuery creates a StreamingQueryWrapper with a new MicroBatchExecution.

Note

recoverFromCheckpointLocation flag corresponds to recoverFromCheckpointLocation flag that StreamingQueryManager uses to start a streaming query and which is enabled by default (and is in fact the only place where createQuery is used).

  • memory sink has the flag enabled for Complete output mode only

  • foreach sink has the flag always enabled

  • console sink has the flag always disabled

  • all other sinks have the flag always enabled

Note
userSpecifiedName corresponds to queryName option (that can be defined using DataStreamWriter‘s queryName method) while userSpecifiedCheckpointLocation is checkpointLocation option.
Note
createQuery is used exclusively when StreamingQueryManager is requested to start a streaming query (when DataStreamWriter is requested to start an execution of a streaming query).

Starting Streaming Query Execution — startQuery Internal Method

startQuery starts a streaming query and returns a handle to it.

Note
trigger defaults to 0 milliseconds (as ProcessingTime(0)).

Internally, startQuery first creates a StreamingQueryWrapper, registers it in activeQueries internal registry (by the id), requests it for the underlying StreamExecution and starts it.

In the end, startQuery returns the StreamingQueryWrapper (as part of the fluent API so you can chain operators) or throws the exception that was reported when attempting to start the query.

startQuery throws an IllegalArgumentException when there is another query registered under name. startQuery looks it up in the activeQueries internal registry.

startQuery throws an IllegalStateException when a query is started again from checkpoint. startQuery looks it up in activeQueries internal registry.

Note
startQuery is used exclusively when DataStreamWriter is requested to start an execution of the streaming query.

Posting StreamingQueryListener Event to StreamingQueryListenerBus — postListenerEvent Internal Method

postListenerEvent simply posts the input event to StreamingQueryListenerBus.

StreamingQueryManager postListenerEvent.png
Figure 3. StreamingQueryManager Propagates StreamingQueryListener Events
Note
postListenerEvent is used exclusively when StreamExecution posts a streaming event.

Handling Termination of Streaming Query (and Deactivating Query in StateStoreCoordinator) — notifyQueryTermination Internal Method

notifyQueryTermination removes the terminatedQuery from activeQueries internal registry (by the query id).

notifyQueryTermination records the terminatedQuery in lastTerminatedQuery internal registry (when no earlier streaming query was recorded or the terminatedQuery terminated due to an exception).

notifyQueryTermination notifies others that are blocked on awaitTerminationLock.

In the end, notifyQueryTermination requests StateStoreCoordinator to deactivate all active runs of the streaming query.

StreamingQueryManager notifyQueryTermination.png
Figure 4. StreamingQueryManager’s Marking Streaming Query as Terminated
Note
notifyQueryTermination is used exclusively when StreamExecution is requested to run a streaming query and the query has finished (running streaming batches) (with or without an exception).
赞(0) 打赏
未经允许不得转载:spark技术分享 » StreamingQueryManager — Streaming Query Management
分享到: 更多 (0)

关注公众号:spark技术分享

联系我们联系我们

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏