Window Function

Overview

Spark SQL supports three kinds of window functions: ranking functions, analytic functions, and aggregate functions.

A window specification defines which rows are included in the frame associated with a given input row. A window specification includes three parts:

Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.

Ordering Specification: controls the way that rows in a partition are ordered, determining the position of the given row in its partition.

Frame Specification: states which rows will be included in the frame for the current input row, based on their relative position to the current row. For example, “the three rows preceding the current row to the current row” describes a frame including the current input row and three rows appearing before the current row.

In SQL, the PARTITION BY and ORDER BY keywords are used to specify partitioning expressions for the partitioning specification, and ordering expressions for the ordering specification, respectively. The SQL syntax is shown below.

OVER (PARTITION BY ... ORDER BY ...)

In addition to the ordering and partitioning, users need to define the start boundary of the frame, the end boundary of the frame, and the type of the frame, which are three components of a frame specification.

There are five types of boundaries, which are UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING. UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING represent the first row of the partition and the last row of the partition, respectively. For the other three types of boundaries, they specify the offset from the position of the current input row and their specific meanings are defined based on the type of the frame. There are two types of frames, ROW frame and RANGE frame.

ROW frames are based on physical offsets from the position of the current input row, e.g., ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING which including one preceding row and 1 following row.

RANGE frames are based on logical offsets from the position of the current input row, and have similar syntax to the ROW frame, e.g., RANGE BETWEEN 2000 PRECEDING AND 1000 FOLLOWING. Suppose the value of the current row is 3000, then the ranges includes the continuous range whose value fall into the range [1000, 4000].

In summary, to define a window specification, users can use the following syntax in SQL.

OVER (PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end)

Implementation

WindowExpression is specified by windowFunction and windowSpec, where WindowSpecDefinition is further defined by partitionSpec, orderSpec and frameSpecification. In the analyzer, Window will be inserted as a logical plan through addWindow if hasWindowFunction check passed.

WindowFunction WindowFunction has the WindowFrame defined.

WindowFunctionFrame is used to maintain the WindowFrame.

References: Introducing Window Functions in Spark SQL

results matching ""

    No results matching ""