package optimizer
- Alphabetic
- Public
- Protected
Type Members
- sealed abstract class BuildSide extends AnyRef
- case class Cost(card: BigInt, size: BigInt) extends Product with Serializable
This class defines the cost model for a plan.
This class defines the cost model for a plan.
- card
Cardinality (number of rows).
- size
Size in bytes.
- case class InlineCTE(alwaysInline: Boolean = false) extends Rule[LogicalPlan] with Product with Serializable
Inlines CTE definitions into corresponding references if either of the conditions satisfies: 1.
Inlines CTE definitions into corresponding references if either of the conditions satisfies: 1. The CTE definition does not contain any non-deterministic expressions or contains attribute references to an outer query. If this CTE definition references another CTE definition that has non-deterministic expressions, it is still OK to inline the current CTE definition. 2. The CTE definition is only referenced once throughout the main query and all the subqueries.
CTE definitions that appear in subqueries and are not inlined will be pulled up to the main query level.
- alwaysInline
if true, inline all CTEs in the query plan.
- case class JoinGraphInfo(starJoins: Set[Int], nonStarJoins: Set[Int]) extends Product with Serializable
Helper class that keeps information about the join graph as sets of item/plan ids.
Helper class that keeps information about the join graph as sets of item/plan ids. It currently stores the star/non-star plans. It can be extended with the set of connected/unconnected plans.
- trait JoinSelectionHelper extends AnyRef
- case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes with Product with Serializable
- abstract class Optimizer extends RuleExecutor[LogicalPlan] with SQLConfHelper
Abstract class all optimizers should inherit of, contains the standard batches (extending Optimizers can override this.
- case class OrderedJoin(left: LogicalPlan, right: LogicalPlan, joinType: JoinType, condition: Option[Expression]) extends LogicalPlan with BinaryNode with Product with Serializable
This is a mimic class for a join node that has been ordered.
- abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSupport
The base class of two rules in the normal and AQE Optimizer.
The base class of two rules in the normal and AQE Optimizer. It simplifies query plans with empty or non-empty relations:
- Higher-node Logical Plans
- Union with all empty children. 2. Binary-node Logical Plans
- Join with one or two empty children (including Intersect/Except).
- Left semi Join Right side is non-empty and condition is empty. Eliminate join to its left side.
- Left anti join Right side is non-empty and condition is empty. Eliminate join to an empty LocalRelation. 3. Unary-node Logical Plans
- Project/Filter/Sample with all empty children.
- Limit/Repartition/RepartitionByExpression/Rebalance with all empty children.
- Aggregate with all empty children and at least one grouping expression.
- Generate(Explode) with all empty children. Others like Hive UDTF may return results.
- Higher-node Logical Plans
- case class ReplaceCurrentLike(catalogManager: CatalogManager) extends Rule[LogicalPlan] with Product with Serializable
Replaces the expression of CurrentDatabase with the current database name.
Replaces the expression of CurrentDatabase with the current database name. Replaces the expression of CurrentCatalog with the current catalog name.
- case class ScalarSubqueryReference(subqueryIndex: Int, headerIndex: Int, dataType: DataType, exprId: ExprId) extends LeafExpression with Unevaluable with Product with Serializable
Temporal reference to a cached subquery.
Temporal reference to a cached subquery.
- subqueryIndex
A subquery index in the cache.
- headerIndex
An index in the output of merged subquery.
- dataType
The dataType of origin scalar subquery.
- class SimpleTestOptimizer extends Optimizer
Value Members
- object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper
Simplifies boolean expressions: 1.
Simplifies boolean expressions: 1. Simplifies expressions whose answer can be determined without evaluating both sides. 2. Eliminates / extracts common factors. 3. Merge same expressions 4. Removes
Notoperator. - case object BuildLeft extends BuildSide with Product with Serializable
- case object BuildRight extends BuildSide with Product with Serializable
- object CheckCartesianProducts extends Rule[LogicalPlan] with PredicateHelper
Check if there any cartesian products between joins of any type in the optimized plan tree.
Check if there any cartesian products between joins of any type in the optimized plan tree. Throw an error if a cartesian product is found without an explicit cross join specified. This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true.
This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be collected before checking if it is a cartesian product. If you have SELECT * from R, S where R.r = S.s, the join between R and S is not a cartesian product and therefore should be allowed. The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule.
This rule must be run AFTER the batch "LocalRelation", since a join with empty relation should not be a cartesian product.
- object CleanUpTempCTEInfo extends Rule[LogicalPlan]
Clean up temporary info from CTERelationDef nodes.
Clean up temporary info from CTERelationDef nodes. This rule should be called after all iterations of PushdownPredicatesAndPruneColumnsForCTEDef are done.
- object CollapseProject extends Rule[LogicalPlan] with AliasHelper
Combines two Project operators into one and perform alias substitution, merging the expressions into one single expression for the following cases.
Combines two Project operators into one and perform alias substitution, merging the expressions into one single expression for the following cases. 1. When two Project operators are adjacent. 2. When two Project operators have LocalLimit/Sample/Repartition operator between them and the upper project consists of the same number of columns which is equal or aliasing.
GlobalLimit(LocalLimit)pattern is also considered. - object CollapseRepartition extends Rule[LogicalPlan]
Combines adjacent RepartitionOperation and RebalancePartitions operators
- object CollapseWindow extends Rule[LogicalPlan]
Collapse Adjacent Window Expression.
Collapse Adjacent Window Expression. - If the partition specs and order specs are the same and the window expression are independent and are of the same window function type, collapse into the parent.
- object ColumnPruning extends Rule[LogicalPlan]
Attempts to eliminate the reading of unneeded columns from the query plan.
Attempts to eliminate the reading of unneeded columns from the query plan.
Since adding Project before Filter conflicts with PushPredicatesThroughProject, this rule will remove the Project p2 in the following pattern:
p1 @ Project(_, Filter(_, p2 @ Project(_, child))) if p2.outputSet.subsetOf(p2.inputSet)
p2 is usually inserted by this rule and useless, p1 could prune the columns anyway.
- object CombineConcats extends Rule[LogicalPlan]
Combine nested Concat expressions.
- object CombineFilters extends Rule[LogicalPlan] with PredicateHelper
Combines two adjacent Filter operators into one, merging the non-redundant conditions into one conjunctive predicate.
- object CombineTypedFilters extends Rule[LogicalPlan]
Combines two adjacent TypedFilters, which operate on same type object in condition, into one, merging the filter functions into one conjunctive function.
- object CombineUnions extends Rule[LogicalPlan]
Combines all adjacent Union operators into a single Union.
- object ComputeCurrentTime extends Rule[LogicalPlan]
Computes the current date and time to make sure we return the same result in a single query.
- object ConstantFolding extends Rule[LogicalPlan]
Replaces Expressions that can be statically evaluated with equivalent Literal values.
- object ConstantPropagation extends Rule[LogicalPlan]
Substitutes Attributes which can be statically evaluated with their corresponding value in conjunctive Expressions e.g.
Substitutes Attributes which can be statically evaluated with their corresponding value in conjunctive Expressions e.g.
SELECT * FROM table WHERE i = 5 AND j = i + 3 ==> SELECT * FROM table WHERE i = 5 AND j = 8
Approach used: - Populate a mapping of attribute => constant value by looking at all the equals predicates - Using this mapping, replace occurrence of the attributes with the corresponding constant values in the AND node.
- object ConvertToLocalRelation extends Rule[LogicalPlan]
Converts local operations (i.e.
Converts local operations (i.e. ones that don't require data exchange) on
LocalRelationto anotherLocalRelation. - object CostBasedJoinReorder extends Rule[LogicalPlan] with PredicateHelper
Cost-based join reorder.
Cost-based join reorder. We may have several join reorder algorithms in the future. This class is the entry of these algorithms, and chooses which one to use.
- object DecimalAggregates extends Rule[LogicalPlan]
Speeds up aggregates on fixed-precision decimals by executing them on unscaled Long values.
Speeds up aggregates on fixed-precision decimals by executing them on unscaled Long values.
This uses the same rules for increasing the precision and scale of the output as org.apache.spark.sql.catalyst.analysis.DecimalPrecision.
- object DecorrelateInnerQuery extends PredicateHelper
Decorrelate the inner query by eliminating outer references and create domain joins.
Decorrelate the inner query by eliminating outer references and create domain joins. The implementation is based on the paper: Unnesting Arbitrary Queries by Thomas Neumann and Alfons Kemper. https://dl.gi.de/handle/20.500.12116/2418.
A correlated subquery can be viewed as a "dependent" nested loop join between the outer and the inner query. For each row produced by the outer query, we bind the OuterReferences in in the inner query with the corresponding values in the row, and then evaluate the inner query.
Dependent Join :- Outer Query +- Inner Query
If the OuterReferences are bound to the same value, the inner query will return the same result. Based on this, we can reduce the times to evaluate the inner query by first getting all distinct values of the OuterReferences.
Normal Join :- Outer Query +- Dependent Join :- Inner Query +- Distinct Aggregate (outer_ref1, outer_ref2, ...) +- Outer Query
The distinct aggregate of the outer references is called a "domain", and the dependent join between the inner query and the domain is called a "domain join". We need to push down the domain join through the inner query until there is no outer reference in the sub-tree and the domain join will turn into a normal join.
The decorrelation function returns a new query plan with optional placeholder DomainJoinss added and a list of join conditions with the outer query. DomainJoins need to be rewritten into actual inner join between the inner query sub-tree and the outer query.
E.g. decorrelate an inner query with equality predicates:
SELECT (SELECT MIN(b) FROM t1 WHERE t2.c = t1.a) FROM t2
Aggregate [] [min(b)] Aggregate [a] [min(b), a] +- Filter (outer(c) = a) => +- Relation [t1] +- Relation [t1]
Join conditions: [c = a]
E.g. decorrelate an inner query with non-equality predicates:
SELECT (SELECT MIN(b) FROM t1 WHERE t2.c > t1.a) FROM t2
Aggregate [] [min(b)] Aggregate [c'] [min(b), c'] +- Filter (outer(c) > a) => +- Filter (c' > a) +- Relation [t1] +- DomainJoin [c'] +- Relation [t1]
Join conditions: [c <=> c']
- object EliminateAggregateFilter extends Rule[LogicalPlan]
Remove useless FILTER clause for aggregate expressions.
Remove useless FILTER clause for aggregate expressions. This rule should be applied before RewriteDistinctAggregates.
- object EliminateDistinct extends Rule[LogicalPlan]
Remove useless DISTINCT:
Remove useless DISTINCT:
- For some aggregate expression, e.g.: MAX and MIN. 2. If the distinct semantics is guaranteed by child.
This rule should be applied before RewriteDistinctAggregates.
- object EliminateLimits extends Rule[LogicalPlan]
This rule is applied by both normal and AQE Optimizer, and optimizes Limit operators by: 1.
This rule is applied by both normal and AQE Optimizer, and optimizes Limit operators by: 1. Eliminate Limit/GlobalLimit operators if it's child max row <= limit. 2. Replace Limit/LocalLimit/GlobalLimit operators with empty LocalRelation if the limit value is zero (0). 3. Combines two adjacent Limit operators into one, merging the expressions into one single expression.
- object EliminateMapObjects extends Rule[LogicalPlan]
Removes MapObjects when the following conditions are satisfied
Removes MapObjects when the following conditions are satisfied
- Mapobject(... lambdavariable(..., false) ...), which means types for input and output are primitive types with non-nullable 2. no custom collection class specified representation of data item.
- object EliminateOffsets extends Rule[LogicalPlan]
This rule optimizes Offset operators by: 1.
This rule optimizes Offset operators by: 1. Eliminate Offset operators if offset == 0. 2. Replace Offset operators to empty LocalRelation if Offset's child max row <= offset. 3. Combines two adjacent Offset operators into one, merging the expressions into one single expression.
- object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper
1.
1. Elimination of outer joins, if the predicates can restrict the result sets so that all null-supplying rows are eliminated
- full outer -> inner if both sides have such predicates - left outer -> inner if the right side has such predicates - right outer -> inner if the left side has such predicates - full outer -> left outer if only the left side has such predicates - full outer -> right outer if only the right side has such predicates
2. Removes outer join if aggregate is from streamed side and duplicate agnostic
SELECT DISTINCT f1 FROM t1 LEFT JOIN t2 ON t1.id = t2.id ==> SELECT DISTINCT f1 FROM t1SELECT t1.c1, max(t1.c2) FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1 GROUP BY t1.c1 ==> SELECT t1.c1, max(t1.c2) FROM t1 GROUP BY t1.c13. Remove outer join if:
- For a left outer join with only left-side columns being selected and the right side join keys are unique.
- For a right outer join with only right-side columns being selected and the left side join keys are unique.
SELECT t1.* FROM t1 LEFT JOIN (SELECT DISTINCT c1 as c1 FROM t) t2 ON t1.c1 = t2.c1 ==> SELECT t1.* FROM t1This rule should be executed before pushing down the Filter
- object EliminateResolvedHint extends Rule[LogicalPlan]
Replaces ResolvedHint operators from the plan.
Replaces ResolvedHint operators from the plan. Move the HintInfo to associated Join operators, otherwise remove it if no Join operator is matched.
- object EliminateSerialization extends Rule[LogicalPlan]
Removes cases where we are unnecessarily going between the object and serialized (InternalRow) representation of data item.
Removes cases where we are unnecessarily going between the object and serialized (InternalRow) representation of data item. For example back to back map operations.
- object EliminateSorts extends Rule[LogicalPlan]
Removes Sort operations if they don't affect the final output ordering.
Removes Sort operations if they don't affect the final output ordering. Note that changes in the final output ordering may affect the file size (SPARK-32318). This rule handles the following cases: 1) if the sort order is empty or the sort order does not have any reference 2) if the Sort operator is a local sort and the child is already sorted 3) if there is another Sort operator separated by 0...n Project, Filter, Repartition or RepartitionByExpression, RebalancePartitions (with deterministic expressions) operators 4) if the Sort operator is within Join separated by 0...n Project, Filter, Repartition or RepartitionByExpression, RebalancePartitions (with deterministic expressions) operators only and the Join condition is deterministic 5) if the Sort operator is within GroupBy separated by 0...n Project, Filter, Repartition or RepartitionByExpression, RebalancePartitions (with deterministic expressions) operators only and the aggregate function is order irrelevant
- object ExtractPythonUDFFromJoinCondition extends Rule[LogicalPlan] with PredicateHelper
PythonUDF in join condition can't be evaluated if it refers to attributes from both join sides.
PythonUDF in join condition can't be evaluated if it refers to attributes from both join sides. See
ExtractPythonUDFsfor details. This rule will detect un-evaluable PythonUDF and pull them out from join condition. - object FoldablePropagation extends Rule[LogicalPlan]
Replace attributes with aliases of the original foldable expressions if possible.
Replace attributes with aliases of the original foldable expressions if possible. Other optimizations will take advantage of the propagated foldable expressions. For example, this rule can optimize
SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, 3
to
SELECT 1.0 x, 'abc' y, Now() z ORDER BY 1.0, 'abc', Now()
and other rules can further optimize it and remove the ORDER BY operator.
- object GenerateOptimization extends Rule[LogicalPlan]
Prunes unnecessary fields from a Generate if it is under a project which does not refer any generated attributes, .e.g., count-like aggregation on an exploded array.
- object GeneratorNestedColumnAliasing
This prunes unnecessary nested columns from Generate, or Project -> Generate
- object InferFiltersFromConstraints extends Rule[LogicalPlan] with PredicateHelper with ConstraintHelper
Generate a list of additional filters from an operator's existing constraint but remove those that are either already part of the operator's condition or are part of the operator's child constraints.
Generate a list of additional filters from an operator's existing constraint but remove those that are either already part of the operator's condition or are part of the operator's child constraints. These filters are currently inserted to the existing conditions in the Filter operators and on either side of Join operators.
Note: While this optimization is applicable to a lot of types of join, it primarily benefits Inner and LeftSemi joins.
- object InferFiltersFromGenerate extends Rule[LogicalPlan]
Infers filters from Generate, such that rows that would have been removed by this Generate can be removed earlier - before joins and in data sources.
- object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with JoinSelectionHelper
Insert a filter on one side of the join if the other side has a selective predicate.
Insert a filter on one side of the join if the other side has a selective predicate. The filter could be an IN subquery (converted to a semi join), a bloom filter, or something else in the future.
- object JoinReorderDP extends PredicateHelper with Logging
Reorder the joins using a dynamic programming algorithm.
Reorder the joins using a dynamic programming algorithm. This implementation is based on the paper: Access Path Selection in a Relational Database Management System. https://dl.acm.org/doi/10.1145/582095.582099
First we put all items (basic joined nodes) into level 0, then we build all two-way joins at level 1 from plans at level 0 (single items), then build all 3-way joins from plans at previous levels (two-way joins and single items), then 4-way joins ... etc, until we build all n-way joins and pick the best plan among them.
When building m-way joins, we only keep the best plan (with the lowest cost) for the same set of m items. E.g., for 3-way joins, we keep only the best plan for items {A, B, C} among plans (A J B) J C, (A J C) J B and (B J C) J A. We also prune cartesian product candidates when building a new plan if there exists no join condition involving references from both left and right. This pruning strategy significantly reduces the search space. E.g., given A J B J C J D with join conditions A.k1 = B.k1 and B.k2 = C.k2 and C.k3 = D.k3, plans maintained for each level are as follows: level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({B, C}), p({C, D}) level 2: p({A, B, C}), p({B, C, D}) level 3: p({A, B, C, D}) where p({A, B, C, D}) is the final output plan.
For cost evaluation, since physical costs for operators are not available currently, we use cardinalities and sizes to compute costs.
- object JoinReorderDPFilters
Implements optional filters to reduce the search space for join enumeration.
Implements optional filters to reduce the search space for join enumeration.
1) Star-join filters: Plan star-joins together since they are assumed to have an optimal execution based on their RI relationship. 2) Cartesian products: Defer their planning later in the graph to avoid large intermediate results (expanding joins, in general). 3) Composite inners: Don't generate "bushy tree" plans to avoid materializing intermediate results.
Filters (2) and (3) are not implemented.
- object LikeSimplification extends Rule[LogicalPlan] with PredicateHelper
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition.
Simplifies LIKE expressions that do not need full regular expressions to evaluate the condition. For example, when the expression is just checking to see if a string starts with a given pattern.
- object LimitPushDown extends Rule[LogicalPlan]
Pushes down LocalLimit beneath UNION ALL, OFFSET and joins.
- object LimitPushDownThroughWindow extends Rule[LogicalPlan]
Pushes down LocalLimit beneath WINDOW.
Pushes down LocalLimit beneath WINDOW. This rule optimizes the following case:
SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM Tab1 LIMIT 5 ==> SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM (SELECT * FROM Tab1 ORDER BY a LIMIT 5) t
- object MergeScalarSubqueries extends Rule[LogicalPlan]
This rule tries to merge multiple non-correlated ScalarSubquerys to compute multiple scalar values once.
This rule tries to merge multiple non-correlated ScalarSubquerys to compute multiple scalar values once.
The process is the following: - While traversing through the plan each ScalarSubquery plan is tried to merge into the cache of already seen subquery plans. If merge is possible then cache is updated with the merged subquery plan, if not then the new subquery plan is added to the cache. During this first traversal each ScalarSubquery expression is replaced to a temporal ScalarSubqueryReference reference pointing to its cached version. The cache uses a flag to keep track of if a cache entry is a result of merging 2 or more plans, or it is a plan that was seen only once. Merged plans in the cache get a "Header", that contains the list of attributes form the scalar return value of a merged subquery. - A second traversal checks if there are merged subqueries in the cache and builds a
WithCTEnode from these queries. TheCTERelationDefnodes contain the merged subquery in the following form:Project(Seq(CreateNamedStruct(name1, attribute1, ...) AS mergedValue), mergedSubqueryPlan)and the definitions are flagged that they host a subquery, that can return maximum one row. During the second traversal ScalarSubqueryReference expressions that pont to a merged subquery is either transformed to aGetStructField(ScalarSubquery(CTERelationRef(...)))expression or restored to the original ScalarSubquery.Eg. the following query:
SELECT (SELECT avg(a) FROM t), (SELECT sum(b) FROM t)
is optimized from:
Optimized Logical Plan
Project [scalar-subquery#242 [] AS scalarsubquery()#253, scalar-subquery#243 [] AS scalarsubquery()#254L] : :- Aggregate [avg(a#244) AS avg(a)#247] : : +- Project [a#244] : : +- Relation default.t[a#244,b#245] parquet : +- Aggregate [sum(a#251) AS sum(a)#250L] : +- Project [a#251] : +- Relation default.t[a#251,b#252] parquet +- OneRowRelation
to:
Optimized Logical Plan
Project [scalar-subquery#242 [].avg(a) AS scalarsubquery()#253, scalar-subquery#243 [].sum(a) AS scalarsubquery()#254L] : :- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] : : +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L] : : +- Project [a#244] : : +- Relation default.t[a#244,b#245] parquet : +- Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] : +- Aggregate [avg(a#244) AS avg(a)#247, sum(a#244) AS sum(a)#250L] : +- Project [a#244] : +- Relation default.t[a#244,b#245] parquet +- OneRowRelation
Physical Plan
*(1) Project [Subquery scalar-subquery#242, [id=#125].avg(a) AS scalarsubquery()#253, ReusedSubquery Subquery scalar-subquery#242, [id=#125].sum(a) AS scalarsubquery()#254L] : :- Subquery scalar-subquery#242, [id=#125] : : +- *(2) Project [named_struct(avg(a), avg(a)#247, sum(a), sum(a)#250L) AS mergedValue#260] : : +- *(2) HashAggregate(keys=[], functions=[avg(a#244), sum(a#244)], output=[avg(a)#247, sum(a)#250L]) : : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#120] : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(a#244), partial_sum(a#244)], output=[sum#262, count#263L, sum#264L]) : : +- *(1) ColumnarToRow : : +- FileScan parquet default.t[a#244] ... : +- ReusedSubquery Subquery scalar-subquery#242, [id=#125] +- *(1) Scan OneRowRelation[]
- object NestedColumnAliasing
This aims to handle a nested column aliasing pattern inside the ColumnPruning optimizer rule.
This aims to handle a nested column aliasing pattern inside the ColumnPruning optimizer rule. If: - A Project or its child references nested fields - Not all of the fields in a nested attribute are used Then: - Substitute the nested field references with alias attributes - Add grandchild Projects transforming the nested fields to aliases
Example 1: Project ------------------ Before: +- Project [concat_ws(s#0.a, s#0.b) AS concat_ws(s.a, s.b)#1] +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [s#0] After: +- Project [concat_ws(_extract_a#2, _extract_b#3) AS concat_ws(s.a, s.b)#1] +- GlobalLimit 5 +- LocalLimit 5 +- Project [s#0.a AS _extract_a#2, s#0.b AS _extract_b#3] +- LocalRelation <empty>, [s#0]
Example 2: Project above Filter ------------------------------- Before: +- Project [s#0.a AS s.a#1] +- Filter (length(s#0.b) > 2) +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [s#0] After: +- Project [_extract_a#2 AS s.a#1] +- Filter (length(_extract_b#3) > 2) +- GlobalLimit 5 +- LocalLimit 5 +- Project [s#0.a AS _extract_a#2, s#0.b AS _extract_b#3] +- LocalRelation <empty>, [s#0]
Example 3: Nested fields with referenced parents ------------------------------------------------ Before: +- Project [s#0.a AS s.a#1, s#0.a.a1 AS s.a.a1#2] +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [s#0] After: +- Project [_extract_a#3 AS s.a#1, _extract_a#3.name AS s.a.a1#2] +- GlobalLimit 5 +- LocalLimit 5 +- Project [s#0.a AS _extract_a#3] +- LocalRelation <empty>, [s#0]
The schema of the datasource relation will be pruned in the SchemaPruning optimizer rule.
- object NormalizeFloatingNumbers extends Rule[LogicalPlan]
We need to take care of special floating numbers (NaN and -0.0) in several places:
We need to take care of special floating numbers (NaN and -0.0) in several places:
- When compare values, different NaNs should be treated as same,
-0.0and0.0should be treated as same. 2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0 should belong to the same group. 3. In join keys, different NaNs should be treated as same,-0.0and0.0should be treated as same. 4. In window partition keys, different NaNs should belong to the same partition, -0.0 and 0.0 should belong to the same partition.
Case 1 is fine, as we handle NaN and -0.0 well during comparison. For complex types, we recursively compare the fields/elements, so it's also fine.
Case 2, 3 and 4 are problematic, as Spark SQL turns grouping/join/window partition keys into binary
UnsafeRowand compare the binary data directly. Different NaNs have different binary representation, and the same thing happens for -0.0 and 0.0.This rule normalizes NaN and -0.0 in window partition keys, join keys and aggregate grouping keys.
Ideally we should do the normalization in the physical operators that compare the binary
UnsafeRowdirectly. We don't need this normalization if the Spark SQL execution engine is not optimized to run on binary data. This rule is created to simplify the implementation, so that we have a single place to do normalization, which is more maintainable.Note that, this rule must be executed at the end of optimizer, because the optimizer may create new joins(the subquery rewrite) and new join conditions(the join reorder).
- When compare values, different NaNs should be treated as same,
- object NullDownPropagation extends Rule[LogicalPlan]
Unwrap the input of IsNull/IsNotNull if the input is NullIntolerant E.g.
Unwrap the input of IsNull/IsNotNull if the input is NullIntolerant E.g. IsNull(Not(null)) == IsNull(null)
- object NullPropagation extends Rule[LogicalPlan]
Replaces Expressions that can be statically evaluated with equivalent Literal values.
Replaces Expressions that can be statically evaluated with equivalent Literal values. This rule is more specific with Null value propagation from bottom to top of the expression tree.
- object ObjectSerializerPruning extends Rule[LogicalPlan]
Prunes unnecessary object serializers from query plan.
Prunes unnecessary object serializers from query plan. This rule prunes both individual serializer and nested fields in serializers.
- object OptimizeCsvJsonExprs extends Rule[LogicalPlan]
Simplify redundant csv/json related expressions.
Simplify redundant csv/json related expressions.
The optimization includes: 1. JsonToStructs(StructsToJson(child)) => child. 2. Prune unnecessary columns from GetStructField/GetArrayStructFields + JsonToStructs. 3. CreateNamedStruct(JsonToStructs(json).col1, JsonToStructs(json).col2, ...) => If(IsNull(json), nullStruct, KnownNotNull(JsonToStructs(prunedSchema, ..., json))) if JsonToStructs(json) is shared among all fields of CreateNamedStruct.
prunedSchemacontains all accessed fields in original CreateNamedStruct. 4. Prune unnecessary columns from GetStructField + CsvToStructs. - object OptimizeIn extends Rule[LogicalPlan]
Optimize IN predicates: 1.
Optimize IN predicates: 1. Converts the predicate to false when the list is empty and the value is not nullable. 2. Removes literal repetitions. 3. Replaces (value, seq[Literal]) with optimized version (value, HashSet[Literal]) which is much faster.
- object OptimizeOneRowPlan extends Rule[LogicalPlan]
The rule is applied both normal and AQE Optimizer.
The rule is applied both normal and AQE Optimizer. It optimizes plan using max rows:
- if the max rows of the child of sort is less than or equal to 1, remove the sort
- if the max rows per partition of the child of local sort is less than or equal to 1, remove the local sort
- if the max rows of the child of aggregate is less than or equal to 1 and its child and it's grouping only(include the rewritten distinct plan), convert aggregate to project
- if the max rows of the child of aggregate is less than or equal to 1, set distinct to false in all aggregate expression
- object OptimizeOneRowRelationSubquery extends Rule[LogicalPlan]
This rule optimizes subqueries with OneRowRelation as leaf nodes.
- object OptimizeRand extends Rule[LogicalPlan]
Rand() generates a random column with i.i.d.
Rand() generates a random column with i.i.d. uniformly distributed values in [0, 1), so compare double literal value with 1.0 or 0.0 could eliminate Rand() in binary comparison.
1. Converts the binary comparison to true literal when the comparison value must be true. 2. Converts the binary comparison to false literal when the comparison value must be false.
- object OptimizeRepartition extends Rule[LogicalPlan]
Replace RepartitionByExpression numPartitions to 1 if all partition expressions are foldable and user not specify.
- object OptimizeUpdateFields extends Rule[LogicalPlan]
Optimizes UpdateFields expression chains.
- object OptimizeWindowFunctions extends Rule[LogicalPlan]
Replaces first(col) to nth_value(col, 1) for better performance.
- object PropagateEmptyRelation extends PropagateEmptyRelationBase
This rule runs in the normal optimizer
- object PruneFilters extends Rule[LogicalPlan] with PredicateHelper
Removes filters that can be evaluated trivially.
Removes filters that can be evaluated trivially. This can be done through the following ways: 1) by eliding the filter for cases where it will always evaluate to
true. 2) by substituting a dummy empty relation when the filter will always evaluate tofalse. 3) by eliminating the always-true conditions given the constraints on the child's output. - object PullOutGroupingExpressions extends Rule[LogicalPlan]
This rule ensures that Aggregate nodes doesn't contain complex grouping expressions in the optimization phase.
This rule ensures that Aggregate nodes doesn't contain complex grouping expressions in the optimization phase.
Complex grouping expressions are pulled out to a Project node under Aggregate and are referenced in both grouping expressions and aggregate expressions without aggregate functions. These references ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions and also simplify the expression transformations on the node (need to transform the expression only once).
For example, in the following query Spark shouldn't optimize the aggregate expression
Not(IsNull(c))toIsNotNull(c)as the grouping expression isIsNull(c): SELECT not(c IS NULL) FROM t GROUP BY c IS NULL Instead, the aggregate expression references a_groupingexpressionattribute: Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (c IS NULL))#230] +- Project [isnull(c#219) AS _groupingexpression#233] +- LocalRelation [c#219] - object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
Pull out all (outer) correlated predicates from a given subquery.
Pull out all (outer) correlated predicates from a given subquery. This method removes the correlated predicates from subquery Filters and adds the references of these predicates to all intermediate Project and Aggregate clauses (if they are missing) in order to be able to evaluate the predicates at the top level.
TODO: Look to merge this rule with RewritePredicateSubquery.
- object PushDownLeftSemiAntiJoin extends Rule[LogicalPlan] with PredicateHelper with JoinSelectionHelper
This rule is a variant of PushPredicateThroughNonJoin which can handle pushing down Left semi and Left Anti joins below the following operators.
This rule is a variant of PushPredicateThroughNonJoin which can handle pushing down Left semi and Left Anti joins below the following operators. 1) Project 2) Window 3) Union 4) Aggregate 5) Other permissible unary operators. please see PushPredicateThroughNonJoin.canPushThrough.
- object PushDownPredicates extends Rule[LogicalPlan]
The unified version for predicate pushdown of normal operators and joins.
The unified version for predicate pushdown of normal operators and joins. This rule improves performance of predicate pushdown for cascading joins such as: Filter-Join-Join-Join. Most predicates can be pushed down in a single pass.
- object PushExtraPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper
Try pushing down disjunctive join condition into left and right child.
Try pushing down disjunctive join condition into left and right child. To avoid expanding the join condition, the join condition will be kept in the original form even when predicate pushdown happens.
- object PushFoldableIntoBranches extends Rule[LogicalPlan]
Push the foldable expression into (if / case) branches.
- object PushLeftSemiLeftAntiThroughJoin extends Rule[LogicalPlan] with PredicateHelper
This rule is a variant of PushPredicateThroughJoin which can handle pushing down Left semi and Left Anti joins below a join operator.
This rule is a variant of PushPredicateThroughJoin which can handle pushing down Left semi and Left Anti joins below a join operator. The allowable join types are: 1) Inner 2) Cross 3) LeftOuter 4) RightOuter
TODO: Currently this rule can push down the left semi or left anti joins to either left or right leg of the child join. This matches the behaviour of
PushPredicateThroughJoinwhen the left semi or left anti join is in expression form. We need to explore the possibility to push the left semi/anti joins to both legs of join if the join condition refers to both left and right legs of the child join. - object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper
Pushes down Filter operators where the
conditioncan be evaluated using only the attributes of the left or right side of a join.Pushes down Filter operators where the
conditioncan be evaluated using only the attributes of the left or right side of a join. Other Filter conditions are moved into theconditionof the Join.And also pushes down the join filter, where the
conditioncan be evaluated using only the attributes of the left or right side of sub query when applicable.Check https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior for more details
- object PushPredicateThroughNonJoin extends Rule[LogicalPlan] with PredicateHelper
Pushes Filter operators through many operators iff: 1) the operator is deterministic 2) the predicate is deterministic and the operator will not change any of rows.
Pushes Filter operators through many operators iff: 1) the operator is deterministic 2) the predicate is deterministic and the operator will not change any of rows.
This heuristic is valid assuming the expression evaluation cost is minimal.
- object PushProjectionThroughLimit extends Rule[LogicalPlan]
Pushes Project operator through Limit operator.
- object PushProjectionThroughUnion extends Rule[LogicalPlan]
Pushes Project operator to both sides of a Union operator.
Pushes Project operator to both sides of a Union operator. Operations that are safe to pushdown are listed as follows. Union: Right now, Union means UNION ALL, which does not de-duplicate rows. So, it is safe to pushdown Filters and Projections through it. Filter pushdown is handled by another rule PushDownPredicates. Once we add UNION DISTINCT, we will not be able to pushdown Projections.
- object PushdownPredicatesAndPruneColumnsForCTEDef extends Rule[LogicalPlan]
Infer predicates and column pruning for CTERelationDef from its reference points, and push the disjunctive predicates as well as the union of attributes down the CTE plan.
- object ReassignLambdaVariableID extends Rule[LogicalPlan]
Reassigns per-query unique IDs to
LambdaVariables, whose original IDs are globally unique.Reassigns per-query unique IDs to
LambdaVariables, whose original IDs are globally unique. This can help Spark to hit codegen cache more often and improve performance. - object RemoveDispensableExpressions extends Rule[LogicalPlan]
Removes nodes that are not necessary.
- object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan]
Removes literals from group expressions in Aggregate, as they have no effect to the result but only makes the grouping key bigger.
- object RemoveNoopOperators extends Rule[LogicalPlan]
Remove no-op operators from the query plan that do not make any modifications.
- object RemoveNoopUnion extends Rule[LogicalPlan]
Smplify the children of
Unionor remove no-opUnionfrom the query plan that do not make any modifications to the query. - object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper
Remove redundant aggregates from a query plan.
Remove redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values.
- object RemoveRedundantAliases extends Rule[LogicalPlan]
Remove redundant aliases from a query plan.
Remove redundant aliases from a query plan. A redundant alias is an alias that does not change the name or metadata of a column, and does not deduplicate it.
- object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan]
Removes repetition from group expressions in Aggregate, as they have no effect to the result but only makes the grouping key bigger.
- object ReorderAssociativeOperator extends Rule[LogicalPlan]
Reorder associative integral-type operators and fold all constants into one.
- object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper
Reorder the joins and push all the conditions into join, so that the bottom ones have at least one condition.
Reorder the joins and push all the conditions into join, so that the bottom ones have at least one condition.
The order of joins will not be changed if all of them already have at least one condition.
If star schema detection is enabled, reorder the star join plans based on heuristics.
- object ReplaceCTERefWithRepartition extends Rule[LogicalPlan]
Replaces CTE references that have not been previously inlined with Repartition operations which will then be planned as shuffles and reused across different reference points.
Replaces CTE references that have not been previously inlined with Repartition operations which will then be planned as shuffles and reused across different reference points.
Note that this rule should be called at the very end of the optimization phase to best guarantee that CTE repartition shuffles are reused.
- object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan]
Replaces logical Deduplicate operator with an Aggregate operator.
- object ReplaceDistinctWithAggregate extends Rule[LogicalPlan]
Replaces logical Distinct operator with an Aggregate operator.
Replaces logical Distinct operator with an Aggregate operator.
SELECT DISTINCT f1, f2 FROM t ==> SELECT f1, f2 FROM t GROUP BY f1, f2 - object ReplaceExceptWithAntiJoin extends Rule[LogicalPlan]
Replaces logical Except operator with a left-anti Join operator.
Replaces logical Except operator with a left-anti Join operator.
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the generated join conditions will be incorrect.
- object ReplaceExceptWithFilter extends Rule[LogicalPlan]
If one or both of the datasets in the logical Except operator are purely transformed using Filter, this rule will replace logical Except operator with a Filter operator by flipping the filter condition of the right child.
If one or both of the datasets in the logical Except operator are purely transformed using Filter, this rule will replace logical Except operator with a Filter operator by flipping the filter condition of the right child.
SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 <> 5)
Note: Before flipping the filter condition of the right node, we should: 1. Combine all it's Filter. 2. Update the attribute references to the left node; 3. Add a Coalesce(condition, False) (to take into account of NULL values in the condition).
- object ReplaceExpressions extends Rule[LogicalPlan]
Finds all the RuntimeReplaceable expressions that are unevaluable and replace them with semantically equivalent expressions that can be evaluated.
Finds all the RuntimeReplaceable expressions that are unevaluable and replace them with semantically equivalent expressions that can be evaluated.
This is mainly used to provide compatibility with other databases. Few examples are: we use this to support "left" by replacing it with "substring". we use this to replace Every and Any with Min and Max respectively.
- object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan]
Replaces logical Intersect operator with a left-semi Join operator.
Replaces logical Intersect operator with a left-semi Join operator.
SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
Note: 1. This rule is only applicable to INTERSECT DISTINCT. Do not use it for INTERSECT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the generated join conditions will be incorrect.
- object ReplaceNullWithFalseInPredicate extends Rule[LogicalPlan]
A rule that replaces
Literal(null, BooleanType)withFalseLiteral, if possible, in the search condition of the WHERE/HAVING/ON(JOIN) clauses, which contain an implicit Boolean operator "(search condition) = TRUE".A rule that replaces
Literal(null, BooleanType)withFalseLiteral, if possible, in the search condition of the WHERE/HAVING/ON(JOIN) clauses, which contain an implicit Boolean operator "(search condition) = TRUE". The replacement is only valid whenLiteral(null, BooleanType)is semantically equivalent toFalseLiteralwhen evaluating the whole search condition.Please note that FALSE and NULL are not exchangeable in most cases, when the search condition contains NOT and NULL-tolerant expressions. Thus, the rule is very conservative and applicable in very limited cases.
For example,
Filter(Literal(null, BooleanType))is equal toFilter(FalseLiteral).Another example containing branches is
Filter(If(cond, FalseLiteral, Literal(null, _))); this can be optimized toFilter(If(cond, FalseLiteral, FalseLiteral)), and eventuallyFilter(FalseLiteral).Moreover, this rule also transforms predicates in all If expressions as well as branch conditions in all CaseWhen expressions, even if they are not part of the search conditions.
For example,
Project(If(And(cond, Literal(null)), Literal(1), Literal(2)))can be simplified intoProject(Literal(2)). - object ReplaceUpdateFieldsExpression extends Rule[LogicalPlan]
Replaces UpdateFields expression with an evaluable expression.
- object RewriteAsOfJoin extends Rule[LogicalPlan]
Replaces logical AsOfJoin operator using a combination of Join and Aggregate operator.
Replaces logical AsOfJoin operator using a combination of Join and Aggregate operator.
Input Pseudo-Query:
SELECT * FROM left ASOF JOIN right ON (condition, as_of on(left.t, right.t), tolerance)
Rewritten Query:
SELECT left.*, __right__.* FROM ( SELECT left.*, ( SELECT MIN_BY(STRUCT(right.*), left.t - right.t) AS __nearest_right__ FROM right WHERE condition AND left.t >= right.t AND right.t >= left.t - tolerance ) as __right__ FROM left ) WHERE __right__ IS NOT NULL - object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelper
This rule rewrites correlated ScalarSubquery expressions into LEFT OUTER joins.
- object RewriteDistinctAggregates extends Rule[LogicalPlan]
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group.
This rule rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate.
First example: query without filter clauses (in scala):
val data = Seq( ("a", "ca1", "cb1", 10), ("a", "ca1", "cb2", 5), ("b", "ca1", "cb1", 13)) .toDF("key", "cat1", "cat2", "value") data.createOrReplaceTempView("data") val agg = data.groupBy($"key") .agg( count_distinct($"cat1").as("cat1_cnt"), count_distinct($"cat2").as("cat2_cnt"), sum($"value").as("total"))
This translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1), COUNT(DISTINCT 'cat2), sum('value)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count('cat1) FILTER (WHERE 'gid = 1), count('cat2) FILTER (WHERE 'gid = 2), first('total) ignore nulls FILTER (WHERE 'gid = 0)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [sum('value)] output = ['key, 'cat1, 'cat2, 'gid, 'total]) Expand( projections = [('key, null, null, 0, cast('value as bigint)), ('key, 'cat1, null, 1, null), ('key, null, 'cat2, 2, null)] output = ['key, 'cat1, 'cat2, 'gid, 'value]) LocalTableScan [...]Second example: aggregate function without distinct and with filter clauses (in sql):
SELECT COUNT(DISTINCT cat1) as cat1_cnt, COUNT(DISTINCT cat2) as cat2_cnt, SUM(value) FILTER (WHERE id > 1) AS total FROM data GROUP BY keyThis translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1), COUNT(DISTINCT 'cat2), sum('value) FILTER (WHERE 'id > 1)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count('cat1) FILTER (WHERE 'gid = 1), count('cat2) FILTER (WHERE 'gid = 2), first('total) ignore nulls FILTER (WHERE 'gid = 0)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [sum('value) FILTER (WHERE 'id > 1)] output = ['key, 'cat1, 'cat2, 'gid, 'total]) Expand( projections = [('key, null, null, 0, cast('value as bigint), 'id), ('key, 'cat1, null, 1, null, null), ('key, null, 'cat2, 2, null, null)] output = ['key, 'cat1, 'cat2, 'gid, 'value, 'id]) LocalTableScan [...]Third example: aggregate function with distinct and filter clauses (in sql):
SELECT COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt, COUNT(DISTINCT cat2) FILTER (WHERE id > 2) as cat2_cnt, SUM(value) FILTER (WHERE id > 3) AS total FROM data GROUP BY key
This translates to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [COUNT(DISTINCT 'cat1) FILTER (WHERE 'id > 1), COUNT(DISTINCT 'cat2) FILTER (WHERE 'id > 2), sum('value) FILTER (WHERE 'id > 3)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) LocalTableScan [...]
This rule rewrites this logical plan to the following (pseudo) logical plan:
Aggregate( key = ['key] functions = [count('cat1) FILTER (WHERE 'gid = 1 and 'max_cond1), count('cat2) FILTER (WHERE 'gid = 2 and 'max_cond2), first('total) ignore nulls FILTER (WHERE 'gid = 0)] output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) Aggregate( key = ['key, 'cat1, 'cat2, 'gid] functions = [max('cond1), max('cond2), sum('value) FILTER (WHERE 'id > 3)] output = ['key, 'cat1, 'cat2, 'gid, 'max_cond1, 'max_cond2, 'total]) Expand( projections = [('key, null, null, 0, null, null, cast('value as bigint), 'id), ('key, 'cat1, null, 1, 'id > 1, null, null, null), ('key, null, 'cat2, 2, null, 'id > 2, null, null)] output = ['key, 'cat1, 'cat2, 'gid, 'cond1, 'cond2, 'value, 'id]) LocalTableScan [...]The rule does the following things here: 1. Expand the data. There are three aggregation groups in this query:
- the non-distinct group; ii. the distinct 'cat1 group; iii. the distinct 'cat2 group. An expand operator is inserted to expand the child data for each group. The expand will null out all unused columns for the given group; this must be done in order to ensure correctness later on. Groups can by identified by a group id (gid) column added by the expand operator. If distinct group exists filter clause, the expand will calculate the filter and output it's result (e.g. cond1) which will be used to calculate the global conditions (e.g. max_cond1) equivalent to filter clauses. 2. De-duplicate the distinct paths and aggregate the non-aggregate path. The group by clause of this aggregate consists of the original group by clause, all the requested distinct columns and the group id. Both de-duplication of distinct column and the aggregation of the non-distinct group take advantage of the fact that we group by the group id (gid) and that we have nulled out all non-relevant columns the given group. If distinct group exists filter clause, we will use max to aggregate the results (e.g. cond1) of the filter output in the previous step. These aggregate will output the global conditions (e.g. max_cond1) equivalent to filter clauses. 3. Aggregating the distinct groups and combining this with the results of the non-distinct aggregation. In this step we use the group id and the global condition to filter the inputs for the aggregate functions. If the global condition (e.g. max_cond1) is true, it means at least one row of a distinct value satisfies the filter. This distinct value should be included in the aggregate function. The result of the non-distinct group are 'aggregated' by using the first operator, it might be more elegant to use the native UDAF merge mechanism for this in the future.
This rule duplicates the input data by two or more times (# distinct groups + an optional non-distinct group). This will put quite a bit of memory pressure of the used aggregate and exchange operators. Keeping the number of distinct groups as low as possible should be priority, we could improve this in the current rule by applying more advanced expression canonicalization techniques.
- object RewriteExceptAll extends Rule[LogicalPlan]
Replaces logical Except operator using a combination of Union, Aggregate and Generate operator.
Replaces logical Except operator using a combination of Union, Aggregate and Generate operator.
Input Query :
SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2
Rewritten Query:
SELECT c1 FROM ( SELECT replicate_rows(sum_val, c1) FROM ( SELECT c1, sum_val FROM ( SELECT c1, sum(vcol) AS sum_val FROM ( SELECT 1L as vcol, c1 FROM ut1 UNION ALL SELECT -1L as vcol, c1 FROM ut2 ) AS union_all GROUP BY union_all.c1 ) WHERE sum_val > 0 ) ) - object RewriteIntersectAll extends Rule[LogicalPlan]
Replaces logical Intersect operator using a combination of Union, Aggregate and Generate operator.
Replaces logical Intersect operator using a combination of Union, Aggregate and Generate operator.
Input Query :
SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2
Rewritten Query:
SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, If (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT true as vcol1, null as , c1 FROM ut1 UNION ALL SELECT null as vcol1, true as vcol2, c1 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) - object RewriteLateralSubquery extends Rule[LogicalPlan]
This rule rewrites LateralSubquery expressions into joins.
- object RewriteNonCorrelatedExists extends Rule[LogicalPlan]
Rewrite non correlated exists subquery to use ScalarSubquery WHERE EXISTS (SELECT A FROM TABLE B WHERE COL1 > 10) will be rewritten to WHERE (SELECT 1 FROM (SELECT A FROM TABLE B WHERE COL1 > 10) LIMIT 1) IS NOT NULL
- object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper
This rule rewrites predicate sub-queries into left semi/anti joins.
This rule rewrites predicate sub-queries into left semi/anti joins. The following predicates are supported: a. EXISTS/NOT EXISTS will be rewritten as semi/anti join, unresolved conditions in Filter will be pulled out as the join conditions. b. IN/NOT IN will be rewritten as semi/anti join, unresolved conditions in the Filter will be pulled out as join conditions, value = selected column will also be used as join condition.
- object SimpleTestOptimizer extends SimpleTestOptimizer
An optimizer used in test code.
An optimizer used in test code.
To ensure extendability, we leave the standard rules in the abstract optimizer rules, while specific rules go to the subclasses
- object SimplifyBinaryComparison extends Rule[LogicalPlan] with PredicateHelper with ConstraintHelper
Simplifies binary comparisons with semantically-equal expressions: 1) Replace '<=>' with 'true' literal.
Simplifies binary comparisons with semantically-equal expressions: 1) Replace '<=>' with 'true' literal. 2) Replace '=', '<=', and '>=' with 'true' literal if both operands are non-nullable. 3) Replace '<' and '>' with 'false' literal if both operands are non-nullable. 4) Unwrap '=', '<=>' if one side is a boolean literal
- object SimplifyCaseConversionExpressions extends Rule[LogicalPlan]
Removes the inner case conversion expressions that are unnecessary because the inner conversion is overwritten by the outer one.
- object SimplifyCasts extends Rule[LogicalPlan]
Removes Casts that are unnecessary because the input is already the correct type.
- object SimplifyConditionals extends Rule[LogicalPlan]
Simplifies conditional expressions (if / case).
- object SimplifyExtractValueOps extends Rule[LogicalPlan]
Simplify redundant CreateNamedStruct, CreateArray and CreateMap expressions.
- object SpecialDatetimeValues extends Rule[LogicalPlan]
Replaces casts of special datetime strings by its date/timestamp values if the input strings are foldable.
- object StarSchemaDetection extends PredicateHelper with SQLConfHelper
Encapsulates star-schema detection logic.
- object SupportedBinaryExpr
- object TransposeWindow extends Rule[LogicalPlan]
Transpose Adjacent Window Expressions.
Transpose Adjacent Window Expressions. - If the partition spec of the parent Window expression is compatible with the partition spec of the child window expression, transpose them.
- object UnwrapCastInBinaryComparison extends Rule[LogicalPlan]
Unwrap casts in binary comparison or
In/InSetoperations with patterns like following:Unwrap casts in binary comparison or
In/InSetoperations with patterns like following:-
BinaryComparison(Cast(fromExp, toType), Literal(value, toType))-BinaryComparison(Literal(value, toType), Cast(fromExp, toType))-In(Cast(fromExp, toType), Seq(Literal(v1, toType), Literal(v2, toType), ...)-InSet(Cast(fromExp, toType), Set(v1, v2, ...))This rule optimizes expressions with the above pattern by either replacing the cast with simpler constructs, or moving the cast from the expression side to the literal side, which enables them to be optimized away later and pushed down to data sources.
Currently this only handles cases where: 1).
fromType(offromExp) andtoTypeare of numeric types (i.e., short, int, float, decimal, etc) or boolean type 2).fromTypecan be safely coerced totoTypewithout precision loss (e.g., short to int, int to long, but not long to int, nor int to boolean)If the above conditions are satisfied, the rule checks to see if the literal
valueis within range(min, max), whereminandmaxare the minimum and maximum value offromType, respectively. If this is true then it means we may safely castvaluetofromTypeand thus able to move the cast to the literal side. That is:cast(fromExp, toType) op value==>fromExp op cast(value, fromType)Note there are some exceptions to the above: if casting from
valuetofromTypecauses rounding up or down, the above conversion will no longer be valid. Instead, the rule does the following:if casting
valuetofromTypecauses rounding up:cast(fromExp, toType) > value==>fromExp >= cast(value, fromType)cast(fromExp, toType) >= value==>fromExp >= cast(value, fromType)cast(fromExp, toType) === value==> if(isnull(fromExp), null, false)cast(fromExp, toType) <=> value==> false (iffromExpis deterministic)cast(fromExp, toType) <= value==>fromExp < cast(value, fromType)cast(fromExp, toType) < value==>fromExp < cast(value, fromType)
Similarly for the case when casting
valuetofromTypecauses rounding down.If the
valueis not within range(min, max), the rule breaks the scenario into different cases and try to replace each with simpler constructs.if
value > max, the cases are of following:cast(fromExp, toType) > value==> if(isnull(fromExp), null, false)cast(fromExp, toType) >= value==> if(isnull(fromExp), null, false)cast(fromExp, toType) === value==> if(isnull(fromExp), null, false)cast(fromExp, toType) <=> value==> false (iffromExpis deterministic)cast(fromExp, toType) <= value==> if(isnull(fromExp), null, true)cast(fromExp, toType) < value==> if(isnull(fromExp), null, true)
if
value == max, the cases are of following:cast(fromExp, toType) > value==> if(isnull(fromExp), null, false)cast(fromExp, toType) >= value==> fromExp == maxcast(fromExp, toType) === value==> fromExp == maxcast(fromExp, toType) <=> value==> fromExp <=> maxcast(fromExp, toType) <= value==> if(isnull(fromExp), null, true)cast(fromExp, toType) < value==> fromExp =!= max
Similarly for the cases when
value == minandvalue < min.Further, the above
if(isnull(fromExp), null, false)is represented using conjunctionand(isnull(fromExp), null), to enable further optimization and filter pushdown to data sources. Similarly,if(isnull(fromExp), null, true)is represented withor(isnotnull(fromExp), null).For
In/InSetoperation, first the rule transform the expression to Equals:Seq( EqualTo(Cast(fromExp, toType), Literal(v1, toType)), EqualTo(Cast(fromExp, toType), Literal(v2, toType)), ... )and using the same rule withBinaryComparisonshow as before to optimize eachEqualTo.