This is the second note of the online course “Static Program Analysis” developed by Yue Li and Tian Tan. The course page with materials can be found here. This note is mainly about Data Flow Analysis. Here Method Calls (Inter-procedural CFG) and Aliases are not covered, and will be introduced later (Inter-procedural Analysis and Pointer Analysis).
1. DFA
In the last note, we talk about Rice’s theorem which shows that there is no perfect static analysis, and we can either compromise soundness or completeness. This results in two types of analysis:
- may analysis: outputs information that may be true (over-approximation)
- must analysis: outputs information that must be true (under-approximation)
DFA (Data Flow Analysis) describes how data flows on CFG (Control Flow Graphs). More specifically, it shows how application-specific data (usually abstractions) flows (over/under-approximation, or may/must analysis) through the nodes (BBs, statements. By transfer functions), edges (control flows. By control-flow handling such as flow merging) of CFG of a program. Different DFA applications may have different data abstractions, different flow approximation strategies, different transfer functions, and different control-flow handlings.
In each DFA application, the main task is to associate every program point with a data-flow value that represents an abstraction of the set of all possible program states that can be observed for that point. We use IN[s] and OUT[s] to represent the program state before and after a certain statement block s, and fs() for the transfer functions. Then the DFA becomes:
- to find a solution to a set of over/under-approximation-directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.
- constraints based on semantics of statements (transfer functions fs())
- constraints based on the flows of control
There are mainly two ways of analysis:
- Forward Analysis: OUT[s] = fs(IN[s])
- Backward Analysis:IN[s] = fs(OUT[s])
2. DFA Applications
2.1 Reaching Definitions Analysis
Reaching Definitions means:
- A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path.
Reaching definitions can be used to detect possible undefined variables.
Reaching Definitions Analysis is a may analysis. The analysis is usually designed as forward analysis. Data flow values, or the definitions of all the variables in a program, can be abstractly represented by bit vectors, and the transfer function and control flow are as below:
The algorithm of Reaching Definitions Analysis (may analysis usually initialize OUT[B] as empty):
2.2 Live Variables Analysis
Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p. If so, v is live at p; otherwise, v is dead at p.
Information of live variables can be used for register allocations.
Live Variables Analysis is also a may analysis. The analysis is usually designed as backward analysis. Data flow values, or all the variables in a program, can be abstractly represented by bit vectors, and the transfer function and control flow are as below:
The algorithm of Live Variables Analysis:
2.3 Available Expressions Analysis
Available Expressions means:
- An expression x op y is available at program point p if
- all paths from the entry to p must pass through the evaluation of x op y
- after the last evaluation of x op y, there is no redefinition of x or y
This definition means at program point p, we can replace expression x op y by the result of its last evaluation. The information of available expressions can be used for detecting global common subexpressions.
This analysis is a must analysis. The analysis is usually designed as forward analysis. Data flow values, or all the expressions in a program, can be abstractly represented by bit vectors, and the transfer function and control flow are as below:
The algorithm of Available Expressions Analysis (unlike may analysis, must analysis usually initialize the OUT[B] as top/full set):
2.4 Analysis Comparison
3. DFA Foundations
Given a CFG (program) with k nodes, the iterative algorithm updates OUT[n] for every node n in each iteration. Each iteration takes an action F:V^k → V^k. X is a fixed point of function F if X = F(X). The iterative algorithm reaches a fixed point.
3.1 Discrete Mathematics
To fully understand DFA analysis, let us first refresh some concepts in discrete math.
- Partial Order:
- Upper and Lower Bounds of a Partial Order set:
For example, for a poset S, the upper and lower bounds, the least upper and lower bounds are shown as below:
Not every poset has Least Upper or Greatest Lower Bounds, but if a poset has, it will be unique.
- Lattice: A poset is a lattice if every pair of its elements has a least upper bound and a greatest lower bound.
Semilattice: A poset is a semilattice if every pair of its elements has a least upper bound (join semilattice) OR a greatest lower bound (meet semilattice).
Complete Lattice: All subsets of a lattice have a least upper bound and a greatest lower bound.
Every finite lattice (P is finite) is a complete lattice. Complete lattice is mostly focused in DFA.
Product Lattice:
A product lattice is a lattice. If a product lattice L is a product of complete lattices, then L is also complete.
3.2 Data Flow Analysis Framework via Lattice
Data flow analysis can be seen as iteratively applying transfer functions and meet/join operations on the values of a lattice. A data flow analysis framework (D, L, F) consists of:
- D: a direction of data flow: forwards or backwards
- L: a lattice including domain of the values V and a meet ⊓ or join ⊔ operator (usually meet is used for must analysis, and join is for may analysis)
- F: a family of transfer functions from V to V
The iterative algorithm (or the IN/OUT equation system) produces a solution to a data flow analysis:
- The algorithm is guaranteed to terminate or reach to a fixed point.
- The solution is the best one (greatest or least fixed point).
- The complexity of the algorithm: worst case is the product of the lattice height and the number of nodes in the CFG, which is also the number of total iterations.
The following figure provides a lattice view of may and must analysis:
3.3 MOP
Meet-Over-All-Paths, or MOP, is another solution to DFA. MOP computes the data-flow values at the end of each path and apply join/meet operator to these values to find their least upper or greatest lower bound. Since some paths may be not executable, it may be not fully precise. It is also impractical because it is unbounded and not enumerable.
Compare to our iterative algorithm:
3.4 Constant Propagation
Given a variable x at program point p, Constant Propagation determines whether x is guaranteed to hold a constant value at p. It is a must analysis, and not distributive.
The data flow analysis framework for Constant Propagation consists of:
- D: forwards direction
- L: lattice. domain of values (UNDEF - constants - NAC), meet operator.
- F: transfer function.
3.5 Worklist Algorithm
It is an optimization of our Iterative Algorithm. It avoids repeating calculation of BB in each iteration: