I have been taking the online course of Nanjing University developed by Yue Li and Tian Tan in Spring 2020. This is the first note of the course. The course page with materials can be found here. This note consists of the introduction of static analysis and IR.
1. Introduction
Static Programs Analysis falls under the application branch of programming languages. As programs has become larger and more complicated, it is more and more important and essential to use static analysis to ensure the quality of such programs.
Static Analysis analyzes a program to reason about its behaviors (e.g. type indication, call hierarchy) and check if it satisfies certain criterion (e.g. bug detection, memory leak, private information leak) BEFORE running the program.
The ideal solution would be that the static analysis can provide a result showing the exact answers to all the non-trivial properties we want to analyze. Unfortunately, based on Rice’s Theorem, such approach is not exist:
- Rice’s Theorem: Any non-trivial property of the behavior of programs in a r.e.(recursively enumerable - recognizable by a Turing-machine) language is undecidable.
Even though there is no perfect static analysis, we can still do useful static analysis which can give us either sound(compromise completeness, have false positives) or complete(compromise soundness, have false negatives) results. Most static analysis choose the former (sound but not fully precise static analysis), e.g. bug detection.
In most cases, static analysis can be summarized into two steps:
- Abstraction: convert concrete domain into abstract domain, e.g. ‘100,1,-1,0,w/0’ to ‘+,-,0,unknown’.
- Over-approximation
- Transfer functions: define how to evaluate statements on abstract values, based on the analysis problem and the semantics of the statements.
- Control flows: abstract and approximate the paths of the statements such as flow merging (as a way of over-approximation).
2. Intermediate Representation
Before diving into static analysis, it is important to introduce how the source code become an executable program, and in which step the static analysis takes part. The process of transferring source code to machine code is usually by using compilers, which consists of the following steps, where the CFG (control flow graphs) also happens during the IR and Static Analysis step:
2.1 AST vs. IR (3AC, SSA)
There are two most important and commonly used representations when analyzing programs: AST (abstract syntax tree) and IR (intermediate representation). As the name of AST implies, it is a tree structure to express a program, whereas IR uses 3AC (3-address-code, a common IR language where there is at most one operator on the right side of an instruction. Each 3AC contains at most 3 addresses.) or SSA (static single assignment, another common IR language where every variable is with distinct name.). The following shows the difference of AST and IR using 3AC:
The main difference between 3AC and SSA is that SSA gives each definition a new name and propagate the new name to subsequent uses:
In thhe case of control flow merging, SSA will assign a new operator phi to select values at the merge nodes:
From the previous figure of compilers, we can see the differences between AST (abstract syntax tree) and IR (intermediate representation):
- AST
- high-level and closed to grammar structure
- usually language dependent
- suitable for fast type checking
- lack of control flow information
- IR
- low-level and closed to machine code
- usually language independent
- compact and uniform
- contains control flow information
- usually considered as the basis for static analysis
2.2 CFG and BB
Control Flow Analysis usually refers to building CFG (control flow graphs) which serves as thhe basic structure for static analysis. The node of a CFG can be a 3AC instruction, but usually a BB (Basic Block, maximal sequences of consecutive 3AC instructions with the properties that can be entered only at the beginning AND exited only at the end).