Abstract:
This thesis aims to investigate string manipulation with security implications in different programming languages and to improve the state-of-the-art by applying the abstract interpretation theory to string analysis. Erroneous string manipulation is a challenging problem in software verification and, in fact, it is one of the major cause of program vulnerabilities that can be exploited by malicious users, leading to severe consequences for the affected systems. By string analysis we mean statically computing the set of string values that are possibly assigned to a variable. Like for other analysis issues, this is undecidable. Thus a certain degree of approximation is necessary in order to find evidence of bugs and vulnerabilities in string manipulating code. We take advantage of the Abstract Interpretation theory, i.e., a powerful mathematical theory that enables us to define and prove the soundness of approximations. The five main contributions of this thesis are: We introduce a new sophisticated abstract domain for C strings. The way the domain (called M-String) is conceived allows it to be tailored for specific verification tasks (e.g., detection of buffer overflows). We describe the concrete and the abstract semantics of basic string operations and prove their soundness formally. Furthermore, we provide an executable implementation of abstract operations. Using a tool that automatically lifts existing programs into the M-String domain along with an explicit-state model checker, we evaluate the accuracy of the proposed domain experimentally on real-case test programs.
We combine abstract domains resulting from the reduced product between string shape abstraction and string content abstraction, in order to improve the ability to detect inconsistent states leading to program errors without a major impact with respect to efficiency. In particular, the combinations involve some string abstract domains introduced in the literature with the segmentation domain that we instantiate for string analysis.
Completeness, in Abstract Interpretation, ensures that the analysis does not lose information with respect to the property of interest. We provide a systematic and constructive approach for generating the completion of string domains for dynamic languages, and we apply it to the refinement of existing string abstractions. Indeed, for dynamic languages, lack of string analysis completeness is a key security issue, as poorly managed string manipulation code may easily lead to significant security flaws. We also provide an effective procedure to measure the precision improvement obtained when lifting the analysis to complete domains.
Almost all the existing string abstract domains tracks information of single variables in a program (e.g., if a string contains a certain character), without inspecting their relationship with other values, causing the loss of relevant knowledge about their possible values. Thus, we introduce a generic framework that allows to formalize relational string abstract domains based on ordering relationship, and we instantiate such a framework to several domains built upon different well-known string orders (e.g., substring relationships). We implemented the domain based on substring ordering, and we provide an experimental evaluation about its effectiveness on some case studies.
We manipulate string values in the context of relational database watermarking. We propose a semantic-driven watermarking approach of relational textual databases, which marks multi-word textual attributes, exploiting the synonym substitution technique for text watermarking together with notions in semantic similarity analysis, and dealing with the semantic perturbations provoked by the watermark embedding. We show the effectiveness of our approach through an experimental evaluation. We also prove the resilience of our approach with respect to the random synonym substitution attack.