Abstract:
In recent years, Large Language Models (LLMs) have gained significant popularity due to their remarkable abilities in understanding, processing, and generating human language. The rapid advancement of these models has contributed to their growing adoption in a variety of industrial environments. However, the deployment of these models comes with significant challenges, particularly in terms of security and computational efficiency. In particular, LLMs are vulnerable to small perturbations in the input data. Even minor changes, such as slight modifications to the input text or injection of suffixes that seem random, can lead the model to change its decision or generate incorrect, biased, or harmful content. This thesis examines the relationship between model quantization, the process of decreasing the precision of neural network weights, and the transferability of adversarial attacks in LLMs. The primary goals are to assess the effectiveness of adversarial attacks and to determine whether attacks crafted on quantized models can successfully transfer to their non-quantized counterparts, exposing potential security risks. By conducting experiments across a range of models and attack scenarios, the research demonstrates that attacks targeting low-precision models can effectively compromise models of higher precision. This finding highlights a critical security gap that could be exploited by malicious actors, emphasizing the need for more secure quantization strategies.