Published in Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), Dec. 2004 (14% acceptance rate, 27/193) [pdf]
CP-Miner has been licensed and commercially developed by Pattern Insight.
Abstract:
Copy-pasted code is very common in large software because programmers prefer reusing code via copy-paste in order to reduce programming effort. Copy-paste is prone to introducing bugs. Recent studies show that a significant portion of operating system bugs concentrate incopy-pasted code. Unfortunately, it is very challenging to efficiently identify copy-pasted code in large software. Existing copy-paste detection tools are either not scalable to large software, or cannot handle small modifications in copy-paste. Furthermore, few tools are available to detect copy-paste related bugs.
In this paper, we propose a tool, called CP-Miner, that uses data mining techniques to efficiently identify copy-pasted code in large software including operating systems, and detect copy-paste related bugs. Specifically, it takes less than 20 minutes for CP-Miner to identify 190,000 and 150,000 copy-pasted segments in Linux and FreeBSD, respectively. The copy-pasted code accounts for 20-22% of code in Linux and FreeBSD. Similarly, CP-Miner also identifies many copy-pasted segments in the Apache Web Server and PostgresSQL, which account for 17.7-22% of code in these software.
Moreover, CP-Miner has detected 49 and 31 copy-paste related bugs in the latest versions of Linux and FreeBSD, respectively. Some of these bugs have been reported by us to the open source community and are then fixed by the developers. CP-Miner detected 5 and 2 copy-paste related bugs in the latest version of Apache and PostgresSQL. The bugs in Apache were fixed immediately after we reported the bugs.
In addition, we have analyzed some interesting characteristics of copy-paste in Linux and FreeBSD, including the distribution of copy-pasted code across different length, granularity, modules, degrees of modification, and various software versions.