ERROR LOADING HTML FROM SOURCE (http://ncf.sobek.ufl.edu//design/skins/UFDC/html/header_item.html)

A COMPARISON OF BIOLOGICAL STRING SIMILARITY ALGORITHMS APPLICATION TO SOURCE CODE PLAGIARISM DETECTION

Permanent Link: http://ncf.sobek.ufl.edu/NCFE004886/00001

Material Information

Title: A COMPARISON OF BIOLOGICAL STRING SIMILARITY ALGORITHMS APPLICATION TO SOURCE CODE PLAGIARISM DETECTION
Physical Description: Book
Language: English
Creator: Wielga, Colin
Publisher: New College of Florida
Place of Publication: Sarasota, Fla.
Creation Date: 2013
Publication Date: 2013

Subjects

Subjects / Keywords: Computer Science
Plagiarism
String Comparison
Genre: bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Source code plagiarism is easy to perform and difficult to catch. Detection approaches vary, with little consensus. This thesis compares several string comparison techniques borrowed from Biology on a large collection of student work containing various types of plagiarism. All the algorithms succeeded in matching a plagiarized file to its original files upwards of 90% of the time. A modification is proposed for these algorithms that drastically improves their runtimes with little or no effect on accuracy. The strengths and weaknesses of each are explored, in the hope of improving future plagiarism detection techniques.
Statement of Responsibility: by Colin Wielga
Thesis: Thesis (B.A.) -- New College of Florida, 2013
Electronic Access: RESTRICTED TO NCF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE
Bibliography: Includes bibliographical references.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The New College of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Local: Faculty Sponsor: Rahal, Imad

Record Information

Source Institution: New College of Florida
Holding Location: New College of Florida
Rights Management: Applicable rights reserved.
Classification: local - S.T. 2013 W6
System ID: NCFE004886:00001

Permanent Link: http://ncf.sobek.ufl.edu/NCFE004886/00001

Material Information

Title: A COMPARISON OF BIOLOGICAL STRING SIMILARITY ALGORITHMS APPLICATION TO SOURCE CODE PLAGIARISM DETECTION
Physical Description: Book
Language: English
Creator: Wielga, Colin
Publisher: New College of Florida
Place of Publication: Sarasota, Fla.
Creation Date: 2013
Publication Date: 2013

Subjects

Subjects / Keywords: Computer Science
Plagiarism
String Comparison
Genre: bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Source code plagiarism is easy to perform and difficult to catch. Detection approaches vary, with little consensus. This thesis compares several string comparison techniques borrowed from Biology on a large collection of student work containing various types of plagiarism. All the algorithms succeeded in matching a plagiarized file to its original files upwards of 90% of the time. A modification is proposed for these algorithms that drastically improves their runtimes with little or no effect on accuracy. The strengths and weaknesses of each are explored, in the hope of improving future plagiarism detection techniques.
Statement of Responsibility: by Colin Wielga
Thesis: Thesis (B.A.) -- New College of Florida, 2013
Electronic Access: RESTRICTED TO NCF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE
Bibliography: Includes bibliographical references.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The New College of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Local: Faculty Sponsor: Rahal, Imad

Record Information

Source Institution: New College of Florida
Holding Location: New College of Florida
Rights Management: Applicable rights reserved.
Classification: local - S.T. 2013 W6
System ID: NCFE004886:00001


This item is only available as the following downloads:


Full Text

PAGE 1

i A COMPARISON OF BIOL OGICAL STRING SIMILA RITY ALGORITHMS APPL ICATION TO SOURCE CODE PLAGIARI SM DETECTION BY COLIN WIELGA A Thesis Submitted to the Division of Natural Sciences New College of Florida In partial fulfillment of the req uirement for the degree Bachelor of Art Under the sponsorship of Dr. Imad Rahal Sarasota, Florida May 2013

PAGE 2

ii Acknowledgements To Imad Rahal for an interesting topic and a friendly manner ; David Mullins and Patrick McDonald for patience and second chanc es; John Gillette for kind words and a thousand commas; my family and friends for your support -thank you.

PAGE 3

iii Contents Acknowledgements ................................ ................................ ................................ ................................ ... ii Contents ................................ ................................ ................................ ................................ ................... iii ABSTRACT ................................ ................................ ................................ ................................ .................. v 1 Introduction ................................ ................................ ................................ ................................ .......... 1 1.1 Measure of Success ................................ ................................ ................................ ....................... 2 2 Current Techniques ................................ ................................ ................................ ............................... 4 2.1 Introduction ................................ ................................ ................................ ................................ .. 4 2.2 Distance Based ................................ ................................ ................................ .............................. 5 2.3 String Based ................................ ................................ ................................ ................................ .. 6 2.3.1 MOSS ................................ ................................ ................................ ................................ ..... 9 2.4 Graph Based ................................ ................................ ................................ ................................ 11 2.5 Summary ................................ ................................ ................................ ................................ ..... 13 3 Algorithms ................................ ................................ ................................ ................................ ........... 15 3.1 Introduction ................................ ................................ ................................ ................................ 15 3.2 Data Description ................................ ................................ ................................ ......................... 15 3.3 Preprocessing ................................ ................................ ................................ .............................. 16 3.4 Greedy String Tiling ................................ ................................ ................................ ..................... 19 3.4.1 Motivation ................................ ................................ ................................ ........................... 19 3.4.2 Greedy String Tiling ................................ ................................ ................................ ............. 23 3.4.3 RKR GST ................................ ................................ ................................ ............................... 24 3.4.4 Algorithm ................................ ................................ ................................ ............................ 26 3.4.5 Accounting for Unmatched Characters ................................ ................................ ............... 29 3.4.6 Selecting scoring functions ................................ ................................ ................................ 30 3.5 Count Compare ................................ ................................ ................................ ........................... 31 3.6 Global Alignment ................................ ................................ ................................ ........................ 32 3.6.1 Problem ................................ ................................ ................................ ............................... 32 3.6.2 Algorithm ................................ ................................ ................................ ............................ 34 3.7 Local Alignment ................................ ................................ ................................ ........................... 38 3.7.1 Motivation ................................ ................................ ................................ ........................... 38 3.7.2 Algorithm ................................ ................................ ................................ ............................ 38

PAGE 4

iv 3.8 Optimizing ................................ ................................ ................................ ................................ ... 39 3.9 Summary ................................ ................................ ................................ ................................ ..... 42 4 R esults ................................ ................................ ................................ ................................ ................. 43 4.1 Introduction ................................ ................................ ................................ ................................ 43 4.2 Accuracy ................................ ................................ ................................ ................................ ...... 43 4.3 Accuracy by Category ................................ ................................ ................................ .................. 46 4.4 Runtime ................................ ................................ ................................ ................................ ....... 48 4.5 Improving Runtimes ................................ ................................ ................................ .................... 51 4.6 Summary ................................ ................................ ................................ ................................ ..... 54 5 Conclusion ................................ ................................ ................................ ................................ ........... 55 5.1 Future Work ................................ ................................ ................................ ................................ 56 Bibliography ................................ ................................ ................................ ................................ ................ 58 Appendix: Raw Results ................................ ................................ ................................ ................................ 60

PAGE 5

v A COMPARISON OF BIOL OGICAL STRING SIMILA RITY ALGORITHMS APPL ICATION TO SOURCE CODE PLAGIARI SM DETECTION Colin Wielga New College of Florida, 20 13 ABSTRACT Source code plagiarism is easy to perform and difficult to catch. Detection approaches vary, with little consensus. This thesis compares several string comparison techniques borrowed from Biology on a large collection of student work contai ning various types of plagiarism. All the algorithms succeeded in matching a plagiarized file to its original files upwards of 90% of the time. A modification is proposed for these algorithms that drastically improves their runtimes with little or no effec t on accuracy. The strengths and weaknesses of each are explored, in the hope of improving future plagiarism detection techniques. Dr. Imad Rahal Division of Natural Sciences

PAGE 6

vi

PAGE 7

1 1 Introduction S ource code plagiarism is a serious problem in education It is easy to perform, thanks to the wealth of information available online, and often goes undetected since assignments are generally evaluated according to their results not on the code itself. Manual inspection of code is not viable since it is extre mely time consuming. Unfortunately it is difficult to design an automated plagiarism detection method that works in general, since it is easy to make modifications to a computer program that changes its syntax without changing its logic These near match es make the problem interesting and challenging. Extensive research has been done on how to best detect source code plagiarism. Some of the popular approaches are discussed in more detail in C hapter 1. The most popular approach is to convert files into st rings that summarize the contents of that file, and then compute the similarity of each pair of strings. There are a variety of algorithms for computing the similarities of a pair of strings and little consensus on which is the best. In this thesis we com pare several algorithms originally intended for biological string comparisons on a large collection of student code from which we have plagiarized several files. The four algorithms compared are: running Karp Robin g reedy st ring tiling local alignment, g lobal alignment and count compare Running Karp Robin Greedy String Tiling (RKR GST) is currently among the most popular algorithms, it compares two strings by examining their common substrings. Local and Global Alignment dynamically compare sequences of strings in terms of insertion and deletions. Count Compare is a special case of RKR GST that runs extremely quickly. We also propose the use of count compare to prune the dataset so the other algorithms can be run quickly with little or no affect on the r esults.

PAGE 8

2 This thesis is divided into three chapters. In the first chapter we summarize the approaches currently used to detect source code plagiarism In the second chapter we give detailed descriptions of the algorithms to be compared. In the third chapte r we examine the runtime and accuracy of the different algorithms. 1.1 Measure of Success Before we begin it is necessary to discuss how we intend to measure success. Traditionally plagiarism detection algorithms compare every file with ever y other file and p rint out the most similar. Success can then be measured with precision (proportion of the files in the top that are plagiarized ), recall (proportion of the plagiarized file s that are in the top ) and f1 (the harmonic mean of precision and recall). There are two problems with this approach. First, it depends on the value of chosen ; one algorithm might perform better with one value of but worse with another. Second, the algorithm s we are using only tell us which files are most similar to a specific file, as opposed to which pair of files is the most similar, and thus it is nontrivial to select the most similar matches. Much like the choice of the best choice for a way to select the top is algorithm dependent ; one technique might perform well with one algorithm and poorly with another. Selecting a technique to choose the top is beyond the scope of this thesis. In order to address these issues we introduce two new measures of success. In both cases we compare a plagiarized file to all the potentia l original files. The score we wish to give is how many files one would need to look at in order to find the original file (i.e., the file from which the plagiarized file was plagiarized) when inspecting the files from most to least similar to the plagiar ized file If the corresponding original file scored better than all the other files we give the algorithm a score of one for that file I f it scored the second best the algorithm scores a two,

PAGE 9

3 and so on. The difference between our two scoring systems is h ow they deal with ties involving the original file. Upper bound Score: The upper bound score of a plagiarized file is one plus the number of files that score higher than the original file. Files that tie with the original file are considered to have scor ed higher than the original file. Lower Bound Score : The lower bound score of a plagiarized file is one plus the number of files that score higher than the original file. Files that tie with the original file are considered to have scored low er than the o riginal file. Upper bound score treat s the original as if it is the last of the tied files to be examined while lower bound score treat s the original as if it were the first. In addition to avoiding the above disadvantages, these measures of success have t he additional advantage that they do not require that we compare every pair of files. So comparisons can be computed in a fraction of the time.

PAGE 10

4 2 Current Techniques 2.1 Introduction In this chapter, we will discuss the ways in which plagiar ism detection is currently approached. We will look at the major schools of thought and examine MOSS, an important plagiarism detection algorithm in detail. We divide the approaches used to detect plagiarism into three schools: distance based, string base d and graph based. Distance based techniques generate a set of vectors or tuples where each element describes some property of a file. For example one might use tuples w h ere the elements are: 'how many lines,' 'how many loops' and 'how many variables' a file contains. These descriptions are then compared to get an idea of how similar two files are. String based techniques work much like their natural language cousins; they flatten the code into strings and look for long series of matching 'words.' Finally, graph based techniques attempt to create an isomorphism between gra phs that express the logical structure of a program. It is difficult to say objectively how successful an algorithm is because the results are so data dependent. Results depend on the programming language used, the level and goals of the programmers, and t he types of plagiarism encountered. An algorithm may perform well on one dataset and badly on others. Since there is no standard way to compare different algorithms, authors often discuss the types of plagiarism their algorithm perform well against. The ba ckbone of this discussion is the following list ( Figure 2 1 ) of plagiarism techniques (Whale G. R., 1990) We will use these to guide our discussion of the three major schools. Figure 2 1 L ist techniques used to conceal plagiarism 1. Changing comments or formatting

PAGE 11

5 2. Changing identifiers 3. Changing the order of operands in expressions 4. Changing data types 5. Replacing expression with equivalents 6. A dding redundant statements of variables 7. Changing the order of independent statements 8. Changing the structure of iteration statements 9. Changing the structure of selection statements 10. Replacing procedure calls by the procedure body 11. Introducing non structured st atements 12. Combined original and copied program fragments 2.2 Distance B ased Distance based methods (also called attribute counting methods) are the oldest and simplest of the three major approaches. Distance based methods generally start by generat ing a represe ntation of each file as a tuple, where each element of the tuple describes some property of the file, such as the number of loops or the number of lines. This is a useful way to describe a file since it allows us to think of the files as living in some dimensional space. T he tuples are then compared using some distance measure to find the most likely set of plagiarisms. The goal of a distance based measure is to select a set of attributes that provide a complete description of any file. Unfortunately the number of avail able attributes is large and the data varies drastically, so coming up with the right set of attributes is difficult. One successful approach was that of Accuse, an algorithm from 1981. It experimented with 20

PAGE 12

6 attributes that were popular at the time and s elected seven that produced the best results (Grier, 1981) Another problem distance based methods encounter is that their attributes are often not independent. For example, a longer file will generally possess more loops; this is troubling if length and number of loops are both attrib utes that one is counting since the length of a file ends up expressing itself twice. Accuse deals with this problem by calculating the correlation between its attributes and taking that into consi deration when measuring the distance between files. The Faidhi Robinson System, a distance based algorithm introduced in 1987, deals with it by selecting a set of 23 attributes it claims are independent (J. A. Faidhi, 1987) D istance based techniques represent files in a very simple way (compared to the other two common approaches). One advantage of this simple representation is that it ignores many of the small scale changes that can be made to conceal plagiarism. It is comple tely unaffected by changes in order of statements and generally does well with changes in which a statement is replaced with its logical equivalent (types 2, 3, 4, 5, 7, 8, and 9 in Figure 2 1 ). It also makes for v ery fast comparisons since each file generally only needs to be scanned once. A disadvantage of this simple representation is that distance based techniques tend to perform poorly when new code is introduced (types 6, sometimes 10, and 12 in Figure 2 1 ). Since, they have no way of identifying the common subsets, they only care that the final attribute counts are different. Distance based techniques also tend to produce many false positives, since very different piec es of code can generate the same tuple. In general, distance based methods are out performed by string based methods as the plagiarism becomes more complex. (Kristina L. Verco, 1996) 2.3 String B ased

PAGE 13

7 String based methods are currently the most popular or the three major schools They are somewhat more complex than d istance based methods, si nce they take statement order into account String based comparisons are carried out in two steps. First, the file is converted into a string : we calls this step preprocessing This involves removing everything that is not a reserved word (formatting, comments, variable names, etc). What remains are reserved words, each of which are replace d with a character that represents them We disc uss this in more details in section 3.3 Second, the strings are examined in pairs with an algorithm that determines string similarity. We choose to work with strings for several reasons First, strings are more abstract than programs T his makes the problem easier to think about and allows us to leverage what we already know about strings. Second, it does not tie us to one type of data : a string can easily represent a variety of languages. Third, by reducing a f ile to a string, we eliminate a lot of information not relevant to the function of the code, such as variable names and formatting. The difficult part of creating a string based algorithm is the second step. Determining the similarity of two strings is a d ifficult task with many possible approaches One reason comparing strings is difficult is that it is unclear how to compare one letter to another. Ordering is another problem, even if two strings share the same set of characters there are factorial possibl e orderings with no natural concept of similarity. The following is a short history of string based techniques that focuses on the algorithm s used in the second step to compare strings (as opposed to the algorithm s used to generate the strings). In 1986 G eoff Whale introduced Plague, the first significant string based algorithm (Whale G. 1986) It used Heckel's Algorithm, which analy z es string in terms of moves, insertions and deletions however ; it does not always generate op timal solutions (Heckel, 1978)

PAGE 14

8 The years 1992 to 1996 saw three versions of Yet Another Plague (YAP). The first of these used a mixture of UNIX utilities relying heavily on the file comparison utility s diff. The second versio n l ike Plague, used Heckel's Algorithm because YAP 1 was too slow. The third iteration of YAP used running Karp Rabin greedy string tiling (RKR GST) which looks for maximal substrings (discussed in more detail in S ection 3.4 ). Another program called Sherlock was introduced by Mike Joy and Michael Luck in 1999. It is unclear what algorithm it used (we suspect RKR GST); however, it is interesting because it constructed a graph out of its results which it d isplayed to the user, expressing not only how two file strings and compare, but also how and compare to other files (Mike Joy, 1999) SIM was also introduced in 1999. It used a sequence alignment technique (similar to that which we discuss in S ection 3.6 ), which compares strings in term of insertions and deletions but always finds an optimal solution (David Gitchell, 1999) JPlag appeared in 2000 and used RKR GST. It boasted a web service that allowed us ers to easily upload collections of files for it to search for plagiarism. It also supports a variety of language s (Lutz Prechelt, 2000) Measure of Software Similarity (MOSS), is currently one of the most popular plagiarism de tection system. It was introduced by Alex Aiken in 2003. MOSS is a little difficult to classify since it converts files to strings but then divides the strings into a set of k grams. The details of how it compares the k grams are not publish ed Because of its importance we discuss MOSS in more detail in the next S ection 2. 3.1 In general, string based techniques perform well. Depending on the preprocessing and the details of the algorithm, they can detect simil arities despite replacing statement s with equivalent s (types 2,3,4,5,8, and 9 in Figure 2 1 ). They do decently with large scale changes like

PAGE 15

9 mixing plagiarized code and original code and with large scale reordering (types 10,11,and 12 in Figure 2 1 ), but struggle with line scale recording and insertion (types 6 and 7 in Figure 2 1 ). 2.3.1 MOSS It would be wrong to talk about source code plagiarism detection techniques without discussing MOSS in detail We classify MOSS as a string based algorithm since it begins by working with string s. H owever this classification is somewhat blurry since the strings are converted into sets of substring s or k grams. All the details of how MOSS compares files have not been published B ecause of its importance we will discuss what has been published below (Saul Schleimer, 2003) 2.3.1.1 Goals MOSS attempts to select a set of k gram s called a fingerprint, which represent s a document and can be compared to other fingerprints to measure their similarity. MOSS proposed the following three criteria to guide his search for a good fingerprint. Figure 2 2 Schleimer criteria for a good fingerprint 1. Whitespace insensitivity many documents contain information not relevant to their core meaning and any good measure should ignore this irrelevant information. The nature of 'not relevant to thei r core meaning' is somewhat format dependent. For text documents, it includes extra spaces, indentation, line breaks and capitalization; while for computer code, it includes comments and variable names. 2. Noise suppression when comparing large files, findi ng short matches is often meaningless. For example when comparing text files discovering that 'is' appears in both files hardly says much about the similarity of the documents. Similarly, finding that two java files contain the keyword 'int' is not meani ngful.

PAGE 16

10 3. Position independence for an approach focused on matching the details of a document, macro scale changes such as rearrang ing the order of paragraphs (or functions) should not affect the fingerprint of the document. Furthermore, changes to one part of a document should not affect the part of our fingerprint from unrelated parts. 2.3.1.2 Algorithm MOSS starts by using string preprocessing like that used by most string based algorithms to convert each file into a string of characters. Not only does this ma ke the code easier to work with it removes most of the whitespace -achieving MOSS's first goal. Then for each string MOSS generates every possible substring of length or gram from that string The grams are then hashed One advantage of working on the level of grams/hashes is it ensures that no match es shorter than can be observed. MOSS looks at grams as being a single object, as opposed to an array of characters (thus the hashing), so information about a gram s substrings is completely ignored. This achieves MOSS's second goal of suppressing noise, since small matches (i.e. matches length less than ) cannot be observed when fingerprints are compared The final and hardest step is choosing which grams to use to represe nt the file. This is where MOSS's third goal becomes clearer. Ideally macro scale changes do not to affect the grams/hashes chosen from a block of code For example, removing a block of code should not affect the set of hashes chosen except removing th e hashes from the deleted block. Likewise, changing the order of the blocks should not change the hashes chosen from each. M OSS also needs grams/hashes to represent the whole document. This means that we need hashes to be distributed with relative eq uality throughout the document. An even distribution of hashes insures that large sections cannot be deleted or added without it being

PAGE 17

11 reflected in the fingerprint. A truly uniform distribution (say one of every hash ), would be ideal since we could be certain of the size of changes we would observe The obvious solution would be to choos e every gram/hash. While this e nsures equal distribution, it fails to satisfy our third requirement. Adding or removing a single character would change ever y hash after the change. A single deletion could change many of the hashes chosen including those in distant parts of the document. To choose hashes/ grams, MOSS looks at sets of consecutive hashes and chooses the lowest hash value from each. If two hashes are tied for the lowest, it chooses any hash that has already been selected I f more than one or none have already been selected it choose s the one furthest to the left. This ensures that the hashes chosen for the fingerprints are equally distribu ted (at least one in every ) and that the same chunk of text will always return the same set of hashes. This achieves MOSS's third goal from Figure 2 2 Choosing the lowest hash out of every hashes, has the additional bonus of guarantee ing that if two files share a sufficiently long substring this will guarantee their fingerprints shar e a hash. If two strings share a substring of length they will generate the same set of hashes from this substring and select the same one to be in their respective fingerprints Thus, MOSS achieves all three of its goals: 1) It cleans to eliminate white space; 2) i t works only with grams making small noisy matches irrelevant; and 3) it select s hashes independent of large scale changes by s electing the lowest hash value of each hashes. 2.4 Graph B ased Graph based methods are the newest and most complex of the three techniques. They involve the most in depth representation of the code, which allows them to compare the structure of a program i nstead of the details of its code.

PAGE 18

12 As an example, consider one popular graph based plagiarism algorithm, GPlag (Chao Liu, 2006) GPlag converts each file into a program dependence graph and then measures similarity by matching subgraphs. A program dependence graph is a directed graph that expresses how control statements tie together. Vertices are statements; edges connect a statement to all the possible next lines that involve a variable in that statement. This is best demonst rated with an example. Below is a few lin es of example code ( Pseudo Code 2 1 ) along with the graph it would generate ( Figure 2 3 ). Pseudo Code 2 1 An example function to be converted in to a graph in Figure 2 3 Sum(int[] array) int s um = 0 int c ount = 0 while ( count < array.size) sum += array[count] count ++ return sum Figure 2 3 An example graph like one Gplag might generate from Pseudo Code 2 1 Let's follow the integer sum through the program dependence graph sum is initiated above the while loop I t next appears either in the return statement (if the while condition is never true) or in the addition and assignment statement inside the loop, so its initialization is connect ed to these two statements. From the addition and assignme nt sum can next appear in int count =0 int sum =0 int[] array while (count < array.size) sum += array[count] count++ return sum

PAGE 19

13 the same line in the next iteration of the loop or in the return statement, so we connect the addition and assignment to itself and to the return statement. sum is never reference d again after the return statement so we are done. We track the possible paths of each variable instead of the possible paths of execution since changes to separate variables are independent. Thus the order in which variables are changed does not matter. As a result, GPlag is not tricked by reordering in dependent statements. The main difference between graph based techniques is the different ways they represent a file as a graph. Another approach is to use a variation of the program's call graph. A call graph is a directed graph with functions for verti ces and edges connecting each function to the functions it calls (Kammer, 2011) Graph based algorithms, in theory, do well against all the types of plagiarism with the possible exceptions of inserting meaningless code (type 6 in Figure 2 1 ), since it could cause unexpected changes to the graph and replacing procedure calls with procedure bodies (type 10 in Figure 2 1 ) since it would move large parts of the graph around. Graph based algorithms could also be tricked by slightly more complex plagiarism techniques such as wrapping functions in meaningless container function s or adding calls to functions that are never executed. Graph based algorith ms biggest weakness is its runtime. Constructing graph isomorphism is slow; it has been shown to be NP complete (Michael R. Garey, 1979) 2.5 Summary Plagiarism detection can be approached from several very different angles. We ha ve discussed the three major approaches here. All three are generally effective but each has specific weaknesses. Distance based techniques struggle with large scale changes such as the additions of large blocks of meaningless code. String based technique s can be fooled by

PAGE 20

14 reordering independent statements and insetting meaningless lines randomly. Graph based techniques struggle with structural changes such as replacing function calls with the function bodies.

PAGE 21

15 3 Algorithms 3.1 Introductio n In this chapter is to discuss the details of the algorithms we intend to compare. We begin with a discussion of the data and how it was preprocessed The first algorithm we discuss is a version of running Karp Rabin greedy string tiling (RKR GST). RKR GST is an efficient technique for finding common s ubstrings of two strings. It is used by several popular plagiarism detection techniques, most notably JPlag and YAP3. We also discuss a special case of RKR GST that runs in linear time which we call count compare. Next, we discuss a pair of closely related algorithms local and global a lignment which are used to measure the similarity of sequences of DNA in terms of insertions and deletions. To the best of our knowledge, a lignment algorithms have been us ed only in SIM but we believe they merit further exploration. 3.2 Data D escription Our data is a collection of visual basic 6 projects from students in introductory programming courses at the College of Saint Benedict and St John's University. In all, it contains 4708 files collected over seven years (2003 2009). Visual basic is a somewhat nonstandard programming language; it was designed for the easy development of graphical user interface (GUI) programs. It is an e vent d riven language mean ing that blocks of code are associated with various actions the user performs. A button press for example might be attached to code that causes a message to appear. Thi s is efficient for G UI programming because people generally think about UI in terms of cause and effe ct, however it means we cannot count on a meaningful call graph. Thus using most graph based plagiarism detection is impossible for this dataset.

PAGE 22

16 One hundred eighty of the 4708 original files were intentionally plagiarized for detection. These plagiarism s consist of 30 files of each the following six types: Figure 3 1 A list of techniques for concealing plagiarism that we used when creating plagiarisms for testing 1. Changing comments or formatting 2. Changing identifiers or data types 3. Replacing expressions with equivalents, changing structure of iteration or selection statements, replacing procedure calls with the procedure body 4. Changing the order of operands in expressions or of independent statements 5. Adding r edundant statements or variables, introducing non structured statements 6. Some combination of the above The above list is a condensed version of a list of 12 types of plagiarism we saw in chapter 2 ( Figure 2 1 ) grou ped by the type of changes they make to the code. T he plagiarized files have been divided by type so we can determin e which algorithms are effective for specific types of plagiarism R eal plagiarists however, tend to combine multiple techniques So we als o include some files that combine several of the above types. To avoid confusion any references to a certain type of plagiarism henceforth will be from this list, since these are the types relevant to our dataset 3.3 Preprocessing Before we can run any of o ur algorithms we need to reduce the files to strings. The standard technique for reducing a file to a string is to convert each reserved word into a character. First all comments and formatting are removed. Next, all literals are replaced with standard p lace holders, for example the string "hello world" might be replaced with

PAGE 23

17 string and the value 12 with #. Next variable names, function names and anything user defined are also replaced with place holders. Finally each word is either a reserved word or pl ace holder. This set is finite so we can simply map it to a set of characters to get a string. Our technique converts a file to a string by line instead of by word. We divided statements in to 22 basic logical types, such as the declaration of a new numeri c global variable or the opening of an 'if' statement. Each logical type was then associated with a two letter code, for example declaring a new global numeric variable is PN and the opening of an 'if' statement is (I .This is best demonstrated by the exa mple below ( Pseudo Code 3 1 ) where each line has been matched with its logical purpose. Pseudo Code 3 1 Some example code with the each line matched with its logical type int count =0; local number for (int i=0;i<100;i++){ begin loop if (i % 2){ begin if count = count + 1 assignment print (i + is even") print } end if } end loop print(count + even numbers are < 100" ) print Each line is converted to the pair of characters associated with its logical type These characters are then combined in to a string. We hope this encoding system retains the fundamental logic of a line while eliminating some of the room to make meaningless changes. The goal of this preprocessing is to eliminate plagiarisms of types 1, 2 and 3 from Figure 3 1 Table 3 1 (below) displays the different logical elemen ts along with their two character codes and the total number of times they occur in our data Table 3 1 Logical types and there corresponding codes and the number of times they appear in our dataset logical element code C ount ASSIGNMENT == 112018

PAGE 24

18 PRINT PR 29945 START IF (S 22753 END SUB S) 22753 ELSE EE 15412 LOCAL NUMBER DN 14681 START IF (I 12733 END IF I) 12733 MSG BOX MS 9478 START LOOP (L 9396 END LOOP L) 9396 LOCAL STRING D$ 5982 INPUT BO X IN 3041 OPEN FILE IN OF 2964 FILE READ FR 2839 END ED 1977 LOCAL OTHER D? 1236 PUBLIC STRING P$ 984 LOCAL BOOLEAN DB 867 PUBLIC OTHER P? 151 FILE WRITE FW 149 PUBLIC BOOLEAN PB 126

PAGE 25

19 Notice that the statements are unevenly distributed with assig nment comprising almost 40% of the characters. This is important because it means these strings will have much more in common than strings randomly generated from an alphabet of 22 letters. An additional benefit of this encoding system is that it returns m uch shorter strings than the standard technique. This is helpful, since all our algorithms runtimes depend on the length of the compared strings; shorter strings mean faster runtimes. Figure 3 2 below shows how the encoded files are distributed according to length. Figure 3 2 Number of files that generate strings of each length (slightly smoothed) The vast majority of the strings we are comparing are short, with a median length of 29 and an average length of 60. 3.4 Greedy String Tiling 3.4.1 Motivation String tiling is one effective way to measure the similarity of a pair of strings. It measures similarity in terms of the number and length of common substrings. String tiling associates a set of substrings from with matching substrings from The idea is that if both 0 100 200 300 400 500 600 700 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 Number of Strings String Length (characters)

PAGE 26

20 strings are modifications of a common parent string, two associated substrings must have been the same in th e parent string. The association of a substring of with a m atching substring for is called a tile, formally: Definition: A tile is a permanent and unique (one to one) association of a substring from with a matching substring from (Wise M. J., 1993) A tiling then is a set of no n overlapping tiles. For example let's look at how one might tile for abcdbcd and abcabcd We can visualize their similarities if write each string on one axis of a grid and shade the cells where the letter on the axis matches the letter on the axis. Figure 3 3 A visual c ompar ison of abcdbcd and abcabcd a b c d b c d a 1 b 3 c a 2 b 4 c d Representing the c omparison in this way is helpful since the matching substrings appear as diagonal runs of shaded cells. We have labeled the first element of the largest matches so we

PAGE 27

21 can refer to them. If we tile match 2, this permanently associates the last four charact ers of with the first four characters of These characters become unavailable for other matches. Visually any match that shares a row or column with a tiled match cannot be tiled. Tiling match 2 makes tiling match 1 impossible since it uses the first three characters of ,it also makes tiling match 4 impossible since it uses the last three characters of We would then have no choice but to tile match 3 to get the following tiling : Figure 3 4 A visual repr esentation of a tiling of abcdbcd and abcabcd a b c d b c d a b 3 c a 2 b c d This is not the only available tiling, in fact, there are many. We need some w ay to select the best one. To assign a value to a tiling, we introduce a weighting function which takes the length of a tile to a score. We then assume that the best tiling maximizes the sum of applied to the length of each tile. We discuss the best choice for in section 3.4.6

PAGE 28

22 Above is the score of a tiling, where is the length of the th tile. The function is chosen to encourage longer matches, since they are much less likely to happen by chance. Once we have found the optimal tiling w e use its score as the similarity of the strings tiled. Unfortunately, s electing the best tiling turns out to be difficult since there are so many possible solutions. Instead of calculating the number of tilings, we calculate how many ways the characters from can be matched with the character of Once the characters have been matched it is trivial to choose the best tiling for that match Thus we make our calculation simpler and avoid checking some pointless cases W e break the problem down by letter, si nce characters can only be matched to character of the same letter. For a given letter if the strings contain different numbers of the s we choose the proper number of s from the string with more to be unmatched. We then can match the first in with each s in the second in with each of the remaining in and so on. This gives us the following number of possible solutions: We cannot prove that this problem cannot be solved in polynomial time. However, we argue that it is difficult, in order to motivate a greedy solution. Often, we approach difficult problems like this by trying to split them up into smaller easier to solve pieces. However, in this case, such an approach is difficult T o determine whether a match is 'worth tiling we must determine if the matches it would eliminate (those match es containing parts of the tiled match) are 'worth tiling' This in turn requires determining if the matches they would eliminate are 'worth tiling' and so on. Thus it is rare that we can be certain of a solution without testing many of the available options.

PAGE 29

23 3.4.2 Greedy String Tiling Greedy string tiling (GST) approxim ates the optimal tiling by at each opportunity tiling the largest available substring. We wish to show that this give s solutions close to those calculated by brute force (or exact tiling) We believe that this is a reasonable approach since larger substr ings produce more value per match ed character. Unfortunately, since there is no workable algorithm to calculate the optimal solutions, it is difficult to know how close the approximation is. The best we can do is to compare greedy and exact tilings on smal l strings. Figure 3 5 The sum of the scores of 500 comparisons for strings of various lengths for both the optional tiling ( exact ) and the greedy tiling. Figure 3 5 above shows the total score for comparing the same 500 pairs of strings of lengths 1 to 33 using greedy and exact techniques when The strings compared are randomly selected substrings of actual files; as such, they are certainly representative of the data set. It is reassuring how well the exact and greedy scores match ; on average the difference between them is about 2.43 % of the total exact score. While the difference is growing so are the total scores. We would like to be certain that the difference is not growing faster than the 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Total Score for 500 Runs Length of Compared Strings (characters) Exact Greedy

PAGE 30

24 total scores as the lengths of the strings increases since that would cast doubt on the r eliability of using the GST as an approximation at longer lengths. T he difference as a percentage of total exact score changes with respect to length is shown below. Figure 3 6 The d ifference between the s um of the scores of 500 g reedy tilings an d 500 e xact tilings as percentage of the sum of the scores 500 e xact tiling compared to string length It is hard to be certain of much with such a small range of values so far from the relevant range (our average st r ing has a length of 60 with some being much longer); however, the difference appears to be stable around 2.5%. To put this difference in perspective, the average difference between a comparison done with exact and greedy on two string of length 25 is 0 .64 which is eclipsed by the standard deviations of exact and greedy, 23.72 and 23.54 respectively. The algorithm we use to perform greedy string tiling is running Karp Robin greedy string tiling (RKR GST) which is discussed in detail in the next section. 3.4.3 RKR GST RKR GST was introduced in 1993 by Michael Wise (Wise M. J., 1993) It was originally introduced to compare DNA sequences but was adapted for plagiarism detection by in YAP3 in 1996. RKR GST has been successful becaus e its speed : it offer s an experimental runtime 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Percent String Length (characters) Difference as percet of Exact

PAGE 31

25 of (Wise M. J., 1996) Before we can discuss RKR GST in detail, we must discuss running Karp Rabin String Matching, a key technique that underlies RKR GST. 3.4.3.1 Running Kar p Rabin String Matching Karp Rabin String Matching was created by Michael Rabin and Richard Karp in 1987 (Richard M. Karp, 1987) It finds all matching substrings of a given length shared by a pair of strings and ext remely quickly. It boasts an average runtime of and a worst case run time of This is remarkable since comparing every letter in to ever y letter in would take time. Pseudo Code 3 2 Karp Rabin String Matching hashtable f or substrings t of T of length l insert hash(t) ,data(t) into hashtable f or substings p of P of length l dataList = get hash(t) for hashtable for data in datalist if data really match es p ... The key behind Karp Rabin String Matching 's remarkable speed is a hash table with substrings as keys and positions as values. First all substring of length of are converted into hashes and used to store their positions in in the hash table. Positions that share a hash value are connected together into a linked list since we want to recall all the positions at which a given hash value occurs. Then we iterate over all the substrings of of length converting each to a hash. Two matching substrings will generate the same hash value. So, we look up the hashes from substrings of in the table to get the positions of their matches. The hash function i s not perfect and sometimes gives two different strings the same value, so after the positions are returned we mu st check to make sure each pair of strings actually matches.

PAGE 32

26 An important factor impacting Karp Rabin String Matching's speed is the speed of the hash function In general, the runtime of the hash function is on the order of the length of the substrings we are hashing, since that is the amount of information it contains. However since we are hashing neighboring strings that share common substrin gs, we can do better. Karp Rabin string matching takes advantage of this by using a rolling hash function. Instead of creating an entirely new hash, r olling hash functions generate the next hash by making adjustments to the last hash. For example if is abcdefg and is 5, we hash abcde normally, then we hash bcdef by making adjustments to the hash of abcde for adding the f and removing the a The rolling hash we use is shown below ; note that all arithmetic is done modulo (i.e. remainder of the division) some large prime: Where is some prime (not the one we are working modulo of). You can think of this as something like the decimal representation of a number in base with the ASCII value of each character being the digits To change form the hash of one string to the next, we subtract the highest order digit, multiply by and then add the new ASCII value. Mathematically: ) Karp Rabin string matching is a clever demonstration of th e power of hash tables. It allows us to find matches of a given length from two strings quickly. It performs best on strings with few matches and a small 3.4.4 Algorithm Now that we have discussed Karp Rabin string matching, we can talk about RKR GST. RKR GST runs a slightly modified version of Karp Rabin string matching on shorter and shorter

PAGE 33

27 lengths until it has tiled all match es longer than a certain minimal length (matches below the minimal length are considered to be noise). RKR GST starts by running Karp Rabin string match ing with a length it expects to be around that of the longest common substring It then adjusts the length according to the matches it finds. It i ncreases the length if it finds too large of a match, and decreas es it when it finds al l matches of that length. The main method of RKR GST looks something like Pseudo Code 3 3 Body of RKR GST : Pseudo Code 3 3 Body of RKR GST go = true while go maxfound = scan(length) if maxfound >= 2*n length = 2*n else markstrings() if s >= 2* minimal match length s= s/2 else if s > minimal match length e = minimal match length else go = false w here scan (see Pseudo Code 3 4 ) runs a modified version of Karp Rabin string matching that finds all the matches longer than a certain length and returns the length of the longest match it finds, and markstrings (see Pseudo Code 3 5 ) turns the matches found by the latest scan into tiles ; in doing so it 'marks' them so they cannot be used in other matches. The above code scans strings for matches and records the largest one found, if it is bigger than twice the sear ch length we double the search length and try again O therwise we mark off the matches we found and halve the search length. The choice to double and halve the search length attempts to balance two concerns. On the one hand, we want to search as few leng ths as we can since this means fewer loops and

PAGE 34

28 faster runtimes. O n the other hand, executing Karp Rabin string matching with a length too much shorter than the matches present is slow because it finds too many matches. This system balances these two issue s since it keeps the number of matches in each window relatively even. The size of the windows exponentially increase but the number of matches per length exponentially decreases ; so in theory they should be somewhat even producing reasonable runtimes. RKR GST uses the following modified version of Karp Rabin string matching to find its matches: Pseudo Code 3 4 Scan, a modified version of Karp Rabin String Matching Define Scan (n) l argest match = n f or entirely unmarked substrings t of length l in T i nsert hash(t) data(t) into hashtable f or entirely unmarked substrings p of length l in P dataList = get hash(t) for hashtable for data in datalist k=0 while T of data start + k == P of p start + k k ++ i f k > 2n r eturn k i f k >= n a dd match to matches i f k > largest match l argest match = k r eturn largest match The above looks a lot like the version of Karp Rabin String Matching we have already see n in Pseudo Code 3 2 but with three differences. First, it uses the matches of length (the search length) as starting points to look for matches that could be longer than Second it quits if it finds a mat ch longer than twice the search length. Third, we are only interested in substrings that are entirely unmarked so we only look at that subset of substrings.

PAGE 35

29 The matches are stored in an array of queues ; each queue is filled with matches of a certain lengt h. When a match is founded it is added to the end queue in the position associated with its length. Finally we examine the deta il of how matches are turned in to tiles. Pseudo Code 3 5 MarkStrings Defi ne m arkStrings() for queue q in matches form largest_length to smallest_length for match m in queue q if all elements of m are unmarked mark all elements of m in p and t totalmarked = totalmarked + L markStrings is straightforward; it goes t hrough the matches we have found from largest to smallest, tiling the ones it can. In summary, RKR GST attempts to use Karp Rabin string matching at various lengths. If it discover s too long of a match it increases the search length tries again. Otherwis e it tiles what it has found and continues searching with a shorter length. When all the matches longer than a certain length have been found, it terminates. 3.4.5 Accounting for Unmatched Characters As currently described, RKR GST does not take unmatch ed cha racters into account. This is a concern since it means we currently consider ab to be equally similar to abc and abcdefghi This cause s longer strings to be considerabl y more likely to be chosen as the most likely original file since they have more opport unities to match any substring of Longer strings also tend to contai n s many characters that cannot be matched. Thus to mitigate their advantage, we introduce a second scoring function that subtracts from the score based on the total number of unmatch ed characters. This is reasonable since a large amount of different code should indicate that two files are different (although it is weak against plagiarists inserting large amounts of junk code).

PAGE 36

30 3.4.6 Selecting scoring functions There are too many possible s coring functions to properly explore them all L uckily an ideal choice is not necessary just one that give s good results. Moreover, too in depth a searc h might lead to a function over fit to our data. An obvious set to try for either scoring function is where is the length of a given match (or the total number of unmatched characters) and is a real number greater than one. This is a very nice set of functions; we can be certain that they encourage longer matches and it is small enough that we could simply try them all and take the best. Thus we tested scoring functions of the form: w here is the length of the th tile, is the total number of unmatched characters and, and are parameters. We tried every value for and between and taking steps of I n general the best results were when and were small and relatively close to each other. The average upper bound scores of the best performing area ar e shown below: Table 3 2 Upper Bounds Scores for different scoring functions. u defines how much we discourage long amounts unmatched characters and m defines how much we encourage long matches Matched Power Unmatched Power 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 1.0 1.788 1.733 1.711 1.716 1.716 1.722 1.727 1.727 1.727 1.727 1.727 1.1 1.733 1.727 1.716 1.722 1.711 1.711 1.722 1.722 1.722 1.722 1.722 1.2 1.733 1.727 1.722 1.711 1.716 1.711 1.71 6 1.716 1.716 1.716 1.716 1.3 1.738 1.733 1.733 1.722 1.727 1.727 1.727 1.716 1.716 1.716 1.716

PAGE 37

31 1.4 1.766 1.744 1.738 1.727 1.727 1.733 1.727 1.727 1.716 1.716 1.716 1.5 1.805 1.777 1.738 1.733 1.727 1.733 1.733 1.727 1.727 1.716 1.716 1.6 1.888 1.827 1.805 1.744 1.738 1.738 1.733 1.733 1.727 1.727 1.716 1.7 1.95 1.855 1.822 1.811 1.777 1.772 1.738 1.733 1.733 1.727 1.727 1.8 2.005 1.944 1.866 1.811 1.805 1.794 1.777 1.766 1.738 1.733 1.727 1.9 2.116 2.005 1.922 1.861 1.816 1.805 1.805 1.777 1.772 1. 777 1.738 2.0 2.244 2.088 1.994 1.911 1.844 1.827 1.805 1.805 1.777 1.772 1.777 Many of the above functions give similar results and with the standard deviations of most of the above averages being around 5 or 6 it is difficult to be certain of a choice The average lower bound scores gave similar results w ith minimums achieved at the same positions. Based on these two tables we chose and since it is in the center of the best performing region. It is worth noting that exponential and factorial functions were also tried. We had high hopes since they more closely reflect the probability of a match of the given length and push greedy and exact closer together; however, the polynomial functions outperformed them. 3.5 Count Compare It is remarkable how well and performed considering that it does not value longer matches more. It simply counts up how many chara cters are matched and subtracts how many characters are not matched. Its accuracy is of interest to us because it can be always be run in linear time. Given and the number of characters that will be matched does not depend on the matches chosen. I f there are three of a given letter in and five of it in we will always find

PAGE 38

32 the three in with three of the five in The total number of characters that will be matched is: w here is the number of occurrences of some letter in and is the number of occurrences of in The number unmatched is just the length of plus the length of minus twice the total number matched. We will call this technique Count Compare for obvious reasons. Experimentally it is shown to run between 10 and 1000 times faster than RKR GST depending on the l ength of the string in question Runtimes are discussed in more detail in section 4.4 Si nce Count Compare is so much faster than RKR GST and gives similar results we can use it to get reliable estimates quickly. Given a plagiarized file and a set of possible originals, Count Compare can be used to select a subset of files that are likely to be similar to the given file. We then run RKR GST or other slower algorithms on the likely matches for a more detailed analysis to get the final results. 3.6 Global Alignment 3.6.1 Problem Sequence Alignment is a string comparison technique borrowed from bioinform atics where it is used to determine the similarity of sequences of DNA (and RNA). It attempts to line up as many characters from one sequence of DNA with another when the two sequences are placed side by side. It does this by inserting blank characters int o the sequences. DNA changes through insertions, deletions and mutations at the nucleotide level. Inserting blank characters into a sequence can counteract deletions from one sequence or insertions into an other

PAGE 39

33 sequence. Thus Sequence Alignment is a reason able measure of how much two sequence s of DNA have changed since their common ancestor. More formally, given two strings and and a function that associate a pair of characters with a score g lobal alignment inserts blank characters into and to create two new Strings and which maximize the sum of the values of of th character of and the th ch aracter of : w here is the th character of and the th character of For this definition to make sense and must be of equal length. If and have different l engths blank characters must be inserted until and have equal length. Let's look at an example L et be aabca and be bcaaa and be if the characters are equal ( but not both blank) and 0 otherwise. Without inserting any blank character we get a score of a a b c a b c a a a 0 0 0 0 1 = 1 However if we insert a few blanks, represented by dashes we can do much better: a a b c a b c a a a 0 0 1 1 1 0 0 = 3 Choosing a function for can be a difficult problem since it is often hard to say how similar two characters are. This issue is discussed further in section 3.8 For now let we assume is chosen to encourage matching characters and discourages mismatches (as we used in the example) of two blank characters ( ) m ust be negative since otherwise we could always improve a solution by inse r ting more dashes

PAGE 40

34 A brute force solution to this problem would require trying a large number of cases. If we are interested in counting the number of possible solutions, it is useful to break them down by length. A solution can have a length as short as the longer string or as long as the sum of their lengths. For any given length one must choose of the characters of to be dashes Then one must choose the dashes for these cannot be aligned with the dashes in thus we must choose of the locations that don't line up to a match. We can expand this to: It is pretty clear this is not a polynomial as and increase, if you just look at t he leading term. If you increase and by one you multiply the leading term by this is greater than one and increase as and do thus, the number of possible s olutions increase s exponentially as and increase. So something more clever than a brute force solution is required. 3.6.2 Algorithm The clever solution required was introduced by Saul Needleman and Christian Wunsch in 1970 ( Saul B. Needleman, 1970) Many of the possible alignments share common sub

PAGE 41

35 alignments The Wunsch Neddleman algorithm uses dynamic programming to take advantage of this. Their algorithm always achieves an optimal alignment in time. Solutio ns to the problem can be visualized as the set of paths from the top left corner of a by grid to the bottom right. Figure 3 7 Some v isual representations of a lignment on the left an empty gr id, in the middle and right two examples of alignments with the path they correspond with AAB -CA B ACAB A -ABCA BACAB --A path is made up of three types of segments, to the right, down, and down and to the right. Each type of segment repres ents one of the three possible alignments Figure 3 8 The t ypes of possible paths with what they represent. To the left a horizontal line segment which aligns a character from the top string with a dash fr om the left string. In the middle is a vertical line segment which aligns a character from the left string to a dash from the top string. On the right is a diagonal line segment that aligns a character from the top string with a character from the left str ing

PAGE 42

36 AA B BA AA BA C AA B BA C A segment to the right aligns the next un aligned character of with a dash in A downward segment aligns a dash in to the next un align ed character of A diagonal segment aligns the next un align ed character of with the next un aligned character of Figure 3 9 T he t ypes of possible paths with the match they indicate lines from the center of a line segment to the axis, can be used to tell what that line segment aligns AA B BA AA BA C AA B BA C One can quickly see what a segment represents by dropping lines back from its mid point to the axes. The letters these lines point to are the letters aligned ; if one of these lines points to the space in between letters a dash is aligned from that string. Since each line segment is independent of the pa th leading up to it, the maximal score a path leading to any point is: 1. the score of the horizontal segment leading to it plus the maximal score of the paths leading to the point to its left : ; 2. the score of the vertical segment lead ing to it plus the score of the maximal score of the paths leading to the point above it : ; or

PAGE 43

37 3. the score of the diagonal segment leading to it plus the maximal score of paths leading to the point above and to its left : This sug gests the following recursion relation: Pseudo Code 3 6 Global a lignment define maxPathScore (x, y) i f recordedBestPathScores[x][y] != unknown: if x= 0 and y =0 result = 0 el s e if x = 0 resu lt = s ( ,P(y)) + bestPathScore(x,y 1,T,P) else if y = 0 result s ( T(x) ) + bestPathScore(x 1,y,T,P) else result = Max( f( T(x) ) + bestPathScore(x 1,y,T,P) s ( ,P(y)) + bestPathScore(x,y 1,T,P) s (T(x),P(y)) + bestPathScore(x 1,y 1,T, P) recordedbestPathScore[x][y] = result return result else return bestPathScore[x][y] where recordedbestPathScore is a by matrix with recordedbestPathScore[x][y] holding the best score for a path align ing the first characters of wi th the first y characters of or unknown Many paths contain common sub paths By recording the values of the optimal scores for these sub paths in recordedbestPathScore we can avoid repeating calculations. The first three cases are base cases ; if the p oint is in the first row or column or both there is only one possible path leading to it. With this function all we have to do to get the global alignment of and is to take the maxPathScore of Since we are only interested in how similar two strings are not why they are similar, we do not calculate the path that yields the best score. If we were interested we could find it by backtracking once recordedbestPathScore has been

PAGE 44

38 completely filled out b y starting at the bottom right corner, a nd at each point looking at which of the three options gave us the score for the point we are currently at. 3.7 Local Alignment 3.7.1 Motivation Global alignment attempts to align the entire s tring with the entire string L ocal a lignment, on the other hand, finds the maximal score of align ing a subset of with a subset of F ormally, g iven some scoring function which takes a pair of characters to a value, we attempt to find and that m aximize the following function: where is a substring of with blank characters inserted and is a substring of with blank characters inserted. 3.7.2 Algorithm T he following algorithm w as introduced by Temple Smith and Michael Waterman in 1981 (Temple Smith, 1981) It builds on the Needleman Wunsch algorithm (we saw in section 3.6.2 ) The difference is we are now interested in any path, not just the path from one corner to the other. This means that the best path to a point can always be the path of length 0, (score 0) that starts at or any of the options explo r ed in the section on global alignment above The pseudo code for local alignment is basically the same as that presented f or global alignment ( Pseudo Code 3 6 ) with the addition that the best score always has the option to be zero (the path of length 0). Pseudo Code 3 7 Local alignment define maxPathScore (x, y ,T ,P ) i f recordedBestPathScores[x][y] != unknown: if x= 0 and y =0

PAGE 45

39 result = 0 eles if x = 0 result = max( s ( ,P(y)) + bestPathScore(x,y 1,T,P) 0) else if y = 0 result = max(s ( T(x) ) + bestPathScore(x 1,y,T,P),0) else result = Max(s ( T(x) ) + bestPathScore(x 1,y,T,P) s ( ,P(y)) + bestPathScore(x,y 1,T,P) s (T(x),P(y)) + bestPathScore(x 1,y 1,T,P) ,0) recordedbestPathScore[x][y] = r esult return result else return recordedbestPathScore[x][y] The above returns the best score for a path that leads to a given point, so to find the best path from any point to any point we must run maxPathScore on and and then iterate over recordedbestpathscore to find its maximum. As with global alignment we are not interested in the details of the path, but if we were interested we could backtrack to find it. 3.8 Optimizing Selecti ng a weighting functi on for our alignment is a difficult task. This discrete function is a 23 (the number of types of characters we have plus the dash) by 23 matrix with each element being the score for aligning one of the possible pairs of characters. It looks something li ke: Figure 3 10 An e xample of the values of matching the following characters a b c d e f g h a 0.78 0.32 0.34 0.28 0.01 0.07 0.34 0.42 1.37 b 0.02 0.71 0.16 0.61 0.34 0.19 0.25 0.44 0.70 c 0.06 0.01 0.92 0.15 0.12 0.37 0.17 0.06 1.38 d 0.09 0.42 0.54 0.91 0.70 0.20 0.48 0.45 0.92 e 0.82 0.47 0.39 0.03 0.83 0.37 0.12 0.02 0.95

PAGE 46

40 f 0.10 0.24 0.21 0.02 0.06 1.07 0.34 0.67 0.93 g 0.22 0.16 0.32 0.14 0.08 0.26 1.02 0.61 1.01 h 0.19 0.32 0.18 0.01 0.20 0.18 0.01 0.89 1.43 0.46 1.29 1.12 0.35 1.33 1.31 0.76 1.40 1.00 It is tempting to believe this matrix should be symmetric; however this is not necessarily true. Since one character is from the st ring viewed as the original file and the other is from a string viewed as a plagiarized file, it is best to treat them differently. For example it is common for a plagiarized file to contain chunks of code the original does not, while the converse is not necessarily true. This means one should receive a greater penalty for inserting dashes in to the plagiarized file than the original file. This manifests itself in the matrix in the scoring function with being more negative than Unfortunately, is not the kind of space that can be easily search ed We search it somewhat randomly by repeated ly making random changes to the best solution we have fou nd. We start with our best guess at the solution, calculate its upper bound score, and then we make random changes to it and calculate the alter ed version s upper bound score. The function that performs better becomes our new best guess and the process rep eats. Error! Reference source not found. (below) shows the average upper bound score versus the number of generations for several runs.

PAGE 47

41 Figure 3 11 The a ve rage upper bound scores of the given number generations of optimization four runs are shown This technique is only effective if one has a guess at the solution and the space is relatively smooth. We have not proved it but we believe both of these are t rue in this case. We expect the space is smooth since small changes in generally only cause small changes in alignment which in turn, only cause s small changes in average upper/lower bound score. We verify this experimentally W e tend to see small changes in score if we make small changes to We also have to make a gue ss at a starting point since we have good reason to believe should encourage matches and discourage dashes. The optimal retains these properties although this might be a self fulfilling prophecy. There is also a serious danger of supposed optimal sol utions ending up in local minimums. Again we cannot be certain, but it is encouraging that many runs appear to approach the same lower bound. Since we are interested in how alignment performs on more general datasets, over optimizing is a serious concern. We ran this algorithm using subsets of the data to ensure that our selection of did not depend too heavily on the files in question. Fortunately the s 1 1.2 1.4 1.6 1.8 2 2.2 2.4 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 Average Upper bound Score Number of Runs

PAGE 48

42 generated using subsets of the data generally gave good results on the whole data set, with upper bound average scores close to those attained by optimizing on the entire dataset Despite the difficulties of the problem and the limitations of our approach, we are able to dramatically improve the average upper bound source 3.9 Summary We have introduced the algorithms we intend to compare. We would now like to summarize what they do and say a few words ab out what their strengths and weaknesses are RKR GST compares strings by looking at their common substrings. It should perform well against chunks of coding being move d around and added since those will not disrupt the substrings. W e expect RKR GST will do less well with random insertions since those will disrupt its substrings. Count compare is a special case of RKR GST I t compares string by counting the number of unmatched characters. It should have accuracy worse than that of R KR GST but run much faster Global alignment compares strings in terms of insertions and deletions of single characters. It should perform well against changes to single lines, but badly if large chunks of code are moved or removed Local alignment uses a similar approach as global alignment except it only looks at substring of the strings it compares (as opposed to the entire strings) For the most part it should perform much like global alignment; however it may be able to catch plagiarized code combin ed with original code which all the others would struggle w ith. Unfortunately it is also likely to come up with a lot of false positives since it does not consider a lot of information

PAGE 49

43 4 Results 4.1 Introduction In this chapte r we present the results of runni ng the algorithms discussed in the previous chapter on our dataset. We begin by examining examine the accuracy of our plagiarism detections algorithms. Second, we discuss how well each algorithm does with respect to each category of plagiarism. Third we compare the runtimes of each algorithm with respect to the length of the strings in question. Finally we discuss the effects of using Count Compare to speed up the other algorithms. 4.2 Accuracy Table 4 1 A sum mary of the results, the upper and lower bound scores for each algorithm upper bound score lower bound score Average Standard deviation Average Standard deviation Count Compare 1.788 8 6.28111 1.1277 0.8012 RKR GST 1.711 1 6.28111 1.0777 0.8012 Lo cal Alignment 2.027 7 9.1324 1.0055 0.0745 Global Alignment 1.6 6.066 1.0055 0.0745 Table 4 1 summarizes the results for each algorithm. In all cases the original file is on average one of the two closest to the plagiarized file. Global alignment appears to produce the highest degree of accuracy, with RKR GST following closely behind. The difference between their scores is eclipsed by either algorithm's standard deviation; however, global alignment and

PAGE 50

44 RKR GST te nd to do badly on the same files with global alignment consistently performing better so this is not a concern. We are unsurprised that global alignment has a better average upper bound score than local alignment but they share a similar lower bound Loc al alignment includes less information since it only considers part of the strings it compares Local alignment can successfully find the original file but it often finds many other files that it also considers equally similar to the original. This result s in considerabl y more ties, and the difference in average upper bound score. In all cases, the average upper bound scores are not the result of nice a distribution. All the algorithms predict the correct file over 90% of the time; however, when they miss it is often by a large margin. For example local alignment upper bound averages composition is shown below: Table 4 2 The d istribution of u pper b ound s cores produced by local alignment, on the left the upper bound score achieved on the right the number of file with that score Upper Bound Score Number of F iles 1 172 2 4 3 1 18 1 51 1 112 1 The error rate is large ly driven by three files. It correctly identified 172 of the 180 files, but found one origi nal file to be the 112 th most similar to its plagiarism. This explains the large (compared to the averages) standard deviations we see in Table 4 1 It also illustrates how

PAGE 51

45 unstable our results are with respect to changes in the dataset adding or removing one of these high scoring files would drastically change our results The files the algorithms performed badly on tend to be short. At the moment we include all files in our comparison; however, it would be reaso nable to throw out all files shorter than a certain length, on grounds that they do not contain enough information to reliably determine whether they are plagiarisms. The graph below shows how our average upper bounds would improve if one throws out all fi le shorter than the given length. Figure 4 1 the a verage u pper b ound s core for all four algorithms when we consider only string longer than the given length These algorithms perform drasticall y better if the shortest files are removed. In other word s, we are drastically more successful in finding the best match for a string as the string gets longer. To illustrate this, removing the three shortest files increases local alignments average 1 1.2 1.4 1.6 1.8 2 2.2 0 100 200 300 400 500 Average Upper Bound Score String Length (Characters) Count Compare RKR GST Local Alignment Global Alignment

PAGE 52

46 upper bound score fro m 1.6 to 1.033898. It is unsurprising but, nevertheless, comforting that these algorithms get better results when more information is available. We see no improvement on average lower bound scores by removing the shortest files. This is bec ause the difficulty with short files i s not that the original file does not receive the highest score but that its one of many files that does T he file s aff ecting our average lower bounds are legitimately difficult to detect. They are of average length w ith substantial changes. We have a good understanding of why RKR GST does better than count compare, and why global alignment outperforms local alignment, but, why does local alignment outperform RKR GST? We offer two reasons. First, global alignment has the advantage of a more detailed scoring function. In particular its scoring system takes the letters in a match in to account w h ile RKR GST only cares about the length of the match This means global alignments scoring function has been finely tu ned to r eflect the relationships between different letters (logical elements) In our case optimizing the scoring function is particularly easy since we only need to balance the needs of the small set of file s that tend to give us trouble Second, RKR GST is much less stable again s t small changes; a single insertion breaks a large match into two small matches and introduces a mismatch ; while in global alignment this just introduces a mismatch. 4.3 Accuracy by Category In this section, we discuss the deta ils of how these algorithms perform against the various types of plagiarism discussed in Figure 3 1 A comparison is somewhat difficult because so few of the scores being averaged are not 1. There are only a few fi les in each category that contribute meaningfully to the scores. We do not have enough data to be sure that any patterns we do see are not just noise. In an attempt to reduce the noise we removed some of the extremely short files.

PAGE 53

47 Table 4 3 below shows the lower bound averages by category and algorithm. We compare lower bound averages since we are interested in what we fail to find. Average upper bound tells us more about the algorithms ability to differentchate f alse positives from real pliagiarisms. Table 4 3 Summary of locer bound scores by plagiarism category and algorithm # Type CC RKRGST LA GA 1 Changing comments or formatting 1 1 1 1 2 Changing identifier s or data types 1 1 1 1 3 Replacing expressions with equivalents, changing structure of iteration or selection statements, replacing procedure calls with the procedure body 1.667 1.133 1 1 4 Changing the order of operands in expressions or of independent statements 1 1 1 1 5 Adding redundant statements or variables, introducing non structured statements 1.1 1.33 1.033 1.033 6 Mixed 1 1 1 1 Unsurprisingly, all the algorithms do well against changes in formatting and data type. These are easy to identity and are completely resolved by preprocessing. We expected mixed results from the third category. The general categories we used in preprocessing should catch replacing loops and conditionals with their logical equivalent. Replacing procedure calls with p rocedure bodies however, is more difficult as it implies block reordering or addition. Count compare struggles with any change involving the addition of lines. RKRGST should do well against block reordering but could be fooled by block addition. Local

PAGE 54

48 al ignment should do well against block addition but operates poorly against block reordering. Global alignment should do poorly against any large scale changes. The poor results we see for count compare and RKRGST are a result of a single file with a large amount of code added. As expected count compare and RKRGST are confused; interestingly though, both alignment algorithm successfully identify the original file. It is unclear why string tiling fails here but alignment succeeds W e believe it is because we value the unmatched portions of RKR GST with a polynomial and they grow linearly in a lignment. Thus, large unmatched portions would have more effect on RKR GST. While changing the order of independent expression should have some effect it was unclear whe ther the effect would be significant. As far as RKR GST is concerned it turns a long match into several short matches, but no characters are missed. For a lignment some of the reordered characters will be missed. Count compare does not notice the change. In this case it did not end up mattering. Adding redundant statements proved problematic across the board, as it should. For Count Compare extra line s translate directly into negative points. For RK R GST it not only causes mismatches but it divides larg e matches in to several shorter matches this is why we expect it did so badly. F or local and global alignment, It means extra dashes which can be more or less important depending on the character in question but is certainly less important than an extra li ne is to count compare. The mixed plagiarism was meant to be a test of how well these algorithms could deal with multiple types at the same time. All the algorithms were successful in identifying the original files. 4.4 Runtime

PAGE 55

49 Decent runtimes are very impor tant for plagiarism detection since larges sets of potentially large files need to be compared. All the algorithms were selected with speed in mind. We expect count compare to run the fastest in linear time. RKR GST should run the next fastest, its runtime depends on the strings in questions but we expect something although it should run slowly relative to the other algorithms on short strings (Wise M. J., 1996) The alignment algorithms should run the slowest with for all cases. Speed comparisons were r un on a HP dv7 2185dx running Windows 7. The dv7 has a 2.0GHz Intel Core 2 Quad Core Q9000 and 6GB or DDR2 Ram. All a lgorithms were written in java 7 (Standard Edition). The first set of speed tests compared run time to the length of files compared. For e ach length each algorithm was r u n on the same set of 500 pairs of strings The strings were randomly selected substrings of encoded files from our dataset All the algorithms were run on the same set of files. The graph below shows the total time it takes to compare 500 strings of lengths 1 to 500 for each of the four algorithms. Figure 4 2 Total time to compare 500 pairs of strings of the given lengths all four algorithms 0 5 10 15 20 25 30 35 40 1 33 65 97 129 161 193 225 257 289 321 353 385 417 449 481 Time (s) Length (Characters) Count Compare RKR GST Local Alignment Global Alignment

PAGE 56

50 This demonstrates the remarkabl e speed of count compare. By length 500 it runs 1000 times faster than its competitors. It is difficult to see here on this scale graph, but even when comparing short strings the difference is substantial. Between lengths 0 and 50 it is on average 35 times faster than global alignment, its next closest competitor and 100 times faster than RKR GST (which performs poorly at short lengths). In addition to demonstrating count compare's speed, the above suggests almost no difference between the runtimes of RKR GST and Alignment. Both String Tiling and Alignment appear to be running in quadratic time with respect to the length of the strings. The a lignment algorithms always run in quadratic time, with local Alignment being slightly slower. RKR GST s runtime howe ver, depends on the number and size of the matches. Worst case RKR GST runs in and best case We expect it performs relatively badly here because m any of our strings are made of a few letters (recall that assignment comp rises almost 40% of all characters in our dataset) This means pairs of strings tend to contain more matches than one would expect from strings from a 22 character alphabet RKR GST performs best when the strings it is comparing contain very few matches, s o our frequent matches increase our runtime To demonstrate how RKR GST might run on other data sets w e measured its runtime on 500 strings randomly generated from different sized alphabets Our intension is to test how RKR GST runs on datasets where the strings contain different numbers of matches. We achieve this by randomly generating test strings from different sized alphabets since string from smaller alphabets should contain have more matches Figure 4 3 be low shows RKR GST runtime on different sized alpha bets

PAGE 57

51 Figure 4 3 Total time for RKR GST to compare 500 strings randomly generated from different sized alphabets of various lengths As expected decre asing the size of the alphabet increase s the number of matches and causes RKRGST to run drastically slower. Our dataset appear s to run similarly to a randomly generated dataset with 8 characters. 4.5 Improving Runtimes R uns on the full dataset were on the sa me machine discussed above. However since we make a large number of independent calculations we use several threads to take advantage of the four cores available to us. The runtimes for each algorithm running on the entire data set are displayed below. Th ese times include about 4s spent preprocessing. Table 4 4 Total time for each algorithm to compare each plagiarized file to all the potential original files Algorithm time (s) RKR GST 921.890 0 2 4 6 8 10 12 14 16 18 20 10 30 50 70 90 110 130 150 170 190 210 230 250 Tme (s) String Length (Characters) 2 Character alphabet 4 Character alphabet 8 Character alphabet 16 Character alphabet

PAGE 58

52 Local Alignm ent 439.361 Global Alignment 438.961 Count Compare 13.925 As expected Count c ompare is much faster than its competitors, completing the task 66 times faster than RKR GST and 30 times faster than Local/Global Alignment. What is surprising is how much slower RKR GST is compared to Local and Global Alignment given Figure 4 2 RKR GST runs much slower than Local/Global Alignment on short strings (lengths 0 100). Since most of the strings are short (see Figure 3 2 ) this turns out to make a large difference. It is possible to improve these times with little to no effect on our accuracy by using count compare to select likely files, so we do not have to run the slower algor ithms on the entire dataset The improvement to runtimes depends on how strictly we select likely files There are many possible way s to define a likely file W e selected the following: given a plagiarism and a set of possible originals, a 'likely file' is one of the possible originals whose count compare score is sufficiently close to the highest score achieved by an y of the potential originals. In particular a 'likely file' is any file whose count compare score is less than some number standard deviations from the highest scor e amount potential originals Error! Reference source not found. below compares how st r ict l y we use count compare to eliminate files (expressed in terms of ) to tota l runtimes for RKR GST and global alignment accelerated by count compare. Low s select a very small set of likely files while larges s a very large set of likely files.

PAGE 59

53 Figure 4 4 Total time for RKR GST and global alignment accelerated by count compare to compare each plagiarized file to all possible originals for different values of r Figure 4 4 shows how combining count compare with other algorithms substantially impro ves runtimes. Unsurprisingly, t he gains were more substantial as we more aggressiv ely remove unlikely matches We were able to improve global alignments runtime by over 33 times and RKR GST runtime by over 60 times. Figure 4 5 Upper bound scores of algorithms when accelerated by count compare r Global A lignment RKR GST Full D ataset 1.6 1.7111 D ataset P runed u sing Count C ompare 0 1.7833 1.7444 0.1 1.6 1.7111 0.2 1.6 1.7111 0.3 1.6 1.7111 0.4 1 .6 1.7111 ... 1.6 1.7111 In nearly all cases we saw no change in the quality of the results. We tested r form 0 to 4 with intervals of 0.1. The only case in which the accuracy was effected was when =0. In this 0 100 200 300 400 500 600 700 800 900 0 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3 3.3 3.6 3.9 4.2 Time (s) r RKR GST Global Alignment

PAGE 60

54 case, the slower algorithms were only used to resolve ties for first place, both algorithms still performed better than count compare, the worst case. In no case did the lower bound score change. On this dataset of 0.1 is the best choice however this is dataset dependent. 4.6 Summary In this chapter, we have compared several algorithms, in terms of both speed and accuracy. Global alignment appears to offer the best accuracy with RKR GST following closely behind. Count compare runs by far t he fastest followed by nearly identical runtimes from the alignments and RKR GST. Finally we saw that, in almost all cases it is beneficial to use count compare to improve the runtimes of the slower algorithms. Considering Global Alignments results, its lack of popularity in the literature is surprising. We conjecture that it does less well with a traditional measure of success because of its low standard deviation. A low standard deviation is problematic because it makes the correct results harder to pi ck out of a large number of comparisons. Any results we see are a result of the dataset we use. It is also likely the structure of Visual B asic is better suited for alignment than RKR GST. Visual Basic in general, and our data set in particular it has ver y few function calls this makes syntax changes on the functional level difficult RKR GST works in terms of chunks of code so naturally performs well when changes are made on the level of functions. We also believe because of the nature of Visual Basic an d the level of the programs in question there will be less variety in the strings then we would otherwise expect. This could cause long matches to be less rare then RKR GST expects them to be. These factors may have negatively effected RKR GST performance.

PAGE 61

55 5 Conclusion We applie d several biological algorithms to a large collection of student code in order to compare their ability to detect several plagiarized files. Given a plagiarized file all the algorithms were able to correctly select the original file between 91 and 96 percent of the time. M ost of the mistakes were made on extremely short files. While all the algorithms preformed well global alignment delivered the best results. It produced an average upper bound score of 1.6 and an average lower bound of 1.0055. T he only category of plagiarism it struggled with was the addition redundant statements or variables, which maybe just be a limitation of the string based approach since it handled it graceful ly compared to the other algorithms. Global alignments good results are somewhat surprising considering its minimal presence in the literature W e expect this is because, it might be difficult to draw a line between similar files and plagiarized files given its scores distribution RKR GST preforme d the second be st. W e are not surprised by its good results because of its popularity in the literature. It produced an average upper bound score of 1.711 and an average lower bound of 1.0777. It struggled with random insertions and replacing procedure cal ls with procedure bodies. Random insertions are difficult for it because they break up the chunks of matching code it uses to determine similarity. It struggled with replacing procedure calls with procedure bodies because this can sometime generate a large discrepancy between the length of the original file and the plagiarized file O ur implementation struggled with this because we penalize unmatched characters. Local alignment performed the worst of the algorithms we examined. It was able to successfully identify the original files but they often were one of many files tied for most similar. It produced an upper bound score of 2.0277 and a lower bound of 1.0055. We expect its poor

PAGE 62

56 performance is because it disregards portions of the strings it compares, l eaving it with too little information to distinguish between similar files. Its troubles were particularly pronounced when comparing large files. Finally we introduced count compare, a simple string comparison method that did decently on its own, but whose true use is improving the runtimes of the other algorithms. We were able to use it to improve global alignments runtime by thirty times and RKR GST runtime by sixty times with no loss in performance. 5.1 Future Work From here there are many possible next s teps. We can could use what we have learned to implement a proper plagiarism detection system, continue to explore the performances of different algorithms, test different scoring systems for RKR GST, test out different encode systems or look at more datas ets to verify our results. We list some tempting options below. 1. We are interested in transitioning to a more standard measure of success. In particular we would like to measure likelihood of being plagiarized in terms of a high scoring comparison distance from other high scoring comparisons that share a file with it This is in contrast to more traditional techniques which consider a high scoring comparison distance from all comparison which share a file with it 2. We are interested in adjusting our prep rocessing to include more information, since all the algorithms preformed better when they had more information. There are different ways to approach this. One op tion is to divide assignment in to several logical elements by the type variable being assigned since it makes

PAGE 63

57 up such a disproportionate amount of the characters in our dataset. Hopefully this would make our data more diverse and decrease the number of ties. 3. We would like to try having RKR GST value a tile accord ing to the probability of that ma tch occurring. This may be impossible since it is computationally expensive to count and store all the possible matches; however, it might be possible using a frequent pattern tree (Pang Ning Tan, 2005)

PAGE 64

58 Bib liography Chao Liu, C. C. (2006). GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. Knowledge Discovery and Data Mining pp. 872 881. David Gitchell, N. T. (1999, Marth). Sim: A Utility for Detecting Similarity in Computer Programs. ACM SIGCSE Bulletin, 31 (1), pp. 266 270. Grier, S. (1981, February). A Tool that Detects Plagiarism in Pascal Programs. SIGCSE Bulletin, 13 (1). Heckel, P. (1978). A technique for Isolating Differences Between Files. Communications of the ACM, 21 (4), pp. 264 268. J. A. Faidhi, S. K. (1987). An Empirical Approahc for Detecting Program Similarity within a University Programming Environment. Computers and Education, 11 (1), pp. 11 19. Kammer, M. L. (2011, May). Plagiarism detection in Hask ell programs using call graph matching. Kristina L. Verco, M. J. (1996). SoftWare for Detecting Suspected Plagiaris: Comparing Structure and Attribute Counting Systems. Proc. of 1st Australian Conference on Computer Science Education Lutz Prechelt, G. M. (2000, March 28). JPlag: Finding Plagiarisms Amoung a Set of Programs. Journal of Universal Computer Science 8 (11). Michael R. Garey, D. S. (1979). Computers and intractability: A Guide to the Theoty of NP Completeness. New York: W. H. Freemman & Co. Mik e Joy, M. L. (1999, May). IEEE Transaction on Education. 42 (2), p. 129. Pang Ning Tan, M. S. (2005). Introduction to Data Mining. Addison Wesley. Richard M. Karp, M. O. (1987, March). Efficient Randomized Pattern Matching Algotithms. IBM ournal of Research and Development, 31 (2), pp. 249 260. Saul B. Needleman, C. D. (1970, March 28). A general method applicable to the search for similarites in the amino acid sequence of two proteins. Journal of Modecular Biology, 48 (3), pp. 443 453. Saul Schleimer, D. S. ( 2003). Winnowing: Local Algorithms for Document Fingerprinting. Proceedings of the 2003 ACM SIGMOD international conference on Management of data pp. 76 85. Temple Smith, M. W. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147 (1), pp. 195 197.

PAGE 65

59 Whale, G. (1986). Detection of Plagiarism in Student Programs. Whale, G. R. (1990). Identification of Program Similarity in Large Populations. The Computer Journal 33(2) pp. 140 146. Wise, M. J. (1993). Neweyes: A new System for Comparing Biological Swquences Using the Running Karp Rabin Greedy String Tiling Algorithm. Third International Conference on Intelligent Systems for Molecular Biology pp. 393 401. Wise, M. J. (1996). YAP3: Improved Detection of Similarities in Comput er Program and Other Texts. ACM SIGCSE Technical Symposium on Computer Science Education pp. 130 134.

PAGE 66

60 Appendix : Raw Results We have included the results on each individual file for each of the algorithms. Figure A 1 Raw Results Id Type Length Count Compare RKR GST Local Alignment Global Alignment U pper Bound L ower Bound U pper Bound L ower Bound U pper Bound L ower Bound U pper Bound L ower Bound 1 1 69 1 1 1 1 1 1 1 1 2 1 46 1 1 1 1 1 1 1 1 3 1 30 1 1 1 1 1 1 1 1 4 1 48 1 1 1 1 1 1 1 1 5 1 202 1 1 1 1 1 1 1 1 6 1 152 1 1 1 1 1 1 1 1 7 1 45 1 1 1 1 1 1 1 1 8 1 109 1 1 1 1 1 1 1 1 9 1 43 1 1 1 1 1 1 1 1 10 1 78 1 1 1 1 1 1 1 1 11 1 144 1 1 1 1 1 1 1 1 12 1 70 1 1 1 1 1 1 1 1 13 1 74 1 1 1 1 1 1 1 1 14 1 26 2 1 2 1 1 1 1 1 15 1 121 1 1 1 1 1 1 1 1 16 1 24 3 1 3 1 1 1 1 1 17 1 41 1 1 1 1 1 1 1 1 18 1 86 1 1 1 1 1 1 1 1 19 1 20 13 1 13 1 18 1 13 1 20 1 58 1 1 1 1 1 1 1 1 21 1 64 1 1 1 1 1 1 1 1 22 1 106 1 1 1 1 1 1 1 1 23 1 67 1 1 1 1 1 1 1 1 24 1 29 1 1 1 1 1 1 1 1 25 1 121 1 1 1 1 1 1 1 1 26 1 157 1 1 1 1 1 1 1 1 27 1 36 1 1 1 1 1 1 1 1 28 1 311 1 1 1 1 1 1 1 1 29 1 43 1 1 1 1 1 1 1 1

PAGE 67

61 30 1 93 1 1 1 1 1 1 1 1 31 2 175 1 1 1 1 1 1 1 1 32 2 158 1 1 1 1 1 1 1 1 33 2 73 1 1 1 1 1 1 1 1 34 2 50 1 1 1 1 1 1 1 1 35 2 75 1 1 1 1 1 1 1 1 36 2 125 1 1 1 1 1 1 1 1 37 2 17 11 1 11 1 51 1 11 1 38 2 12 81 1 81 1 112 1 81 1 39 2 33 2 1 2 1 1 1 1 1 40 2 28 1 1 1 1 1 1 1 1 41 2 142 1 1 1 1 1 1 1 1 42 2 44 1 1 1 1 1 1 1 1 43 2 142 1 1 1 1 1 1 1 1 44 2 28 3 1 3 1 3 1 3 1 45 2 36 1 1 1 1 1 1 1 1 46 2 28 1 1 1 1 1 1 1 1 47 2 29 1 1 1 1 1 1 1 1 48 2 93 1 1 1 1 1 1 1 1 49 2 101 1 1 1 1 1 1 1 1 50 2 151 1 1 1 1 1 1 1 1 51 2 84 1 1 1 1 1 1 1 1 52 2 33 1 1 1 1 1 1 1 1 53 2 3 9 1 1 1 1 1 1 1 1 54 2 44 1 1 1 1 1 1 1 1 55 2 66 1 1 1 1 1 1 1 1 56 2 55 1 1 1 1 1 1 1 1 57 2 103 1 1 1 1 1 1 1 1 58 2 61 1 1 1 1 1 1 1 1 59 2 21 1 1 1 1 1 1 1 1 60 2 215 1 1 1 1 1 1 1 1 61 3 91 1 1 1 1 1 1 1 1 62 3 127 1 1 1 1 1 1 1 1 63 3 56 1 1 1 1 1 1 1 1 64 3 66 23 21 5 5 1 1 1 1 65 3 428 1 1 1 1 1 1 1 1 66 3 112 1 1 1 1 1 1 1 1 67 3 41 1 1 1 1 1 1 1 1 68 3 115 1 1 1 1 1 1 1 1 69 3 149 1 1 1 1 1 1 1 1 70 3 44 1 1 1 1 1 1 1 1 71 3 138 1 1 1 1 1 1 1 1 72 3 20 1 1 1 1 1 1 1 1

PAGE 68

62 73 3 46 1 1 1 1 1 1 1 1 74 3 86 1 1 1 1 1 1 1 1 75 3 293 1 1 1 1 1 1 1 1 76 3 35 1 1 1 1 1 1 1 1 77 3 32 1 1 1 1 1 1 1 1 78 3 24 1 1 1 1 1 1 1 1 79 3 219 1 1 1 1 1 1 1 1 80 3 1367 1 1 1 1 1 1 1 1 81 3 39 1 1 1 1 1 1 1 1 82 3 55 1 1 1 1 1 1 1 1 83 3 68 1 1 1 1 1 1 1 1 84 3 32 1 1 1 1 1 1 1 1 85 3 34 1 1 1 1 1 1 1 1 86 3 33 1 1 1 1 1 1 1 1 87 3 278 1 1 1 1 1 1 1 1 88 3 63 1 1 1 1 1 1 1 1 89 3 58 1 1 1 1 1 1 1 1 90 3 135 1 1 1 1 1 1 1 1 91 4 38 1 1 1 1 1 1 1 1 92 4 57 1 1 1 1 1 1 1 1 93 4 186 1 1 1 1 1 1 1 1 94 4 47 1 1 1 1 1 1 1 1 95 4 25 1 1 1 1 1 1 1 1 96 4 33 1 1 1 1 1 1 1 1 97 4 115 1 1 1 1 1 1 1 1 98 4 206 1 1 1 1 1 1 1 1 99 4 160 1 1 1 1 1 1 1 1 100 4 121 1 1 1 1 1 1 1 1 101 4 118 1 1 1 1 1 1 1 1 102 4 23 3 1 2 1 2 1 2 1 103 4 84 1 1 1 1 1 1 1 1 104 4 38 1 1 1 1 1 1 1 1 105 4 54 1 1 1 1 1 1 1 1 106 4 42 1 1 1 1 1 1 1 1 107 4 66 1 1 1 1 1 1 1 1 108 4 153 1 1 1 1 1 1 1 1 109 4 21 1 1 1 1 1 1 1 1 110 4 84 1 1 1 1 1 1 1 1 111 4 140 1 1 1 1 1 1 1 1 112 4 68 1 1 1 1 1 1 1 1 113 4 129 1 1 1 1 1 1 1 1 114 4 113 1 1 1 1 1 1 1 1 115 4 52 1 1 1 1 1 1 1 1

PAGE 69

63 116 4 47 1 1 1 1 1 1 1 1 117 4 44 1 1 1 1 1 1 1 1 118 4 134 1 1 1 1 1 1 1 1 119 4 222 1 1 1 1 1 1 1 1 120 4 125 1 1 1 1 1 1 1 1 121 5 66 1 1 1 1 1 1 1 1 122 5 36 1 1 1 1 1 1 1 1 123 5 172 1 1 1 1 1 1 1 1 124 5 175 1 1 1 1 1 1 1 1 125 5 61 1 1 1 1 1 1 1 1 126 5 189 1 1 1 1 1 1 1 1 127 5 33 2 1 2 1 1 1 1 1 128 5 29 1 1 1 1 1 1 1 1 129 5 185 1 1 1 1 1 1 1 1 130 5 46 1 1 1 1 1 1 1 1 131 5 47 1 1 1 1 1 1 1 1 132 5 177 1 1 1 1 1 1 1 1 133 5 41 6 4 11 11 2 2 2 2 134 5 73 1 1 1 1 1 1 1 1 135 5 108 1 1 1 1 1 1 1 1 136 5 273 1 1 1 1 1 1 1 1 137 5 30 1 1 1 1 1 1 1 1 138 5 53 1 1 1 1 1 1 1 1 139 5 409 1 1 1 1 1 1 1 1 140 5 38 1 1 1 1 1 1 1 1 141 5 133 1 1 1 1 1 1 1 1 142 5 31 1 1 1 1 1 1 1 1 143 5 32 1 1 1 1 1 1 1 1 144 5 109 1 1 1 1 1 1 1 1 145 5 53 1 1 1 1 1 1 1 1 146 5 39 1 1 1 1 1 1 1 1 147 5 93 1 1 1 1 1 1 1 1 148 5 30 1 1 1 1 1 1 1 1 149 5 40 1 1 1 1 1 1 1 1 150 5 33 1 1 1 1 1 1 1 1 151 6 87 1 1 1 1 1 1 1 1 152 6 93 1 1 1 1 1 1 1 1 153 6 79 1 1 1 1 1 1 1 1 154 6 42 1 1 1 1 1 1 1 1 155 6 23 1 1 1 1 1 1 1 1 156 6 62 1 1 1 1 1 1 1 1 157 6 26 1 1 1 1 1 1 1 1 158 6 74 1 1 1 1 1 1 1 1

PAGE 70

64 159 6 30 1 1 1 1 1 1 1 1 160 6 78 1 1 1 1 1 1 1 1 161 6 36 1 1 1 1 1 1 1 1 162 6 31 1 1 1 1 1 1 1 1 163 6 57 1 1 1 1 1 1 1 1 164 6 35 1 1 1 1 1 1 1 1 165 6 107 1 1 1 1 1 1 1 1 166 6 30 1 1 1 1 1 1 1 1 167 6 30 1 1 1 1 2 1 1 1 168 6 79 1 1 1 1 1 1 1 1 169 6 58 1 1 1 1 1 1 1 1 170 6 106 1 1 1 1 1 1 1 1 171 6 428 1 1 1 1 1 1 1 1 172 6 65 1 1 1 1 1 1 1 1 173 6 154 1 1 1 1 1 1 1 1 174 6 91 1 1 1 1 1 1 1 1 175 6 111 1 1 1 1 1 1 1 1 176 6 49 2 1 2 1 2 1 2 1 177 6 51 2 1 2 1 2 1 2 1 178 6 28 1 1 1 1 1 1 1 1 179 6 24 3 1 3 1 1 1 1 1 180 6 34 1 1 1 1 1 1 1 1 Average 9 3.527 1.788 1.127 1.7111 1.0777 2.0277 1.0055 1.6 1.005


ERROR LOADING HTML FROM SOURCE (http://ncf.sobek.ufl.edu//design/skins/UFDC/html/footer_item.html)