Written by B[]; Last updated: 04/06/2017
The Netizens team is a student led hacking team, a group part of the University of Hertfordshire that has many cyber security projects. These projects come in many forms, including hardware, software, procedures and much more. One of these interests has been the Fake News Challenge, or FNC as we have come to refer to it.
Our thinking is, "if you can detect fake news, you can detect malicious social hacking". This could be in the form of phishing emails, scamming websites or malicious programs. The ability to detect human deception attempts allows for a whole class of systems that detect users. Bringing artificial intelligence to these systems allows for smarter detection.
Fake news, as defined on the FNC site, by the New York Times:
Fake news, defined by the New York Times as “a made-up story with an
intention to deceive” 1, often for a secondary gain, is arguably one of the
most serious challenges facing the news industry today. In a December Pew
Research poll, 64% of US adults said that “made-up news” has caused a
“great deal of confusion” about the facts of current events 2.
The goal of the competition is to explore how modern techniques in machine learning and artificial intelligence can be used to fact check news sources. Doing so is considered if a time consuming and often difficult task for experts, but automating the process with stance detection allows for fast classification of news sources if accurate.
More can be found here.
The formalism of the stance classification is defined as follows:
Input
A headline and a body text - either from the same news article or from
two different articles.
Output
Classify the stance of the body text relative to the claim made in the
headline into one of four categories:
1. Agrees: The body text agrees with the headline.
2. Disagrees: The body text disagrees with the headline.
3. Discusses: The body text discuss the same topic as the headline, but
does not take a position
4. Unrelated: The body text discusses a different topic than the
headline
The scoring is done as follows:
There are several rules associated with the competition:
Our initial design thinking was to produce a framework that would allow us to attempt several different ideas, having the data input and data output all handled by the program. The next line of thinking was to have the program automated via the terminal, so that we could test multiple implementations over some long period of time and come back and analyse the results.
This was done is Java, as it was considered an easy language to implement this in and allows for debugging. Our team members also use multiple different development OSes, so we decided to use a language that works on most. In hindsight, a data orientated (such as R or Python) would have been more appropriate - live and learn.
For our initial design, we implemented 1st nearest neighbour. This gave us very poor results, not matching anything other than one label.
Due to implementation, running the program was taking 10 hours, for single nearest neighbour! We stored our database in RAM as strings and casted to numeric values each time - that was incredibly slow. After realising our mistake, we got our timing down to about a minute for analysing and matching - much better!
We later found a few more problems in our coded:
A re-write soon sorted this and were able to validate that our code was indeed working as intended. Impressively, even for K values larger than 100, we were still able to classify our data in the order of minutes.
Please refer to our snapshot repository for the source code for our version of the software for the competition.
TODO: Discuss the code layout.
The code can be visualised as follows (source):
As you'll see, everything has been completely written from scratch - not a single line has been borrowed from somewhere. If anything, this makes this project impressive within it's own right.
The following is a description of each class and it's purpose:
Further comments exist in the source files themselves.
Running the following:
ant; java -jar fnc.jar -h
Reveals:
./fnc.jar [OPT]
OPTions
-c --clss Classified data
<FILE> CSV titles file
<FILE> CSV bodies file
-h --help Displays this help
-k --kner Set the K for k-nearest
<INT> Number of nearest
-j --jobs Set the number of jobs to run
<INT> Number of jobs
-m --mode Set the mode
knear (def) k-nearest neighbours
-r --runs Set the number of runs
<INT> Number of runs
-u --ucls Unclassified data
<FILE> CSV titles file
<FILE> CSV bodies file
-f --fix Fixes the results to it's original form.
<FILE> The original whole titles database
The options are self explanatory, but of interest to us is in particular is the K value, which allows us to define how many nearest neighbours we take into consideration. In the results section, we see the outcome of varying our K value.
The following is the Marker.java
interface (without comments):
public interface Marker{
public String getName();
public double analyse(DataSet ds, int y);
}
As you can see, it's very simple. Each of the makers is given a reference name, as well as accepting the data to analyse at a given position and then returning a double
value associated with it. This double
value can be the full range of double
, where it is normalised after.
We have the following types of markers:
NOTE: We actually ended up turning off the dictionary match marker, as this actually made the matching process harder.
During our initial testing, we were getting results above 85% when classifying 80% of the training data, using 20% of the training data to train.
Varying the K value had a large affect and there was clearly an advantage to looking at about 105 of a node's nearest neighbours for the purpose of classification.
Our actual results were 83.0% using the new test data, trained with the previously released training data. This is considerably lower than our original classification, but this is understandable given the amount of overlap in our previous data training data.
One additional restraint added at the end of the competition was that 1 hour before closure, at 23:59 GMT on the 2nd of June, the results table would go "dark" in order to add some additional mystery to who the winner is.
Below we have the results table, pulled at 22:56 GMT:
submission_pk User score
404231 seanbaird 9556.500 (1)
404214 athene 9550.750 (2)
404189 jaminriedel 9521.500 (3)
404185 shangjingbo_uiuc 9345.500 (4)
404142 enoriega87 9289.500 (5)
404228 OSUfnc2017 9280.500 (6)
404205 florianmai 9276.500 (7)
404197 humeaum 9271.500 (8)
403966 pebo01 9270.250 (9)
404194 gtri 9243.000 (10)
404226 jamesthorne 9189.750 (11)
404133 Soyaholic 9146.250 (12)
403872 ezequiels 9117.500 (13)
404233 Tsang 9111.500 (14)
404129 siliang 9103.750 (15)
404062 neelvr 9090.750 (16)
404080 stefanieqiu 9083.250 (17)
404215 htanev 9081.250 (18)
404221 tagucci 9067.000 (19)
404070 saradhix 9043.500 (20)
404137 acp16sz 8995.750 (21)
404176 Gavin 8969.750 (22)
404092 JAIST 8879.750 (23)
404114 subbareddy 8855.000 (24)
404025 amblock 8797.500 (25)
404036 annaseg 8797.500 (25)
403978 Debanjan 8775.250 (26)
404183 johnnytorres83 8761.750 (27)
404086 avineshpvs 8539.500 (28)
403858 bnns 8349.750 (29)
404108 gayathrir 8322.750 (30)
404076 MBAN 8321.000 (31)
404209 martineb 8273.750 (32)
403809 contentcheck 8265.750 (33)
404204 vivek.datla 8125.750 (34)
404227 vikasy 7833.500 (35)
404175 barray 7778.250 (36)
404232 MicroMappers 7420.250 (37)
404082 borisnadion 7160.500 (38)
403913 etrrosenman 7106.750 (39)
404033 agrena 6909.750 (40)
404056 gpitsilis 6511.750 (41)
404224 mivenskaya 6494.750 (42)
403851 rldavis 5617.500 (43)
404065 isabelle.ingato 5298.500 (44)
404049 daysm 4911.000 (45)
404184 mrkhartmann4 4588.750 (46)
404106 bertslike 4506.500 (47)
All-in-all, it seems as though we had 50 entries in total - so by extension some people entered when the score board finally went dark. Given our position in the table, if they randomly placed it is likely they placed above us. This means we placed a respectable 36-39 out of 50.
So now, we look at how things have gone in reflection.
Thinking about the implementation, I think we're mostly happy to have travelled the road we have in terms of the learning experience involved with writing everything from scratch. Of course, given the opportunity again, I think we would possibly explore the idea of the using a higher level system for analysing the data, possibly as several different tools rather than one large rolled tool.
This also comes down to the classic "given more time". Both the people on our team were extremely pressed for time with various happenings, both of us leaving the UK for other Countries, as well as many other University related pressures. This project really only saw a day perhaps in a month - so it's impressive it got this far.
Speaking of being impressed, it's surprising how far you can get with just a k-nearest neighbour algorithm. It's certainly fast! You could very easily label data on the fly, possibly even in browser per client if you were using this as a web tool.
We didn't come last! Given our limited time, our long path to solving the problem, miraculously we didn't even come last in solving the problem. 83.0% classification of some completely unknown data is really not too bad at all.
The competition itself was run very nicely, given the number of people competing from various Countries and time zones. Some things could have been done better, but considering this was their first attempt and they were not sure how successful it would be - I think this is an impressive accomplishment to those involved. Hats off to them.
Lastly, this competition is called "FNC-1", the "1" implying it's the first in a series. I think we would definitely be interested in competing next year!