Topic Tracking and Detection: A project.

mma3

I'm currently in my third year at university doing a Computer Science degree and wanted to write an article for Hugi. So I decided to write an article on my final year project! I'm only in the design stage of it at the moment so it will be quite general and vague. If people are interested I will write exactly how I completed it when I finally finish it in the summer of 2001! Until then here is the overview.

What is it?

The basic idea of the project is to collect on-line newspaper articles from nine world-wide sources. Then a computer program is run to decide which topic each article belongs to. For example: Bush/Gore election race, flooding, a murder case, etc.. It looks a simple enough task, exactly what I thought when first faced with it. But there are some quite interesting techniques that can be used to track the topics.

Pipeline

Many of us will be used to the idea of pipelines. My program will use this general approach to track topics. Each article goes through each stage.

The stages are as follows:

Strip the tags
Remove stop words
Conflation
Comparison and Weighting

I will now take each stage in turn and explain what each part does.

Strip the tags

The first stage of the production pipeline is to strip out all of the html tags from the document to leave only the important information. Other information that doesn't refer to the article is also removed as it could potentially affect the effectiveness of the system.

Remove stop words

All stop words are removed from the article as they do not add any useful information to the ability of the tracking program. Stop words are words such as: the, it, and, but, can etc.. These words obviously add meaning to the article but they aren't useful with regards to the tracking task.

Conflation

The article will now just contain words that are relevant to the tracking of topics. Stemming is one of the final things that needs to be done before articles can be compared to one another to decide which topic they belong to. Stemming involves taking a word and breaking it down into its most basic form.

For example:

RELIGIOUSLY becomes RELIGION
RELIGIONS becomes RELIGION

This helps as a match between RELIGIOUSLY and RELIGIONS is not found even though they are both obviously related. The stemming of the word allows similarities to be found later in the pipeline.

There are other ideas that can be brought into the mix.

Incorrect spellings: DRIEVING becomes DRIVING
Multi-word ideas: WORLD-WIDE and WORLDWIDE
Alternative spellings, e.g. national differences: COLOUR and COLOR

There are many other ways in order to optimise the stemming, the above outline the basic ones.

The Hard Bit: Comparison and Weighting

All that the article should be left with now is a list of words.

For example the article could contain the six words: Bush, election, USA, constitution, Florida and vote. (This is a simplification as there are likely to be hundreds of words rather than six!) We will call this article X (the date shows the date article was "released"). If this article was compared against existing articles:

Article X - (25th October) Bush, election, USA, constitution, Florida, vote

Article A - (21st October) Sea, barrier, flood, damage, Florida, millions.
Article B - (22nd October) Bill, Gates, lose, Netscape, legal, case.
Article C - (23rd October) Gore, election, democrat, Bush, constitution, Florida.
Article D - (24th October) Minister, vote, Japan, election, time, yellow.

Then the system would score each article for the number of matches:

A - 1/6
B - 0/6
C - 4/6
D - 2/6

An similarity matrix can then be created in order to represent the information.

A B C D X
A - 0 1 0 1
B - - 0 0 0
C - - - 0 4
D - - - - 2
X - - - - -

The matrix is "triangular" as C is never compared to D as C happened before D. (I hope that made sense!)

It is obvious that C scores the highest; a threshold function can be used in order to decide which articles X is related to. So if we say that only articles that score 3 or over are accepted then X will be seen to be linked to C. As with many computing ideas there is a trade-off. This time it's between missing articles and false alarms. The higher the threshold the more articles are missed, the lower the threshold the higher the number of articles are put into a topic that aren't actually related. So it's vital that the correct threshold is found.

From the threshold an adjacency matrix can be created:

A B C D X
A - 0 0 0 0
B - - 0 0 0
C - - - 0 1
D - - - - 0
X - - - - -

A 0 signifies that the articles are not related. A 1 signifies that there is a relation between the articles.

Other Stuff

There are many ways in which the system can be optimised. Ideas for doing so include:

Use of bigrams - When two words appear next to each other frequently they will be weighted higher than when there is a match than normal words. Words at start of article are weighted higher. Longer words weighted higher as they are likely to be more important.

Words that are found often in many documents are discarded as the are too "common".

Time penalty - When a topic hasn't appeared for a while it is likely that the topic has run its course, so increase the likelihood that a new article won't be added to a topic.

The End

As this example is quite small there is only one link. In my project I have collected 1000-1250 articles from all around the world (four UK sources, one US, one Middle East, one Australian, one Japanese and one South African source). So the matrices should be much larger and more complex.

Written by mma3