Spam or Not Spam 

Unsolicited email (spam) is annoying and clutters your mailbox. You are to write a spam filter - a program that reads email messages of regular ASCII characters (32-127) and tries to determine whether or not each message is spam.

How can we determine whether or not a message is spam? Spam contains words and phrases that are not common in genuine email messages. For example, the phrase

MAKE MONEY FAST, HONEY!!

is in all-uppercase, contains the word "money" and ends with a double exclamation mark.

One way to create a spam filter is to read through many spam and non-spam messages and to come up with a set of rules that will classify any particular message as spam or not. This process can be tedious and error prone to do manually. Instead you will write a program to automate the process.

A useful step in automatic classification is to split the text up into set of trigrams. A trigram is a sequence of three adjacent characters that appear in the message. A trigram is case sensitive. The example above is composed of the trigrams:

MAK
AKE
KE
E M
MO
MON
ONE
NEY
EY
Y F
FA
FAS
AST
ST,
T,
, H
HO
HON
ONE
NEY
EY!
Y!!

If we examine a sample of spam and non-spam messages we find that some trigrams are more common in spam; whereas others are more common in non-spam. This observation leads to a classification method:

Then we say that a message is spam if

similarity(fmessage, fspam) > similarity(fmessage, fnon - spam)

Input 

The input file contains several sets (less than 10) of input. The description of each set is given below.

The first line of each set contains three integers: s(0 < s < 5) the number of sample spam messages to follow; n(0 < n < 5) the number of sample non-spam messages to follow; c(0 < c < 10) the number of messages to be classified as spam or non-spam, based on trigram the trigram frequencies of the sample messages. Each message consists of several lines of text and is terminated by a line containing `ENDMESSAGE'. This line will not appear elsewhere in the input, and is not considered part of the message. No line has more than 1000 characters.

Input is terminated by a set where s = 0, n = 0 and c = 0. This set should not be processed.

Output 

For each set of input your program should output a single line to identify the serial of the input set. The output specification for each set is given below:

For each of the c messages, your program will output two lines. On the first line, output similarity(fmessage, fspam) and similarity(fmessage, fnon - spam). On the second line print the classification of the message (`spam' or `non-spam'). Round the numbers to five decimal digits.

For detailed description look at the output for sample input.

When forming trigrams, we never include a new line character. We don't include trigrams that span multiple lines, either.

Sample Input 

2 1 1
AAAA
BBBB CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE
2 1 2
AAAA
BBBB CCCC
ENDMESSAGE
BBBB
ENDMESSAGE
AAAABBBB
ENDMESSAGE
AAABB
ENDMESSAGE
AAABB
ENDMESSAGE
0 0 0

Sample Output 

Set 1:
0.22222 0.73030
non-spam
Set 2:
0.22222 0.73030
non-spam
0.22222 0.73030
non-spam



Problem Setter: Gordon V. Cormack, University of Waterloo, Canada