copyright 2003-2010 uk-neural.net
Comparing Software Performances
Evaluating machine learning software, one against another, is not as simple a task as it might seem.
Certain products have strengths in particular areas, or have multiples of settings which can be adjusted
to help suit a task.
Software vendors in this area frequently use industry 'standard' data sets to illustrate the
prowess of their particular product. But let's face it, if they are trying to convince the
potential purchaser that theirs is the one to go with, they're unlikely to select a problem
which does not show their package in its best light. Also, the often used 'problem' data
sets, are not problems at all. Asking a machine to sort out what's going on in an XOR
function is more like an exhibition AI obstacle course, its not really going to assist an
individual to understand anything. Another favourite is Iris classification (the flower) from
various bits of specific information . . . Again, the rules are so simple that it is not, in my
opinion, representative of a real-world problem.
Another shortcoming of the Iris, XOR, examples is that they have definite,
specific solutions. If x, y AND z then iris species is definitely Virginica. In my area
of interest, as I suspect with many others, there are very few, if any, definite answers to
my queries ~ I'm looking for trends. For instance, a specific set of pre-game
circumstances in a sports contest may produce result x. But, the very next day, an
exact same set of circumstances produces result y. So here we'll present a
comparison of some commercial packages that have been trained and tested on a couple of sports
related data sets of my own compilation.
A Couple of Caveats:
The data configurations used make no claim to be the best solution to the given task, but at least they are the same for every program.
So far as possible, each program was used in its 'default' state. (some allow a host of configurations and settings allowing a user to possibly 'drill
down' to a better solution)
Software Packages Used:
neural net
- Ward Systems Predictor
neural net
- Tiberius data mining
genetic programming
- Discipulus
genetic programming
- GeneXProTools
regression splines
- Salford Systems MARS
Sample Task 1: (football)
10 Inputs per record. League games won, drawn, lost, goals scored and goals conceded for each of home and away teams. Each data item is divided
by the number of games played, producing per game figures. e.g. won 9 (but from 20 games) = 0.45
One output, an integer representing the home team advantage in goals.
e.g. score = 3-1, output = +2. score = 0-1, output = -1.
Trained upon data sampled from two seasons (1,422 records), tested against a third season (2,121 records).
Fitness measurements compared;
R-squared. Standard statistical measure of fitness predicted versus actual (1=perfect match)
Sum of actual errors *
Sum of raw errors *
* Take the case of predictions for four cases. When checked against actual results, if two cases were high by 5, and the other two instances were low
by 5, the Sum of Actual errors would be = 5+5-5-5 = 0, whereas the sum of raw errors = 5+5+5+5 = 20.
If another package predicted all 4 @ 2 too high, actual = 8, raw = 8. So, Actual figure allows an overview of the distribution of errors, the nearer to
zero, the better its focus. Raw figures give an accumulated error over the whole data set (lower is better)
Within half-a-goal. Sum of all cases where prediction was closer than 0.5 of the actual home team goal superiority.
By far & away the most significant two statistical measures are R-squared and Sum of Raw Errors. Software shown in ranking order, best score
nearer the top;
Sample Task 2: (horse
racing favourite
spreads)
3 Inputs per record:
•
Is Race a handicap or not?
•
Number of runners?
•
Odds of favourite?
One output, an integer representing
the spreads value for a favourite's performance where: win=25, coming 2nd = 10pts, finishing 3rd = 5pts, otherwise zero points.
Trained upon data sample of 1,000, tested against out-of-sample set of a 859 records.
Fitness measurements taken;
R-squared. Standard statistical measure of fitness predicted versus actual (1=perfect match)
Sum of actual errors
Sum of raw errors
Software shown in ranking order, best score nearer the top;
Approximate training times (both exercises)
MARS
< 1 minute
WARD genetic
1 hour
WARD neural
< 1 minute
GeneXproTools
1 hour
Discipulus
1 hour (both individual & team are trained simultaneously)
Tiberius
<10 minutes
Software Relative Prices
WARD Predictor
US$550.00
GeneXproTools Advanced
GBP£650
Discipulus Professional 4
US$495.00 (v5 is now current - cost has shot up to US$764.50 per annum )
Tiberius
US$265.00 (3-year license)
MARS
Salford Systems quoted me for the least expensive option which was $4,995.00 for a single user license with a
further $1,998.00 annual renewal charge. If it makes any difference MARS price does include tech support,
maintenance, all upgrades to future versions and internet training for a single user. Seats to any upcoming
Salford Systems MARS training will be discounted by 55%
Testing was performed without bias, either in selection of tasks or otherwise. They are in my experience quite typical and perhaps underline why
Tiberius is not only my package of choice, but the one to which I now judge all others.
The software chosen for this comparison is, in my experience, the cream of the current (2010) commercial machine learning software. A package
performing poorly in this company does not infer the software is not up to scratch. I have tried & tested many products - but obviously not all of those
tried many were rejected for this exercise for reasons stated below. All the following products I rank as below the capabilities of all the packages used
in the above tests;
Attrasoft
BrainCom
Crespin
Emergent
ExcelNeural
FANN
Joone
Neurosolutions
Pythia
QNet
RapidMiner
RockEye
Tanagra
Trajan (which is also the Neural Network add-in incorporated into Statistica)
XLPert
My rejection of these was for a variety of factors. It is not my intention to review these products individually, and of course my reasons for rejection
may not be valid cause for others to do the same.
The above rejection list, in this reviewers opinion, suffer from at least one (and in a few cases a good few more than one) of the following negative
factors;
•
Very poor at out-of-sample predictions.
•
Flaky and/or bug-ridden software.
•
Frequent program crashes
•
Overly complex user interface (some are possibly targeted primarily at academic users)
•
Very poor user support (sometimes NO user support)
advertiser space
R-squared
0.04525 Tiberius
0.04484 WARD neural mode
0.04102 MARS
0.03620 GeneXProTools
0.03944 Discipulus best 'team'
0.03186 WARD genetic mode
0.03180 Discipulus best program
Sum of actual errors
155.49 GeneXProTools
226.94 WARD genetic mode
227.70 Tiberius
255.15 WARD neural mode
286.02 MARS
306.17 Discipulus best program
320.41 Discipulus best 'team'
Sum of raw errors
2719.89 Tiberius
2721.66 Discipulus best 'team'
2729.83 WARD neural mode
2730.41 MARS
2767.60 GeneXProTools
2780.54 Discipulus best program
2818.13 WARD genetic mode
Within half-a-goal
560 Discipulus best 'team'
554 Tiberius
553 WARD neural mode
553 MARS
553 GeneXProTools
530 WARD genetic mode
529 Discipulus best program
R-squared
0.09133 MARS
0.09108 Tiberius
0.09018 WARD genetic mode
0.08953 GeneXProTools
0.08654 Discipulus best 'team'
0.08076 Discipulus best program
0.07090 WARD neural mode
Sum of actual errors
20.79 WARD genetic mode
-21.13 Discipulus best program
22.85 Discipulus best 'team'
109.20 Tiberius
132.71 GeneXProTools
173.85 MARS
201.80 WARD neural mode
Sum of raw errors
7173.65 Tiberius
7188.63 MARS
7203.16 WARD genetic mode
7221.83 GeneXProTools
7232.79 Discipulus best program
7239.55 Discipulus best 'team'
7315.22 WARD neural mode
uk-neural.net anticipating the future by examining the past