copyright 2003-2010 uk-neural.net Comparing Software Performances Evaluating machine learning software, one against another, is not as simple a task as it might seem. Certain products have strengths in particular areas, or have multiples of settings which can be adjusted to help suit a task. Software vendors in this area frequently use industry 'standard' data sets to illustrate the prowess of their particular product. But let's face it, if they are trying to convince the potential purchaser that theirs is the one to go with, they're unlikely to select a problem which does not show their package in its best light. Also, the often used 'problem' data sets, are not problems at all. Asking a machine to sort out what's going on in an XOR function is more like an exhibition AI obstacle course, its not really going to assist an individual to understand anything. Another favourite is Iris classification (the flower) from various bits of specific information . . . Again, the rules are so simple that it is not, in my opinion, representative of a real-world problem. Another shortcoming of the Iris, XOR, examples is that they have definite, specific solutions. If x, y AND z then iris species is definitely Virginica. In my area of interest, as I suspect with many others, there are very few, if any, definite answers to my queries ~ I'm looking for trends. For instance, a specific set of pre-game circumstances in a sports contest may produce result x. But, the very next day, an exact same set of circumstances produces result y. So here we'll present a comparison of some commercial packages that have been trained and tested on a couple of sports related data sets of my own compilation. A Couple of Caveats: The data configurations used make no claim to be the best solution to the given task, but at least they are the same for every program. So far as possible, each program was used in its 'default' state. (some allow a host of configurations and settings allowing a user to possibly 'drill down' to a better solution) Software Packages Used: neural net - Ward Systems Predictor neural net - Tiberius data mining genetic programming - Discipulus genetic programming - GeneXProTools regression splines - Salford Systems MARS Sample Task 1: (football) 10 Inputs per record. League games won, drawn, lost, goals scored and goals conceded for each of home and away teams. Each data item is divided by the number of games played, producing per game figures. e.g. won 9 (but from 20 games) = 0.45 One output, an integer representing the home team advantage in goals. e.g. score = 3-1, output = +2. score = 0-1, output = -1. Trained upon data sampled from two seasons (1,422 records), tested against a third season (2,121 records). Fitness measurements compared; R-squared. Standard statistical measure of fitness predicted versus actual (1=perfect match) Sum of actual errors * Sum of raw errors * * Take the case of predictions for four cases. When checked against actual results, if two cases were high by 5, and the other two instances were low by 5, the Sum of Actual errors would be = 5+5-5-5 = 0, whereas the sum of raw errors = 5+5+5+5 = 20. If another package predicted all 4 @ 2 too high, actual = 8, raw = 8. So, Actual figure allows an overview of the distribution of errors, the nearer to zero, the better its focus. Raw figures give an accumulated error over the whole data set (lower is better) Within half-a-goal. Sum of all cases where prediction was closer than 0.5 of the actual home team goal superiority. By far & away the most significant two statistical measures are R-squared and Sum of Raw Errors. Software shown in ranking order, best score nearer the top; Sample Task 2: (horse racing favourite spreads) 3 Inputs per record: Is Race a handicap or not? Number of runners? Odds of favourite? One output, an integer representing the spreads value for a favourite's performance where: win=25, coming 2nd = 10pts, finishing 3rd = 5pts, otherwise zero points. Trained upon data sample of 1,000, tested against out-of-sample set of a 859 records. Fitness measurements taken; R-squared. Standard statistical measure of fitness predicted versus actual (1=perfect match) Sum of actual errors Sum of raw errors Software shown in ranking order, best score nearer the top; Approximate training times (both exercises) MARS < 1 minute WARD genetic  1 hour WARD neural < 1 minute GeneXproTools  1 hour Discipulus 1 hour (both individual & team are trained simultaneously) Tiberius <10 minutes Software Relative Prices WARD Predictor   US$550.00 GeneXproTools Advanced  GBP£650 Discipulus Professional 4 US$495.00 (v5 is now current - cost has shot up to US$764.50 per annum ) Tiberius  US$265.00 (3-year license) MARS Salford Systems quoted me for the least expensive option which was $4,995.00 for a single user license with a further $1,998.00 annual renewal charge. If it makes any difference MARS price does include tech support, maintenance, all upgrades to future versions and internet training for a single user. Seats to any upcoming Salford Systems MARS training will be discounted by 55% Testing was performed without bias, either in selection of tasks or otherwise. They are in my experience quite typical and perhaps underline why Tiberius is not only my package of choice, but the one to which I now judge all others. The software chosen for this comparison is, in my experience, the cream of the current (2010) commercial machine learning software. A package performing poorly in this company does not infer the software is not up to scratch. I have tried & tested many products - but obviously not all of those tried many were rejected for this exercise for reasons stated below. All the following products I rank as below the capabilities of all the packages used in the above tests; Attrasoft BrainCom Crespin Emergent ExcelNeural FANN Joone Neurosolutions Pythia QNet RapidMiner RockEye Tanagra Trajan (which is also the Neural Network add-in incorporated into Statistica) XLPert My rejection of these was for a variety of factors. It is not my intention to review these products individually, and of course my reasons for rejection may not be valid cause for others to do the same. The above rejection list, in this reviewers opinion, suffer from at least one (and in a few cases a good few more than one) of the following negative factors; Very poor at out-of-sample predictions. Flaky and/or bug-ridden software. Frequent program crashes Overly complex user interface (some are possibly targeted primarily at academic users) Very poor user support (sometimes NO user support) advertiser space R-squared 0.04525 Tiberius 0.04484 WARD neural mode 0.04102 MARS 0.03620 GeneXProTools 0.03944 Discipulus best 'team' 0.03186 WARD genetic mode 0.03180 Discipulus best program Sum of actual errors 155.49 GeneXProTools 226.94 WARD genetic mode 227.70 Tiberius 255.15 WARD neural mode 286.02 MARS 306.17 Discipulus best program 320.41 Discipulus best 'team' Sum of raw errors 2719.89 Tiberius 2721.66 Discipulus best 'team' 2729.83 WARD neural mode 2730.41 MARS 2767.60 GeneXProTools 2780.54 Discipulus best program 2818.13 WARD genetic mode Within half-a-goal 560 Discipulus best 'team' 554 Tiberius 553 WARD neural mode 553 MARS 553 GeneXProTools 530 WARD genetic mode 529 Discipulus best program R-squared 0.09133 MARS 0.09108 Tiberius 0.09018 WARD genetic mode 0.08953 GeneXProTools 0.08654 Discipulus best 'team' 0.08076 Discipulus best program 0.07090 WARD neural mode Sum of actual errors 20.79 WARD genetic mode -21.13 Discipulus best program 22.85 Discipulus best 'team' 109.20 Tiberius 132.71 GeneXProTools 173.85 MARS 201.80 WARD neural mode Sum of raw errors 7173.65 Tiberius 7188.63 MARS 7203.16 WARD genetic mode 7221.83 GeneXProTools 7232.79 Discipulus best program 7239.55 Discipulus best 'team' 7315.22 WARD neural mode uk-neural.net  anticipating the future by examining the past