PARALLEL AND DISTRIBUTED DATA MINING: THE FASTWEKA TOOL

Luciano Jose Senger, Lilian Tais Gouveia, Cristian Abreu, Marcio Augusto Souza

Abstract


Data mining refers to the process of extract useful information and knowledge from a given data set, using statistic techniques and machine learning algorithms.  Due to the huge size of data and amount of computation involved in data mining, it is very difficult, using current data mining tools, for a single computer to efficiently deal with large data. In this scenario, parallel computers and distributed systems can be used to speed up the data mining process. This paper presents the FastWeka, a tool for speedup data mining tasks, using multicore computers and a peer-to-peer system as computing platform.  By exploiting the inherent parallelism of the data mining cross-validation phase (using k-fold technique), Fastweka can achieve an improvement in the speed of data mining. Aiming to evaluate the tool, a forest cover dataset composed of 55 attributes and 581,012 records was considered as input of data mining algorithms. The computing times obtained when using FastWeka reveals a speedup of 9 when using 10 folds and 10 processing elements, without jeopardizing the classification accuracy. The experiments also show that better speedup values are obtained when the number of folds is multiple of the quantity of available processing elements and when it is processed only 1-fold per computer of a peer-to-peer system.

Keywords


parallel computing; data mining; agriculture

Full Text:

PDF