Data mining is the process of detecting hidden facts and interrelations in large data arrays. The obtained data can be used to make decisions in various spheres of human activity.
The information found during data mining is non-trivial and previously unknown. The gained knowledge describes new relations between properties and predicts values of some attributes based on others, and so on. The knowledge can also be applied on new data with some degree of confidence. This knowledge can be beneficial when it is applied.
Working with data mining is available in the desktop and web applications from the tools of Foresight Analytics Platform: Dashboards, Analytical Queries (OLAP), Reports, and Time Series Analysis.
The following tasks can be executed by means of data mining methods:
Dividing objects or observations into specified number of groups based on the proximity of values and their attributes. To solve problems the following methods of clustering are used: K-Modes Method and Kohonen Self-Organizing Maps.
Determining exceptionality degree for each attribute of each object on the ground of all amount of data. To execute the task, Exception Analysis is used.
Substituting missing values of one attribute depending on values of other attributes based on existing classification. To execute the task the following methods are used: Decision Tree, Logistic Regression and Back-Propagation Network.
Finding the most significant factor, determining a degree of influencing a dependent variable by each factor. To execute the task, Naive Bayes Classifier is used.
Determining frequently occurred together set of elements based on the analysis of a variety of repeated transactions. To execute the task, Association Analysis is used.
Continue specified time series by the selected forecasting method, using the information on its frequency. To execute the task the following forecasting methods are used: Grey Forecast, Extrapolate and Exponential Smoothing.
Data mining methods can be used to get a ROC curve (receiver operating characteristic) or error curve that is a graphical plot that assesses the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
In data mining methods taking only categorical input data, numeric input data will be transformed into categorical data by the Binning procedure. The procedure is the following: the input data array is divided into the specified number of ranges (groups) according to the split rules. The obtained ranges are used in data mining methods as separate categories.
Examples of categorical data are:
Names of cities.
Names of goods.
Answer in a questionnaire: "yes", "no".
Sizes of clothes: S, M, L, XL, XXL.
Education: primary, secondary, high.
Estimation of process result: "Good" or "Bad".
Names of carmakers: Ford, Toyota.
Product estimation: "effective", "rejected".
Phone codes of regions and so on.
See also: