TOPICS IN DATA SCIENCE
AND UNSTRUCTURED DATA
Submitted to :- Abdolreza Abhari
Submitted by :- Gurpreet
Student Number:- 500802475
Data mining is
a process which is used to turn raw data into useful information by various
companies. With the help of data mining, the companies can look into patterns
and understand the customers in a better way with more effective strategies
which will further increase their sale and decrease the prices.
It is a basic
procedure where insightful techniques are connected to remove information
designs. It is an interdisciplinary subfield of software engineering. The
general objective of the information mining process is to separate data from an
informational index and change it into a reasonable structure for additionally
utilize. Beside the crude examination step, it includes database and
information administration angles, information pre-preparing, model and
surmising contemplations, intriguing quality measurements, unpredictability
contemplations, post-handling of found structures, representation, and online
updating. Data mining is the investigation venture of the “learning
revelation in databases” process, or KDD
The data is
stored electronically & the search is automatic by computer in data mining.
Its not even new, statisticians and engineers have been working from long that
patterns in the data can be solved automatically and also validated and could
be used for predictions. With the growth in database, it almost gets doubled in
every 20 months, so its very difficult in quantitative sense. The opportunities
for data mining will increase definitely, as the world will grow in complexity,
the data it generates, so data mining is the only hope for elucidating of the
hidden patterns. The data which is intelligently analysed is a very valuable
resource, which can lead to new insights further has various advantages.
Data mining is
all about the solution of the problems with the analysing of data which is
already present in the databases. For instance, the problem of customers
loyalty in the highly competitive market.
The key to this problem is the database of customer choices with their
profiles. The behaviour pattern of former customers can be used to analyse the
characteristics of those who remains loyal and those who change products. They
can easily characterise the customers to identify them who care willing to jump
the ship. Those groups can be identified and can be targeted with the special treatment.
Same technique can be used to know the customers who are attracted to other
services. So, in todays competitive world, data is the material which can
increase the growth of any business, only if it is mined.
And how are the patterns expressed?
The nontrival predictions on new data are allowed with the help of useful
patterns. There are two ways to express the pattern :- as a black box whose
inwards are incomprehensible and the other one is a transparent box whose
construction reveals the structure of the pattern. Assuming, both can make good
predictions. The difference among both is that whether or not the mined
patterns are represented in way of structure, which can be used to form future
decisions. These kind of patterns are known as structural as they do capture
the decision structure in an excellent manner. They basically help to tell or
explain something about the data.
Describing Structural Patterns
What are structural patterns?
It is described below with the help of an illustration which is under as
If tear production rate = reduced then recommendation =
Otherwise, if age =
young and astigmatic
= no then
recommendation = soft
descriptions need not necessarily be couched as rules such as these. Decision trees,
which specify the sequences of decisions that need to be made along with the
resulting recommendation, are another popular means of expression.
example is a very simplistic one. For a start, all combinations of possible values
are represented in the table. There are 24 rows, representing three possible
of age and two values each for spectacle prescription, astigmatism, and tear
rate (3 × 2 × 2 × 2 = 24). The rules do not really generalize from the
they merely summarize it. In most learning situations, the set of examples
input is far from complete, and part of the job is to generalize to other, new
You can imagine omitting some of the rows in the table for which the tear
rate is reduced and
still coming up with the rule
If tear production rate
then recommendation = none
would generalize to the missing rows and fill them in correctly. Second, values
specified for all the features in all the examples. Real-life datasets
examples in which the values of some features, for some reason or other,
are unknown—for example,
measurements were not taken or were lost. Third, the
rules classify the examples correctly, whereas often, because of errors or
noise in the data, misclassifications occur even on
the data that is used to create the
The techniques which are used for learning and doesn’t represent conceptual problems are known as machine
learning. Data mining is a procedure which involves learning in practical, not
much theoretical. We will find out techniques to find structural patterns, and
to make predictions from the data. The
information/knowledge will be collected from the data, as an example clients
which have switched loyalties.
The prediction is made whether a customer will be switching the loyalty
under different circumstances, but the output might also include the exact
description of the structure that can be utilised to group the unknown
And in addition, it is useful to supply an explicit portrayal of the
learning that is gained. Fundamentally, this reflects the two meanings of
learning considered over: the securing of information and the capacity to
utilize it. Many learning procedures search for structural depictions of what
is found out—portrayals
that can turn out to be genuinely unpredictable and are typically communicated
as sets of guidelines, for example, the ones portrayed already or the decision
trees portrayed. Since they can be comprehended by individuals, these
depictions serve to clarify what has been realized—at the end of the day, to clarify the reason for new
experience tells us that in most of the applications of data mining, the
knowledge structure, the structural descriptions are very important as much as to
perform on new instances. Data mining is usually used by people to gain
knowledge, not only the predictions. It sounds like a good idea to gain
knowledge from the available data.
Data mining deals with the kind of patterns that
can be mined. On the basis of the kind of data to be mined, there are two
categories of functions involved in Data Mining ?
Classification and Prediction
The descriptive function deals with the general
properties of data in the database. Here is the list of descriptive functions ?
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept alludes to the data to be related
with the classes or ideas. For instance, in an organization, the classes of
things for deals incorporate PC and printers, and ideas of clients incorporate
enormous spenders and budget spenders. Such depictions of a class or an idea
are called class/idea portrayals. These depictions can be inferred by the
accompanying two ways –
Data Characterization ? This refers to summarizing data of class
under study. This class under study is called as Target Class.
Data Discrimination ? It refers to the mapping or classification
of a class with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur
frequently in transactional data. Here is the list of kind of frequent patterns
Frequent Item Set ? It refers to a set of items that frequently
appear together, for example, milk and bread.
Frequent Subsequence ? A sequence of patterns that occur
frequently such as purchasing a camera is followed by memory card.
Structure ? Substructure refers to
different structural forms, such as graphs, trees, or lattices, which may be
combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify
patterns that are frequently purchased together. This process refers to the
process of uncovering the relationship among data and determining association
For example, a retailer generates an association
rule that shows that 70% of time milk is sold with bread and only 30% of times
biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to
uncover interesting statistical correlations between associated-attribute-value
pairs or between two item sets to analyze that if they have positive, negative
or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of
objects. Cluster analysis refers to forming group of objects that are very
similar to each other but are highly different from the objects in other clusters.
Classification and Prediction
Classification is the process of finding a model
that describes the data classes or concepts. The purpose is to be able to use
this model to predict the class of objects whose class label is unknown. This
derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms ?
Classification (IF-THEN) Rules
The list of functions involved in these processes
are as follows ?
Classification ? It predicts the class of objects whose
class label is unknown. Its objective is to find a derived model that describes
and distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
Prediction ? It is used to predict missing or
unavailable numerical data values rather than class labels. Regression Analysis
is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.
Outlier Analysis ? Outliers may be defined as the data objects
that do not comply with the general behavior or model of the data available.
Evolution Analysis ? Evolution analysis refers to the
description and model regularities or trends for objects whose behavior changes
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining
This query is input to the system.
A data mining query is defined in terms of data mining task
These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives ?
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user
is interested. This portion includes the following ?
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed.
These functions are ?
Association and Correlation Analysis
The background knowledge allows data to be mined at
multiple levels of abstraction. For example, the Concept hierarchies are one of
the background knowledge that allows data to be mined at multiple levels of
Interestingness measures and thresholds for pattern
This is used to evaluate the patterns that are
discovered by the process of knowledge discovery. There are different
interesting measures for different kind of knowledge.
Representation for visualizing the discovered
This refers to the form in which discovered
patterns are to be displayed. These representations may include the following.
Data mining isn’t a simple task, as the calculations
utilized can get exceptionally perplexing and data isn’t generally accessible
at one place. It should be coordinated from different heterogeneous information
sources. These components likewise make a few issues. Here in this
instructional exercise, we will talk about the significant issues with respect
Mining Methodology and User Interaction
Issues in Performance
Issues in Diverse data types
The following diagram describes the major issues.
Mining Methodology and User
It refers to the following kinds of issues –
various types of information in databases ? Different clients might be keen on
various types of learning. In this way it is important for data mining to cover
a wide scope of learning revelation task.
mining of learning at various levels of deliberation ? The data mining process
should be intuitive on the grounds that it enables clients to center the scan
for patterns, giving and refining data mining demands in light of the returned
Handling noisy or incomplete data ? The data cleaning techniques are required to deal with the
clamor and deficient articles while mining the information regularities.
On the off chance that the data cleaning techniques are not there then the
precision of the found examples will be poor.
Pattern evaluation – The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.
There can be performance-related issues such as
scalability of data mining algorithms ?
In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
circulated, and incremental mining calculations ? The components, for example,
tremendous size of databases, wide appropriation of data, and many-sided
quality of data mining techniques rouse the advancement of parallel and
conveyed information mining calculations. These calculations isolate the
information into allotments which is additionally prepared in a parallel mold.
At that point the outcomes from the partitions is consolidated. The incremental
calculations, refresh databases without mining the information again starting
with no outside help.
Diverse Data Types Issues
relational and complex types of data ?
The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all these
kind of data.
from heterogeneous databases and global information systems ? The data is available at different data
sources on LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to data
Applications in Sales/Marketing
The hidden pattern inside historical purchasing
transactions data are better understood with the help of data mining. Which enables
the launch of new campaigns in the market in a cost-efficient way. The data
mining applications are described as under :-
mining is used for market basket analysis to provide information on what
product combinations were purchased together when they were bought and in
what sequence. This information helps businesses promote their most
profitable products and maximize the profit. In addition, it
encourages customers to purchase related products that they may have been
missed or overlooked.
buying pattern of customer’s behaviour is identified by retail companies
with the use of data mining.
Data Mining Applications
in Banking / Finance
data mining technique is used to help identifying the credit card fraud
loyalty is identified by data mining techniques , i.e by analysing the purchasing
activities of customers, for example the information of recurrence of
procurement in a timeframe, an aggregate fiscal value of all buys and when
was the last buy. In the wake of dissecting those measurements, the
relative measure is created for every client. The higher of the score, the
more relative faithful the client is.
using data mining, credit card spending by the customers can be identified
mining also helps in identifying the rules of stock trading from historical
Applications in Health Care and Insurance
The development of the insurance business altogether relies
upon the capacity to convert data into the learning, data or knowledge about
clients, contenders, and its business sectors. Data mining is connected in insurance
industry of late however conveyed gigantic upper hands to the organizations who
have actualized it effectively. The data mining applications in the protection
business are as under:
mining is connected in claims investigation, for example, distinguishing which medical
methodology are asserted together.
mining empowers to forecasts which clients will conceivably buy new policies.
mining permits insurance agencies to identify dangerous clients’ behaviour
mining recognizes deceitful behaviour.