TOPICS IN DATA SCIENCECP-8210 FINAL REPORTSTRUCTUREDAND UNSTRUCTURED DATA Submitted to :- Abdolreza Abhari Submitted by :- GurpreetSinghStudent Number:- 500802475 DATE 21/Dec/2017Introduction Data mining isa process which is used to turn raw data into useful information by variouscompanies. With the help of data mining, the companies can look into patternsand understand the customers in a better way with more effective strategieswhich will further increase their sale and decrease the prices.It is a basicprocedure where insightful techniques are connected to remove informationdesigns. It is an interdisciplinary subfield of software engineering. Thegeneral objective of the information mining process is to separate data from aninformational index and change it into a reasonable structure for additionallyutilize.
Beside the crude examination step, it includes database andinformation administration angles, information pre-preparing, model andsurmising contemplations, intriguing quality measurements, unpredictabilitycontemplations, post-handling of found structures, representation, and onlineupdating. Data mining is the investigation venture of the “learningrevelation in databases” process, or KDD The data isstored electronically & the search is automatic by computer in data mining.Its not even new, statisticians and engineers have been working from long thatpatterns in the data can be solved automatically and also validated and couldbe used for predictions.
With the growth in database, it almost gets doubled inevery 20 months, so its very difficult in quantitative sense. The opportunitiesfor data mining will increase definitely, as the world will grow in complexity,the data it generates, so data mining is the only hope for elucidating of thehidden patterns. The data which is intelligently analysed is a very valuableresource, which can lead to new insights further has various advantages. Data mining isall about the solution of the problems with the analysing of data which isalready present in the databases. For instance, the problem of customersloyalty in the highly competitive market. The key to this problem is the database of customer choices with theirprofiles.
The behaviour pattern of former customers can be used to analyse thecharacteristics of those who remains loyal and those who change products. Theycan easily characterise the customers to identify them who care willing to jumpthe ship. Those groups can be identified and can be targeted with the special treatment.Same technique can be used to know the customers who are attracted to otherservices.
So, in todays competitive world, data is the material which canincrease the growth of any business, only if it is mined. And how are the patterns expressed? The nontrival predictions on new data are allowed with the help of usefulpatterns. There are two ways to express the pattern :- as a black box whoseinwards are incomprehensible and the other one is a transparent box whoseconstruction reveals the structure of the pattern.
Assuming, both can make goodpredictions. The difference among both is that whether or not the minedpatterns are represented in way of structure, which can be used to form futuredecisions. These kind of patterns are known as structural as they do capturethe decision structure in an excellent manner. They basically help to tell orexplain something about the data.
Describing Structural Patterns What are structural patterns?It is described below with the help of an illustration which is under asfollows :- If tear production rate = reduced then recommendation =noneOtherwise, if age =young and astigmatic= no thenrecommendation = soft Structuraldescriptions need not necessarily be couched as rules such as these. Decision trees,which specify the sequences of decisions that need to be made along with theresulting recommendation, are another popular means of expression.Thisexample is a very simplistic one. For a start, all combinations of possible valuesare represented in the table.
There are 24 rows, representing three possible valuesof age and two values each for spectacle prescription, astigmatism, and tearproductionrate (3 × 2 × 2 × 2 = 24). The rules do not really generalize from thedata;they merely summarize it. In most learning situations, the set of examplesgivenasinput is far from complete, and part of the job is to generalize to other, newexamples.You can imagine omitting some of the rows in the table for which the tearproductionrate is reduced andstill coming up with the ruleIf tear production rate= reducedthen recommendation = noneThiswould generalize to the missing rows and fill them in correctly.
Second, valuesarespecified for all the features in all the examples. Real-life datasetsinvariablycontainexamples in which the values of some features, for some reason or other,are unknown—for example,measurements were not taken or were lost. Third, theprecedingrules classify the examples correctly, whereas often, because of errors ornoise in the data, misclassifications occur even onthe data that is used to create theclassifier. Data Mining The techniques which are used for learning and doesn’t represent conceptual problems are known as machinelearning. Data mining is a procedure which involves learning in practical, notmuch theoretical. We will find out techniques to find structural patterns, andto make predictions from the data. Theinformation/knowledge will be collected from the data, as an example clientswhich have switched loyalties.The prediction is made whether a customer will be switching the loyaltyunder different circumstances, but the output might also include the exactdescription of the structure that can be utilised to group the unknownexamples.
And in addition, it is useful to supply an explicit portrayal of thelearning that is gained. Fundamentally, this reflects the two meanings oflearning considered over: the securing of information and the capacity toutilize it. Many learning procedures search for structural depictions of whatis found out—portrayalsthat can turn out to be genuinely unpredictable and are typically communicatedas sets of guidelines, for example, the ones portrayed already or the decisiontrees portrayed. Since they can be comprehended by individuals, thesedepictions serve to clarify what has been realized—at the end of the day, to clarify the reason for newprediction. The pastexperience tells us that in most of the applications of data mining, theknowledge structure, the structural descriptions are very important as much as toperform on new instances. Data mining is usually used by people to gainknowledge, not only the predictions. It sounds like a good idea to gainknowledge from the available data. Data mining deals with the kind of patterns thatcan be mined.
On the basis of the kind of data to be mined, there are twocategories of functions involved in Data Mining ? Descriptive Classification and PredictionDescriptive FunctionThe descriptive function deals with the generalproperties of data in the database. Here is the list of descriptive functions ? Class/Concept Description Mining of Frequent Patterns Mining of Associations Mining of Correlations Mining of ClustersClass/Concept DescriptionClass/Concept alludes to the data to be relatedwith the classes or ideas. For instance, in an organization, the classes ofthings for deals incorporate PC and printers, and ideas of clients incorporateenormous spenders and budget spenders. Such depictions of a class or an ideaare called class/idea portrayals. These depictions can be inferred by theaccompanying two ways – · Data Characterization ? This refers to summarizing data of classunder study. This class under study is called as Target Class.· Data Discrimination ? It refers to the mapping or classificationof a class with some predefined group or class.
Mining of Frequent PatternsFrequent patterns are those patterns that occurfrequently in transactional data. Here is the list of kind of frequent patterns?· Frequent Item Set ? It refers to a set of items that frequentlyappear together, for example, milk and bread.· Frequent Subsequence ? A sequence of patterns that occurfrequently such as purchasing a camera is followed by memory card.· Frequent SubStructure ? Substructure refers todifferent structural forms, such as graphs, trees, or lattices, which may becombined with item-sets or subsequences.
Mining of AssociationAssociations are used in retail sales to identifypatterns that are frequently purchased together. This process refers to theprocess of uncovering the relationship among data and determining associationrules.For example, a retailer generates an associationrule that shows that 70% of time milk is sold with bread and only 30% of timesbiscuits are sold with bread.Mining of CorrelationsIt is a kind of additional analysis performed touncover interesting statistical correlations between associated-attribute-valuepairs or between two item sets to analyze that if they have positive, negativeor no effect on each other.Mining of ClustersCluster refers to a group of similar kind ofobjects. Cluster analysis refers to forming group of objects that are verysimilar to each other but are highly different from the objects in other clusters.Classification and PredictionClassification is the process of finding a modelthat describes the data classes or concepts.
The purpose is to be able to usethis model to predict the class of objects whose class label is unknown. Thisderived model is based on the analysis of sets of training data. The derivedmodel can be presented in the following forms ? Classification (IF-THEN) Rules Decision Trees Mathematical Formulae Neural NetworksThe list of functions involved in these processesare as follows ?· Classification ? It predicts the class of objects whoseclass label is unknown. Its objective is to find a derived model that describesand distinguishes data classes or concepts. The Derived Model is based on theanalysis set of training data i.e. the data object whose class label is wellknown.
· Prediction ? It is used to predict missing orunavailable numerical data values rather than class labels. Regression Analysisis generally used for prediction. Prediction can also be used foridentification of distribution trends based on available data.
· Outlier Analysis ? Outliers may be defined as the data objectsthat do not comply with the general behavior or model of the data available.· Evolution Analysis ? Evolution analysis refers to thedescription and model regularities or trends for objects whose behavior changesover time.Data Mining Task Primitives We can specify a data mining task in the form of a data mining query. This query is input to the system. A data mining query is defined in terms of data mining task primitives.Note ?These primitives allow us to communicate in an interactive manner with the datamining system. Here is the list of Data Mining Task Primitives ? Set of task relevant data to be mined. Kind of knowledge to be mined.
Background knowledge to be used in discovery process. Interestingness measures and thresholds for pattern evaluation. Representation for visualizing the discovered patterns.Set of task relevant data to be minedThis is the portion of database in which the useris interested. This portion includes the following ? Database Attributes Data Warehouse dimensions of interestKind of knowledge to be minedIt refers to the kind of functions to be performed.These functions are ? Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution AnalysisBackground knowledgeThe background knowledge allows data to be mined atmultiple levels of abstraction. For example, the Concept hierarchies are one ofthe background knowledge that allows data to be mined at multiple levels ofabstraction.
Interestingness measures and thresholds for patternevaluationThis is used to evaluate the patterns that arediscovered by the process of knowledge discovery. There are differentinteresting measures for different kind of knowledge.Representation for visualizing the discoveredpatternsThis refers to the form in which discoveredpatterns are to be displayed. These representations may include the following.? Rules Tables Charts Graphs Decision Trees Cubes IssuessssssssssssssssssssssssData mining isn’t a simple task, as the calculationsutilized can get exceptionally perplexing and data isn’t generally accessibleat one place.
It should be coordinated from different heterogeneous informationsources. These components likewise make a few issues. Here in thisinstructional exercise, we will talk about the significant issues with respectto ? Mining Methodology and User Interaction Issues in Performance Issues in Diverse data typesThe following diagram describes the major issues. Mining Methodology and UserInteraction IssuesIt refers to the following kinds of issues –• Miningvarious types of information in databases ? Different clients might be keen onvarious types of learning. In this way it is important for data mining to covera wide scope of learning revelation task. • Interactivemining of learning at various levels of deliberation ? The data mining processshould be intuitive on the grounds that it enables clients to center the scanfor patterns, giving and refining data mining demands in light of the returnedcomes about. Handling noisy or incomplete data ? The data cleaning techniques are required to deal with the clamor and deficient articles while mining the information regularities.
On the off chance that the data cleaning techniques are not there then the precision of the found examples will be poor. · Pattern evaluation – The patterns discovered should beinteresting because either they represent common knowledge or lack novelty.Performance IssuesThere can be performance-related issues such asfollows ?· Efficiency andscalability of data mining algorithms ?In order to effectively extract the information from huge amount of data indatabases, data mining algorithm must be efficient and scalable.• Parallel,circulated, and incremental mining calculations ? The components, for example,tremendous size of databases, wide appropriation of data, and many-sidedquality of data mining techniques rouse the advancement of parallel andconveyed information mining calculations. These calculations isolate theinformation into allotments which is additionally prepared in a parallel mold.
At that point the outcomes from the partitions is consolidated. The incrementalcalculations, refresh databases without mining the information again startingwith no outside help.· Diverse Data Types Issues· Handling ofrelational and complex types of data ?The database may contain complex data objects, multimedia data objects, spatialdata, temporal data etc. It is not possible for one system to mine all thesekind of data.· Mining informationfrom heterogeneous databases and global information systems ? The data is available at different datasources on LAN or WAN.
These data source may be structured, semi structured orunstructured. Therefore mining the knowledge from them adds challenges to datamining. ApplicationsData MiningApplications in Sales/MarketingThe hidden pattern inside historical purchasingtransactions data are better understood with the help of data mining. Which enablesthe launch of new campaigns in the market in a cost-efficient way. The datamining applications are described as under :- Data mining is used for market basket analysis to provide information on what product combinations were purchased together when they were bought and in what sequence. This information helps businesses promote their most profitable products and maximize the profit. In addition, it encourages customers to purchase related products that they may have been missed or overlooked. The buying pattern of customer’s behaviour is identified by retail companies with the use of data mining.
Data Mining Applicationsin Banking / Finance The data mining technique is used to help identifying the credit card fraud detection. Customer’s loyalty is identified by data mining techniques , i.e by analysing the purchasing activities of customers, for example the information of recurrence of procurement in a timeframe, an aggregate fiscal value of all buys and when was the last buy. In the wake of dissecting those measurements, the relative measure is created for every client. The higher of the score, the more relative faithful the client is. By using data mining, credit card spending by the customers can be identified Data mining also helps in identifying the rules of stock trading from historical data. Data MiningApplications in Health Care and Insurance The development of the insurance business altogether reliesupon the capacity to convert data into the learning, data or knowledge aboutclients, contenders, and its business sectors.
Data mining is connected in insuranceindustry of late however conveyed gigantic upper hands to the organizations whohave actualized it effectively. The data mining applications in the protectionbusiness are as under: • Datamining is connected in claims investigation, for example, distinguishing which medicalmethodology are asserted together.• Datamining empowers to forecasts which clients will conceivably buy new policies. • Datamining permits insurance agencies to identify dangerous clients’ behaviourpatterns.
• Datamining recognizes deceitful behaviour.