Due an issue with this classifier and can

Due to the advent of the internet, a large number of electronic documents are generated every day which makes the task of document classification extremely relevant. The importance of text mining and classification is increasing due to more availability of digital data from numerous sources. Text mining is the extraction of important information from the raw unstructured data for use in businesses, companies, and real-world applications. On the other hand, document classification is actually a problem in computer science that automatically classifies the documents into several pre-defined categories or topics based on their contents. This problem has been solved using many machine learning algorithms like K-NN, NB Classification, Artificial Neural Networks, SVM, Decision Tree induction and Rule induction. While each of the mentioned classification algorithms has its own unique properties, they also have their associated limitations.NB is the oldest and the simplest classifier which uses the highest probability values of the categories for document classification and is based on the assumption that features in a document are conditionally independent given a category. Even though the assumption is generally incorrect, it is effective as a classifier when trained with large training data. However, capacity control and generalization remains an issue with this classifier and can lead to overfitting. K-NN algorithm depends on the nearest neighbour documents in the vector space of training data and makes the decision based on the category to which most neighbours belong. It simply uses the training data for category prediction without learning anything from the training data. It requires distance measurement and sorting for every prediction and takes longer time in case of the large data set. It is also less effective against noisy data. SVM is a popular classification algorithm that tries to implement Structural risk minimization by finding an optimal hyperplane for binary and multi-class problems which leads to good generalization. A major drawback of this algorithm is the process of converting text data into numerical data. When techniques such as Term frequency-inverse document frequency (TF-IDF) are used, dimensionality becomes a problem that leads to high training and classification time. Other drawbacks are memory requirement and high algorithm complexity. But SVM is robust in handling high dimensional data. Also, SVM is computationally more efficient and require less memory compared to neural networks. Neural Networks are known to be the best approach for achieving Machine Learning and can equally prove effective for the process of classification. Several types of neural networks are used for classification task such as recurrent neural networks, multilayer perceptrons, etc. Decision Tree and Rule induction algorithms are simple to understand but fail to work well when distinguishing features between different documents are more.Document classification has widespread real-world applications like spam filtering, news stories, and academic papers classification according to different subjects, geographical locations, and domains, email filtering, mail routing, news monitoring, automated indexing of scientific articles, patient reports and so on. Information retrieval becomes much more efficient while dealing with categorized data than with non-categorized data. This paper categorizes articles from the 20 Newsgroups dataset in their respective categories using some of the machine learning algorithms. This article further displays the results obtained by different classification algorithms and also shows the improvement in classification accuracy obtained through this work.The following section of the paper is literature survey spanning thirteen research papers. This section discusses the problems identified, proposed algorithms or methodologies and the experiments and results mentioned in the papers. Further, it uses this discussion to shed light on any research gaps identified and consequently, on how the research discussed in this paper helps fill the gap. The next section describes and analyses dataset on which the research is based. It is very important to realize what type of data the research is dealing with and transform it into the form that can be used with all the classification algorithms. This is discussed in the following section ‘Data Collection & Pre-Processing’. The sections following that describes the classification algorithms used to perform experiments on different subsets of the 20 Newsgroups dataset to obtain results, experimental setup including how the dataset is split and the parameters used. The next sections are ‘Evaluating Results’ and ‘Conclusion’ and finally, the references used throughout the research are enumerated.