Abstract

Scene classification is an important and elementary problem in image understanding. It deals with large number

of scenes in order to discover the common structure shared by all the scenes in a class. It is used in medical

science (X-Ray, ECG and Endoscopy etc), criminal detection, gender classification, skin classification, facial

image classification, generating weather information from satellite image; identify vegetation types,

anthropogenic structures, mineral resources, or transient changes in any of these properties. In this paper, at first

we propose a feature extraction method named LHOG or Localized HOG. We consider that an image contains

some important region which helps to find similarity with same class of images. We generate local information

from an image via our proposed LHOG method. Then by combing all the local information we generate the

global descriptor using Bag of Feature (BoF) method which is finally used to represent and classify an image

accurately and efficiently. In classification purpose, we use Support Vector Machine (SVM) that analyze data

and recognize patterns. The basic SVM takes a set of input data and predicts, for each given input, which of two

possible classes forms the output. In our paper, we use six different classes of images.

Keywords: LHOG; Localized HOG; BoF; Scene Classification; Corner Detection.

————————————————————————

* Corresponding author.

13

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

1. Introduction

A scene refers to the place where an action or event occurs. It is different from object or texture which depends

on the distance between the observer and the target point. If the distance is low that means high coverage of

point, it is called an object but when the distance increases, the fixed point goes to large scale and it is known as

scene. Images of computer, monitor, human, bus, truck etc are objects. On the other hand, an image of football

field, cricket field, bus terminal, horizon, river, mountain, forest, full image of a train etc are known as scenes.

Scene classification is a problem and great interest on researcher. Scene images have large varieties. A scene

may vary on scale, rotation, illumination another variation on two dimensional (2D) and three dimensional (3D).

Existing features that are use for scene classification are base on only color, shape, texture and other visual

parts of the image. Most of them are single descriptor. Those descriptors are single feature based and cannot

show high accuracy and effectiveness. So, we have propose a new approach which at first chooses some

interesting parts of an image that helps to find similarity between same class of images and also help to differ

from other classes of image. We propose a method for selecting interest in an image which is used to decide

locally important patches. After selecting the points we analyze the surrounding area of that point and apply a

method to generate Localized feature which we name as LHOG feature. Then we convert all local features,

LHOG, to global feature and thus get a final descriptor of an image. Then we apply Support Vector Machine

(SVM) to train itself and then classify the descriptor from a test image. We have found a global descriptor that

means global feature from local features (LHOG feature) using Bag-of-features (BoF) technique 7. As the

same way, finally we get different global descriptors. Our method makes huge variety for different classes of

data set as example of our sample data set shown in Figure 1.

(a) (b)

Figure 1: Sample image (a) CU road (b) Zero point

HOG 6 is based on evaluating well-normalized local histograms of image gradient orientations in a dense grid.

The basic idea is that local object appearance and shape can often be characterized rather well by the

distribution of local intensity gradients or edge directions, even without precise knowledge of the corresponding

gradient or edge positions. In 16 17 a method was developed for distinctive, scale and rotation invariant

14

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

features of images that can be use to perform matching between different views of an object or scene. A

generative model from the statistical text literature here applied to a bag of visual words representation for each

image, and subsequently, training a multi way classifier on the topic distribution vector for each image 18.

Shape and appearance based image classification that shows accuracy rate 15 but our proposed approach show

better result than others. Applying our method we see that for Figure 1(a). there are 305 corners and there

corresponding LHOG features by comparing all of these LHOG feature global descriptor is generated that is

47,51,29,11,21,48,5,16,51,26 and for Figure 1(b). there are 291 corners gives a final global descriptor

15,27,44,4,17,69,24,26,40,26. It shows that a global descriptor using LHOG value there is huge difference

between Figure 1 (a) and (b). For this reason we have achieved a good accuracy and it gives faster result.

2. Proposed Method

Our method consists of the stages key point detection, feature extraction, global mapping and classification. In

the key point detection stage, we are concerned about localizing the highly informative patches. These points are

detected following the sequence of edge detection, curve extraction and finding cornerness. Around each

interest point a rectangular patch is analyzed to get statistical attributes which aims to produce local features at

that area. These local features are mapped to a multi-dimensional space in order to generate a global signature of

a scene. In our approach we use state of art Canny edge detection technology which is followed by curvature

scale space corner detector 1 method for measuring cornerness. Corner distribution in every local patch is

analyzed by constructing a normalized histogram. This histogram gives the logical feature in our method. All

logical features throughout of a given scene image are fed into the bag-of-features aiming to generate the global

signature of this scene. We use support vector machine (SVM) as our classification system. Localized HOG or

LHOG is a feature descriptor use in computer vision and image processing for the purpose of object detection

and scene classification. The technique counts occurrences of gradient orientation in localized portions of an

image. It is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast

normalization for improved accuracy. Figure 2 depicts our proposed approach. The following sections illustrate

the sequence of stages in our proposed method.

3. Interest point

Interest points or corners are very vital part of an image processing technique. Interest points are located using

the following step by step procedures.

3.1. Edge Detection

Edge consists of a meaningful feature and contains significant information of an image. The edge detection

process serves to simplify the analysis of images by drastically reducing the amount of data to be processed,

while at the same time preserving useful structural information about object boundaries. In our method, we used

Canny detection method. Steps of Canny edge detection method 910 as follows:

a) Image Smoothing with Guassian image smoothing based on this equation

15

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

Figure 2: Overall design of scene classification.

g(m,n)= G (m,n) f (m,n) ? ? (1)

where ?

?

? ?

?

?? 2

2

2?

exp

2

1 2

2 ?

m +n

??

G = and ( m , n ) is the pixel coordinate.

After applying Gaussian smoothing we find the image shown in Figure 3.

Figure 3: Smoothing image after Gaussiaan filter.

b) Compute the gradient magnitude from x, y partial derivatives

T

x y

T

S S S

y

S

x

S = ?

?

? ?

?

?

?

?

?

? ? =

(2)

This is the derivatives of pixel (x,y)

Test Images

Training Images Interest Point

Global Descriptor

HOG

Global Descriptor Bag-of-Feature

SVM Scene Class

LHOG

16

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

Gradient magnitude and orientation are as follows

2 2 ?S = Sx + Sy (3)

x

y

S

S1 tan ? ? =

c) Apply non-maxima suppression to gradient magnitude for thinning image to eliminate non-important

edge point. Suppress the pixels in gradient which are not local maxima.

( ) ( ) ( ) ( )

( ) ( ) ?

?

?

?

?

? > ? ?? ??

? > ? ? ? ? =

0 otherwise

& , ,

if , , , , S x y S x y

S x y S x y S x y G x y (4 )

Where (x?,y?) and (x??,y??) are the neighbors of (x,y)in ?S along the direction normal to an edge

After applying these steps on our sample image then we get an edge map shown in Figure 4.

Figure 4: Edge-map

3.2. Curve Extraction and Corner Detection

Curvature is the amount by which a geometric object deviates from being flat, or straight in the case of a line,

but this is defined in different ways depending on the context. Let the equation for curvature K is

( ) ( ) ( ) ( ) ( )

( ) ( )

3 / 2 2 2 X u,? +Y u,?

X u,? Y u,? X u,? Y u,? K u,? = ? ?

? ?? ? ?? ? (5)

where X (u,?)= x(u) g?(u,?) ? ? , X (u,?)= x(u) g??(u,?) ?? ? ,Y(u,?)= y(u) g?(u,?) ? ? ,

Y(u,?)= y(u) g??(u,?) ?? ? , and ? is a convolution operator, while g(u,?)denotes a Gaussian of Width

? and g?(u,?)and g??(u,?)are the first and second derivatives of g(u,?)respectively 1. After curve

17

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

extraction we get a output shown in Figure 5.

Figure 5: Extracted curve.

Now list of corner candidates are { }j

N

j j j A = P1, P2,…..P where { }j

i

j

i

j

i P = x , y are pixels on the contour. And N

is the number of pixels on the contour.

Now it is either close or open j A is closed if |P P |T j

N

j

1 usually T is 2 or 3.

The contour convolved with the Gaussian smoothing kernel g is denoted by A = A g j j

smooth ? where g is a

digital Gaussian function with width controlled by ? now the curvature value of each pixel value is computed

by ( ) ( )

3 / 2 2 2

2 2

j

i

j

i

j

i

j

i

j

i

j

j i

i

?x + ?y

?x ? y ? x ?y K = ? for i=1,2,3,…….. . .. . , N (6)

where ( )/ 2 1 1

j

i

j

i+

j

i ?x = x x ? ? , ( )/ 2 1 1

j

i

j

i+

j

i ?y = y y ? ? and

( )/ 2 2 j

i 1

j

i+1

j ? xi = ?x ?x ? ? , ( )/ 2 2 j

i 1

j

i+1

j ? yi = ?y ?y ? ? and all the local maximum and curvature function

are included in the initial list of corner candidates. But there may be some rounded corners that’s needed to

remove. We can remove it by adaptive threshold methods 2.

( ) ?| ( )| ?

× ×

1

2 i=u L

u+ L

K i

L + L +

T u = R K = R

2 1 1

1

(7)

where u is the position of the corner candidate and L1+L2 is the position of the ROS centre at u and R is a

coefficient. After applying curvature extraction, round corner and false corner removing 1, then we get our

desire interest points as shown Figure 6.

4. Feature Extraction

In a scene image, we can observe that the most of the area belonging to this scene is flat. Generally a flat area

does not contain enough clues to represent the image in a discriminative way. Rather the textured area is very

18

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

good at representing the scene contents. Considering this in our mind, we try to select a patch around the corner

points which are treated as interest point in the previous section. The selected patch area is shown in Figure 7.

Figure 6: Detected corner. The small rectangle denotes the small patch around the corner.

Figure7: Patch area.

4.1 Localized Feature

In this method we consider that every part as an interesting point that is a local representation of that point. In

each scene, we separate an interesting point by corners and there corresponding HOG 3 6 value around the

corners. After corner detection generating HOG values that makes a Localized HOG or LHOG feature as

follows:

LHOG feature = Interest point + Corresponding HOG of interest point (8)

Our sample image there is 305 important corners are detected. For example a corner point (255, 31) is selected

and its patch area is 25 X 50 pixels. LHOG values of Figure 7 are shown in table 1.

4.2 Global Mapping

Global mapping represents the over structure and distribution of local features. To perform this we use bag-offeatures.

BoF approaches are characterized by the use of an orderless collection of image features. Lacking any

structure or spatial information, it is perhaps surprising that this choice of image representation would be

19

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

powerful enough to match or exceed state-of-the-art performance in many of the applications to which it has been

applied 7. BoF generates global feature from all of the local features. In our method all the LHOG value are

combined and after applying clustering method we generates a global descriptor. Bag-of-Feature takes the

following steps shown as Figure 8.

Table 1: LHOG values for an interested point

Position LHOG value

1 0.1822

2 0.6231

3 0.4337

4 0.2723

. .

. .

. .

80 0.1284

81 0.1196

The rest of corners generate 81 X 1 matrix of LHOG descriptor. All of the LHOG values are considered as a

local descriptor.

Figure 8: Stepwise Bag-of-Feature.

First LHOG value is compared with all 10 cluster and find the minimum distance it goes to cluster 2 finally we

got a global descriptor based on 305 LHOG value. Global descriptor from LHOG using K means clustering 5.

Finally we applied this method for all of images of same class and different 6 classes. Image representation by

codeword 4 using LHOG frequencies shown in Figure 9.

20

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

Figure 9: Codewords from sample image.

5. Classification

Classification is a model that receives data as input and predicts for a given input in which class it is. In case of

supervised learning we have to provide a training data set with its group number. Then provides an input to test

in which class its similar to that input. In our case we have used Support Vector Machine 8 supervised learning

as a classifier. At first trains the SVM machine with all the images that are for training purpose actually it takes

a descriptor set and image group number. During the time of classify it takes on a test image descriptor that is

matches with trained image set. It returns a group number in which group it is more similar. It returns nothing if

it is not closely match with none of group.

6. Experimental Result

In our research there are 6 classes of data set and we applied stratified k-fold cross-validation 13. Results are

shown in table 2 and accuracy graph in Figure 10.

Figure 10: Accuracy vs. Number of class

21

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

Table 2: Accuracy results for 2 fold cross validation.

Number

of Class

Classes Name Number of

Image per

Class

Total

Number of

Image

Number of

Image Correctly

Identified

Number of

Image Wrong

Classified

Accuracy

2 Class CU Road, IT Building 42 84 83 1 98.80%

3 Class CU Road, IT Building,

Freedom Sculpture

42 126 121 5 96.03%

4 Class CU Road, IT Building,

Freedom Sculpture,

Shah Jalal Hall

42 168 156 12 92.85%

5 Class CU Road, IT Building,

Freedom Sculpture,

Shah Jalal Hall, Saheed

Minar

42 210 190 20 90.48%

6 Class CU Road, IT Building,

Freedom Sculpture,

Shah Jalal Hall, Saheed

Minar, Zero Point

42 252 216 36 85.71%

Our dataset is self data set shown in Figure 11.

(a) (b) (c)

22

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

(d) (e) (f)

Figure 11: Sample images of our data Set where (a) CU Road, (b) IT Building, (c) Freedom Sculpture, (d)

Saheed Minar, (e) Shah Jalal Hall, (f) Zero Point.

6.1. Recall and Precision Graph

In pattern recognition precision 12 is the fraction of retrieved instances that are relevant, while recall is the

fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an

understanding and measure of relevance. When referring to the performance of a classification model, we are

interested in the model’s ability to correctly predict or separate the classes. When looking at the errors made by

a classification model, the confusion matrix gives the full picture. Considering three classes problem with A, B,

and C class. A predictive model may result in the following confusion matrix when tested on independent data.

The confusion matrix shows how the predictions are made by the model in table 3.

Table 3: Confusion matrix with notation

Predicted class

A B C

Known class

(class label in

data)

A A tp AB e AC e

B BA e B tp BC e

C CA e CB e C tp

i) Precision:

Precision is a measure of the accuracy provided that a specific class has been predicted.

It is defined by:

Precision= tp /(tp+ fp) (9)

23

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

where tp and fp are the numbers of true positive and false positive predictions for the considered class. In the

confusion matrix above, the precision for the class A would be calculated,

PrecissionA = tp A /(tp A +eBA +eCA )= 25/(25+3+1) ? 0.86 (10)

ii) Recall:

Recall is true positive rate. It is defined by the formula:

Recall = Sensitivity = tp /(tp+ fn) (11)

where tp and fn are the numbers of true positive and false negative predictions for the considered class.

tp+ fn is the total number of test examples of the considered class. For class A in the matrix above, the recall

would be:

( )

25 /(25 5 2) ? 0.78

/

= + +

Recall = Sensitivity = tp tp +e +e A A A A AB AC (12)

Our experimental result of recall and precision in various classes are shown in Figure 12.

6.2. Receiver Operating Characteristics (ROC)

ROC 11 12 curve is a useful technique for organizing classifiers and representing their performance. It is

created by plotting the fraction of true positive rate (TPR) vs. false positive rate (FPR). Let us define an

experiment from P positive instances and N negative instances. The four outcomes can be formulated in a 2×2

confusion matrix in table 4.

Table 4: Confusion matrix for ROC curve

The calculation of TPR and FPR are as follows:

TPR = TP / P = TP /(TP+ FN) (13)

FPR = (FP / N)= FP /(FP+TN) (14)

Prediction outcome

Actual value

P’ N’ Total

P True Positives False Negatives P

N False Positives True Negatives N

Total

P’ N’

24

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

(a) (b)

(c) (d)

(e)

Figure 12: Recall precision graph for (a) 2 class (b) 3 class (c) 4 class (d) 5 class (e) 6 class.

Our experimental results of 2 fold ROC curve for different number of scene classes are as shown in Figure 13.

7. Conclusion and Future Work

In this paper we propose a novel scene classification method. our research we have achieved a good

performance that on previous graph and its accuracy rate is high that is above 85 percent. Our database contains

images in variety of format on same class.

25

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

(a) (b)

(c) (d)

(e)

Figure 13: ROC curve for (a) 2 class (b) 3 class (c) 4 class (d) 5 class (e) 6 class.

26

International Journal of Computer (IJC) (2016) Volume 20, No 1, pp 13-28

Total time of classifying scene that includes input data to classified output is 31.8837s and it was for 84 images

hence per image computation time is 31.8837 /84= 0.3796 s where image resolution was 461× 365 pixels.

This time is slower than other existing system of scene classification also our accuracy is higher than other

existing systems of scene classification.

In future we will concentrate on increasing accuracy and reducing processing time. We will try to obtain

processing time 0.05s per image so that our proposed method can work in security system such as criminal

detection from video.