Introduction Over therecent time, as we could see, the amount of data that is needed to be processedhas grown significantly. If a decade ago the average statistical databasecontained up to a few million records, then with the development anddissemination of the Internet, it became necessary to create databases withhundreds of millions or even billions of records. With theincrease in the volume of the database, a processing problem arose.
With alarge database distribution between servers and many tables, the search time ofthe required record is strongly increasing. Plus, since large databases areused on sites with millions of visitors, the number of one-time calls can reachseveral thousand. Each new conversion will come faster than the processing ofthe previous one. Thus, the database servers will quickly get killed, and allthis will lead to a denial of work. It becomes obvious the need for the refusalof the relational databases and the transition to another methodology for datastorage. One of such methodologies is NoSQL.
What isNoSQL? NoSQL isa concept that involves the use of non-relational data models and the abilityto scale horizontally (allocating a database for a very large number of domainsshould not affect the processing speed). For the first time the term NoSQL wasused in 1998 by the Italian software developer Carlo Strozzi, and then stillmeant a relational database with open source code that did not use the SQLlanguage. In the modern sense, the term NoSQL has been used since 2009.
Why NoSQLis interesting? Thereare two main reasons: 1. Productivedevelopment of applications. The development of many applications requireseffort to display data structures that are, for example, in RAM. Thus, we canobtain the solution, which will be solved the phenomenon of Impedance Mismatch.NoSQL databases offer a data model that better meets the needs of yourapplication, resulting in easier interaction with the database. And it meansthat you need to write your code shorter, requiring less debugging.
On this waya DB needs a smaller amount of changes 2. Largeamounts of data. Organizations considered valuable to have in its available asmuch information and rapid addition of processes, which in the case of arelational database is expensive, not to mention the fact, whether it is at allpossible to provide. The main reason for this is that relational databasedesigned to work on the same machine, while more economical to work with alarge database and to distribute the load on clusters of many smaller andcheaper machines. Most of the NoSQL databases are designed just to run onclusters, therefore they are better suited to work with a large amount ofinformation. Thus,NoSql databases represent a promising technology that allows the manipulationof huge amounts of data distributed among servers.What Is aKey-Value Store? Key-value stores are the simplest NoSQL data stores to usefrom an API perspective. Main idea of KV method is pairing of key and value.
Value are stored in a blob, without data store knowing, what is inside. Theclient can either get the value for the key, put a value for a key, or delete akey from the data store. It’s application’s responsibility to understand whatwas stored. Since key-value stores alwaysuse primary-key access, they generally have great performance and can be easilyscaled.
Figure 1 – typical example of Key-Value The key-value model is one of the simplest non-trivialdata models, which is used by more complex data models to be implemented as anextension of it. The KV model can be extended to a discretely ordered modelmaintaining keys in lexicographic order. This computationally powerful extensioncan efficiently retrieve selective key ranges.Comparison ofcharacteristics between traditional RDBMS and Key-Value Store Relationaldatabases and repositories of key values differ radically and are used to solvevarious problems. Comparing the characteristics allows us only to understandthe difference between them. Comparison of characteristics will allow understandthe difference between them: Relational database Key-value store The database consists of tables, tables contain columns and rows, and rows consist of values of columns. All rows in one table have the same structure. For domains you can draw an analogy with tables, but unlike the tables for domains is not determined by the structure of the data.
Domain is a box into which you can put anything you like. Records within the same domain can have different structures. The data model is defined in advance. Is strongly typed, contains constraints and relations to ensure integrity of data. Identification of records takes place using the key, wherein each entry record has a dynamic set of attributes associated with it. The data model is based on the natural representation of the contained data, not of the functionality of the application. In some implementation, the attributes can only be strings. In other implementations, the attributes have simple data types that reflect the types used in programming: integers, array of strings, and lists.
The data model is normalized to avoid data duplication. Normalization creates relationships between tables. Relationships between tables connect data in different tables. Between domains, as well as within the same domain, the relationship is not explicitly defined. ?omparisonof data access between traditional RDBMS and Key-Value Store Relational database Key-value store Data is created, updated, deleted and queried using structured query language (SQL). Data is created, updated, deleted and queried using a call to the API methods. SQL queries can extract data from single table or from multiple tables using joins Some implementations provide a SQL-like syntax to specify filter conditions.
SQL queries can include aggregation and complex filters. You can often use only the basic operators comparison (=, !=, <, >, <= and =>). A relational database usually contains built-in logic, such as triggers, stored procedures and functions. All business logic and logic to support the integrity of data contained in the application code. Comparison of interaction withapplications between traditional RDBMS and Key-Value Store Relational database Key-value store Most commonly used private APIs, or generalized, such as OLE DB or ODBC. The most commonly used SOAP and / or the REST API, by means of which the access to the data. The data is stored in a format that reflects their natural structure, so you need mapping of application structures and relational database structures.
Data can be displayed more effectively in the application structure, only the code needs to write data into objects. Theadvantages of Key-Value storage Thereare two distinct advantages of such systems to relational DB: 1. Theyare very suitable for cloud services. The first advantage of key-value storageis that they are easier, and thus have greater scalability than relationaldatabases.
If you put together your own system, and plan to place dozens orhundreds of servers that need to cope with the increasing workload for yourdata store, then you have to choose – key-value stores. Since this storage iseasily and dynamically expand, they are also useful for vendors who providemulti-user storage web platform. Such a framework is relatively low-cost meansof storing data with a lot of potential for scalability. Users typically payonly for what they use, but their needs may grow. The vendor will be able todynamically and virtually no restrictions to increase the size of the platform,based on the load. 2. Amore natural integration with the code.
The relational data model and objectmodel of code are usually constructed in different ways, leading to someincompatibilities. The developers solve this problem by writing the code thatdisplays the relational model to an object model. This process does not haveclear and achievable values quickly and can take a lot of time that could bespent on the development of the application itself. Meanwhile, many key-valuestorages store data in such astructure that appears in objects more naturally.
This can significantly reducedevelopment time. Thedisadvantages of Key-Value storage (the advantages of Relational DB)1. Constraintsin a relational database to ensure data integrity at the lowest level. Datathat do not satisfy the constraints are physically unable to get to the base.In storages of key-value there are no such restriction, so data integritymonitoring is fully based on the application.
However, in any code has bugs. Ifthe errors in a properly designed relational database usually don’t lead todata integrity issues, errors in the storages of key-value storages willusually lead to such problems. 2.
Another advantage of relational databases is that they force you to go throughthe process of developing a data model. If you have a well-developed model, thedatabase will contain a logical structure that fully reflects the structure ofthe stored data, but at odds with the structure of the application. Thus, thedata become independent of the application. This means that another applicationcan use the same data and application logic can be changed without any changesin the database model. To do the same thing with the key-value storage, youneed to replace the process of designing the relational model design classes inwhich are general classes, based on the natural data structure. 3.
Unlikerelational databases, repositories are targeted for use in the”cloud”, are much less common standards. Although conceptually theyare not different, they all have different the API, query interfaces andspecific. Therefore, you’d better trust your vendor, because if somethinghappens, it will be not so easily switch to another service provider. And giventhe fact that almost all modern key-value storages are in beta versions, trustis even riskier than in the case of relational databases.Key-ValueStore Features on Riak example Usageof NoSQL data stores requires an understanding of features compatibilitybetween itself and the standard RDBMS data stores, which also used by us.
The mainpoint is to understand what features NoSQL are lacking and what changes must bedone to the application architecture for more effective use of a key-value datastore and its features. Some common features of NoSQL data stores we willdiscuss here are consistency, transactions, query features, structure of thedata, and scaling. Consistency Consistencyapplies only for a single-key operation. These are either a get, put, or deleteon a single key. Optimistic writes are very cost-expensive because data storeitself cannot determine a change in value. In distributed key-value stores(Riak, for example) implemented the eventuallyconsistent model of consistency. Since the value may have already beenreplicated to other nodes, Riak has two ways of resolving update conflicts:either the newest write wins and older writes lose, or both (all) values arereturned allowing the client to resolve the conflict. InRiak, these options can be set up during the bucket creation.
Buckets are justa way to namespace keys so that key collisions can be reduced. Let’s assumethat all customer keys reside in the customer bucket. When creating a bucket,we can provide default consistency values, such as “write is considered goodonly when the data is consistent across all the nodes where the data is stored.” Bucket bucket = connection.createBucket(bucketName).withRetrier(attempts(3)).
allowSiblings(siblingsAllowed).nVal(numberOfReplicasOfTheData).w(numberOfNodesToRespondToWrite).r(numberOfNodesToRespondToRead).execute(); Toguarantee that data in every node is consistent, we can increase the numberOfNodesToRespondToWrite set by w to be the same as nVal. Of course, doing that will decrease the cluster’s writeperformance. We can change the allowSiblings flag during bucket creation forsome improvement on write or read conflicts. If the flag is set to false, storewill let the last write to win and not create siblings.
Transactions Differentproducts have different specifications of transactions, but, in general thereare no guarantees on the writes. Many data stores do implement transactions indifferent ways. Riak uses the concept of quorum implemented by using the replicationfactor during the write API call. Let’sassume we have a Riak cluster with a replication factor of 5 and we supply the numberOfNodesToRespondToWrite (W) value of 3. It means that Riak willhave tolerance of N – W = 2. So, up two nodes can be down, and data store stillwill succeed on write operation, though we would have lost some data on thosetwo nodes for read.
QueryFeatures Asname implies, all key-value stores can query by the key. When query uses someattributes of the value column, it’s not possible to use the database only, anapplication must read the value to check it out for validity. Thereis an interesting side effect: most of the data stores will not return a listof all their primary keys. And even if they did, cost of retrieving lists ofkeys and later querying for the values would be quite excessive. Some key-valuedatabases compensate this by searching inside the value, as it implemented in RiakSearch tool. That allows user to query the data just like when using indexes.
Whileusing key-value stores, lots of thought must be given to the design of the key.Can the key be generated using some algorithm? Can the key be provided by theuser (user ID, email, etc.)? Or derived from timestamps or other data that canbe derived outside of the database? Thesequery characteristics make key-value stores likely candidates for storingsession data (with the session ID as the key), shopping cart data, userprofiles, and so on. The expiry_secs property can be used to expire keys aftera certain time interval, especially for session/shopping cart objects. When writing to the Riakbucket using the store API, the object is stored for the key provided.Similarly, we can get the value stored for the key using the fetch API.
Riak provides an HTTP-basedinterface, so that all operations can be performed from the web-browser or onthe command line using curl. Let’s save this data to Riak: Use the curl command to POSTthe data, storing the data in the session bucket with the key of a7e618d9db25 (mustprovide this key): Structureof Data Key-valuedatabases don’t care what is stored in the value part of the key-value pair.The value can be a blob, text, JSON, XML, and so on. In Riak, we can use theContent-Type in the POST request to specify the data type.Scaling Sharding is a methodologyof backing up data by duplicating it in discrete storages (shards).
Most of KVstores can be scaled with sharding. The value of the key determines on whichnode the key is stored, so, assuming we are sharding by the first character ofthe key, if the key starts with an z, it will be sent to different node thanthe key starting with b. This way of sharding increases performance becausemore nodes are added to the cluster. Shardinghave some downsides, though: if the node used to store z-keys’ values goesdown, all z-keyed data becomes unavailable, nor can new data be added with keysthat start with z. Datastores such as Riak allow control of the aspects of the CAP Theorem: N is number of nodes storing thekey-value replicas, R is number of nodesthat must have the data being successfully fetched for read to be considered valid,and W is the number of nodes thatmust be written to before write is considered successful. Assumingwe have a 5-node Riak cluster, if N=3 means that all data is replicated to atleast three nodes. R=2 means any two nodes must reply to a GET request for itto be considered successful.
W=2 ensures that the PUT request is written to twonodes before the write is considered successful. Thesesettings allow us to fine-tune node failures tolerance for read or writeoperations. Based on specific data store, these values can be changed for optimizationof read availability or write availability.
Suitableuse cases Let’sdiscuss some of the problems where key-value stores are a good fit. 1.Storing Session Information Generally,every web session is unique and is assigned a unique sessionID value.
Applications that store the sessionID on disk or in an RDBMS will greatly benefitfrom moving to a key-value store, since everything about the session can bestored by a single PUT request or retrieved using GET. This single-requestoperation makes it very fast, as everything about the session is stored in asingle object. 2.User Profiles, Preferences Almostevery user has a unique userId, username, or some other attribute, as well aspreferences such as language, color, timezone, which products the user hasaccess to, and so on. This can all be put into an object, so gettingpreferences of a user takes a single GET operation. Similarly, product profilescan be stored. 3.
Shopping Cart Data E-commercewebsites have shopping carts tied to the user. As we want the shopping carts tobe available all the time, across browsers, machines, and sessions, all theshopping information can be put into the value where the key is the userID. Whennot to use Thereare problem spaces where key-value stores are not the best solution. 1.
Relationships amongData Ifyou need to have relationships between different sets of data, or correlate thedata between different sets of keys, key-value stores are not the best solutionto use, even though some key-value stores provide link-walking features. 2. Multioperation Transactions Ifyou’re saving multiple keys and there is a failure to save any one of them, andyou want to revert or roll back the rest of the operations, key-value storesare not the best solution to be used.
3. Query by Data Ifyou need to search the keys based on something found in the value part of thekey-value pairs, then key-value stores are not going to perform well for you.There is no way to inspect the value on the database side, except for someproducts like Riak Search or indexing engines like Lucen or Solr. 4. Operations by Sets Sinceoperations are limited to one key at a time, there is no way to operate uponmultiple keys at the same time.
If you need to operate upon multiple keys, you musthandle this from the client side. Conclusion Key-valuestores are most suitable for storing a large number of poorly structured datathat assume distribution among several domains. That is, such repositories aresuitable for sites with a very large number of visitors.
Also, such a datastore should be selected if the data should be object-oriented, or can havedynamic attributes.Although the main disadvantage of such storagefacilities is the lack of the NoSQL standard, in the near future the standardcan be adopted, and it will be more convenient to operate with this type ofstorage.