Nikolay Samokhvalov

Saved 11/28/15 7:43 AM to PostgreSQL

PostgreSQL 9.4's "jsonb" Document Store - YouTube

PostgreSQL 9.4's "jsonb" Document Store - YouTube  See details »

 
Video subtitles:
  • 00:00: alright thank you all for coming my name is Micah Yoder I am a Linux engineer at rackspace hosting I work on an internal tools team where we deal with logging and monitoring and some other kind of stuff like that now i've been using postgresql off and on since probably the late nineties I'm interested in all things open source databases and I also dabble with other databases like MongoDB and I want to play with a lot of the others also now there's been a trend and open source databases where the more relational databases well you have the
  • 00:30: no SQL movement of course which i'm sure you're familiar with but the traditional opens open source relational databases are starting to pick up some of the features of those things now I guess I should go backwards a little bit postgresql has a long history of innovating and and putting in more data types than and most of its competitors like in traditional SQL fields normally you can just store an integer or a billion or a string or whatever but postgresql has long supported arrays
  • 01:00: it's long supported Geum geometric types network types and other complex aggregate types and as of a few years ago it started supporting H store which is a basic key value store and as a version 9.2 I think it added a JSON type and that can store basically blobs of Jason as text but an validated the JSON but it didn't really make it efficient back in the back end or anything like
  • 01:31: that so that now in 9.4 it adds Jason B which is kind of like the Jason type accepted its binary Jason it stores it in the back end and then in an efficient way that i can press this efficiently on the server side now mysql kind of does actually the maria DB fork of mysql kind of takes on another form of no SQL if you're familiar with Cassandra the the column store database well no I Maria DB
  • 02:02: 10 adds something kind of like that so you can see that the traditional relational databases are kind of picking yep on the the no SQL have been as much as they can but here we're going to focus on postgresql 9.4 and Jason B now i'm sure you know what Jason is there's an example of it it's basically an arbitrary structured document so you can put in any kind of data you want in any kind of structure and as I said it's
  • 02:32: yeah Jason B is binary Jason so all this is start efficiently server side now why would you want to do this my thinking was that and you can imagine a shopping cart application where you have a large table full of all kinds of products that are for sale well you're going to have some attributes that are pretty basic and pretty common to all products you're going to have the product ID or skoo the
  • 03:03: title of the product the date added to the catalog the price the status of whether it's available or not in a basic description so you probably want to store these in basic normal SQL fields like integers or strings or whatever for maximum performance yeah okay now in okay now there might be other things that other attributes that might impact some products that they're fairly sparse
  • 03:34: you would not assign them you would not put them with every product so maybe if a product has a more detailed specification last you could store that and then Jason be filled or longer a more description text or if its food you could store nutritional information that way or maybe customization options or a list of photos with their attributes or rather any kind of other special notes or our geographical information or whatever just anything that that might
  • 04:04: be relatively sparse and not not something you'd associate with every single product and I think that would be more efficiently stored in a jason b field called additional information or something like that so here are three ways to do it you're just using normal SQL fields well you're probably going to have to have more than one query to fetch all the data first you're going to have to fetch all the basic field from the products table and if it has photos or
  • 04:34: specifications or nutritional information or anything like that you're going to have to go back to the database again and run another query and go get that information separately and if you have several of those that that can add several round trips to the database and that could slow down your your page load time for your customers and that might be especially true if you have some joins and of course Jason documents can alleviate the need for some kinds of joins now MongoDB does this and of course MongoDB this is a pure JSON
  • 05:09: document star and so you can store everything in a JSON document including all the fields that I mentioned back here like the skoo and the and the title and everything those would all be in Jason fields in MongoDB and the advantage is you can fetch them you can fetch everything with a single query to MongoDB the disadvantage is I think that when you store every single field in every single row you also have to store
  • 05:39: the name of the field in every row and I think that's potentially a little bit wasteful now in Postgres with J in postgresql with Jason B it's also like MongoDB you can execute one query defense the whole thing so you'll fetch the the standard SQL feels like the title and the skoo and all that along with the additional info on Jason B and it's only one round trip to the database and so it's extremely efficient and I'll show a little more about how to process
  • 06:11: that later but first I want to talk about index is a little bit so there are two possibilities for indexes in adjacent be document first you can have a GI n index sorry this is wider than that in the block so you can specify a gin index on the whole Jason be filled and then you can rent a query like this that will use the index that will basically check i'll
  • 06:42: talk about operators later but this basically checks that this jason b document is contained within the database field then that will use the index now you can also create an expression index on and any kind of expression in Postgres and here we're doing that so here we're creating expression index on a single key in
  • 07:15: adjacent be document so you'll see to query the into query the document using this index you will have to use an expression in the where clause that's very much like the expression used to create the index however I might note that if you're going to need to put an index on something in general it might be better to just put it in a normal SQL field as opposed to a Jason be filled there might be exceptions but it's probably how I would do in most cases so
  • 07:47: here's an example of using it from Python you it is supported by the normal cycle pg2 library which is the most common postgresql librarian python jason b is officially supported in version 2.5 point for which was released I think a couple weeks ago I actually was able to get this to work with two point five point three also I think the main advantage is an internal an internal efficiency advantage of how it connects
  • 08:18: directly to the type without an additional conversion layer but it's pretty straightforward you just import the psycho PG to model module and also the extras module which has the the JSON type that deals with this so you have your standard connection string to the database then you start building your your data and it's a general I mean it's a it's a standard Python dictionary you can put anything you want in it including arrays whatever that's this long nap to to the JSON fields then you
  • 08:49: get a cursor and use the cursor to execute the insert query so you can see how it's put in there that a psycho PG 2 has the execute function which which takes that in the Jason field that are the JSON data type which you put your dictionary again and that automatically puts it in the insert statement in the
  • 09:20: proper way here's an example of querying it from Python so so basically you just run your select yeah get your results and split it into the ID and jason fields and you can you can treat the j field which is the jason just as any Python dictionary just like you put it in as so here's how I would might propose using it in my hypothetical products table you're going to select
  • 09:51: all your fields from the products table in one query then if this is this is an extremely efficient check again you can efficiently check if nutrition is in the additional info field and if so bring out the nutritional information if the specifications are in this are in the field go ahead and print them out and etc those are all extremely fast checks and I think this runs extremely efficiently performance Heather a few things to be said about here I tried
  • 10:25: inserting in a million small jason documents in both postgres and mongodb now the top line no full commit with MongoDB I use the default right concern which does not wait for for it to write to the does not wait for Annette sync to the disk I just shows it out there and hopes it succeeds and in post grass I did everything in a single transaction and this is the timing i got so you can see postgres appears to be a little bit faster now the full comment line frankly I don't believe this and
  • 10:55: but basically I used for post grass I used every document and its own transaction so a million small transactions and in MongoDB I set right concern j equals true which does wait for an F sync to the disk it tells the application it succeeded and durably writing it to the disk before continuing and I'm not quite sure why the number is so high on MongoDB the cpu load and I oh wait we're Buffalo so i am not an expert database bench marker so maybe someone who is should look at that sometimes but
  • 11:25: yeah well yeah postgres actually I think there are about 60 bytes so yeah okay yeah yeah I just wonder why postgres was so much faster but I also did the SEC did a similar thing in a remote cloud
  • 11:55: server and they were a lot more they were a lot closer in performance but of course the network latency would have had something to do with that query performance I didn't do any formal timing but I did do table scans of both of them they both returned in under a second and I could not perceive any real difference between the two of course they were both within the the page cache and RAM
  • 12:25: now query operations and index and operators so there are a few operators I already mentioned the at greater than operator that basically returns true if the jason b document on the right is contained within the jason b document on the left so you can use that to search through a table for for a specified set of keys and values on the second row we have let's see that and I hate pronounce
  • 12:58: that but basically then these two operators can check if a certain index or a certain key is with um the jason b document and then but they actually return then the field as Jayson be and under that yeah you can you can select by either a string key or a numerical index if the root of the jason b is an array so under that you can also the
  • 13:28: operator with two dashes and learn a to greater than signs instead of just one does the same thing but it returns them as text so you can do text string functions on the result and then you can also do a look in the document within a certain JSON path you can use comma separated list of strings for for sub keys or numbers which would be indexes
  • 14:00: into an array so you can go arbitrarily deep that way now how does it compare to MongoDB well from what I can tell it is not possible to do something like this within within adjacent document and mongodb it lets you query the document weary the documents based on if in this case is number if number is less than five so MongoDB has a pretty elaborate query language where you can construct
  • 14:31: basically a JSON document that describes the query that of course it has pretty much everything you can do with an SQL field within any of MongoDB is Jason fields and it's pretty elaborate so any kind of comparison whether it's in a set anything like that knows but yeah like I said as far as I can tell there's no way to do that within Jason be also comparing how you do updates and both of
  • 15:01: them again MongoDB has this query language where you can update a large number of rows within a certain way well there's this update query that queries which rose will be updated then it has this update query that will specify how to change the documents and then the multi true tells us to do more than one row at a time so so with this you can do as many rows as you want in one single
  • 15:36: round trip to the server but it's not atomic so it's possible if the server will crash or something else will go wrong and some of the fields might be up now some of the rows might be updated and some will not be updated now postgres from what I can tell also does not allow you to do that directly but what you can do is start a transaction and do a select for update all those fields from all those rows and then you'd have to pull them down into a
  • 16:07: Python dictionary and make the required modifications there and then send them back up with an update statement and then commit the transaction so that's a little more powerful maybe better day is atomic so either all it will all succeed or a wall fail also i would say that doing it in post grass working with Python dictionaries and post grass is a lot easier than MongoDB s update language which requires you to construct
  • 16:39: basically a little JSON document describing the update and frankly I think that's a little bit tedious but on the other hand you have to pull each record down over the network loaded into your Python and send the update back up so if it's a large document or if you have network latency or anything like that that could be a disadvantage in storage i found that postgres is Jason B was just a
  • 17:09: little bit larger than what i got in MongoDB I had actually read a lot of reports that Jason be took an extraordinary large amount of storage but I was not able to reproduce that at this time it seems to to be okay for me yeah and I think that's about it so I was wondering if there were any questions or I believe it's an internal to postgres
  • 17:43: detail I don't think you can get and get it out of the database in any other way than through postgres Oh any any other questions I've pretty much just played with it yeah it's still i think i forgot to
  • 18:14: mention that postgres 9.4 isn't even in production yet it's in its innovative too and I believe it should be released any day now yes alright anything else all right thank you