Published on Jul 25, 2022
Let's talk about AWS DynamoDB. The reason for me to start using DynamoDB was very simple (or you can call it naive), that is it is a managed cloud service. Frankly speaking, I don't enjoy maintaining IT infrastructure as much as I do with codebase, therefore I personally prefer those services which do not require me to fiddle with VPC, subnets etc. (not that I can't do it, but I'm a lazy developer trying my best to avoid it). Just like many other beginners, I made tons of mistakes using DynamoDB without realizing them... until perhaps 2 years ago. Now I'm going to talk about some of the lessons I've learned and hopefully it can help you as well.
I'm not even joking. One of the best ways to avoid rookie mistakes is to read and understand the official documentation for DynamoDB. Besides that, you should also read the whitepapaer. The purpose is not to just read it, but to understand it. Many beginners' mistakes are made due to the lack of basic knowledge of DynamoDB; they consider DynamoDB only as a simple key-value store which leads to misuse that is unnoticed until production time.
One of my early mistakes was using DynamoDB like a SQL database. Particularly, I created
one table per domain entity. For example, if a e-commerce system has entities such as User
,
Order
, Product
, I'd create corresponding tables users
, orders
, products
,
just like what I usually do in SQL database systems.
If you are thinking about "I'd do the same", then let me tell you now - don't.
A DynamoDB table is not like SQL table. If you consider it as a bucket, then you can put all kinds of stuff in it, because it's schemaless. This means that you can create one table to rule all the entities. Suppose the structures of entities are like this:
type user = {
type: 'User'
username: string
date_created: string
}
type order = {
type: 'Order'
order_items: string[]
date_created: string
}
type product = {
type: 'Product'
product_code: string
price: number
date_created: string
}
With all of them put in one table, t may look like what's displayed below (the table is a bit wide, so you need to scroll it horizontally to see all columns).
pk | type | username | order_items | product_code | price | date_created |
---|---|---|---|---|---|---|
USER#42423 | User | john.doe | 2018-05-23T18:14:31+10:00 | |||
ORDER#53778924 | Order | ['product_001', 'product_005', 'product_142'] | 2017-07-23T22:29:42+10:00 | |||
PRODUCT#34789 | Product | product_045 | 36.50 | 2015-08-23T22:31:42+10:00 |
The table above shows you that different kinds of items can co-exist in the same
DynamoDB table. For columns that are related to an entity, their values simply don't exist.
For example, the order item does not have username
.
You should also notice that the pk
(primary hash key) has a different prefix per item type.
This is the usual approach to distinguish different types of items. IDs across different types
may clash but they will not after being prefixed with their own types.
Each DynamoDB table must have one primary index, which contains a required hash key and an optional sort key (also named range key). Many beginners like myself ignore sort key and use UUID for hash key, making the whole table like a simple key-value store. This is not entirely useless but you also lost the chance to enjoy the query power provided by DynamoDB.
Hash key is not to be used as UUID. You should consider hash key as partitioning key. Hash key is not required to guarantee uniqueness, while hash key + sort + key is. Imagine you walk into a library - how do you quickly find the book you want? If you know the category of the book, say, fiction, then you can go straight to the fiction area - this is just like using a hash key to quickly narrow down the searching scope. Next, knowing the name of the book, you can find it without checking the name of every book in that area, thanks to the fact that the books in fiction area have already been sorted. Following this example, the data we store in DynamoDB may look like below (Only primary index columns are shown).
pk | sk |
---|---|
BOOK#fiction | BOOK#Anna Karenina |
BOOK#fiction | BOOK#To Kill a Mockingbird |
BOOK#programming | BOOK#Clean Code |
BOOK#programming | BOOK#The Pragmatic Programmer |
The example above only shows 4 rows of data, but you can imagine for a library, there could be a few thousand books per category. Without the sort key, you can only rely on hash key. That means, whenever you want to find a book in the library, you can only use its category to narrow down the searching scope to a certain level, and after that, you have to scan every book until you find the book. The time complexity is roughly O(1) + O(N) = O(N). With sort key, you're able to perform a search on a sorted list, that can be as fast as O(logN), which is much faster, especially when you're facing a large number of books of the same category.
The difference is more dramatic in DynamoDB, where a query can only be performed on
hash key + sort key - everything else will be a scan. If you want to find programming books
with names ranging from M to P, which were written by Martin Fowler, you will need to ask
DynamoDB to query with pk='programming' AND sk BETWEEN 'M' and 'P'
, and then scan its outcome
to find all the books with filter author='Martin Fowler'
. So, the complexity will be
something like O(1) + O(logN) + O(n), where the n
is usually a much smaller number than N
which can be ignored, so it is still roughly equal to O(1) + O(logN) = O(logN).
Knowing how to use sort key wisely is key to preforming fast queries on DynamoDB.
One-to-many relationship is widely seen in any domain data models. The common mistake I've
seen is that people try to solve this by using multiple tables. Let's explore a better way
to handle this in a simple example: suppose a Blog
has many Comments
. First question
is: why don't we have table blogs
and table comments
? Because doing that, you will end up with:
To model this relationship in one table, we can use the same hash key for both blog and comments, and give them different sort keys. For example:
pk | sk |
---|---|
BLOG#1234 | BLOG#1234 |
BLOG#1234 | COMMENT#3245 |
BLOG#1234 | COMMENT#3246 |
BLOG#1678 | BLOG#1678 |
BLOG#1678 | COMMENT#4577 |
Because Blog
and Comment
share the same hash key, we can query a blog with its comments
just by pk=$blog_pk
, for example, pk='BLOG#1234'
. This will give us 3 items: the first 3
rows in the table above. During deserialization, we can use their sort keys to determine
what type of items they are - a blog's sort key starts with BLOG#
, while comments start
with COMMENT#
.
This technique can solve all sorts of one-to-many relationships. You can follow the same
pattern to link other entities to Blog
. If X
has many P
, Q
, R
, S
, then
by querying pk='X#1234'
, you can get all the items owned by X
with X
itself. If you're only
interested in X
and P
, you can query pk='X#1234' and sk BEGIN_WITH 'P#'
- I think you've
got the idea now.
If I were to write "TL;DR" for this lesson, it'd be "do not read what you just saved". I know it sounds a bit odd, but reading this should tell you the more about it.
The mistake I made was that I had 2 functions, in the first of which I save an item to a DynamoDB table and immediately after that I pass the saved item's ID to the second function, where I read the saved item and use it to process another set of data. This approach has the problem that occasionally I got stale data in the second function, despite the fact that the first function just updated the data. I should have known better that DynamoDB by default does not guarantee strongly consistent read.
This is not to say that DynamoDB sucks. There's a trade off here, because consistency comes with its own price:
That being said, if you accept the disadvantages above or you are willing to handle the errors caused by them, you can consider using strongly consistent read; otherwise, design your application logic to expect eventual consistency.
You also should know that strongly consistent reads are only available on primary index.
This is not an advertisement, nor am I the author of this DynamoDB book. My manager bought this book and shared it with me. I'm learned quite a lot from this book and the lessons I wrote in this page only covers perhaps 10 percent of what the book can offer. I consider this book is a must read once you understand the basics of DynamoDB.
There you have it. I hope these 6 lessons can help you have a smoother journey using DynamoDB and hopefully avoid all the beginner mistakes like I made.
© 2022 disasterdev.net. All rights reserved