DynamoDB is sometimes considered just a simple key-value store, but nothing could be further from the truth. DynamoDB can handle complex access patterns, from highly-relational data models to time series data or even geospatial data.
In this post, we'll see how to model one-to-many relationships in DynamoDB. One-to-many relationships are at the core of nearly all applications. In DynamoDB, you have a few different options for representing one-to-many relationships.
We'll cover the basics of one-to-many relationships, then we'll review five different strategies for modeling one-to-many relationships in DynamoDB:
- Denormalization by using a complex attribute
- Denormalization by duplicating data
- Composite primary key + the Query API action
- Secondary index + the Query API action
- Composite sort keys with hierarchical data
Let's get started!
Basics of one-to-many relationships
A one-to-many relationship occurs when a particular object is the owner or source for a number of sub-objects. A few examples include:
- Workplace: A single office will have many employees working there; a single manager may have many direct reports.
- E-commerce: A single customer may make multiple orders over time; a single order may be comprised of multiple items.
- Software-as-a-Service (SaaS) accounts: An organization will purchase a SaaS subscription; multiple users will belong to one organization.
With one-to-many relationships, there's one core problem: how do I fetch information about the parent entity when retrieving one or more of the related entities?
In a relational database, there's essentially one way to do this--using a foreign key in one table to refer to a record in another table and using a SQL join at query time to combine the two tables.
There are no joins in DynamoDB. Instead, there are a number of strategies for one-to-many relationships, and the approach you take will depend on your needs.
In this post, we will cover five strategies for modeling one-to-many relationships with DynamoDB:
- Denormalization by using a complex attribute
- Denormalization by duplicating data
- Composite primary key + the Query API action
- Secondary index + the Query API action
- Composite sort keys with hierarchical data
We will cover each strategy in depth below--when you would use it, when you wouldn't use it, and an example. The end of the post includes a summary of the five strategies and when to choose each one.
Denormalization by using a complex attribute
Database normalization is a key component of relational database modeling and one of the hardest habits to break when moving to DynamoDB.
You can read the basics of normalization elsewhere, but there are a number of areas where denormalization is helpful with DynamoDB.
The first way we'll use denormalization with DynamoDB is by having an attribute that uses a complex data type, like a list or a map. This violates the first tenet of database normalization: to get into first normal form, each attribute value must be atomic. It cannot be broken down any further.
Let's see this by way of an example. Imagine we have an e-commerce site where there are Customer entities that represent people that have created an account on our site. A single Customer can have multiple mailing addresses to which they may ship items. Perhaps I have one address for my home, another address for my workplace, and a third address for my parents (a relic from the time I sent them a belated anniversary present).
In a relational database, you would model this with two tables using a foreign key to link the tables together, as follows:
Notice that each record in the Addresses
table includes a CustomerId
, which identifies the Customer to which this Address belongs. You can use the join operation to follow the pointer to the record and find information about the Customer.
DynamoDB works differently. Because there are no joins, we need to find a different way to assemble data from two different types of entities. In this example, we can add a MailingAddresses
attribute on our Customer item. This attribute is a map and contains all addresses for the given customer:
Because MailingAddresses
contains multiple values, it is no longer atomic and thus violates the principles of first normal form.
There are two factors to consider when deciding whether to handle a one-to-many relationship by denormalizing with a complex attribute:
Do you have any access patterns based on the values in the complex attribute?
All data access in DynamoDB is done via primary keys and secondary indexes. You cannot use a complex attribute like a list or a map in a primary key. Thus, you won't be able to make queries based on the values in a complex attribute.
In our example, we don't have any access patterns like "Fetch a Customer by his or her mailing address". All use of the
MailingAddress
attribute will be in the context of a Customer, such as displaying the saved addresses on the order checkout page. Given these needs, it's fine for us to save them in a complex attribute.Is the amount of data in the complex attribute unbounded?
A single DynamoDB item cannot exceed 400KB of data. If the amount of data that is contained in your complex attribute is potentially unbounded, it won't be a good fit for denormalizing and keeping together on a single item.
In this example, it's reasonable for our application to put limits on the number of mailing addresses a customer can store. A maximum of 20 addresses should satisfy almost all use cases and avoid issues with the 400KB limit.
But you could imagine other places where the one-to-many relationship might be unbounded. For example, our e-commerce application has a concept of Orders and Order Items. Because an Order could have an unbounded number of Order Items (you don't want to tell your customers there's a maximum number of items they can order!), it makes sense to split Order Items separately from Orders.
If the answer to either of the questions above is "Yes", then denormalization with a complex attribute is not a good fit to model that one-to-many relationship.
Denormalization by duplicating data
In the strategy above, we denormalized our data by using a complex attribute. This violated the principles of first normal form for relational modeling. In this strategy, we'll continue our crusade against normalization.
Here, we'll violate the principles of second normal form by duplicating data across multiple items.
In all databases, each record is uniquely identified by some sort of key. In a relational database, this might be an auto-incrementing primary key. In DynamoDB, this is the primary key.
To get to second normal form, each non-key attribute must depend on the whole key. This is a confusing way to say that data should not be duplicated across multiple records. If data is duplicated, it should be pulled out into a separate table. Each record that uses that data should refer to it via a foreign key reference.
Imagine we have an application that contains Books and Authors. Each Book has an Author, and each Author has some biographical information, such as their name and birth year. In a relational database, we would model the data as follows:
Note: In reality, a book can have multiple authors. For simplification of this example, we're assuming each book has exactly one author.
This works in a relational database as you can join those two tables at query-time to include the author's biographical information when retrieving details about the book.
But we don't have joins in DynamoDB. So how can we solve this? We can ignore the rules of second normal form and include the Author's biographical information on each Book item, as shown below.
Notice that there are multiple Books that contain the biographical information for the Author Stephen King. Because this information won't change, we can store it directly on the Book item itself. Whenever we retreive the Book, we will also get information about the parent Author item.
There are two main questions you should ask when considering this strategy:
Is the duplicated information immutable?
If the data does change, how often does it change and how many items include the duplicated information?
In our example above, we've duplicated biographical information that isn't likely to change. Because it's essentially immutable, it's OK to duplicate it without worrying about consistency issues when that data changes.
Even if the data you're duplicating does change, you still may decide to duplicate it. The big factors to consider are how often the data changes and how many items include the duplicated information.
If the data changes fairly infrequently and the denormalized items are read a lot, it may be OK to duplicate to save money on all of those subsequent reads. When the duplicated data does change, you'll need to work to ensure it's changed in all those items.
Which leads us to the second factor--how many items contain the duplicated data. If you've only duplicated the data across three items, it can be easy to find and update those items when the data changes. If that data is copied across thousands of items, it can be a real chore to discover and update each of those items, and you run a greater risk of data inconsistency.
Essentially, you're balancing the benefit of duplication (in the form of faster reads) against the costs of updating the data. The costs of updating the data includes both factors above. If the costs of either of the factors above are low, then almost any benefit is worth it. If the costs are high, the opposite is true.
Composite primary key + the Query API action
The next strategy to model one-to-many relationships--and probably the most common way--is to use a composite primary key plus the Query API to fetch an object and its related sub-objects.
A key concept in DynamoDB is the notion of item collections. Item collections are all the items in a table or secondary index that share the same partition key. When using the Query API action, you can fetch multiple items within a single item collection. This can include items of different types, which gives you join-like behavior with much better performance characteristics.
Let's use one of the examples from the beginning of this section. In a SaaS application, Organizations will sign up for accounts. Then, multiple Users will belong to an Organization and take advantage of the subscription.
Because we'll be including different types of items in the same table, we won't have meaningful attribute names for the attributes in our primary key. Rather, we'll use generic attribute names, like PK
and SK
, for our primary key.
We have two types of items in our table--Organizations and Users. The patterns for the PK
and SK
values are as follows:
Entity | PK | SK |
---|---|---|
Organizations | ORG#<OrgName> | METADATA#<OrgName> |
Users | ORG#<OrgName> | User#<UserName> |
The table below shows some example items:
In this table, we've added five items--two Organization items for Microsoft and Amazon, and three User items for Bill Gates, Satya Nadella, and Jeff Bezos.
Outlined in red is the item collection for items with the partition key of ORG#MICROSOFT
. Notice how there are two different item types in that collection. In green is the Organization item type in that item collection, and in blue is the User item type in that item collection.
This primary key design makes it easy to solve four access patterns:
Retrieve an Organization. Use the GetItem API call and the Organization's name to make a request for the item with a PK of
ORG#<OrgName>
and an SK ofMETADATA#<OrgName>
.Retrieve an Organization and all Users within the Organization. Use the Query API action with a key condition expression of
PK = ORG#<OrgName>
. This would retrieve the Organization and all Users within it as they all have the same partition key.Retrieve only the Users within an Organization. Use the Query API action with a key condition expression of
PK = ORG#<OrgName> AND begins_with(SK, "USER#")
. The use of thebegins_with()
function allows us to retrieve only the Users without fetching the Organization object as well.Retrieve a specific User. If you know both the Organization name and the User's username, you can use the GetItem API call with a PK of
ORG#<OrgName>
and an SK ofUSER#<Username>
to fetch the User item.
While all four of these access patterns can be useful, the second access pattern--Retrieve an Organization and all Users within the Organization--is most interesting for this discussion of one-to-many relationships. Notice how we're emulating a join operation in SQL by locating the parent object (the Organization) in the same item collection as the related objects (the Users). We are pre-joining our data by arranging them together at write time.
This is a pretty common way to model one-to-many relationships and will work for a number of situations.
Secondary index + the Query API action
A similar pattern for one-to-many relationships is to use a global secondary index and the Query API to fetch many. This pattern is almost the same as the previous pattern but it uses a secondary index rather than the primary keys on the main table.
You may need to use this pattern instead of the previous pattern because the primary keys in your table are reserved for another purpose. It could be some write-specific purpose, such as to ensure uniqueness on a particular property, or it could be because you have hierarchical data with a number of levels.
For the latter situation, let's go back to our most recent example. Imagine that in your SaaS application, each User can create and save various objects. If this were Google Drive, it might be a Document. If this were Zendesk, it might be a Ticket. If it were Typeform, it might be a Form.
Let's use the Zendesk example and go with a Ticket. For our cases, let's say that each Ticket is identified by an ID that is a combination of a timestamp plus a random hash suffix. Further, each ticket belongs to a particular User in an Organization.
If we wanted to find all Tickets that belong to a particular User, we could try to intersperse them with the existing table format from the previous strategy, as follows:
Notice the two new Ticket items outlined in red.
The problem with this is that it really jams up my prior use cases. If I want to retrieve an Organization and all its Users, I'm also retrieving a bunch of Tickets. And since Tickets are likely to vastly exceed the number of Users, I'll be fetching a lot of useless data and making multiple pagination requests to handle our original use case.
Instead, let's try something different. We'll do three things:
We'll model our Ticket items to be in a separate item collection altogether in the main table. For the
PK
andSK
values, we'll use a pattern ofTICKET#<TicketId>
which will allow for direct lookups of the Ticket item.Create a global secondary index named
GSI1
whose keys areGSI1PK
andGSI1SK
.For both our Ticket and User items, add values for
GSI1PK
andGSI1SK
. For both items, theGSI1PK
attribute value will beORG#<OrgName>#USER#<UserName>
.
For the User item, the GSI1SK
value will be USER#<UserName>
.
For the Ticket item, the GSI1SK
value will be TICKET#<TicketId>
.
Now our base table looks as follows:
Notice that our Ticket items are no longer interspersed with their parent Users in the base table. Further, the User items now have additional GSI1PK
and GSI1SK
attributes that will be used for indexing.
If we look at our GSI1 secondary index, we see the following:
This secondary index has an item collection with both the User item and all of the user's Ticket items. This enables the same access patterns we discussed in the previous section.
One last note before moving on--notice that I've structured it so that the User item is the last item in the partition. This is because the Tickets are sorted by timestamp. It's likely that I'll want to fetch a User and the User's most recent Tickets, rather than the oldest tickets. As such, I order it so that the User is at the end of the item collection, and I can use the ScanIndexForward=False
property to indicate that DynamoDB should start at the end of the item collection and read backwards.
Composite sort keys with hierarchical data
In the last two strategies, we saw some data with a couple levels of hierarchy--an Organization has Users, which create Tickets. But what if you have more than two levels of hierarchy? You don't want to keep adding secondary indexes to enable arbitrary levels of fetching throughout your hierarchy.
A common example in this area is around location-based data. Let's keep with our workplace theme and imagine you're tracking all the locations of Starbucks around the world. You want to be able to filter Starbucks locations on arbitrary geographic levels--by country, by state, by city, or by zip code.
We could solve this problem by using a composite sort key. This term is a little confusing, because we're using a composite primary key on our table. The term composite sort key means that we'll be smashing a bunch of properties together in our sort key to allow for different search granularity.
Let's see how this looks in a table. Below are a few items:
In our table, the partition key is the country where the Starbucks is located. For the sort key, we include the State, City, and ZipCode, with each level separated by a #
. With this pattern, we can search at four levels of granularity using just our primary key!
The patterns are:
Find all locations in a given country. Use a Query with a key condition expression of
PK = <Country>
, where Country is the country you want.Find all locations in a given country and state. Use a Query with a condition expression of
PK = <Country> AND begins_with(SK, '<State>#'
.Find all locations in a given country, state, and city. Use a Query with a condition expression of
PK = <Country> AND begins_with(SK, '<State>#<City>'
.Find all locations in a given country, state, city, and zip code. Use a Query with a condition expression of
PK = <Country> AND begins_with(SK, '<State>#<City>#<ZipCode>'
.
This composite sort key pattern won't work for all scenarios, but it can be great in the right situation. It works best when:
You have many levels of hierarchy (>2), and you have access patterns for different levels within the hierarchy.
When searching at a particular level in the hierarchy, you want all subitems in that level rather than just the items in that level.
For example, recall our SaaS example when discussing the primary key and secondary index strategies. When searching at one level of the hierarchy--find all Users--we didn't want to dip deeper into the hierarchy to find all Tickets for each User. In that case, a composite sort key will return a lot of extraneous items.
If you want a detailed walkthrough of this example, I wrote up the full Starbucks example on DynamoDBGuide.com.
Conclusion
In this post, we discussed five different strategies you can implement when modeling data in a one-to-many relationship with DynamoDB. The strategies are summarized in the table below.
Strategy | Notes | Relevant examples |
---|---|---|
Denormalize + complex attribute | Good when nested objects are bounded and are not accessed directly | User mailing addresses |
Denormalize + duplicate | Good when duplicated data is immutable or infrequently changing | Books & Authors; Movies & Roles |
Primary key + Query API | Most common. Good for multiple access patterns on the two entity types. | Most one-to-many relationships |
Secondary index + Query API | Similar to primary key strategy. Good when primary key is needed for something else. | Most one-to-many relationships |
Composite sort key | Good for very hierarchical data where you need to search at multiple levels of the hierarchy | Starbucks locations |
Consider your needs when modeling one-to-many relationships and determine which strategy works best for your situation.
If you have questions or comments on this piece, feel free to leave a note below or email me directly.