January 11, 2022

The ultimate guide on DB-generated IDs in JPA Entities

According to the JPA specification, Entity is a Java class that meets the following requirements:

  1. Annotated with @Entity annotation
  2. Has no-args constructor
  3. Is not final
  4. Has an ID field (or fields)

As you can see, ID is required. Why is that?

Introduction: why do we need IDs in JPA Entities at all?

JDBC and relational databases do not require primary or unique keys for a table. While working with JDBC, we communicate with a database using its own language - native SQL queries. To obtain a dataset, developers run a select statement that returns corresponding tuples. To save or update data, we need to compose another insert or update statement. At this application-to-database level of communications, there is no direct link between objects in the application and records stored in the database. Usually, this mapping is managed manually as a part of the business logic.

JPA takes a different approach. It introduces entities - Java objects that are rigidly tied to their records in the database. Therefore, the JPA specification requires developers to define a field or a set of fields to establish the one-to-one association between an entity instance and a particular DB record. This way, developers can fetch JPA entities from the database, work with them and save them later without calling any insert or update statements. This is one of the key concepts allowing developers to focus mainly on the business logic, while most of the boilerplate operations are being handled by a JPA implementation itself, and IDs are a vital part of this process.

Note: The ID does not have to map to the column(s) defined as the table primary key. We need to map the ID to column(s) that uniquely identify each row. But for the article, we will continue to use the terms ID and primary key interchangeably.

ID types: what we have

We need to define IDs in our entities. What options do we have?

First, we can define an ID of a “simple” or a “composite” structure. The “simple” ID is represented by a single field in an entity, the composite one – by a separate class that contains a set of fields that identify an entity.

Usually, we use simple IDs for our JPA entities. Simple IDs can be generated automatically (surrogate IDs), and this is the most common way for handling ID values. Generation can happen on the database side (server-side generation) or in the application (client-side generation). Both methods have their pros and cons.

In this article, we will focus on server-side generated IDs. For the sake of simplicity, we will use Hibernate ORM as the default JPA implementation for all examples unless we explicitly mention another ORM.

Generated IDs - why should we care?

ID generation event usually happens only once – when we save a new entity to the database. So, suppose we have an application that doesn’t create many entities often (rule of thumb – let’s say, not more than 100 entities per second) and doesn’t share the database with other applications. In that case, in theory, we can use any ID generation strategy. An application for managing countries list is a good example – we do not create new countries very often. But what about electricity metering? If we have 100 meters, sending data every hour, we’ll have to save 100 measurements every hour. Basically, we can save one measurement every 36 seconds. Doesn’t look like too much. And what about thousands of meters? Tens of thousands? What if we decide to get measurements every 10 minutes? How much will it cost for a business to stop the information system to change the ID generation strategy?

In practice, applications tend to grow and change as well as businesses, and that’s why it is essential to choose a proper ID generation strategy to avoid painful migrations in the future. We are going to mention performance a lot in this article, and even our applications are not a new Facebook or Twitter, and they don’t save millions of entities per second yet, we should think about the most appropriate ID generation strategy in advance to avoid problems in the future.

How generation works by default

The most straightforward way is to define a generated ID in a JPA entity is to annotate a field with the @Id and @GeneratedValue annotations. We don’t even need to specify any parameters for the @GeneratedValue. Defaults will do the trick, and you will get an ID field generated properly.

@Table(name = "pet")
@Entity
public class Pet {
    @Id
    @GeneratedValue
    @Column(name = "id", nullable = false)
    private Long id;
}

There are two types of default values: those that should not be changed from the beginning and those that should be. Default values do not break the application, but do they work well in the case of generated IDs? Let’s have a look at the @GeneratedValue default parameter values:

public @interface GeneratedValue {

    GenerationType strategy() default AUTO;

     String generator() default "";
}

As we can see, we have the generation strategy parameter set to ‘AUTO’. That means that the JPA provider decides how to generate a unique value for the ID. The question here: is the default value good enough? Let’s start with the list of strategies we can use.
JPA standard describes three more strategies in addition to AUTO:

  • IDENTITY - uses built-in database-specific identity column type for ID generation.
  • SEQUENCE - generates a unique ID value using a sequence.
  • TABLE - uses a separate table that emulates a sequence. When an application needs an ID, the JPA provider locks the table row, updates the stored ID value, and returns it to the application. This strategy provides the worst performance compared to the previous two and should be avoided if possible. You can read more about this strategy in the documentation.

According to the developer’s manual, if we use an ID type different from UUID (such as Long, Integer, etc.) and set the strategy to AUTO, Hibernate will do the following (since version 5.0):

  • Try to use the SEQUENCE ID generation strategy
  • If sequences are not supported (i.e., we use MySQL), it will use TABLE (or IDENTITY, prior to Hibernate 5.0) strategy for ID generation

Why does Hibernate try to use SEQUENCE as the default strategy? The key metric here is performance. The TABLE strategy is the worst in terms of performance. In this article, the author did some testing using different strategies. He was able to decrease save time for 10K entities from 185 seconds to 4.3 seconds by changing the ID generation strategy from IDENTITY to SEQUENCE and enabling some Hibernate optimizations. So, both fallback strategies (IDENTITY and TABLE) will not break the application, but the performance will not be great.

The problem here is that even the default configuration for the SEQUENCE will not perform well; the performance will be close to IDENTITY. It happens because a single database sequence is used for all entities, and the sequence parameters do not allow Hibernate to apply ID pooling optimization. We will look at the default SEQUENCE behavior in detail in the next section.

Conclusion: Leaving default values for the ID generation strategy will likely cause negative effects on our application performance. For the production applications, we need to change defaults to something more appropriate.

Sequence: how to define it properly?

The SEQUENCE strategy uses a separate DB object – sequence – to fetch and assign a unique ID value before inserting the data into the database. This provides batch INSERT operation support because the JPA provider does not need to fetch generated IDs after each INSERT like for identity columns, trigger-generated IDs, etc.

Defaults: are they good enough?

To define the default definition for the SEQUENCE strategy, we need to write the code below. In fact, this is what we get for the default AUTO strategy if our database supports sequences.

@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE)
@Column(name = "id", nullable = false)
private Long id;

The SQL for the sequence generated for this definition (if we enable automatic DB creation in the Hibernate settings) will look like this:

create sequence hibernate_sequence start 1 increment 1;

The JPA provider will solely use this database sequence for all INSERT statements if we set default parameters for all SEQUENCE strategies in our application. This may cause some issues.

First, we might exhaust the sequence. In most databases, the maximum sequence value is 2^63-1, so it is quite hard to reach this limit. But it is still possible for the applications that generate a lot of new data, for example, IoT systems or banner networks generating billions of events per day.

* 2^63-1 is a big number. If we save 10.000 entities per second, we’ll need about 29 million years to exhaust the sequence. It means that in most cases, we might not bother about sequence ends, but we still need to be aware that the sequence is limited.*

Second, the performance will suffer. The default sequence increment is set to 1, which disables the Hibernate’s ID pool generation optimization for sequences. The JPA provider will fetch ID from every single INSERT statement from the sequence. For example, if we try to save two entities and have a look at the Hibernate SQL log, we’ll see something like this:

    select nextval ('hibernate_sequence')
    insert into pet (name, id) values (?, ?)
    select nextval ('hibernate_sequence')
    insert into pet (name, id) values (?, ?)

So, we select two ID values by executing two SELECT statements, assign those IDs to entities and then save them. This gives us an overhead – one additional ‘SELECT’ per one INSERT. This is a negative impact on the application performance.

Conclusion: Default settings for the SEQUENCE ID generation strategy is a good approach for non-data-intensive applications. If we plan something bigger, in order to avoid the issues with performance and sequence exhaustion, we need to alter the default strategy’s settings.

Sequences: what can we change?

Let’s start from specifying a dedicated sequence for an entity ID generation.

@Id
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "pet_seq")
@Column(name = "id", nullable = false)
private Long id;

For this definition, we’ll see that the following SQL is executed:

create sequence pet_seq start 1 increment 50

Hibernate uses ID generation pool optimization for the non-default sequences. The idea is to allocate a range of values for one session and use those values as IDs. By default, the number of allocated IDs is equal to 50.

The optimization works in the following way:

  • Step 1: Hibernate executes one select to get the ID from the sequence.
  • Step 2: If the selected value is equal to the sequence initial value, the Hibernate selects the next ID from the sequence as a high value, setting the initial value as a range low value. Otherwise, it goes to step 4.
  • Step 3: Hibernates inserts data assigning IDs from low to high range.
  • Step 4: As soon as Hibernate needs the next batch, it selects the next ID value from the sequence (which is bigger than the initial value). Hibernate calculates the available range for this ID
    based on the allocationSize parameter. Low value = ID – allocationSize, high = ID. Then Hibernate goes to step 3.

We perform only two extra SELECTs for the first 50 saved entities for default settings. For the following 50 entities, we perform only one additional select. For example, if we enable Hibernate SQL log, we can see something like this:

    select nextval ('pet_seq'); //selects 1 – got initial value, need to select next value
    select nextval ('pet_seq'); //selects 51 as range high value
    insert into pet (name, id) values (?, ?);// id=1
    insert into pet (name, id) values (?, ?);//id=2
//insert other 48 entities
    select nextval ('pet_seq'); //selects 101 as range next high value, calculates 10150 = 51 as the low
    insert into pet (name, id) values (?, ?);//id=51
//etc. 

There is one drawback: if the database session is closed (i.e., the application is restarted or we reopen the entity manager), unused IDs will be lost. A good example of such a short-living application may be a serverless lambda function. If we save only one entity per session and then exit from the application, we will lose 49 IDs forever. This behavior can lead to a sequence exhaustion, so for short sessions that deal with a small number of instances, we need to set a smaller ID allocation size to avoid wasting a lot of IDs.

To adjust the ID generation parameters, i.e., to decrease an allocation size, we can use the @SequenceGenerator annotation. Sequence generators allow us to use either an existing sequence or create a new one with the desired parameters. For example, in the code below, we provide the complete sequence definition and specify the ID allocation size to 20.

@Id
@SequenceGenerator(name = "pet_seq", 
        sequenceName = "pet_sequence", 
        initialValue = 1, allocationSize = 20)
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "pet_seq")
@Column(name = "id", nullable = false)
private Long id;

Hibernate will generate the following SQL for this definition if the sequence doesn’t exist:

create sequence pet_sequence start 1 increment 20

When we define a sequence generator, we need to remember the following: if we specify the existing sequence name, and hibernate schema validation is enabled, the allocationSize parameter must match the increment parameter for the database sequence; otherwise, the application won’t start.

If we want to change the sequence validation behavior in Hibernate, we can disable schema validation or set parameter’s hibernate.id.sequence.increment_size_mismatch_strategy value to value LOG or FIX.

For the LOG parameter value, Hibernate will ignore the mismatch. It may cause PK uniqueness violation because the ID allocation range calculation won’t match the actual sequence increment value, and we can get duplicate ID values. For example, for the allocationSize equal to 20 and the sequence increment is 1, we’ll get something like this:

    select nextval ('pet_seq'); // selects 1 initial value, need to select next value
    select nextval ('pet_seq'); //selects 2 as range high value
    insert into pet (name, id) values (?, ?);// id=1
    insert into pet (name, id) values (?, ?);//id=2
//Now we’ve exceeded high value, need to select the next batch
    select nextval ('pet_seq'); //selects 3 as range high value, calculates 320 = -17 as the low
    insert into pet (name, id) values (?, ?);//id=-17
    insert into pet (name, id) values (?, ?);//id=-16
//Restarting the application
    select nextval ('pet_seq'); //selects 4 as range high value, calculates 420 = -16 as the low
    insert into pet (name, id) values (?, ?);//id=-16 getting unique constraint violation

Suppose we set the parameter to FIX. In that case, the allocationSize parameter in the JPA sequence generator will be adjusted automatically to match the DB sequence increment parameter, e.g., to 1 for the case above.

One more feature of the @SequenceGenerator definition is that we can reuse the same sequence for different entities by specifying the same sequenceName in different sequence generators.

//ID Definition for ‘Pet’ entity
@Id
@SequenceGenerator(name = "pet_seq", sequenceName = "common_sequence")
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "pet_seq")
@Column(name = "id", nullable = false)
private Long id;

//ID Definition for ‘Owner’ entity
@Id
@SequenceGenerator(name = "owner_seq", sequenceName = " common_sequence ")
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "owner_seq")
@Column(name = "id", nullable = false)
private Long id;

Conclusion: Defining sequence generators allows us to:

  1. Use ID fetching optimization for better application performance.
  2. Tune the fetch size according to an application workload to keep a balance between frequent ID fetching and wasting some IDs due to a session close.
  3. Share the same sequence between different entities.

This makes the SEQUENCE ID generation almost an ideal choice. Are there any caveats in this strategy?

Multiple clients for the DB: any problems here?

Even though the SEQUENCE generation strategy uses sequence from the database, it assigns ID value in the application code. It means that other applications that use the same database might not be aware of the sequence existence, hence the strategy for ID generation.

Using more than one client for our database may lead to cases when other DB clients assign IDs directly without using the sequence. These ID values might be the same as reserved for unsaved entities in our application. When our application starts saving entities, it might cause PK uniqueness violation, and the data will not be stored.

Conclusion: SEQUENCE generation strategy for IDs might not work well if multiple clients update the database. For this case, ID generation should be controlled by the database. ‘IDENTITY’ strategy works better here.

Identity: pros and cons

IDENTITY is the “default” strategy for ID generation for developers who work with a MySQL database. Since many RDBMSes (apart from MySQL) support identity data type for column definition, we can see this strategy in many applications. Sometimes developers choose it because “it worked in my previous project,” and nobody wants to change the habit if it works. By specifying the strategy like in the code below, we get a reliable ID generation process managed in a single place - the database.

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id", nullable = false)
private Long id;

For every INSERT statement, a database automatically generates a unique ID value for the corresponding @Id field. This is similar to the SEQUENCE strategy behavior if we define an allocationSize equals to “1”. For both cases, we need to fetch an ID value for every INSERT statement. There is one difference, however. It is important to understand that using identity columns means that the entity row must be physically inserted before the identifier value is known. Since the database generates the ID’s value, a JPA provider should return it to the application after inserting the data.

The question is: How does a JPA provider fetch the ID after the record is inserted? If the database driver supports JDBC 3 API (most modern databases do), it is done automatically. The JPA provider implicitly invokes the Statement.getGeneratedValues() method, which returns the generated value. Under the hood, the JPA provider generates a SQL statement like this:

insert into pet (name) values (‘Buddy’) RETURNING *

Suppose we use an older version of a database driver. In that case, an additional select will be executed (usually by the JPA provider, but sometimes we need to do it manually) to fetch the generated value, similar to the code below. This is the log for the old PostgreSQL version, which emulates the IDENTITY datatype using DB sequences. For other RDBMSes, the SQL will be similar.

    insert into pet (name) values (?)
    select currval('pet_id_seq')
    insert into pet (name) values (?)
    select currval('pet_id_seq')

This behavior does not allow batch inserts execution by a JPA provider. Since the provider needs to get the generated ID after each INSERT, it splits the batch operation to single INSERT operators and fetches the generated ID value after each execution. We just cannot send a batch of INSERT statements and get a batch of generated IDs because we won’t be able to associate generated IDs to JPA objects reliably. The reason is that database does not guarantee that the order of the generated IDs will be the same as the order of INSERTS. Moreover, INSERT statements may not be executed in the same order as in the batch. So, the only reliable way to get IDs for inserted records – split the batch.

Conclusion: IDENTITY strategy is easy to use, and it guarantees reliable application-independent primary key value generation.

From the other side, this strategy provides a sub-optimal performance in regular INSERT operations, and batch INSERT operations are not supported at all. Therefore, it is recommended to use IDENTITY for cases where we save small amounts of new data or when several independent client applications change the database.

Using Spring Data JPA, Hibernate or EclipseLink and code in IntelliJ IDEA? Make sure you are ultimately productive with the JPA Buddy plugin!

It will always give you a valuable hint and even generate the desired piece of code for you: JPA entities and Spring Data repositories, Liquibase changelogs and Flyway migrations, DTOs and MapStruct mappers and even more!

Conclusion: Identity vs Sequence vs Others

So, which ID generation strategy should we choose for our JPA entities? Here are some recommendations:

  1. SEQUENCE is the preferred choice – it provides better overall performance compared to other strategies. Also, we need to consider the following:
    a. It is a good practice to define a separate sequence for each JPA entity. Avoid default sequence generator parameters.
    b. We should use @SequenceGenerator annotation to fine-tune sequence parameters.
    c. We need to define batch size according to the application workload patterns.
  2. We may prefer the IDENTITY strategy for the following cases:
    a. If the database does not support sequences.
    b. For entities that are not frequently created and saved.
    c. If our database is modified by other applications.
  3. Avoid TABLE and AUTO generation strategies if possible. They provide the worst performance.
    The IDs list is not limited to simple server-generated IDs only. In the following articles, we will talk about client-generated IDs and UUIDs in particular. Also, though not very popular, composite IDs have something to learn about, so we will talk about them too.