Data as Code: An Alternative to Storing Static and Semi-static Data in Databases

Data as Code: An Alternative to Storing Static and Semi-static Data in Databases

Is storing all data in a database always the best choice?

Published on Sunday, March 9, 2025

Is storing all data in a database always the best choice?

We often unquestioningly follow the mantra: "Put everything in the (relational) database." But what happens when your application grows to hundreds or even thousands of tables, many containing just a handful of static or rarely-changing entries? Suddenly, your database looks like an unorganized closet, not an efficient data store.

Why do we store even simple static datasets, such as selectors for measurement units, in the database? Why do we store rarely changing data like lists of countries and languages in data storage optimized for reading and writing? Is there a way to simplify architecture and improve the performance of static data?

Why Use Data as Code? Dynamic and Static Data

The main argument for the "store everything in (relational) database" approach was that the entire data for an application should be stored in the database, so it is possible to recreate everything just with a query. It is a valid argument, but the cases in which you must do this are sporadic. Another argument is that multiple applications (from different teams) should use the same database. I would strongly disagree with this (in most cases). This reminds me of the tech legend - Bezos API Mandate:

  • All teams will henceforth expose their data and functionality through service interfaces.
  • There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
  • ...

Direct access to the database from multiple applications may be the fastest, but the price of database coupling is too much to pay.

Another problem is why we often do not split more strictly dynamic - frequently changed, user-created data from (semi)static - infrequent, almost unchanging, non-user, application-used data. An example of (semi)static data would be the list of countries, and an example of dynamic data would be a customer's order history. In the worst-case scenario, Database-per-Tenant, you must update all your databases with static data.

An alternative, but still in the domain of "all data should be in database", is having a separate static data database. The problem with this approach is that you lose the ability to query join calls. Sure, there are ways to join tables, even in separate databases, but let's not go with that route, especially for the frequently used queries.

Data as a Service for (Semi) Static Data

An interesting solution is Data as a Service. Put the (semi) static data in the separate service and make a synchronous call. In many cases, this works great. You have a specialized standalone service that is easy to update with new data. You are also free to share this service with many applications.

The main problem is the data access. You must make a synchronous call to an external service and get the data. You can cache, but if there is a chance that data could be updated, you need a way to invalidate the cache. Things will get even more complex if it is essential to have the same data across all servers or applications, and if there is a cache per service.

Also, there are problems with the nature of services. If you access something external, there will always be some latency and performance issues, and you lose the ability to go offline.

For (semi) static data, the main benefit of data as a service is the possibility of independent deployment of extensive data that is frequently updated but still static data. Security could also be a benefit because you can easily switch from secret data to jibberish for development, even on the same service.

However, I especially like data as the service approach because it can be centralized. We are avoiding local data duplication, especially in the database per customer approach, thus making database backups more efficient. Also, when the data is in your application, it is your problem, but when you access it, it is "someone else's".

Data as Code

While Data as a Service can work, it introduces network latency and cache complexity. Is there a way if we could store semi-static data directly in our application?

Data as Code is a somewhat controversial approach to storing static data in the applications. As developers, we have a 'natural tendency' for everything to follow the Single Responsibility Principle. Following that principle strictly, code should be a set of instructions, not the data for the application. However, I genuinely believe we can store a lot of (semi)static data in the app code or as a separate library in many enterprise applications.

Important Data Parameters

There are a few parameters we hate to take into consideration when we decide to store Data as Code:

  • Change Frequency - How often is data changed? Is it rare, like a list of possible units in the combo box? Or is it some regulation that is changed more frequently, like several times a year?
  • Complexity - How many fields? Are some of the fields related to other data?
  • Number of Rows - How significant is the data list? 5 rows? 10k rows?
  • Access Frequency - How often is the data accessed or queried?

Data as Library

The simplest solution for storing Data as Code is to put it with the main code. The obvious advantage with this approach is that you have everything 'under one roof'. The disadvantage is that there will be a longer build time, mainly if you use code-generating tools that automatically regenerate the entire data set as code. You can not also give the access and capabilities of updating outside the primary team that easy.

However, having data as a separate library, especially while working on the method signatures, can be tedious. Good infrastructure as code helps because it is very easy to set up new libraries, but you still have to push the signature changes, build it, and then reference it or upgrade the library version in the main code.

Many tools can automatically update libraries to the newest version for the data used in several services.

Data as Code in the .NET

I use Data as Code in the .NET in several ways.

  • Smart Enum
  • Static Class
  • Resx (+ Static Class)
  • Embedded Databases

Smart Enum

Ardalis.SmartEnum is probably my favourite library. It expands the enum concept by adding object-oriented capabilities.

using Ardalis.SmartEnum;

public class OrderStatus : SmartEnum<OrderStatus>
{
    // Define static instances (similar to enum values)
    public static readonly OrderStatus Pending = new OrderStatus("Pending", 1, true, "Your order is pending.");
    public static readonly OrderStatus Shipped = new OrderStatus("Shipped", 2, false, "Your order has been shipped.");
    public static readonly OrderStatus Delivered = new OrderStatus("Delivered", 3, false, "Your order has been delivered.");

    // Extra properties
    public bool CanCancel { get; }
    public string NotificationMessage { get; }

    // Constructor is private to prevent new instances outside predefined ones
    private OrderStatus(string name, int value, bool canCancel, string message)
        : base(name, value)
    {
        CanCancel = canCancel;
        NotificationMessage = message;
    }
}

The only thing I almost always expand is to force Guid usage for Id. By giving a unique Guid to every SmartEnum instance, I ensure that the proper Id is used in the data storage. For example, I often work with countries and languages and frequently swap these things by mistake, like Country.France.Id with Language.French.Id. The data would be valid with an integer Id, but it shouldn't be. If we take care of the uniqueness of the Guid, we can quickly check if the Guid exists in the SmartEnum. Of course, we lose the ability to insert the data manually (you will not remember the Guid as a value, as you would an integer).

The only advantage I find with classic C# enums is that they are compile-time. In some cases, you have to use enums.

When to use Smart Enums for data?

When you have some data with a possible complex relationship between other data, one item of the Smart Enum can reference another item of another Smart Enum list - Countries have a list of Languages. Use also when you need true enum capabilities like referencing something by name - Country.Serbia.

Static Class

Another good approach is to use static classes. We can determine the memory footprint at the compile time if we use static class with static readonly list and hardcoded values. In .Net, the garbage collector does not need to manage it since it's static read-only and exists during the lifetime of the applications. There are no problems with thread-safety because it's immutable. Performance is optimal because accessing static fields is very fast, and there could be no memory leaks; it's just part of the application's static memory.

When to use Static Classes for data?

When you have a lot of items, and you need all of them in the memory all the time! Even tens of thousands of items are not a problem with modern hardware.

Resx (+ Static Class)

.resx files (Resource Files) in .NET are XML-based files primarily used to store localizations. However, it is possible, although rarely used, to store strings, images, and other objects.

The main difference between this solution and the others is that it does not load data into memory before first access. You can dispose of it manually, by dropping the reference to ResourceManager and forcing garbage collection. Reload is automatic, just by re-accessing the resource

At compile time, .resx files are converted into binary .resources files. When a row is accessed, ResourceManager loads it into a dictionary-like structure, enabling O(1) lookups for subsequent requests.

However, .resx is a key-value structure. It's not optimized for searching by more than one parameter.

Using Static Class (or SmartEnum) with Resx is also possible. You can separate data into often used with static class, and less used and more extensive data in Resx files. For example, you can put the main searchable, frequently accessed, part in static classes, and the other fine details about that 'row' you can put in Resx, and call it when necessary. You can search by more than one value with Static Class, and put additional data in the Resx, and access it only when necessary.

When to use Resx (+ Static Class)

Whenever you have a lot of data and you want to have the capability to load it sparingly.

Sure, you can create your own solution for loading data from textual files like json, csv, xml, but with Resx files, you get Microsoft's battle-proven solution.

Embedded Databases

The most heavy-duty type data as the code would be a 'true' database deployed in the library form, with binary storage format, optimized for fast reads/writes and efficient storage with minimal overhead. However, these biggest strengths are also their greatest weaknesses, because you need specialized data updating tools.

There are many c# compatible embedded databases. They come in many types, such as Key-Value Stores, Document-Based Databases, Relational Embedded Databases, Object Databases, etc. The most popular are SQLite and LiteDB.

Generally, I don't use this approach. But I've been thinking about one problem for the last few months, and I will probably try to use LiteDb for some data.

Data Generation for Data in Code

Creating and updating Data in Code is especially interesting. With traditional databases, we have well-known procedures and scripts. Here are the things that are a little different.

Manual

For simple use cases, we are going to type everything manually. It's quick and easy, and we have source control to check the changes.

LLM

LLMs are great tools for inserting data in the code. The only problems are hallucinations. We have to be very careful and check all the data manually. Once, I generated some static data, and LLM 'dropped' a char from the guid. It was not a fun day.

Country	Code
Austria	AT
Belgium	BE
...


Create ardalis.smartenum list
use guid as Id

LLM Input

using Ardalis.SmartEnum;

public sealed class Country : SmartEnum<Country, Guid>
{
    public static readonly Country Austria = new("Austria", new Guid("f53aee0c-d8e2-4b07-832d-f1f25b0ecacf"));
    public static readonly Country Belgium = new("Belgium", new Guid("7b4c62b4-5ed8-4578-9aee-3aa9454c84b6"));
...

    private Country(string name, Guid id) : base(name, id) { }
}

LLM Output with ChatGPT 4.5 (research preview)

new Ardalis.SmartEnum.CountryCode(
    new Ardalis.SmartEnum.Entry(
        name: "Austria",
        id: "9f3e4d0b-75a8-1126-a84b-bdcdff1b8607", // Replace with actual GUID
        displayText: "AT"
    ),
    new Ardalis.SmartEnum.Entry(
        name: "Belgium",
        id: "9f3e4d0b-75a8-1126-a84b-bdcdff1b8608", // Replace with actual GUID
        displayText: "BE"
    ),

LLM Output with Deepseek R1 Distill 14B - Obvious mistake

Check the output! Even high-end LLMs will sometimes do just a part of the job because "they think" you need an explanation of how static data works. I would also add a guid column to Excel and force that data. "Weaker" LLMs that could be run locally, like Deepseek R1 Distill with just 14B parameters, did not write good code. Be even more careful with the output of your original static data!

Scripts and Apps

We can create custom scripts or apps to generate data. Sometimes, I use csx (C# Script Files) and c# Console Applications. The advantage of this approach is that you have 'complete power' of the whole programming language. You can make many modifications, such as cleaning up the data. The disadvantage is that you have a somewhat more complex pipeline.

Scripts and Apps have an excellent support for csv and Excel. You can even use Microsoft Access if necessary. However, remember that Access is Windows only, and for Excel, you need additional libraries, sometimes commercial.

If you use the embedded database route, this is your only option (or at least the most elegant).

T4 Text Templates

Text Template Transformation Toolkit (T4) is a free, open-source, template-based text generator. It is supported in Rider and Visual Studio, and you specify in project files that you will (re)generate tt files during each build.

T4 templates let us generate C# code dynamically. Here's a simple example that generates an enum from a list of colors:

<#@ template debug="false" hostspecific="true" language="C#" #>
<#@ output extension=".cs" #>
<#@ assembly name="System.Core" #>
<#@ import namespace="System" #>
<#@ import namespace="System.Collections.Generic" #>

namespace GeneratedEnums
{
    public enum ColorEnum
    {
<#
    // Define color names and their hex values
    var colors = new Dictionary<string, string>
    {
        { "Red", "#FF0000" },
        { "Green", "#00FF00" },
        { "Blue", "#0000FF" },
        { "Yellow", "#FFFF00" },
        { "Cyan", "#00FFFF" },
        { "Magenta", "#FF00FF" },
        { "Black", "#000000" },
        { "White", "#FFFFFF" },
        { "Gray", "#808080" }
    };

    foreach (var kvp in colors)
    {
#>
        <#= kvp.Key #> = 0x<#= kvp.Value.Replace("#", "") #>,
<#
    }
#>
    }
}

The example illustrates how generating enum from a `tt file is possible.

You always have the option to read rows from csv or Excel and insert them in the code as objects (or enum values).

The disadvantage of T4 is its speed and somewhat clunky tooling. Rider, for some reason, works much better than Visual Studio. Large Excel files will freeze even the better hardware. Also, generating data on the build will prolong build times significantly. I recommend moving these data as separate nugets, even if they are used only in one application.

Also, remember that in practice, the same code data generated from a basic c# console application will be created faster than with T4.

Why does Microsoft call T4, 'T4 Text Templates' and not T6?

Updating Data on Dev Machine or Pipeline and Testing Data

Should we update data on the local dev machine and push the 'compiled' data into the source control directly, or should we have a pipeline that 'compiles' the data, pushes the changes into source control, and then builds the nuget?

It depends.

If the developers only update data, compiling it locally and pushing it for additional compilation is fine. If the plan is to open data updating for non-technical personnel, you must create a mechanism for updating the source files, generating code as data, and finally, compiling.

Data Warning!

For simple use cases, like a list of colors, changes for mistakes are low. However, even more complex data will significantly increase the chance for error. Think your Excel data is error-free? Think again. Even minor formatting mistakes can break your data import process.

Create innovative unit tests. Maybe even load the data from Excel separately in tests and unit tests against some of that data. Check at least the number of items with tests.

However, this problem is not with Data as Code; you have the same problem with SQL Scripts. But, if someone knows how to write and run SQL Scripts, he knows how to clean up data.

Concerns

Memory

The first concern is always the memory usage. But, in most cases, it is negligible. We made so many bad decisions regarding computers and software development regarding optimization; a little bit of data in the memory will not break your app.

Recompilation

In most cases, Data as Code requires recompilation. With modern continuous integration tools, we have much bigger issues than data updating if we have a problem with recompilation and deployment.

Versioning

With Data as Code in libraries, having different data on different services can cause problems. However, we can automatically reference the newest libraries with modern deployment tools.

Conclusions

Approach Type-Safe Advanced Searching Requires Recompilation Usable in Attributes Best For
Data as a Service ❌ No ✅ Yes ❌ No ❌ No Best for sharing access across multiple applications but introduces network latency and cache complexity.
Enums ✅ Yes ❌ No ✅ Yes ✅ Yes Best for small set of simple data and usage with Attributes
Smart Enums ✅ Yes ✅ Yes (In-Memory LINQ) ✅ Yes ❌ No Best for structured data with relationships, metadata, and enum-like behavior, allowing rich metadata and filtering.
Static Classes ✅ Yes ✅ Yes (In-Memory LINQ) ✅ Yes ❌ No Best for frequently accessed large datasets that remain in memory for optimal performance, supporting multi-property searches.
Resx (+ Static Class) ✅ Yes ⚠️ Limited (Key Lookup Only) / ✅ Yes (with Static Class) ❌ No / ✅ Yes (with Static Class) ❌ No Best for storing large but infrequently accessed data but has limited search capabilities.
Embedded Databases (SQLite, LiteDB) ❌ No ✅ Yes (SQL Queries) ⚠️ Requires running an update script ❌ No Best for heavy-duty, structured data storage requiring efficient reads and writes.

Data as code is not a silver bullet. It's another valuable tool. For many applications, we can simplify work with data significantly by evaluating data usage and update patterns.

You always have the option to combine multiple approaches. For example, I wanted to have type safety for UI localization for one project. The translations were in resx files. When working with translations in code, I used smart enums, and for 'tagging' attributes for automatic translations, I used enums.