NHS definition of “hoarding” is a disorder where someone acquires an excessive number of items and stores them in a chaotic manner, usually resulting in unmanageable amounts of clutter. The items can be of little or no monetary value.

Data hoarding works approximately in the same way but with Data.

Reddit subreddit https://www.reddit.com/r/DataHoarder/ has 724k members. Most people call themselves “data hoarders” if they keep too many photos, have terabytes of emails or possess old CDs, video tapes or hard drives.

It turns out organisation are also suffering with symptoms of “data hoarding”.

Accumulating low or no-value data at an enterprise scale creates real problems for the business. Cost of storing the data - make the most obvious impact. Storage costs are falling; nevertheless, this remains one of the biggest reasons to deal with hoarding.

  • Reduced performance of the 'bloated' systems is another result of hoarding. No matter how performant your databases, warehouses, or datalakes are, uncontrolled dumping of data makes them sluggish.

  • Ever-growing volumes of data inevitably bring complexity which drives higher support and operational costs.

  • Compliance with laws like GDPR or HIPPA becomes harder with the more data you store.

  • Security is being impacted by increased volumetrics and potentially higher cost of a breach.

In organisations, hoarding is not limited to individuals; it is often done by teams or even entire departments.

Logs, metrics, audits, and backups are typical subjects of hoarding. Any team that produces data can become a hoarder:

  • Developers with their debugging

  • DevOps with endless streams of infra logs and metrics

  • Security retaining ALL audit records and storing them indefinitely

  • Marketing with customer data views, transactions, and user actions

  • Data engineers with endless DB backups, temp tables, and leftover copies of raw data

  • Analysts with views, reports, and various datasets that nobody ever looks at

Most of the above are often necessary, justifiable, and very useful instruments. However, when applied unconditionally or 'just in case,' they create more problems than solutions. Removing data is very hard, so it's much easier to keep it under control from the beginning.

Some of the obvious reasons which encouraging hoarding:

  • fear of making mistakes

  • lack of data governance in place (retention)

  • no time or established practice for basic housekeeping

  • no incentives to delete

  • no data ownership

  • no accountability

What do we do to get rid of the clutter?

It’s not an easy task. Deleting data is super hard. Nobody likes the weight of responsibility for deleting data. The first step would be just to stop.

You don’t have to delete anything or de-duplicate, just stop accumulating something which is there just because this is how it always was. Ask questions.

Governance. There should be a written document defining retention times for all types of Data in the Organisation. Everyone should be able to check which data we must keep and for how long. Everything else - is temporary, only whilst it’s being used.

Data ownership. Assign cost centres to all systems and Data they hold, and process. Personal responsibility is a strong motivator. Nobody wants to see their name in the "biggest spenders" report in finance.

Lifecycle policies. It’s an amazing approach which works very well. Natively supported by many storage technologies. “I will delete it when I’m done” never works. If you know something is temporary, assign a lifespan, or create an automated recycling task. Destroy and re-create.

Data Observability. Adopting Tools like Data Catalogues and visualising your Data journey through the organisation significantly helps to monitor evolution of your data and prevent unnecessary duplication and bloating.

Automation. Think about what you want to happen if you are not around.

Start deleting things by default. Delete code you don’t use, delete logs you don’t read, delete data you have processed. Ask the question why should I not delete it?

What if I delete something important?

There's always a risk. That's why it's crucial to get things right from the beginning. But if you find yourself in a situation where you have to proactively delete data from very high-impact and high-value data stores, do it with great care!

Here are some useful techniques, especially for live systems:

  • Instead of deleting it, leave it behind.

  • Use a multi-step process: first mark, then restrict access, rename, and finally delete.

  • Implement peer review.

Deleting data is tough and feels like a big responsibility. The best way to avoid hoarding is to not gather unnecessary data in the first place.

Deleting data is tough and feels like a big responsibility. The best way to avoid hoarding is to not gather unnecessary data in the first place.

Deleting data is tough and feels like a big responsibility. The best way to avoid hoarding is to not gather unnecessary data in the first place.

Andrey Kozichev

Subscribe for the latest blogs and news updates!

Related Posts

data engineering

Feb 16, 2024

Developers have been tracking dependent API changes for years, we need to include data engineering as schema consumers that also need to be made aware of changes.

Data engineering

Jan 10, 2024

Airbyte is the Swiss Army knife for ETL tasks

© MetaOps 2024

© MetaOps 2024

© MetaOps 2024