governance
Andrey Kozichev
Data Hoarding
NHS definition of “hoarding” is a disorder where someone acquires an excessive number of items and stores them in a chaotic manner, usually resulting in unmanageable amounts of clutter. The items can be of little or no monetary value.
Data hoarding works approximately in the same way but with Data.
Reddit subreddit https://www.reddit.com/r/DataHoarder/ has 724k members. Most people call themselves “data hoarders” if they keep too many photos, have terabytes of emails or possess old CDs, video tapes or hard drives.
It turns out organisation are also suffering with symptoms of “data hoarding”.
Accumulating low or no-value data at an enterprise scale creates real problems for the business. Cost of storing the data - make the most obvious impact. Storage costs are falling; nevertheless, this remains one of the biggest reasons to deal with hoarding.
Reduced performance of the 'bloated' systems is another result of hoarding. No matter how performant your databases, warehouses, or datalakes are, uncontrolled dumping of data makes them sluggish.
Ever-growing volumes of data inevitably bring complexity which drives higher support and operational costs.
Compliance with laws like GDPR or HIPPA becomes harder with the more data you store.
Security is being impacted by increased volumetrics and potentially higher cost of a breach.
In organisations, hoarding is not limited to individuals; it is often done by teams or even entire departments.
Logs, metrics, audits, and backups are typical subjects of hoarding. Any team that produces data can become a hoarder:
Developers with their debugging
DevOps with endless streams of infra logs and metrics
Security retaining ALL audit records and storing them indefinitely
Marketing with customer data views, transactions, and user actions
Data engineers with endless DB backups, temp tables, and leftover copies of raw data
Analysts with views, reports, and various datasets that nobody ever looks at
Most of the above are often necessary, justifiable, and very useful instruments. However, when applied unconditionally or 'just in case,' they create more problems than solutions. Removing data is very hard, so it's much easier to keep it under control from the beginning.
Some of the obvious reasons which encouraging hoarding:
fear of making mistakes
lack of data governance in place (retention)
no time or established practice for basic housekeeping
no incentives to delete
no data ownership
no accountability
What do we do to get rid of the clutter?
It’s not an easy task. Deleting data is super hard. Nobody likes the weight of responsibility for deleting data. The first step would be just to stop.
You don’t have to delete anything or de-duplicate, just stop accumulating something which is there just because this is how it always was. Ask questions.
Governance. There should be a written document defining retention times for all types of Data in the Organisation. Everyone should be able to check which data we must keep and for how long. Everything else - is temporary, only whilst it’s being used.
Data ownership. Assign cost centres to all systems and Data they hold, and process. Personal responsibility is a strong motivator. Nobody wants to see their name in the "biggest spenders" report in finance.
Lifecycle policies. It’s an amazing approach which works very well. Natively supported by many storage technologies. “I will delete it when I’m done” never works. If you know something is temporary, assign a lifespan, or create an automated recycling task. Destroy and re-create.
Data Observability. Adopting Tools like Data Catalogues and visualising your Data journey through the organisation significantly helps to monitor evolution of your data and prevent unnecessary duplication and bloating.
Automation. Think about what you want to happen if you are not around.
Start deleting things by default. Delete code you don’t use, delete logs you don’t read, delete data you have processed. Ask the question why should I not delete it?
What if I delete something important?
There's always a risk. That's why it's crucial to get things right from the beginning. But if you find yourself in a situation where you have to proactively delete data from very high-impact and high-value data stores, do it with great care!
Here are some useful techniques, especially for live systems:
Instead of deleting it, leave it behind.
Use a multi-step process: first mark, then restrict access, rename, and finally delete.
Implement peer review.
Related Posts
security
Jan 17, 2024
Any workload in the Cloud can be compromised through access to raw memory via the supervisor
governance
Dec 14, 2023
Utilizing a combination of schema validation for structured data, regular expressions to search for specific items, and AI for text and document analysis will yield fast, cost-efficient, and reliable results when removing Personally Identifiable Information (PII) from data.