Building an awareness of the data lake fallacy

Businesses can often find new trends and processes difficult to resist in the information technology environment, especially when they promise to offer lower operating costs and increased efficiencies.

It's important to be aware, however, that certain new trends and processes can actually fail to drive innovation. One of these new trends, data lakes, needs to be understood by IT leaders within businesses.

Analytical organisation Gartner recently released a report detailing the 'data lake fallacy', so-called because of the gaps in the concept and precautions surrounding it. Research found that several vendors are marketing data lakes as a component able to help Big Data initiatives, specifically by allowing businesses to capitalise on opportunities.

"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," said Andrew White, vice president and distinguished analyst at Gartner.

"Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organisation, the proposition of enterprise wide data management has yet to be realised."

What are data lakes?

A data lake is essentially a large storage repository, designed to hold a vast quantity of raw data until it's needed by personnel within the organisation. The lake can keep this data in its native format, through use of flat architecture. Traditional hierarchical data warehouses store the same information in files and folders.

Once a data element has been added to the lake, it's assigned a unique identifier and tagged with metadata. When information needs to be accessed by someone searching within the organisation, all they need to do is query the entire data lake through use of the tags and identifiers.

The data lake is subsequently able to feed back any relevant data that it finds in storage.

While data lakes can be passed off as simply buzzwords for large companies, it's increasingly being used to describe any large data storage medium where the requirements remain undefined until a query is actually sent out.

"In broad terms, data lakes are marketed as enterprise wide data management platforms for analysing disparate sources of data in its native format," said Nick Heudecker, research director at Gartner.

"This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organisation."

By storing disparate data and essentially ignoring how it's used, it's hoped that independently managed data collections can be replaced with the larger 'lake'.

What do businesses need to be aware of?

Gartner warns that data lakes don't solve the data issue for businesses, as they only offer a way to better store information. While technology could possibly be used to handle getting value out of the lake, it's always going to fall back to the responsibility of the business.

"Without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place," said Andrew White, vice president at Gartner.

What's more, performance is also likely to remain an issue for businesses.

The various tools and data interfaces commonly used by businesses are often deployed to handle optimised and purpose-built data infrastructure, and this is where they're most effective.

By using these same tools on general-purpose storage infrastructure such as a data lake, it's likely businesses won't see the same returns.

Ensuring governance with the correct training

To guarantee a strong level of IT governance within an organisation, it's often essential to undertake various levels of training. Specialised IT governance courses can ensure that governance is given the necessary consideration.