Steve Hayward
Product Marketing Manager, BIOVIA

It is no longer a surprise to hear that one of the biggest challenges in the lab today is the effective management and use of data. With digitization efforts long past their infancy, most companies are grappling with the proliferation of data generated throughout their organization—how to store, manage and leverage it effectively. Different storage solutions have come in and out of favor, driven by specific needs and enabled by the technology available. One of the most recent and buzz-worthy is the data lake, which has risen to prominence and been investigated by numerous scientific companies in the past few years.

Dipping a toe in the data lake
When laboratory instruments first started producing digital files, they were simply stored on the local computer disk. With removable storage and networking abilities, files could be collected and collated together into a basic folder-based repository. While many view this as hopelessly outdated, the reality is that some labs still rely on local file storage for simplicity and cost effectiveness, viewing more modern and automated systems as too complex and costly. The reverse is also true, where many companies are willing to invest in more advanced data storage solutions to leverage specific capabilities and achieve their business goals.

The rise of laboratory informatics systems such as electronic lab notebooks (ELNs), laboratory information management systems (LIMS) and laboratory execution systems (LES) have furthered the proliferation of digital information generated by the lab. However, these disparate systems have also created multiple data silos, with data only accessible through proprietary vendor software and incompatible data formats. So while digitization has been achieved, true digitalization (the effective use of digital data for scientific and business purposes) often remains an elusive goal.

The natural reaction is to try to remove the artificial barriers, placing everything in a single common repository such as a scientific data management system (SDMS) to ease data access. Broadly speaking, an SDMS falls into the category of a data warehouse (or data mart), where the schema for data storage must be pre-defined in advance—the data is processed and structured based on its anticipated use.
A data lake stands in contrast to a data warehouse in a few key ways. First and foremost, a data lake employs schema-on-read; appropriate processing is only applied when data is queried. To enable this, raw data can be stored in structured, semi-structured or unstructured forms, and is tagged with appropriate metadata and a unique identifier. The architecture is flat, rather than the hierarchical strategy of traditional file systems, and object-oriented, making it much easier to scale and manage.

Swimming past the shallows
If you are at all familiar with the term data lake, you have also heard of Hadoop, the open-source framework that has become synonymous with data lakes. Hadoop introduced the idea of distributed computing and large-scale parallel processing. By storing large amounts of unstructured data from disparate sources, such systems can process unique and novel queries, combining and processing data as necessary on request. The schema-on-read distributed computing aspects of a data lake enables new ways of combining and interpreting data, often independent of the data’s original purpose.

In this way, it is common to equate a data lake with “big data,” but they are not the same thing. Big data analytical efforts can be enabled by a data lake, but still require specific goals and processes to be in place, along with strong governance of the projects. Without strong oversight, data may not receive useful metadata tagging, and it may still remain effectively siloed from some company stakeholders.

Don’t let your data drown
Data lake installations can be categorized by the type of analytic efforts they are intended to enable. Self-service business intelligence analytics can utilize an Inflow Data Lake, which mainly aggregates disparate data to bridge data silos. Because of the emphasis on different sources of data, good governance is paramount so that data is properly identified and tagged with metadata for contextualization and can be properly accessed when needed. Additionally, without this oversight during installation, data that is sent for storage effectively “drowns” in the data lake, unable to be easily retrieved or leveraged.

At the opposite end of the spectrum is the Outflow Data Lake, which is built for operational analytics and designed to speed data re-use to enable real-time analytics and faster decision-making. Governance in this design is still important, especially in the initial design, by clearly defining the expected types of analytics required, but must remain agile, responding to changing business needs and being more flexible in terms of the input data.

Hybrid types of data lakes exist between these two ends of the spectrum, mixing the data aggregation and operational analytical aspects to various degrees. A system such as this can be extremely powerful, removing data silos, storing large amounts of data efficiently and enabling insights both to scientists and business managers. A hybrid system will often allow for a lot of flexibility in how data is handled and the analytic queries available, but this also requires a high degree of user skill.

If you’ve sensed a theme here—governance—that’s because it becomes a key consideration in any data lake installation. Although sometimes under-appreciated in the early stages of a project, the importance of good governance grows apparent as issues of data sources, data flow, metadata tagging and expected business outcomes are addressed. Clearly defining these aspects and managing them over the lifetime of the project will prevent a data lake from devolving into a data “swamp.”

Come on in, the water’s fine
Although a data lake exists in large part to store disparate structured and unstructured data in an un-siloed manner, it is still important to consider some standardization of the data heading into the lake itself. Metadata tags can properly identify both a data type and its associated properties, but raw data is parsed into a standardized format prior to storage, making it much easier to combine and interpret with other data files down the road if they share the same format. Minimizing data format disparity prior to storage will ease data interpretation and future analytics efforts.

While a data lake’s ostensible raison d’être is the efficient storage of data, it is still possible to run into limitations. The main storage is of the original raw data; however, any modifications or processing of that data can require copies of the files, exponentially increasing storage needs. Additionally, the system needs to properly track the relationship between files and record the changes made, both for compliance and to avoid orphaning files from their original data.

As with many lab informatics solutions, the current trend is toward the cloud. In some cases, the entirety of the data lake project can be cloud-hosted, but a popular option is on-premises data storage with the computing portion hosted as a cloud-based service. This arrangement leverages more cost-effective data storage and the power of purpose-built, parallel-processing systems on demand from the cloud.

Hadoop, with its wide range of modules that can be tricky to configure, has started to fall out of favor in the face of more out-of-the-box cloud-based solutions. Existing cloud vendors are well positioned to provide either the heavy lifting of analytic computing power, or a complete cloud-based data lake solution. The downside is more confusion about what actually constitutes a data lake when half or all of it is in the cloud—perhaps the new metaphor will be of the data lake “evaporating” into the cloud.

Looking toward the horizon
As organizations continue to test and implement data lake solutions, it is clear that removing barriers to data access is a driving force in today’s lab informatics space. Where many data storage silos have been eliminated, now companies find that access to data analytics remains siloed and efforts are not fully leveraged across business units. Unlocking and leveraging the analytics aspect of the business, perhaps in connection with the move toward the cloud, will allow the potential of data lakes to be fully realized.