Data is living: it is born, it lives, it retires and dies. However, in the era of Big Data, the management of the data life cycle is in profound upheaval. The increase in volume, the explosion unstructured, the variety of classifications and support, the speed of collection, the handling and analysis…as many characteristics which modify the classical rules of traditional information technology. In a Big Data environment, it has become necessary to rethink data life cycle.
What is the management of the data life cycle?
Classically, the term Data life cycle management or DLM refers to the management of the data flow of an information system throughout its life cycle: from data creation to its deletion, to its usage (extraction, fusion, analysis, migration, sharing…) and its archiving.
When data enters a database, it can be used in different ways or even stored without use until it is obsolete. In the two cases, data can be seen to apply at any time to operations and validations before reaching the end of its useful life duration and being archived or purged.
ILM (Information Lifecyle Management) is also distinguished from DLM. The difference is easy to grasp: data is the atomic element of information which refers instead to the application created from a collection of data. This vision intervenes at the business level and according to the use of information (user point of view).
The events during the life of the data vary. It will depend on the type of data and the needs of the company. Each intervention during this cycle can represent a risk for the company (loss of data, incoherence of data, non-respect of regulations, etc…). The different DLM and ILM tools participate in diminishing this ‘’data’’ risk. Thanks to them, the company will be able to manage the different steps of the data life to reduce its costs, manage the development of the data volume, optimise the handling and use of data or even conform to the laws on data governance, like the RGPD.
Defining and applying storage strategies
The issue of the management of the data life cycle is often to define the choice of storage support according to the value of information. It’s a question of identifying and characterising the data, and then creating rules intended to specify the development of their value in order to choose suitable supports.
For example, if data must be quickly accessible, it will be stored on a high-performance bay. If its value is less critical, it will be kept on a slower bay. The rules can also be established in a manner to evaluate the frequency of backups and the speed of restoration in order to determine the best adapted support. Throughout its life cycle, data can also be replaced or duplicated several times on different supports. The one which will have lost their value will be destroyed the most often.
Furthermore, the choice of a storage strategy will also take into account the necessity to access a collection of data linked to each other. For example, in the case of a conflict about an order, it will be necessary to recover the estimate, the order form, the invoice and the emails corresponding to the transaction. The content management tools (CMT) are used in particular to logically link the necessary data.
Ideas that develop with the appearance of Big Data
Big Data refers to a collection of data so large that it exceeds human capacity and traditional information management I.T tools to analyze. Thus, the volume, variety, velocity – the 3 V which define Big Data- impose a reconsideration of the classical data life cycle.
With Big Data, new orders of magnitude appear that concern the capture, research, sharing, analysis and visualization of data. The classical relational data bases rarely allow management of Big Data. New models of representation, told by Business Analytics & Optimization (BAO) are then involved to manage bases massively parallel. Cloud computing, Super hybrid calculators and (DFS – Distributed Files System allowing in particular to rethink the architecture of Big Data storage.
Likewise the velocity, which refers to the frequency to which the data is generated, captured, shared and updated, impacts the traditional management of the data life cycle. Growing flows of data must now be analysed in near real-time (data surge searches) to respond to the needs of date-sensitive processes.
Big Data deals also with extremely varied data. It’s not a question of more classical relational data. This data is raw, semi-structured if not unstructured. This is complex data stemming from, for example, the web (Web Mining), in text format, video and images. Its analysis is all the more complicated as it interfaces between data of different nature. This variety of data modifies the management of data life cycle. To illustrate, the question of collecting Big Data asks about the recuperation and handling of unstructured data; another might be, for example, the contents of social networks.
In the big data era, the management of data in the company now exceeds the management of BDDs and the problem of silos in a Data Warehouse. The definition for the ISD remains the same, to know how to covert these data flows into clear and usable information. However, to manage the life cycle of massive data, it’s necessary to determine in advance the use of this data at the risk of seeing a pile of systems which would complicate the IS of the company to the extreme.