It is clear that nowadays a point has been reached where data compression is mandatory in most cases and scenarios, especially when dealing with huge volumes of data in the so-called Big Data systems and universe. However, its importance is often forgotten when designing the data flow architecture and infrastructure. In most cases it is simply assumed that there will be some existing solution that will fit our needs. Indeed, there are interesting solutions, either universal or focused on some kind of specific data, but they use to offer non-optimal compression efficiency and their computational cost is frequently too high.
Handling a big dataset can be complicated at all processing levels, from the storage point of view to the transport point of view, leading to complex and costly mechanisms to handle and exploit those datasets properly. Generally, when the issue is raised, there is not a strategy for it but just an attempt to compress the bulk data with the most traditional data compressors. However, with an adequate analysis of the data and a good strategy, the data volume to be transmitted can be significantly reduced in an efficient manner.
DAPCOM analyzes the system requirements, priorities, scenarios and use cases, identifying in which stages of the data flow system data compression can be a boost, improving its general performance and its throughput so that all data consumers can benefit from the optimization.
In this process it is important to understand the data to deal with. DAPCOM studies its format and its statistical properties, so that we can explore all possible technologies and strategies to optimize the compression ratio and compression speed, offering the best compromise suited to the customer requirements. In many cases, the performance boost can be much bigger than the one offered by an expensive hardware upgrade.
Compression ratio is, by definition of a data compressor, one of the most important criteria when evaluating a data compression technology. But in the recent times, with the increasingly huge volumes of data that need to be handled, compression speed must also be taken into account. We can differentiate between two possible scenarios needing compression.
It is generally for storage purposes and not so frequent data access. In this scenario we only care about getting the dataset as small as possible for reducing the storage capacity needs. Therefore, compression ratio is the most important criteria, and CPU consumption or compression times are not so relevant here. This scenario can also be considered for cases where data is just compressed once and read (or transferred) many times, though in such cases the decompression time becomes relevant..
In this case the system compresses data as it arrives from another data source, and after its compression it is sent to another data consumer in a real-time mode. Therefore, the compression stage is part of the data flow system. In this scenario, the data compression stage is completely integrated into the system. Here the compression ratio is obviously important to reduce the overall data volume, but compression times become very relevant. A balance between ratio and speed can be the perfect fit for the system. A typical case is a data communications system compressing data streams in chunks. See the section "show me the numbers"