Big Data refers to a large volume of data that cannot be processed using traditional databases. When we have a sizeable and reasonable amount of data we typically use traditional relational databases such as Oracle, MySQL, SQL Server to store and work with the data. However when we have a large volume of data traditional databases will not be able to handle the load.
Working with Big Data reveals that testing needs to be differently handled. The challenges arise from the very attributes of the data. Known as the three Vs, these include volume, velocity, and variety, often complemented with variability and value. Each of them pose specific challenges individually – they can also create more problems through their synergies. Taking a quick look at each of them:
- Volume: The volume of data collected in organizations is large and comes from different sources like sensors, meter readings, business transactions etc.
- Velocity: Data is created at high speed and has to be handled and processed quickly. Instruments like IOT devices, RFID tags, Smart meters result in automated data generation at unprecedented speeds
- Variety: Data comes in all formats. It can be in audio, video, numeric, text, email, satellite images, atmospheric sensors etc.
Why traditional relational databases cannot scale to support big data?
- Traditional relational databases including Oracle, MySQL, SQL Server cannot be used to big data since most of the data will be in unstructured format
- Variety of data – Data can be in the form of images, video, pictures, text, audio etc. This could be military records, surveillance videos, biological records, genomic data, research data etc. This data cannot be stored in the row and column format of the RDBMS
- The volume of data stored in big data is huge. This data needs to be processed fast and this requires parallel processing. Parallel processing of RDBMS data will be extremely expensive and inefficient
- Traditional databases are not built to store and process data in large volumes / size. Example: Satellite imagery for USA, Roadmaps for the world, all the images on Facebook
- Data creation velocity – Traditional databases cannot handle the velocity with which large volumes of data is created. Example: 6000 tweets are created every second. 510,000 comments are created every minute. Traditional databases cannot handle this velocity of data being stored or retrieved
The volume of big data has created enough demand for big data testing tools, techniques and frameworks. This is because increased data leads to an increased risk of errors and thus, might deteriorate the performance of applications and software. When conducting big data testing, a tester’s goal is completely different. Testing of big data aims at verifying whether the data is complete, ensure an accurate data transformation, ensuring high data quality as well as automating the regression testing scope.
Big Data Application Testing Approach:
- Data validation: Data Validation, also known as the pre-Hadoop testing, ensures that the right data is collected from the right sources. Once this is done, the data is pushed into the Hadoop system and tallied with the source data to ensure that they match in this system and are pushed into the right location.
- Business validation: Business logic validation is the validation of “Map Reduce” which is the heart of Hadoop. During this validation, the tester has to verify the business logic on every node and then verify it against multiple nodes. This is done to ensure that the Map reduce process works correctly, data segregation and aggregation rules are correctly implemented and key value pairs are generated correctly.
- Output validation: This is the final stage of Big Data testing where the output data files are generated and then moved to the required system or the data warehouse. Here the tester checks the data integrity, ensures that data is loaded successfully into the target system, and warrants that there is no data corruption by comparing HDFS file system data with target data.
Performance testing a Big Data Application needs the testers take a defined approach that begins with:
- Setting up of the application cluster that needs to be tested
- Identifying and designing the corresponding workloads
- Preparing individual custom scripts
- Executing the test and analyzing the results
- Re-configuring and re-testing components that did not perform optimally
Whether functional testing, performance testing or any other testing for that matter, Big Data is a specialized area and almost a domain of its own. A lot of specialization is needed to first understand the intricacies involved, the tools available, a robust strategy than can work through the challenges specific to one’s own implementation – big data testing is here to stay – it is important to get an understanding and appreciation of what it is, even if one does not directly work on it and to that extent the above post is to give a high level view into big data testing.