In this digital day and age, we all know that information, be it in any discipline, is most effective and useful when it is available in an electronic medium accessible to the audience at large. While print medium is still in existence, it has been completely dominated by digital information. A lot of visionaries, have commented on the importance of the digital information over the next decade and we’ve been seeing a lot of revolutionary products/devices and applications entering the market in this space over the last few years. We all obviously understand the primary advantages of digital books which include: ease of information access (24/7), minimal space to store and archive data, easy retrieval of information through search and other associated features, more secure data storage than in a physical location, to name some. There are some associated challenges of going digital; the primary one being how copyright issues are handled especially when digital copies can be easily duplicated and shared. Digital content providers are attempting to alleviate this situation by introducing Digital Rights Management to control and promote authorized access to information. While content providers / publishing firms are attempting to mitigate such challenges globally, no one is really denying or questioning the benefits of *going digital*.
In our experience as a QA specialist that has been testing digital content for over 7 years, we from QA InfoTech herein talk about some of our core QA approaches in ensuring such digital content is ready for global consumption and what happens behind the scenes during such a digitization process.
Firstly, content could be in various forms ranging all the way from a print copy, to a text file, a word document, a PDF to name some. Content could also contain plain text, or can have embedded images. The content to be digitized in such cases is undergoing a *conversion* while in other cases the digital content is created from scratch. If content is in print form, the first step in digitization is to scan it. Clearly, commercial scanners are used to scan volumes of print content that have been collected and preserved over years. One of the prime technologies used in content digitization is XML. XML forms the transitional state the content needs to be moved into, before it is output in an electronic format. With XML in use, it is very important for the content to adhere to defined standards based on which the content typically gets tagged with the required headers, sections, sub sections. This schema creation is the core in the process of content digitization. There are a lot of tools and automated solutions these days to take in the input file and create the XML file. OMNIMARK is an example of such a tool. Once such XML files have been created, processing engines come into play to consume the XML and create the run time output which is what users get to see. This could be in various formats such as epub, OEB, MobiPocket, FictionBook, Microsoft Reader (LIT) etc.
From a QA standpoint, content digitization verification happens at two stages. First, when the transitional XML is created and second, when the XML is ingested into the processing engine to create the prod3 runtime files. At the first stage, it is very important to verify that the XML has been created per the specification as most defects can be easily caught at this stage. XML tags, structure, adherence to the schema called out in the specification are the prime ones being verified, so the QA team needs to get involved earlier in the conversion cycle rather than after the digitization has taken place. There may also be scenarios where Source content could be a PDF file which needs to be extracted at an intermediate step using technologies such as Open Office, before the prod3 ingestion takes place. If so, the extraction is also a very important stage for QA.
The second stage is where testing happens at a more black box level. The runtime electronic content is created by the processing engine once the verified XML file has been ingested into it. Herein some of the core elements to verify include: Content appearance, clarity, completeness of information (no truncations), index, glossary, navigational links, navigational links leading to the right content, print features etc. Most importantly since content search is one of the major advantages of digitizing content a good amount of QA needs to be done to check for validity, accuracy and completeness of the search feature. Once such core functionality has been verified, the next thing to check for is compatibility on desktops and mobile devices such as laptops, phones, e-readers, tablets etc. This is important since content access on mobile devices is increasing by the day. We had recently written a blog on Content QA – is this really needed? Specific to content digitization, where existing content is being digitized, not much test effort needs to be expended on testing the content itself. The focus is only on testing the digitization process.
Besides functional and UI testing, we just talked about, other core forms of testing when content is digitized include: Security (especially in cases where Digital Rights Management has been enabled), Performance (whether content is hosted or implemented behind a firewall), Accessibility (especially when digitization has accommodated specific accessibility features such as content reading out and device support to enable access to physically challenged), Usability (to ensure the current flow of content is effective to the target audience), Globalization (only if the content is being localized at the time of digitization).
Having looked at the scope of Content Digitization QA at a high level, let’s take a peek at the scope of test automation. There is a lot of room for automation at the stage when XMLs are being verified. Simple unit tests could even be written in XMLUnit to verify the XML files that are generated. Automation at this stage is very effective since it is simple to write, easy to maintain, saves a ton of time for the tester and helps catch bugs early on which might be very expensive if missed at this stage. So, given how cost effective yet beneficial it is, this is a lucrative area to automate in content digitization testing. At the front end, while some pieces are best verified manually, there are a few areas such as checking for broken navigational links, performance and security testing that can be automated. Navigational links is one that can be very cumbersome to test manually and that yields good results with minimal amount of automation effort. Basic utilities can be written to check for working links and that they navigate to the correct locations. Like in any other product testing effort, performance and security are areas that are best tested and scalable only through effective use of automation, in content digitization as well. Given the volumes that are being handled when content is digitized, it also makes sense to create an automated regression suite if content is expected to undergo changes over the next revisions. If this isn’t the case, there is not a very high ROI in creating and automating such a regression suite.
Using the above defined testing process, trained test engineers who have handled several digitization QA projects, a right mix of manual and automated test efforts and content domain experts wherever necessary, we at QA InfoTech have provided Content Digitization QA services for several leading Content and Publishing houses. Content digitization is still in its nascent stages. A lot of evolution is still to come in terms of the process, associated technologies etc. and the scope for content digitization QA is huge. This is certainly a great area to build a niche given not just the scope of work but also the scope of specialized skills a company or an individual can build and we continue to invest in our people, processes and R&D efforts to further strengthen our edge.