By Daniela Miltner, Product Management, ABBYY
PDFs, office documents, e-mails and texts are everywhere. Today the vast majority of information assets is unstructured and composed in natural language. Most organisations recognise the need to classify and organise these asset types, but why do so few actually practice it? The reality is that they largely lack the capabilities to do so.
In order to leverage unstructured content and transform dark data into actionable information, a new approach to classification is required: one that harnesses machine learning, linguistic and semantic technologies, allowing us to master the growing amount of unstructured data.
The Rise of Natural Language Processing
Categories help the human brain to organise the world. Research from Gartner predicts that the role of natural-language processing will increase in-line with the growing trend of AI and machine learning. However, in IT and business we still face information overload.
To action information, or use it effectively, it is helpful to understand the context and logical class of a document. To let information drive business decisions, the right person needs access to all relevant information when needed – starting from the arrival at the organisation up to retrieval of long-term archives.
Classification not only helps businesses manage the tidal wave of data but also generates business value, which should come as a welcome bonus to those weighing up investment in technology against the bottom line. What’s more, is that this is true for any industry – from consumer-focused enterprises in retail or banking, to those organisations relying on search and discovery processes like the legal sector, the healthcare sector or public sector.
Managing the Digital Roadblocks
So why is it that CIO’s desperately want to classify their information, but are not implementing and taking advantage of it?
Manual classification is time-consuming, inaccurate and inconsistent; with quality often deteriorating as volumes increase and time pressures heighten. Rule-based, automated classification already supports enterprises by sorting and routing semi-structured documents. Structured documents, like a loan application form or an invoice, can be recognised by intelligent input management solutions and routed to the enterprise workflow. However, rules can quickly reach their limits when it comes to unstructured content and natural language texts. Content in unstructured documents is unexpected and not standardised and often, different people use different terms, expressions and syntax to talk about the same thing, which adds to the level of complexity when it comes to managing and converting these documents. As there is no, or only limited metadata accessible, the technology is unable able to draw meaning or context from the document, therefore the information becomes ‘dark data’. Having ‘dark data’ within your enterprise means that you cannot draw value from it, whether it is business-critical or not, it is unsearchable.
Accurate classification of unstructured content has remained an exclusive topic of interest for technical experts re-adjusting working parameters. The new approach to classifying unstructured content uses statistics, linguistics and semantic technologies and combines them with tools that makes the setup of classification models easy to use for employees and processing experts. By deploying machine learning, the most appropriate classification features can be selected. It is not necessary, as with traditional rule-based systems, to specify rule sets or to manually train and tune models with huge quantities of training documents. This changes the way texts and documents can be categorised.
What are the applications of unstructured document classification in business scenarios? Classification is an essential step in almost any kind of content management process. This includes the following:
Content management: High-performance classification of unstructured content allows organisations to manage large repositories quickly, and enables knowledge workers to efficiently search and locate information critical to their work.
Client support: Support is a crucial element of any customer-oriented business, where satisfaction and retention are key success drivers. Large companies with worldwide operations, a wide range of products and services, and millions of customers need daily feedback about what works, what doesn’t and where they could do better. Customer support services are the primary way of receiving that feedback. Fast and accurate classification of incoming complaints and requests is a critical first step towards delivering timely solutions to customers’ issues and driving higher levels of customer satisfaction. This is how customer support helps to deliver outstanding experience and increase the customer loyalty.
Information governance: Granular text- and semantic-based classification enables organisations to keep up with security, compliance and records management requirements. This is especially important given the impending EU General Data Protection Regulation (GDPR) regulation, which will affect any organisation that processes personal data of individuals living within the EU. By setting up category-based document access rights, routing, archiving and search, organisations will support the aim of GDPR to protect all EU citizens from privacy and data breaches, adapting to the increasingly data-driven world that we now live in.
Data migration: Mergers, reorganisations or even just bringing new IT systems online require fast data migration. This comes with the added challenge of keeping it protected and controlled, and avoiding the pitfalls of ‘dark data’, which can be useful for compliance, but storing and securing data typically incurs more expense (and sometimes greater risk) than value. These hurdles can be overcome by setting up flexible content-aware rules to filter content repositories during data migration projects.
E-mail management: Organising e-mails manually is painful, but missing business-critical messages from customers or suppliers is even more so. Metadata (such as ‘to’, ‘from’) is rarely good enough. Using both metadata and content, new semantic-based classification automatically distinguishes the wheat from the chaff.
The classification of unstructured information assets is critical in supporting business objectives and driving value to the enterprise. It is this ‘intelligent classification’ of information that should absolutely be a key consideration for CIOs and other key decision-makers. Simply owning big data is secondary to having access to the critical data that will accelerate an individual’s time from discovery to decision within their documents.