Seeing the wood for the trees: Processing data for machine learning

data centre

Machine learning promises to revolutionise business, but all that data needs to be prepared and processed in the right way…

Intelligent machines have assumed an integral role in our lives almost without us noticing. They are there when Amazon suggests a film we might like to watch or when Facebook tries to sell us a product and they are there when we ask Siri a question. Machine learning has woven itself into the heart of our lives and is also having an increasingly important role in the world of business.

According to McKinsey’s State of Machine Learning Report, the world’s leading companies are racing to be at the head of the queue for patents and IP for new machine learning. Giants such as Baidu and Google spent between $20bn and $30bn on machine learning with most of this going into development.

In a world of big data, companies of all sorts need intelligent machines to make sense of all the information coming into them. Even relatively small organisations are transforming the way they collect and manage data, but none of it does any good if you can’t make sense of it. Automated machines which can process the data are crucial, particularly when it comes to financial services.

Machine learning has an important role in financial services. Institutions are using it to manage interactions with customers, risk and compliance.

Processing data

Machine learning works via a series of algorithms that rely on data to evolve, learn and inform their actions, but their performance depends entirely on what data is fed into the system. Businesses need to decide what data they need and prepare it. In a typical machine or deep learning project this preparation takes approximately 80% of the overall time. Data can go through various data cleansing programs and tools.

The process, though, is not entirely automated and can be quite arduous. It includes the basics of setting filters to remove duplicate copies, weighting and selection, attribute setting and sampling. You may have much more data than you actually need, so sampling helps you pull out selected data that will make it much easier and faster to process.

You will also have to go through a process of feature engineering in which you must decide which attributes you want to monitor and which information a system needs. For example, if managing a system for compliance you need to identify the data sets the machine needs to know.

You will then need to scale the data to make it easily usable for the machines. There are typically three types of transformation:

  • Scaling: You might have values for different units such as kilometres, miles, kilograms or time. Algorithms like data with uniform scales between 0 and 10 for the smallest and largest volumes.
  • Decomposition: Features can be complex and may be more useful to machine learning if pulled out into their component parts.
  • Aggregation: The data could be aggregated into a single feature which is more relevant for the problem you are trying to solve.

Pre-processing the data can be an arduous task and requires people with the right skills and experience. Developers, businesses analysts and data scientists will all be valuable in getting the data into the right shape to be used by the machine. How successful they are in doing this will determine the difference between a machine which causes nothing but problems and one which helps you analyse and action your data and drive profit to your business.