The Big Data !

27 Novembre 2016 , Rédigé par Hugo Coron Publié dans #Internet

Why data is the new coal

“Is data the new oil?” asked proponents of big data back in 2012 in Forbes magazine. By 2016, and the rise of big data’s turbo-powered cousin deep learning, we had become more certain: “Data is the new oil,” stated Fortune.

Amazon’s Neil Lawrence has a slightly different analogy: Data, he says, is coal. Not coal today, though, but coal in the early days of the 18th century, when Thomas Newcomen invented the steam engine. A Devonian ironmonger, Newcomen built his device to pump water out of the south west’s prolific tin mines.

The problem, as Lawrence told the Re-Work conference on Deep Learning in London, was that the pump was rather more useful to those who had a lot of coal than those who didn’t: it was good, but not good enough to buy coal in to run it. That was so true that the first of Newcomen’s steam engines wasn’t built in a tin mine, but in coal works near Dudley.

So why is data coal? The problem is similar: there are a lot of Newcomens in the world of deep learning. Startups like London’s Magic Pony and SwiftKey are coming up with revolutionary new ways to train machines to do impressive feats of cognition, from reconstructing facial data from grainy images to learning the writing style of an individual user to better predict which word they are going to type in a sentence.
And yet, like Newcomen, their innovations are so much more useful to the people who actually have copious amounts of raw material to work from. And so Magic Pony is acquired by Twitter, SwiftKey is acquired by Microsoft – and Lawrence himself gets hired by Amazon from the University of Sheffield, where he was based until three weeks ago.

But there is a coda to the story: 69 years later, James Watt made a nice tweak to the Newcomen steam engine, adding a condenser to the design. That change, Lawrence said, “made the steam engine much more efficient, and that’s what triggered the industrial revolution”.

Whether data is oil or coal, then, there’s another way the analogy holds up: a lot of work is going into trying to make sure we can do more, with less. It’s not as impressive as teaching a computer to play Go or Pac-Man better than any human alive, but “data efficiency” is a crucial step if deep learning is going to move away from simply gobbling up oodles of data and spitting out the best correlations possible.

“If you look at all the areas where deep learning is successful, they’re all areas where there’s lots of data,” points out Lawrence. That’s great if you want to categorise images of cats, but less helpful if you want to use deep learning to diagnose rare illnesses. “It’s generally considered unethical to force people to become sick in order to acquire data.”

Machines remain stupid
The problem is that for all the successes of organisations like Google’s AI research organisation Deep Mind, computers are still pretty awful at actually learning. I can show you a picture of an animal you’ve never seen before in your life – maybe a Quokka? – and that one image would provide you with enough information to correctly identify a completely different Quokka in a totally separate picture. Show the first image of a Quokka to even a good, pre-trained neural network, and you’ll be lucky if it even adjusts its model at all.

The flipside, of course, is that if you show a deep learning system several million pictures of Quokka, along with a few million pictures of every other extant mammal, you could well end up with a mammal identification system which can beat all but the top-performing experts at categorising small furry things.

“Deep learning requires very large quantities of data in order to build up a statistical picture,” says Imperial College’s Murray Shanahan. “It actually is very very slow indeed at learning, whereas a young child is very quickly going to learn the idea.”

Deep learning experts have proposed several ways to tackle the problem of data efficiency. Like much of the field, they’re best thought of through analogy with your own brain.

One such approach involves “progressive neural networks”. It aims to overcome the problem that many deep learning models have when they move into a new field: either they ignore their already-learned information and start afresh, or run the risk of “forgetting” what they already learned as it gets overwritten by the new information. Imagine if your options when learning to identify Quokkas were either to independently relearn the entire concept of heads, bodies, legs and fur, or to try and incorporate your existing knowledge but risk forgetting what a cat looks like.

Raia Hadsell is in charge of Deep Mind’s efforts to implement a better system for deep learning – one which is necessary if the company is to continue toward its long-term goal of building an artificial general intelligence: a machine capable of doing the same set of tasks as you or I.
“There is no model, no neural network, in the world that can be trained to both identify objects, play space invaders, and listen to music,” Hadsell said at Re-Work. “What we would like to be able to do is learn a task, get to [an] expert [level] at that task, and then most on to a second task. Then a third, then a forth then a fifth.

“We want to do that without forgetting. And with the ability to transfer from task to task: if I learn one task, I want that to help me learn the next task.” That’s what Hadsell’s team at Deep Mind has been working on. Their method allows the learning system to “freeze” what it knows about one task – say, playing Pong – and then move on to the next task, while still being able to refer back to what it learned about the first one.

“That could be an interesting low-level vision feature” – learning how to parse individual objects out of the stream of visual data, for instance – “or a high-level policy feature”, such as the knowledge that the small white dot must remain on the correct side of your paddle. It’s easy to see how the former is useful to carry over to other Atari games, while the latter might only be useful if you’re trying to train Breakout. But if you are trying to train Breakout, it lets you skip a whole chunk of learning.

Obviously Deep Mind is still a few steps away from actually using the technique to train an artificial general intelligence, which means they’re also a few steps away from accidentally unleashing a superintelligent AI on the world that will repurpose your brain into a node in a planet-wide supercomputer. But, Hadsell said, the progressive neural network technique does have some more immediate uses in improving data efficiency.

Take robotics. “Data is a problem for robots, because they break, they need minders, and they’re expensive,” she said. One approach is to use brute-force on the problem: take, for example, the 2m miles Alphabet’s self-driving cars have travelled in their attempt to learn how to drive. At the beginning, it was only safe to use on the freeway, and even then with a driver’s hand inches from the wheel. Now, it drives cars with no steering wheel at all – though not, yet, on public roads, for legal reasons.

Another approach is to teach the robot through simulation. Feed its sensors a rough approximation of the real world, and they’ll still learn mostly correctly: then you can “top up” the education with actual training. And the best way to do that is with progressive neural networks, she said.

Take one simple task: grabbing a floating ball using a robotic arm. “In a day, we trained this task robustly in simulation … if it had been done on a real robot it would have taken 55 days to train.” Hooked up to the real arm, just another two hours training was all it needed to get back to the same level of performance.

Teach them to think
Or there’s another approach. Imperial College’s Shanahan has been working in AI long enough to remember the first time it hit the hype cycle. Back then, the popular approach wasn’t deep learning, a method which has only become possible as processing power, storage space and, yes, data availability have all come of age. Instead, a popular approach was “symbolic” AI: focusing on building logical paradigms which could be generalised, and then fed information about the real world to teach them more. The “symbols” in symbolic AI are, Shanahan says, “a bit like sentences in English, that state facts about the world, or some domain.”

Unfortunately, that approach didn’t scale, and AI had a few years in a downturn. But Shanahan argues that there are benefits to a hybrid approach of the two. Not only would it help with the data efficiency problem, but it also helps with a related issue of transparency: “it’s very difficult to extract human-readable explanations for the decisions that they make,” he says. You can’t ask an AI why it decided that a Quokka was a Quokka; it just did.

Shanahan’s idea is to build up a symbolic style database not through hand-coding it, but by hooking it in with another approach, called deep reinforcement learning. That’s when an AI learns through trial and error, rather than by examining vast quantities of data. It was core to how DeepMind’s AlphaGo learned to play, for instance.

In a proof of concept, Shanahan’s team built an AI to play a simple game. In essence, the system is trained, not to play the game directly, but to teach a second system the rules of the game and the state of the world so that it can think in more abstract terms about what is going on.

Just like Hadsell’s approach, that pays off when the rules change slightly. Where a conventional deep learning system is flummoxed, Shanahan’s more abstracted system is able to think generally about the problem, see the similarities to the previous approach, and continue.

Think smart
To some extent, the data efficiency problem can be overstated. It’s true that you can learn something a heck of a lot faster than the typical deep learning system, for instance. But not only are you starting with years worth of previous knowledge that you’re building on – hardly a small amount of data – you also have a weakness that no good deep learning system would put up with: you forget. A lot.

That may turn out to be the cost of an efficient thinking system. Either you forget how to do stuff, or you spend ever increasing resources simply sorting between the myriad things you know trying to find the right one for each situation. But if that’s the price to pay for moving deep learning out of the research centres in the biggest internet companies, it could be worth it.

Alex Hern - 27 Septembre 2016 - Tge Guardian

Admiral to price car insurance based on Facebook posts

One of the biggest insurance companies in Britain is to use social media to analyse the personalities of car owners and set the price of their insurance.

The unprecedented move highlights the start of a new era for how companies use online personal data and will start a debate about privacy.

Admiral Insurance will analyse the Facebook accounts of first-time car owners to look for personality traits that are linked to safe driving. For example, individuals who are identified as conscientious and well-organised will score well.

The insurer will examine posts and likes by the Facebook user, although not photos, looking for habits that research shows are linked to these traits. These include writing in short concrete sentences, using lists, and arranging to meet friends at a set time and place, rather than just “tonight”.

In contrast, evidence that the Facebook user might be overconfident – such as the use of exclamation marks and the frequent use of “always” or “never” rather than “maybe” – will count against them.

The initiative is called firstcarquote and was officially meant to launch this week but that was delayed at the last minute on Tuesday night. It is aimed at first-time drivers or owners – although anyone with a licence can apply. The scheme is voluntary, and will only offer discounts rather than price increases, which could be worth up to £350 a year. However, Admiral has not ruled out expanding firstcarquote.

The rapid growth of social media and personal technology has given insurance companies and employers swaths of data they can access to analyse customers or employees. As well as Admiral’s car insurance scheme, insurers are looking at how they can use the rise of smartwatches and fitness trackers to monitor people’s health. For example, Vitality is currently selling the Apple Watch to health and life insurance customers, with the final price dependent on how much exercise customers do while owning the watch.

Admiral says that firstcarquote offers a way for young drivers to identify themselves as safe rather than having to wait years while they build up a track record and a no claims bonus.
Dan Mines, who led the firstcarquote project at Admiral, denied that it was invasive of personal data.

“It is incredibly transparent. If you don’t want to use it in a quote then you don’t have to,” he said. “We are doing our best to build a product that allows young people to identify themselves as safe drivers.”

Mines said Admiral could eventually develop the scheme further, meaning it could include other social media sites and increase the price of insurance for some drivers.

“This is very much a test product for us. This is innovative, it is the first time anyone has done this,” he said. “It is a test, this is early days. The data will only ever provide a discount. We will work through that and learn more.

“I think the future is unknown. We don’t know if people are prepared to share their data. If we find people aren’t sharing their data, then we won’t ever get to consider that [expanding firstcarquote].”

The scheme is based around algorithms that have been developed by Admiral. The technology uses social data to make a personality assessment and then, judging against real claims data, analyse the risk of insuring the driver.

Yossi Borenstein, the principal data scientist on firstcarquote, said its algorithm looked for correlations between social media data and actual claims data. The technology will evolve as firstcarquote attracts customers and gathers more evidence about the correlations, meaning the importance of items identified on social media could change.

Borenstein said: “Just like conscientiousness there are other traits which can be indicative of safe driving. Our algorithm for calculating what ‘safe’ looks like is constantly learning, as we match social data to actual claims data.

“Our analysis is not based on any one specific model, but rather on thousands of different combinations of likes, words and phrases and is constantly changing with new evidence that we obtain from the data. As such our calculations reflect how drivers generally behave on social media, and how predictive that is, as opposed to fixed assumptions about what a safe driver may look like.”

Borenstein insisted that Admiral would not have access to information about what its customers look at on Facebook or what their friends do. The company would only have access to the information gathered during the quote process and would have no ongoing access.

“If this is successful, it could be revolutionary,” he said. “It could be truly transformational.”

An Admiral spokesman said: “The launch of our firstcarquote trial has had to be delayed. We’ve been working closely with Facebook in Europe to get the service ready, and are now addressing a few outstanding issues. We hope that very soon we will be able to offer first-time drivers better deals on their car insurance.”

Graham Ruddick - 2 Novembre 2016 - The Guardian

Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity.

Big Data has the potential to help companies improve operations and make faster, more intelligent decisions. This data, when captured, formatted, manipulated, stored, and analyzed can help a company to gain useful insight to increase revenues, get or retain customers, and improve operations.

This week we will talk about Big Data.

Nowadays it is present everywhere. In fact mobile applications or websites use this new technology for marketing reasons but also to facilitate the life of the consumer. When you use Waze, the software can tell you in advance your destination precisely because you went there every day.

When you consulted Youtube the site is able to tell you the videos that will most interest you because it keeps in memory your previous visits. This also works on sites like Amazon or Google.

This is why the Big Data is a new energy, a new coal.

With internet the data are everywhere. Sites can sell their databases to other companies precisely in order to develop big data. This is precisely why old advertising media need to be updated very quickly.

The communication tools or advertisements are today very powerful thanks to the big data. Indeed the big data targets a particular person whereas the old communication strategies targeted a group of people.

That's the big difference.

See you next week.

Partager cet article