Pages

Showing posts with label billion. Show all posts
Showing posts with label billion. Show all posts

Saturday, October 15, 2016

A Billion Words Because todays language modeling standard should be higher



Language is chock full of ambiguity, and it can turn up in surprising places. Many words are hard to tell apart without context: most Americans pronounce “ladder” and “latter” identically, for instance. Keyboard inputs on mobile devices have a similar problem, especially for IME keyboards. For example, the input patterns for “Yankees” and “takes” look very similar:
Photo credit: Kurt Partridge

But in this context -- the previous two words, “New York” -- “Yankees” is much more likely.

One key way computers use context is with language models. These are used for predictive keyboards, but also speech recognition, machine translation, spelling correction, query suggestions, and so on. Often those are specialized: word order for queries versus web pages can be very different. Either way, having an accurate language model with wide coverage drives the quality of all these applications.

Due to interactions between components, one thing that can be tricky when evaluating the quality of such complex systems is error attribution. Good engineering practice is to evaluate the quality of each module separately, including the language model. We believe that the field could benefit from a large, standard set with benchmarks for easy comparison and experiments with new modeling techniques.

To that end, we are releasing scripts that convert a set of public data into a language model consisting of over a billion words, with standardized training and test splits, described in an arXiv paper. Along with the scripts, we’re releasing the processed data in one convenient location, along with the training and test data. This will make it much easier for the research community to quickly reproduce results, and we hope will speed up progress on these tasks.

The benchmark scripts and data are freely available, and can be found here: http://www.statmt.org/lm-benchmark/

The field needs a new and better standard benchmark. Currently, researchers report from a set of their choice, and results are very hard to reproduce because of a lack of a standard in preprocessing. We hope that this will solve both those problems, and become the standard benchmark for language modeling experiments. As more researchers use the new benchmark, comparisons will be easier and more accurate, and progress will be faster.

For all the researchers out there, try out this model, run your experiments, and let us know how it goes -- or publish, and we’ll enjoy finding your results at conferences and in journals.
Read More..

Tuesday, September 27, 2016

Zuckerberg pledges 44 5 billion to charity

Facebook became more popular yesterday as its founder Mark Zuckerberg and his wife Priscilla Chan pledged 99% of their shares in Facebook to charity to mark the birth of their first child. These shares are currently valued at around $45 billion. This will leave the Zuckerberg family with a mere half billion dollars to scrape by on. Actually Im not being facetious, I commend Zuckerberg for his actions, nobody needs such a vast fortune and in the hands of well-run charities such huge sums of money can make a real difference. I hope more billionaires follow suit. When you use Facebook now you can know that you are helping make the world a better place.

from The Universal Machine http://universal-machine.blogspot.com/

IFTTT

Put the internet to work for you.

Turn off or edit this Recipe

Read More..

Friday, May 27, 2016

IBM spends 3 billion to push the far future of computer chips

IBM has announced that it is investing $3 billion over the next five years to develop processors with much smaller, more tightly packed electronics than todays chips, and to sustain computing progress even after todays manufacturing technology runs out of steam. The problem is we are just physically finding it impossible to miniaturise silicon chips any more (no pun intended). Read this Cnet article to learn more.

from The Universal Machine http://universal-machine.blogspot.com/

IFTTT

Put the internet to work for you.

Turn off or edit this Recipe

Read More..

Saturday, January 30, 2016

11 Billion Clues in 800 Million Documents A Web Research Corpus Annotated with Freebase Concepts



“I assume that by knowing the truth you mean knowing things as they really are.”
- Plato

When you type in a search query -- perhaps Plato -- are you interested in the string of letters you typed? Or the concept or entity represented by that string? But knowing that the string represents something real and meaningful only gets you so far in computational linguistics or information retrieval -- you have to know what the string actually refers to. The Knowledge Graph and Freebase are databases of things, not strings, and references to them let you operate in the realm of concepts and entities rather than strings and n-grams.

We’ve previously released data to help with disambiguation and recently awarded $1.2M in research grants to work on related problems. Today we’re taking another step: releasing data consisting of nearly 800 million documents automatically annotated with over 11 billion references to Freebase entities.

These Freebase Annotations of the ClueWeb Corpora (FACC) consist of ClueWeb09 FACC and ClueWeb12 FACC. 11 billion phrases that refer to concepts and entities in Freebase were automatically labeled with their unique identifiers (Freebase MID’s). For example:



Since the annotation process was automatic, it likely made mistakes. We optimized for precision over recall, so the algorithm skipped a phrase if it wasn’t confident enough of the correct MID. If you prefer higher precision, we include confidence levels, so you can filter out lower confidence annotations that we did include.

Based on review of a sample of documents, we believe the precision is about 80-85%, and recall, which is inherently difficult to measure in situations like this, is in the range of 70-85%. Not every ClueWeb document is included in this corpus; documents in which we found no entities were excluded from the set. A document might be excluded because there were no entities to be found, because the entities in question weren’t in Freebase, or because none of the entities were resolved at a confidence level above the threshold.

The ClueWeb data is used in multiple TREC tracks. You may also be interested in our annotations of several TREC query sets, including those from the Million Query Track and Web Track.

If you would prefer a human-annotated set, you might want to look at the Wikilinks Corpus we released last year. Entities there were disambiguated by links to Wikipedia, inserted by the authors of the page, which is effectively a form of human annotation.

You can find more detail and download the data on the pages for the two sets: ClueWeb09 FACC and ClueWeb12 FACC. You can also subscribe to our data release mailing list to learn about releases as they happen.

Special thanks to Jamie Callan and Juan Caicedo Carvajal for their help throughout the annotation project.
Read More..