Computer News

Saturday, October 1, 2016

How to measure translation quality in your user interfaces

Posted by Javier Bargas-Avila, User Experience Research at Google

Worldwide, there are about 200 languages that are spoken by at least 3 million people. In this global context, software developers are required to translate their user interfaces into many languages. While graphical user interfaces have evolved substantially when compared to text-based user interfaces, they still rely heavily on textual information. The perceived language quality of translated user interfaces (UIs) can have a significant impact on the overall quality and usability of a product. But how can software developers and product managers learn more about the quality of a translation when they don’t speak the language themselves?

Key information in interaction elements and content are mostly conveyed through text. This aspect can be illustrated by removing text elements from a UI, as shown in the the figure below.

Three versions of the YouTube UI: (a) the original, (b) YouTube without text elements, and (c) YouTube without graphic elements. It gets apparent how the textless version is stripped of the most useful information: it is almost impossible to choose a video to watch and navigating the site is impossible.

In "Measuring user rated language quality: Development and validation of the user interface Language Quality Survey (LQS)", recently published in the International Journal of Human-Computer Studies, we describe the development and validation of a survey that enables users to provide feedback about the language quality of the user interface.

UIs are generally developed in one source language and translated afterwards string by string. The process of translation is prone to errors and might introduce problems that are not present in the source. These problems are most often due to difficulties in the translation process. For example, the word “auto” can be translated to French as automatique (automatic) or automobile (car), which obviously has a different meaning. Translators might chose the wrong term if context is missing during the process. Another problem arises from words that behave as a verb when placed in a button or as a noun if part of a label. For example, “access” can stand for “you have access” (as a label) or “you can request access” (as a button).

Further pitfalls are gender, prepositions without context or other characteristics of the source text that might influence translation. These problems sometimes even get aggravated by the fact that translations are made by different linguists at different points in time. Such mistranslations might not only negatively affect trustworthiness and brand perception, but also the acceptance of the product and its perceived usefulness.

This work was motivated by the fact that in 2012, the YouTube internationalization team had anecdotal evidence which suggested that some language versions of YouTube might benefit from improvement efforts. While expert evaluations led to significant improvements of text quality, these evaluations were expensive and time-consuming. Therefore, it was decided to develop a survey that enables users to provide feedback about the language quality of the user interface to allow a scalable way of gathering quantitative data about language quality.

The Language Quality Survey (LQS) contains 10 questions about language quality. The first five questions form the factor “Readability”, which describes how natural and smooth to read the used text is. For instance, one question targets ease of understanding (“How easy or difficult to understand is the text used in the [product name] interface?”). Questions 6 to 9 summarize the frequency of (in)consistencies in the text, called “Linguistic Correctness”. The full survey can be found in the publication.

Case study: applying the LQS in the field

As the LQS was developed to discover problematic translations of the YouTube interface and allow focused quality improvement efforts, it was made available in over 60 languages and data were gathered for all these versions of the YouTube interface. To understand the quality of each UI version, we compared the results for the translated versions to the source language (here: US-English). We inspected first the global item, in combination with Linguistic Correctness and Readability. Second, we inspected each item separately, to understand which notion of Linguistic Correctness or Readability showed worse (or better) values. Here are some results:

The data revealed that about one third of the languages showed subpar language quality levels, when compared to the source language.
To understand the source of these problems and fix them, we analyzed the qualitative feedback users had provided (every time someone selected the lower two end scale points, pointing at a problem in the language, a text box was surfaced, asking them to provide examples or links to illustrate the issues).
The analysis of these comments provided linguists with valuable feedback of various kinds. For instance, users pointed to confusing terminology, untranslated words that were missed during translation, typographical or grammatical problems, words that were translated but are commonly used in English, or screenshots in help pages that were in English but needed to be localized. Some users also pointed to readability aspects such as sections with old fashioned or too formal tone as well as too informal translations, complex technical or legal wordings, unnatural translations or rather lengthy sections of text. In some languages users also pointed to text that was too small or criticized the readability of the font that was used.
In parallel, in-depth expert reviews (so-called “language find-its”) were organized. In these sessions, a group of experts for each language met and screened all of YouTube to discover aspects of the language that could be improved and decided on concrete actions to fix them. By using the LQS data to select target languages, it was possible to reduce the number of language find-its to about one third of the original estimation (if all languages had been screened).

LQS has since been successfully adapted and used for various Google products such as Docs, Analytics, or AdWords. We have found the LQS to be a reliable, valid and useful tool to approach language quality evaluation and improvement. The LQS can be regarded as a small piece in the puzzle of understanding and improving localization quality. Google is making this survey broadly available, so that everyone can start improving their products for everyone around the world.

High Quality Object Detection at Scale

Google Engineers Christian Szegedy, Scott Reed^*, Dumitru Erhan, and Dragomir Anguelov

Update - 26/02/2015
We recently discovered a bug in the evaluation methodology of our object detector. Consequently, the large numbers we initially reported below are not realistic, due to the fact that our separately trained context extractor was contaminated with half of the validation set images. Therefore, our initial results were overly optimistic and were not attainable by the methodology described in the paper. Re-evaluating our initial results, we have restricted ourselves to reporting only the single-model results on the other half of the dedicated validation set without retraining the models. With the updated evaluation, we are still able to report the best single-model result on the ILSVRC 2014 detection challenge data set, with 0.43 mAP when combining both Selective Search and MultiBox proposals with our post-classification model. The original draft of our paper "Scalable, High Quality Object Detection" has been updated to reflect this information. We are deeply sorry if our initial reported results caused any confusion in the community. Original post follows below.
-C. Szegedy, S. Reed, D. Erhan, and D. Anguelov

The ILSVRC detection challenge is an influential academic benchmark for measuring the quality of object detection. This summer, the GoogLeNet team reported top results in the 2014 edition of the challenge, with ~2X improvement over the previous year’s best results. However, the quality of our results came at a high computational cost: processing each image took about two minutes on a state-of-the-art workstation.

Naturally, we began to think of how we could both improve the accuracy and reduce the computation time needed. Given the already high quality of previous results like those of GoogLeNet^[6], we expected that further improvements to detection quality would be increasingly hard to achieve. In our recent paper Scalable, High Quality Object Detection^[7], we detail advances that instead have resulted in an accelerated rate of progress in object detection:

Evolution of detection quality over time. On the y axis is the mean average precision of the best published results at any given time. The blue line shows result using individual models, the red line is multi-model ensembles. Overfeat^[8] was the state-of-the-art at end of last year, followed by R-CNN^[1] published in May. The later measurement points are the results of our team.^[6,7]

As seen in the plot above, the mean average precision has been improved since August from 0.45 to 0.56: a 23% relative gain. The new approach can also match the quality of the former best solution with 140X reduced computational resources.

Most current approaches for object detection employ two phases^[1]: in the first phase, some hand-engineered algorithm proposes regions of interest in the image. In the second phase, each proposed region is run through a deep neural network, identifying which proposed patches correspond to an object (and what that object is).

For the first phase, the common wisdom^[1,2,3,4] was that it took skillfully crafted code to produce high quality region proposals. This has come with a drawback though: these methods don’t produce reliable scoring for the proposed regions. This forces the second phase to evaluate most of the proposed patches in order to achieve good results.

So we revisited our prior “MultiBox” work^[5], in which we let the computer learn to pick the proposals to see whether we could avoid relying on any of the hand-crafted methods above. Although the MultiBox method, using previous generation vision network architectures, could not compete with hand-engineered proposal approaches, there were several advantages of fully relying on machine learning only. First, the quality of proposals increases with each new improved network architecture or training methodology without additional programming effort. Second, the regions come with confidence scores which are used for trading off running time versus quality. Additionally, the implementation is simplified.

Once we used new variants of the network architecture introduced in [6], MultiBox also started to perform much better; Now, we could match the coverage of alternative methods with half as many proposal patches. Also, we changed our networks to take the context of objects into account, fueling additional quality gains for the second phase. Furthermore, we came up with a new way to train deep networks to learn more robustly even when some objects are not annotated in the training set, which improved both phases.

Besides the significant gains in mean average precision, we can now cut the number of evaluated patches dramatically at a modest loss of quality: the task that used to take 2 minutes of processing time for a single image on a workstation by the GoogLeNet ensemble (of 6 networks), is now performed under a second using a single network without using GPUs. If we constrain ourselves to a single category like “dog”, we can now process 50 images/second on the same machine by a more streamlined approach^[7] that skips the proposal generation step altogether.

As a core area of research in computer vision, object detection is used for providing strong signals for photo and video search, while high quality detection could prove useful for self-driving cars and automatically generated image captions. We look forward to the continuing research in this field.

References:

[1] Rich feature hierarchies for accurate object detection and semantic segmentation
by Ross Girshick and Jeff Donahue and Trevor Darrell and Jitendra Malik (CVPR, 2014)

[2] Prime Object Proposals with Randomized Prim’s Algorithm
by Santiago Manen, Matthieu Guillaumin and Luc Van Gool

[3] Edge boxes: Locating object proposals from edges
by Lawrence C Zitnick, and Piotr Dollàr (ECCV 2014)

[4] BING: Binarized normed gradients for objectness estimation at 300fps
by Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin and Philip Torr (CVPR 2014)

[5] Scalable Object Detection using Deep Neural Networks
by Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov

[6] Going deeper with convolutions
by Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich

[7] Scalable, high quality object detection
by Christian Szegedy, Scott Reed, Dumitru Erhan and Dragomir Anguelov

[8] OverFeat: Integrated Recognition, Localization and Detection using Convolutional Network by Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus and Yann LeCun

* A PhD student at University of Michigan -- Ann Arbor and Software Engineering Intern at Google^?

Computer News

Pages

Saturday, October 1, 2016

How to measure translation quality in your user interfaces

Sunday, March 27, 2016

High Quality Object Detection at Scale

Popular Posts

Blog Archive