Translating 1GB of text to English - google-translate

I'm looking for a language translation API/solution that would fit my use case.
My use case is the following:
The data is 1 GB of free unstructured text written mostly in the world's common languages (French, Spanish, German, Russian, Korean). The language of each piece of text is known.
We can assume the text is grammatically correct and consists of complete sentences, but contains some uncommon words such as chemical compound names.
The text has to be translated to English.
The solution must be at least 10x cheaper than Google Translate which charges $20 per 1M characters.
I would be willing to trade some of the Google's quality for cost-effectiveness. Google, Yahoo, Microsoft, Yandex, Online-Translator.com are all good enough, just too expensive.
I've got a 16 CPU machine at my disposal so offline translation is an option too.
Any suggestions?

For your volumes, Machine Translation prices range from $3 to $25 per 1M symbols (with some outliers like ModernMT which costs $eu per 1000 words).
If you want to trade off a little bit of quality, you may pick what we call "Optimal engines" - one which are within top 5% by performance but have the lowest price.
You may find more details in our Machine Translation report from July 2018.
Then, you need to know which engines support your language pairs and deal with their APIs, request limits and quotas.
You may use Intento API to get a list of engines for your language pairs.
Then, you may use this API in the async mode, then Intento takes care of all the limits. However I am not sure it will deal with 1G file, but you're welcome to try.
To avoid tinkering with the API requests, I would suggest using the CLI.

Related

How to correctly store and process different measure units of ingredients in food service CMS?

I have a backend system for restaurants & food services. On the front end people are using a huge variety of measure units for recipes and such (kilo, cup, gallon, pint, small spoon, big spoon, pinch of salt, dash of pepper and so on). But i need to convert all that to something small and indivisible to make calculations, reports and analytics and the backend. However, to show some info and reports to the user i have to convert all the stuff backwards to the 'user-friendly' measure units.
Thus it can lead to errors in numbers (round-off errors and such).
What is the best practice to deal with that? Thank you!

eBay API: Get all items currently in auction

For an university project (Big Data lecture), I’d like to analyze auctions on eBay. I wasn’t able to find reliable information so far whether it’s possible to get all current auctions on eBay via their API or not. I only need the auction title and the current price and I am aware that this is a huge load of data, but I’m just curios.
I don't think it's possible, in part because of the huge amount of data, and perhaps also because I don't think eBay wants people downloading data en masse like that. Doing so might allow people to do data mining and market research from a vantage point that is too publicly revealing for them.
If you're willing to settle for a large segment of data, look into eBay's Large Merchant Services and their LMS API.
For your research project, you should be able to make sense of an even smaller subset of data by just pulling from eBay's Finding API in a few automated large chunks.

Parsing RTF files into R?

Couldn't find much support for this for R. I'm trying to read a number of RTF files into R to construct a data frame, but I'm struggling to find a good way to parse the RTF file and ignore the structure/formatting of the file. There are really only two lines of text I want to pull from each file -- but it's nested within the structure of the file.
I've pasted a sample RTF file below. The two strings I'd like to capture are:
"Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
"The technology level [...] and managerial implications." (the full paragraph)
Any thoughts on how to efficiently parse this? I think regular expressions might help me, but I'm struggling to form the right expression to get the job done.
{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720
\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720
\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720
\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720
\f1\fs24 \cf0 \
}
1) A simple way if you are on Windows is to read it in using WordPad or Word and then save it as a plain text document.
2) Alternately, to parse it directly in R, read in the rtf file, find lines with the given pattern, pat producing g. Then replace any \\' strings with single quotes producing noq. Finally remove pat and any trailing junk. This works on the sample but you might need to revise the patterns if there are additional embedded \\ strings other than the \\' which we already handle:
Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))
For the indicated file this is the output:
[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."
Revised several times. Added Wordpad/Word solution.

Is it good idea to use google translation API for a few words?

I'm making websites for german clients atm and I would like to make it easier for my german colleague and have all texts in my CMS translated automaticly. I would only need to ask the API for the translation once, because I'm saving the translations in database.
So I found out that google charges $20 per 1 M characters, but I would only need a few words translated now and then, something like 1000 a week max.
So I'd like to know, if there are any restrictions, that I need to use some minimum number of translations, or a minimum price for those who don't fully use the api.
You can still use Google Translate for free through YQL
I think it starts from 1M only, anyway we can't believe the translation because it gets confuse in may cases so just try more tests otherwise it will currupt meaning of the line and sentance, think once again before use.

Industry benchmarks for site search

We have been asked to increase the performance of a clients site search. Before we start we would like to set benchmarks. I have asked the client if they are comfortable with enabling unanimous data sharing so we have access to industry benchmarks as I don't have control over this setting: http://support.google.com/analytics/bin/answer.py?hl=en&answer=1011397 however it sounds like things have changed in the google analytics camp and these reports are only available via a newsletter now? Is this true?
Also, will these reports give me industry standards to compare my clients current search performance against? Or is there another service that has these baseline standards available?
Here's an example of the data we are interested in. This is our clients current search performance:
Visits with Search: 772
Total Unique Searches: 1,093
Results Page Views/Search: 1.36
% Search Exits: 56.45%
% Search Refinements: 24.78%
Time after Search: 00:01:40
Search Depth: 0.59
I work at large ecommerce site, and I asked our AdWords rep about this, having recently wanted access to this kind of data myself.
He said that benchmarking was removed 3/15/11, at which point they were experimenting with a monthly newsletter format to deliver the same kind of data.
From what I've seen they may have done one newsletter before (quietly) retiring it completely. I never saw the newsletter, but I think I remember reading reports of people who did receive one.
Disappointing to know they had access to all that data, but pulled the plug on the program. I wonder if they killed it due to data integrity concerns--they can't guarantee correct tracking-code installations on all these sites opting in, so what is the data worth if it's of questionable quality. iono... just a total guess.
We used to use coremetrics here, and they had an opt-in benchmarking program. So if you know any other webmasters using Coremetrics, you could probably ask them to pull some benchmarking info.
We were able to get some benchmarking data from fireclick.com, but none of it (that I've seen anyways) covers on-site search. Mainly just top line metrics. :-/
So the search for benchmark data continues...

Resources