Matching columns of two datasets - r

I've been assigned this problem where I need to match mobile apps and publishers between two data sets (One is GooglePlay, the other is iTunes).
Here is a description of the variables used in the iTunes dataset (The Google Play data set variables names are similair or the same).
anon_ios_app_id: anonymized iOS app id
anon_ios_publisher_id: anonymized iOS publisher id
points: the “worth” of the match, 10 points is highest worth and 0.5 is the lowest.
ios_name: name of the mobile app in the itunes store
ios_publisher_name: name of the publisher of the app in the itunes store
category_name: the category of the app
type: Game or Non-game
I've done some analysis to look for the names of apps in the data sets that share the same name and publishers. As an example, I searched for apps that had "Walmart" in their names.
GooglePlay <- read.csv("...\\GooglePlay.csv", header = TRUE)
iTunes <- read.csv("...\\iTunes.csv", header = TRUE)
grep("Walmart", iTunes$ios_name)
[1] 41203 51026 63522 64330 112441 113516 115510 117588 117788 119558 119605 120002 165514 248817
[15] 277425 290010 463244 546799 565806
grep("Walmart", GooglePlay$gp_name)
[1] 154 31984 162284 162342 162792 168722 168774 169339 325520 325601 357122 360050 436084 437144
[15] 441458 447177 503260
During my analysis, I did find that some apps had the same name and publisher in both data sets. For example
GooglePlay$gp_name[154]
[1] Walmart Photo
GooglePlay$gp_publisher_name[154]
[1] Kodak Alaris Inc.
iTunes$ios_name[165514]
[1] Walmart Photo
iTunes$ios_publisher_name[165514]
[1] Kodak Alaris Inc.
My objectives are: 1. Provide one unified file with all the respective id’s/names of the matched apps/publishers.
2. provide one number: the SUM(iOS points + GP points) for the matched apps.
What functions should I use to match apps and publishers from the two data sets? How do I make a unified file of those matches?

Related

automate input to user query in R

I apologize if this question has been asked with terminology I don't recognize but it doesn't appear to be.
I am using the function comm2sci in the library taxize to search for the scientific names for a database of over 120,000 rows of common names. Here is a subset of 10:
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA", "ABYSSINIAN BLUE
WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
When searching with the NCBI database in this function, it asks for user input if the common name is generic/general and not species specific, for example the following call will ask for clarification for "AARDVARK" by entering '1', '2' or 'return' for 'NA'.
install.packages("taxize")
library(taxize)
ncbioutput <- comm2sci(commnames, db = "ncbi")###querying ncbi database
Because of this, I cannot rely on this function to find the names of the 120000 species without me sitting and entering 'return' every few minutes. I know this question sounds taxize specific - but I've had this situation in the past with other functions as well. My question is: is there a general way to place the comm2sci call in a conditional statement that will return a specific value when user input is prompted? Or otherwise write a function that will return some input when prompted?
All searches related to this tell me how to ask for user input but not how to override user queries. These are two of the question threads I've found, but I can't seem to apply them to my situation: Make R wait for console input?, Switch R script from non-interactive to interactive
I hope this was clear. Thank you very much for your time!
So the get_* functions, used internally, all by default ask for user input when there is > 1 option. But, all of those functions have a sister function with an underscore, e.g., get_uid_ that do not prompt for input, and return all data. You can use that to get all the data, then process however you like.
Made some changes to comm2sci, so update first: devtools::install_github("ropensci/taxize")
Here's an example.
library(taxize)
commnames <- c("WESTERN CAPERCAILLIE", "AARDVARK", "AARDWOLF", "ABACO ISLAND BOA",
"ABBOTT'S DAY GECKO", "ABDIM'S STORK", "ABRONIA GRAMINEA",
"ABYSSINIAN BLUE WINGED GOOSE",
"ABYSSINIAN CAT", "ABYSSINIAN GROUND HORNBILL")
Then use get_uid_ to get all data
ids <- get_uid_(commnames)
Process the results in ids as you like. Here, for brevity, we'll just grab first row of each
ids <- lapply(ids, function(z) z[1,])
Then grab the uid's out
ids <- as.uid(unname(vapply(ids, "[[", "", "uid")), check = FALSE)
And pass to comm2sci
comm2sci(ids)
$`100830`
[1] "Tetrao urogallus"
$`9818`
[1] "Orycteropus afer"
$`9680`
[1] "Proteles cristatus"
$`51745`
[1] "Chilabothrus exsul"
$`8565`
[1] "Gekko"
$`39789`
[1] "Ciconia abdimii"
$`278977`
[1] "Abronia graminea"
$`8865`
[1] "Cyanochen cyanopterus"
$`9685`
[1] "Felis catus"
$`153643`
[1] "Bucorvus abyssinicus"
Note that NCBI returns common names from get_uid/get_uid_, so you can just go ahead and pluck those out if you want

Can R download text from inside this sort of web app?

Trying do download text as in company profiles from this website
http://www.evca.eu/about-evca/members/member-search/#lsearch
In the past I had good success with similar tasks using for example the XML package, but this won't work here because the data I am trying to grasp is inside some sort of dynamic and the single elements in the list don't have own URLs or something.
Unfortunately I don't know much about web-design, so I am not really sure how to address this. Any suggestions, it would really suck to do this manually. Thanks
First download Fiddler Web Debugger or some other similar tool. It places itself between your browser and web server, then you can see what is going on (also dynamic/AJAX communication).
Run it, go to the website you are trying to understand and execute actions you want to do automatically.
For example if you open http://www.evca.eu/about-evca/members/member-search/#lsearch, enter "a" in the search box and then choose "All" (to get all results), you will see in the Fiddler that browser opens http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999 URL and sends "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest".
You can do the same with R, parse the result, get some text, maybe some links to other stuff.
Below code in R will do the same as described above:
require('XML')
require(stringr)
library(httr)
r <- POST("http://www.evca.eu/umbraco/Surface/MemberSearchPage/HandleSearchForm?page=1&rpp=999999",
body = "Company=a&MemberType=&Country=&X-Requested-With=XMLHttpRequest")
stop_for_status(r)
txt=content(r,"text")
library(stringr)
matches <- str_match_all(txt,"Full company details.*?</h2>")
# remove some rubish from match
companies=gsub("(Full company details)|\t|\n|\r|<[^>]+>",'',matches[[1]])
#remove trainling spaces
companies=gsub("^[ ]+",'',companies)
Result:
> length(companies)
[1] 1148
> head(companies)
[,1]
[1,] "350 Investment Partners"
[2,] "350 Investment Partners LLP"
[3,] "360° Capital Management SA"
[4,] "360° Capital Partners France - Advisory Company"
[5,] "360° Capital Partners Italia - Advisory Company"
[6,] "3i Deutschland Gesellschaft für Industriebeteiligungen mbH"

How to connect data dictionaries to the unlabeled data

I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.

Collect tweets with their related tweeters

I am doing text mining on tweets,I have collected random tweets form different accounts about some topic, I transformed the tweets into data frame, I was able to find the most frequent tweeters among those tweets(by using the column "screenName")... like those tweets:
[1] "ISCSP_ORG: #cybercrime NetSafe publishes guide to phishing:
Auckland, Monday 04 June 2013 – Most New Zealanders will have...
http://t.co/dFLyOO0Djf"
[1] "ISCSP_ORG: #cybercrime Business Briefs: MILL CREEK — H.M. Jackson
High School DECA chapter members earned the organizatio...
http://t.co/auqL6mP7AQ"
[1] "BNDarticles: How do you protect your #smallbiz from #cybercrime?
Here are the top 3 new ways they get in & how to stop them.
http://t.co/DME9q30mcu"
[1] "TweetMoNowNa: RT #jamescollinss: #senatormbishop It's the same
problem I've been having in my fight against #cybercrime. \"Vested
Interests\" - Tell me if …"
[1] "jamescollinss: #senatormbishop It's the same problem I've been
having in my fight against #cybercrime. \"Vested Interests\" - Tell me
if you work out a way!"
there are different tweeters have sent many tweets (in the collected dataset)
Now , I want to collect/group the related tweets for their corresponding tweeters/user..
Is there any way to do it using R ?? any suggestion? your help would be very appreciated.

How to control the echo width using Sweave

I have a problem with the width of the output from echo within sweave, I have a list with a large amount of text. The problem is the echo response from R runs off the page within the pdf. I have tried using
<<>>=
options(width=40)
#
but this has not changed anything.
An example: Set up the list (not showing in latex).
<<echo=FALSE>>=
my_list <- list(example="Site location was fixed using a Silvia Navigator handheld GPS in October 2003. Point of reference used was the station Bench Mark. If the bench mark location was remote from the site then the point of reference used was changed to the 0-1 metre gauge. Bench Mark location was then recorded as a separate entry in the Site History section [but not used as the site location].\r\nFor a Station location map and all digital photograph's of the station, river reach, and site details see H:\\hyd\\dat\\doc. For non digital photo's taken prior to October 2003 please see the relevant station file at Tumut office.")
#
And show the entry of the list.
<<>>=
my_list
#
Is there any way that I can get this to work without having to break up the list with cat statements.
You can use capture.output() to capture the printed representation of the list and then use writeLines() and strwrap() to display this output, nicely wrapped. As capture.output() returns a vector of strings containing the printed representation of the object, we can cat each of them to the screen/page but wrapped using strwrap(). The benefit of this approach is that the result looks like it was printed by R. Here's the solution:
writeLines(strwrap(capture.output(my_list)))
which produces:
$example
[1] "Site location was fixed using a Silvia Navigator
handheld GPS in October 2003. Point of reference used
was the station Bench Mark. If the bench mark location
was remote from the site then the point of reference used
was changed to the 0-1 metre gauge. Bench Mark location
was then recorded as a separate entry in the Site History
section [but not used as the site location].\r\nFor a
Station location map and all digital photograph's of the
station, river reach, and site details see
H:\\hyd\\dat\\doc. For non digital photo's taken prior
to October 2003 please see the relevant station file at
Tumut office."
From a 2010 posting to rhelp by Mark Schwartz:
cat(paste(strwrap(x, width = 70), collapse = "\\\\\n"), "\n")

Resources