How to lookup data and print values based on criteria in R? - r

So I have a csv file that has 12 columns of data, what I want to do is get specific values from the CSV file based on the desired criteria
A snip of the data is provided, so I have this list of Maps:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
The goal is to get CTWinProb & TWinProb values for each of the maps in the Map list, e.g.
CTWinProbs;
Nuke = 0.5758
Dust2 = 0.4965
Inferno = 0.4885
etc and vice versa for TWinProb
So far I have been using sqldf library which is very tedious, this is what I am currently doing:
T1NukeCT <- sqldf("select CTWinProb from Team1 where MapName like '%Nuke%'")
which outputs T1NukeCT = 0.5758
and repeating for each Map and then again for TWinProb
I am sure there is an easier way, just quite new to using R so am not 100% on the best method here or how to go about doing it in a less tedious manner

You may use a WHERE IN (...) clause:
Maps <- c("Nuke","Vertigo","Inferno","Mirage","Train","Overpass","Dust2")
where_in <- paste0("('", paste(Maps, collapse="','"), "')")
sql <- paste0("SELECT CTWinProb FROM Team1 WHERE MapName IN ", where_in)
T1NukeCT <- sqldf(sql)
To be clear, the SQL query generated by the above script is:
SELECT CTWinProb
FROM Team1
WHERE MapName IN ('Nuke','Vertigo','Inferno','Mirage','Train','Overpass','Dust2')

What output/results are you looking for exactly?
If you want results in R, these are two simple functions to return the desired values.
They require the dplyr package to be loaded.
library(dplyr)
YourData <- read_csv("./yourfile/.csv")
CTWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(CTWinProb)}
TWinFunc <- function(x){
YourData %>% filter(MapName == x) %>% pull(TWinProb)}
Now CTWinFunc("Nuke") should return CTWinProb result for Nuke, ie: 0.5758
And TWinFunc("Nuke") should return TWinProb result for Nuke, ie: 0.4242
If you want to return a vector with all the results together, I guess you could use the sapply() function. Something like this...
TWins <- sapply(Maps, TWinFunc)
TWins[lengths(TWins)==0] <- NA
TWins <- unlist(TWins)
And this should give you a table with the results:
cbind(Maps, Twins)
Of course, it seems like all this data is already in the original table and you could just subset that.
YourData[,c(4,11,12)]

Related

R function for looping over a string with unique values

I am working on a project where I have to download more than 10 million records on a relatively small server. So instead of just downloading the entire dataset, I have to download it in smaller sections. I am trying to create a loop that will call batches of the data based on date. I'm used to coding in Stata where you can call a local by using `x' or some variant within a string. However, I can't find a way to do this in R. Below is a small piece of the code I'm using. Basically, whenever I try to run this 'val' and 'val2' aren't updating with the dates in the defined lists so the output literally just reads as if the server is trying to search between 'val' and 'val2' instead of between '20190101' and '20190301'. Any suggestions for how to fix this are greatly appreciated!
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
foreach (val=x, val2=y) %do% {
data<-DBI::dbGetQuery(myconn, "SELECT * FROM .... WHERE (DATE BETWEEN 'val' AND 'val2')")
}
With a basic loop
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
data_all = c()
for(i in 1:length(x)){
query = paste0("SELECT * FROM .... WHERE (DATE BETWEEN '",
x[i], "' AND '", y[i], "')")
data <- DBI::dbGetQuery(myconn, query)
data_all = rbind(data_all, data)
}
With sprintf you can construct the query and use lapply + do.call to combine the results into one dataframe.
x<-c(20190101, 20190301)
y<-c(20190301, 20190501)
input <- sprintf("SELECT * FROM .... WHERE (DATE BETWEEN '%s' AND '%s')", x, y)
result <- do.call(rbind, lapply(input, function(x) DBI::dbGetQuery(myconn, x)))
Using purrr::map_df is a bit shorter.
result <- purrr::map_df(input, ~DBI::dbGetQuery(myconn, .x))

Finding and replacing strings in cells of a matrix in R

I'm trying to process a survey, in which one of the questions asks the respondents to name a friend. Now I have a matrix like this:
I want to save these results in a relational database. I have assigned every person a unique ID, and want the answers to be saved as a last of ID's. So that the table looks like this:
My code so far:
i've tried
df$name %in% df$friends
which did not give any results. I'm now trying to use a for loop with str_detect:
friends <- df$friends
names <- df$name
for (i in 1:length(names)) {
friends_called <- str_detect(friends, names[i])
id_index <- grep(names[i], df$name)
id <- df$id[id_index]
for (j in 1:length(friends_called)) {
if(friends_called[j] == T) {
df$friends_id[j] <- paste(df$friends_id[j], id, ",", sep="")
}
df$friends <- df$friends_id
But I have some issues with it:
It's not working
It uses two loops, which i'm used to from writing python but I read that i should avoid them in R
The string matching needs to be fuzzy (If Anna wrote "Jon" instead of "John", it should still match.
Does anyone have suggestions on how to tackle this?
You can do this without a loop in tidyverse as follows:
df %>%
mutate(friends = map(friends, ~ df %>%
filter(str_detect(.x,name)) %>%
select(id) %>%
unlist() %>%
paste(collapse = ',')))
gives
id name friends
1 a1d John b2e,c3f
2 b2e Anna a1d
3 c3f Denise
or with base R you can use sapply:
df$friends <- sapply(friends, function(x) paste(id[str_detect(x,name)],collapse = ','))

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

R code to iterate through dataframe rows for google maps distance queries

I'm looking for some assistance in writing some R code to iterate through rows in a dataframe and pass the values in each row to a function and print the output either to an excel file, txt file or just in the console.
The purpose of this is to automate a bunch of distance/time queries (several hundred) to google maps using the function found at this website: http://www.nfactorialanalytics.com/r-vignette-for-the-week-finding-time-distance-between-two-places/
The function on that website is as follows:
library(XML)
library(RCurl)
distance2Points <- function(origin,destination){
results <- list();
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
results[['time']] <- time
results[['dist']] <- distance
return(results)
}
The dataframe will contain two columns: origin postal code and destination postal code (Canada, eh?). I'm a beginner R programmer, so I know how to use read.table to load a txt file into a dataframe. I'm just not sure how iterate through the dataframe, each time passing values to the distance2Points function and executing. I think this can be done using either a for loop or one of the apply calls?
Thanks for the help!
edit:
To keep it simple lets assume I want to transform these two vectors into a dataframe
> a <- c("L5B4P2","L5B4P2")
> b <- c("M5E1E5", "A2N1T3")
> postcodetest <- data.frame(a,b)
> postcodetest
a b
1 L5B4P2 M5E1E5
2 L5B4P2 A2N1T3
How should I go about iterating over these two rows to return both distances and times from the distance2Points function?
Here's one way to do it, using lapply to produce a list with the results for each row in your data and using Reduce(rbind, [yourlist]) to concatenate that list into a data frame whose rows correspond to the ones in your original. To make this work, we also have to tweak the code in the original function to return a one-row data frame, so I've done that here.
distance2Points <- function(origin,destination){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',origin,'&destinations=',destination,'&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
dist <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
time <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
distance <- as.numeric(sub(" km","",dist))
time <- as.numeric(time)/60
distance <- distance/1000
# this gives you a one-row data frame instead of a list, b/c it's easy to rbind
results <- data.frame(time = time, distance = distance)
return(results)
}
# now apply that function rowwise to your data, using lapply, and roll the results
# into a single data frame using Reduce(rbind)
results <- Reduce(rbind, lapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i])))
Result when applied to your sample data:
> results
time distance
1 27.06667 27.062
2 1797.80000 2369.311
If you would prefer to do this without creating a new object, you could also write separate functions for computing time and distance -- or a single function with those outputs as options -- and then use sapply or just mutate to create new columns in your original data frame. Here's how that might look using sapply:
distance2Points <- function(origin, destination, output){
require(XML)
require(RCurl)
xml.url <- paste0('http://maps.googleapis.com/maps/api/distancematrix/xml?origins=',
origin, '&destinations=', destination, '&mode=driving&sensor=false')
xmlfile <- xmlParse(getURL(xml.url))
if(output == "distance") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//distance")[[1]])$value)
y <- as.numeric(sub(" km", "", y))/1000
} else if(output == "time") {
y <- xmlValue(xmlChildren(xpathApply(xmlfile,"//duration")[[1]])$value)
y <- as.numeric(y)/60
} else {
y <- NA
}
return(y)
}
postcodetest$distance <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "distance"))
postcodetest$time <- sapply(seq(nrow(postcodetest)), function(i)
distance2Points(postcodetest$a[i], postcodetest$b[i], "time"))
And here's how you could do it in a dplyr pipe with mutate:
library(dplyr)
postcodetest <- postcodetest %>%
mutate(distance = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "distance")),
time = sapply(seq(nrow(postcodetest)), function(i)
distance2Points(a[i], b[i], "time")))

R- How to do a loop on a list and output different dataframes

I'm attempting to create a loop in R that will use a vector of dates, run them through a loop that includes a SQL query, and then generate a separate dataframe for each output. Here is as far as I've gotten:
library(RODBC)
dvect <- as.Date("2015-04-13") + 0:2
d <- list()
for(i in list(dvect)){
queryData <- sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
d[i] <- rbind(d, queryData)
}
From what I can tell, the query portion of the code runs fine since I've tested it by itself. Where I'm stumbling is the last line where I try to save the contents of each loop through the query separately with each having a label of the date that was used in the loop.
I'd appreciate any help. I've only been using R consistently for about 2 months now so I'm definitely open to alternative ways of doing this that are cleaner and more efficient.
Thanks.
I'd suggest making the SQL query a function, and use lapply to apply it and return your result as a list.
userSQLquery = function(i) {
sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
}
dvect = as.Date("2015-04-13") + 0:2
d = as.list(1:length(dvect))
names(d) = dvect
lapply(d, userSQLquery)
I have very little experience with SQL though, so this may not work. Maybe it could start you off?
Looks like a job for lapply (lapply documentation)instead of a for loop. (In R it's often good to avoid a for loop by using a vectorization.)
If you want each date to return a separate data frame, and then have each data frame labelled with the original date, try:
dates <- c("Jan 1", "Oct 31", "Dec 25")
queryData <- function(date){
#dummy data
return(runif(5))
}
results <- lapply(dates, queryData)
names(results) <- dates
Either use:
d[[i]] <- queryData
if you want each data.frame (query result) as a separate element in the list output d.
Or use:
d <- rbind(d, queryData)
if you want a single data.frame with all the query outputs combined. In this case you should declare d as a data.frame (i.e. d <- data.frame()).
You can also store each data.frame (i.e. the query result) with its corresponding date in a list as:
d[[i]] <- list(date = dvect[[i]], queryResult = queryData)
I think the last one is what you are looking for.

Resources