Subset function does not evaluate arguments - r

Using the below data frame called 'data', I'm able to directly assign values to two variables, 'state' and 'measure', and identify the school with the lowest score within the subset:
Create dataframe 'data':
school<-c("NYU", "BYU", "USC", "FIT", "Oswego","UCLA","USF","Columbia")
state<-c("NY","UT","CA","NY","NY","CA", "CA","NY")
measure<-c("MSAT","MSAT","GPA","MSAT","MSAT","GPA","GPA","GPA")
score<-c(590, 490, 2.9, 759, 550, 1.2, 3.1, 3.2)
data<-data.frame(school,state, measure,score)
Subset on a 'state' and 'measure':
answer<-subset(data,subset=(state=="NY" & measure=="MSAT"))
order.answer<-order(answer$score,answer$school) #answer$school is tie-breaker
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
This is the correct answer:
[1] "Oswego"
My problem is that when I create a function to accomplish the same thing, I get an incorrect result:
lowest <- function(state, measure){
answer<-subset(data,subset=(state==state & measure==measure))
order.answer<-order(answer$score,answer$school)
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
}
lowest("NY","MSAT")
Incorrect answer:
[1] "UCLA"
The problem seems to be that the variables 'state' and 'measure' don't take on the values of the arguments "NY" and "MSAT" in the subset line of the function. I've experimented with '=' instead of '==' and also tried subset(data,subset=(state=="state" & measure=="measure")), but can't find a solution.

There seems to be something going awry in your function call related to the fact that your function arguments are the same as your data columns because this works
lowest <- function(State, Measure){
answer<-subset(data,subset=(state==State & measure==Measure))
order.answer<-order(answer$score,answer$school)
answer1<-as.matrix(answer[order.answer,])
answer1[1,1]
}
##
> lowest("NY","MSAT")
[1] "Oswego"
UPDATE: After mulling this over a bit more, I think I can offer a little more detail about what is going on internally. Notice that your first (manual) subset worked correctly above:
> subset(data,subset=(state=="NY" & measure=="MSAT"))
school state measure score
1 NYU NY MSAT 590
4 FIT NY MSAT 759
5 Oswego NY MSAT 550
However, notice that if we create two objects in the global environment, state <- "NY" and measure <- "MSAT", this does not work
state <- "NY"
measure <- "MSAT"
> subset(data,state==state & measure==measure)
school state measure score
1 NYU NY MSAT 590.0
2 BYU UT MSAT 490.0
3 USC CA GPA 2.9
4 FIT NY MSAT 759.0
5 Oswego NY MSAT 550.0
6 UCLA CA GPA 1.2
7 USF CA GPA 3.1
8 Columbia NY GPA 3.2
The reason (I believe) has to do with the R's scope resolution mechanism and how it operates within functions. When a function is called in R, this function call creates a (temporary?) environment where the objects in that function reside in the local frame, i.e. they are prioritized in the sense that if you have a variable x in the global environment, and a function defined as foo <- function(x){do something interesting to x}, that do something interesting is acting on the object passed into the foo via the argument x, and not the object x in the global environment. However, R uses dynamic scoping, meaning that if the object referred to in the local frame is not found, the interpreter will recursively search through the hierarchy of frames / environments until it finds the object referred to. So if instead of foo <- function(x){do something interesting to x} you had foo(z){do something interesting to x}, but you still had x defined in the global environment, rather than result in an error, the function would search through the call stack until it found the object x to "do something interesting" to.
In the example directly above, the reason subset(data,state==state & measure==measure) did not produce the desired results and subset(data,state=="NY & measure=="MSAT") did is because in the subset(data, ...) function call, data and all of its columns were in the local scope, i.e. the columns state and measure were prioritized over the objects state and measure in the global environment. Therefore, the subsetting condition state==state & measure==measure evaluated as TRUE for each row of data, and so the "subset" returned was the entirety of data. Now, if we do
State <- "NY"
Measure <- "MSAT"
> subset(data, state==State & measure==Measure)
school state measure score
1 NYU NY MSAT 590
4 FIT NY MSAT 759
5 Oswego NY MSAT 550
this works fine, because since State and Measure are not found in the local frame of the subset function call, the interpreter will keep searching through environments until it first encounters these objects (in this case, it finds them in the global environment). This is why when I changed the arguments to State and Measure (and made the respective changes in the function body) in your function lowest it produced the desired results - really you could change them to just about anything, as long as the names do not clash with the column names of data, but capitalizing their first letter was a quick fix.

Related

Geocoding in R using googleway

I have read Batch Geocoding with googleway R
I am attempting to geocode some addresses using googleway. I want the geocodes, address, and county returned back.
Using the answer linked to above I created the following function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
df<-as.data.frame(unlist(res[[x]]$results$address_components))
address<-paste(df[1,],df[2,],sep = " ")
city<-paste0(df[3,])
county<-paste0(df[4,])
state<-paste0(df[5,])
zip<-paste0(df[7,])
coordinates<-cbind(coordinates,address,city,county,state,zip)
coordinates<-as.data.frame(coordinates)
})
Then put it back together like so...
library(data.table)
done<-rbindlist(geocodes))
The issue is getting the address and county back out from the 'res' list. The answer linked to above pulls the address from the dataframe that was sent to google and assumes the list is in the right order and there are no multiple match results back from google (in my list there seems to be a couple). Point is, taking the addresses from one file and the coordinates from another seems rather reckless and since I need the county anyway, I need a way to pull it out of google's resulting list saved in 'res'.
The issue is that some addresses have more "types" than others which means referencing by row as I did above does not work.
I also tried including rbindlist inside the function to convert the sublist into a datatable and then pull out the fields but can't quite get it to work. The issue with this approach is that actual addresses are in a vector but the 'types' field which I would use to filter or select is in a sublist.
The best way I can describe it is like this -
list <- c(long address),c(short address), types(LIST(street number, route, county, etc.))
Obviously, I'm a beginner at this. I know there's a simpler way but I am just really struggling with lists and R seems to make extensive use of them.
Edit:
I definitely recognize that I cannot rbind the whole list. I need to pull specific elements out and bind just those. A big part of the problem, in my mind, is that I do not have a great handle on indexing and manipulating lists.
Here are some addresses to try - "301 Adams St, Friendship, WI 53934, USA" has an 7X3 "address components" and corresponding "types" list of 7. Compare that to "222 S Walnut St, Appleton, WI 45911, USA" which has an address components of 9X3 and "types" list of 9. The types list needs to be connected back to the address components matrix because the types list identifies what each row of the address components matrix contains.
Then there are more complexities introduced by imperfect matches. Try "211 Grand Avenue, Rothschild, WI, 54474" and you get 2 lists, one for east grand ave and one for west grand ave. Google seems to prefer the east since that's what comes out in the "formatted address." I don't really care which is used since the county will be the same for either. The "location" interestingly contains 2 sets of geocodes which, presumably, refer to the two matches. I think this complexity can be ignored since the location consisting of two coordinates is still stored as a 'double' (not a list!) so it should stack with the coordinates for the other addresses.
Edit: This should really work but I'm getting an error in the do.call(rbind,types) line of the function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
R says the "types" object is not a list so it can't rbind it. I tried coercing it to a list but still get the error. I checked using the following paired down function and found #294 is null. This halts the function. I get "over query limit" as an error but I am not over the query limit.
geocodes<-lapply(seq_along(res),function(x) {
types<-res[[x]]$results$address_components[[1]]$types
print(typeof(types))
})
Here's my solution using tidyverse functions. This gets the geocode and also the formatted address in case you want it (other components of the result can be returned as well, they just need to be added to the table in the last row of the map function that gets returned.
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(googleway))
set_key("your key here")
df <- tibble(full_address = c("2379 ADDISON BLVD HIGH POINT 27262",
"1751 W LEXINGTON AVE HIGH POINT 27262", "dljknbkjs"))
df %>%
mutate(geocode_result = map(full_address, function(full_address) {
res <- google_geocode(full_address)
if(res$status == "OK") {
geo <- geocode_coordinates(res) %>% as_tibble()
formatted_address <- geocode_address(res)
geocode <- bind_cols(geo, formatted_address = formatted_address)
}
else geocode <- tibble(lat = NA, lng = NA, formatted_address = NA)
return(geocode)
})) %>%
unnest()
#> # A tibble: 3 x 4
#> full_address lat lng formatted_address
#> <chr> <dbl> <dbl> <chr>
#> 1 2379 ADDISON BLVD HIGH POI… 36.0 -80.0 2379 Addison Blvd, High Point, N…
#> 2 1751 W LEXINGTON AVE HIGH … 36.0 -80.1 1751 W Lexington Ave, High Point…
#> 3 dljknbkjs NA NA <NA>
Created on 2019-04-14 by the reprex package (v0.2.1)
Ok, I'll answer it myself.
Begin with a dataframe of addresses. I called mine "addresses" and the singular column in the dataframe is also called "Addresses" (note that I capitalized it).
Use googleway to get the geocode data. I did this using apply to loop across the rows in the address dataframe
library(googleway)
res<-apply(addresses,1,function (x){
google_geocode(address=x[['Address']], key='insert your google api key here - its free to get')
})
Here is the function I wrote to get the nested lists into a dataframe.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
library(data.table)
geocodes<-rbindlist(geocodes,fill=TRUE)
lapply loops along the items in the list, within the function I create a coordinates dataframe and put the geocodes there. I also wanted the other address components, particularly the county, so I also created the "types" dataframe which identifies what the items in the address are. I cbind the address items with the types, then use spread from the tidyr package to reshape the dataframe into wideformat so it's just 1 row wide. I then cbind in the lat and lon from the coordinates dataframe.
The rbindlist stacks it all back together. You could use do.call(rbind, geocodes) but rbindlist is faster.

R Storing Values from For Loop to Data Frame

I am trying to write a function that has 2 arguments: column name and ranking number. The function will read a CSV file that has hospitals from every state. The function should return a data frame with the hospital name that was at the specified rank.
My solution has been to split the main CSV file by state, order each data frame by the desired column, loop through each state's data frame, grab the row (where row number = rank number), store each state's hospital name into a vector, then create a dataframe using the vector from the loop.
When I test each part of my function in the console, I am able to receive the results I need. However, when I run the function altogether, it isn't storing the hospital names as desired.
Here's what I have:
rankall <- function(outcome, num = "best") {
outcomedf <- read.csv("outcome-of-care-measures.csv")
#using this as a test
outcomedf <- outcomedf[order(outcomedf[, 11], outcomedf[, 2]), ]
#create empty vectors for hospital name and state
hospital <- c()
state <- c()
#split the read dataframe
splitdf <- split(outcomedf, outcomedf$State)
#for loop through each split df
for (i in 1:length(splitdf)) {
#store the ranked hospital name into hospital vector
hospital[i] <- as.character(splitdf[[i]][num, 2])
#store the ranked hospital state into state vector
state[i] <- as.character(splitdf[[i]][, 7])
}
#create a df with hospital and state
rankdf <- data.frame(hospital, state)
return(rankdf)
}
When I run the function altogether, I receive NA in my 'hospital' column, but when I run each part of the function individually, I am able to receive the desired hospital names. I'm a little confused as to why I am able to run each individual part of this function outside of the function and it returns the results I want, but not when I run the function as a whole. Thank you.
This question is related to Programming Assignment 3 in the Johns Hopkins University R Programming course on Coursera. As such, we can't provide a "corrected version" of the code because doing so would violate the Coursera Honor code.
When run with the default settings, your implementation of the rankall() function fails because when you use num in the extract operator there is no ["best",2] row. If you run your code as head(rankall("heart attack",1)) it produces the following output.
> head(rankall("heart attack",1))
hospital state
1 PROVIDENCE ALASKA MEDICAL CENTER AK
2 CRESTWOOD MEDICAL CENTER AL
3 ARKANSAS HEART HOSPITAL AR
4 MAYO CLINIC HOSPITAL AZ
5 GLENDALE ADVENTIST MEDICAL CENTER CA
6 ST MARYS HOSPITAL AND MEDICAL CENTER CO
To completely correct your function, you'll need to make the following changes.
Add code to sort the data by the desired outcome column: heart attack, heart failure, or pneumonia
Add code to handle the condition when a state does not have enough hospitals to return a valid result (i.e. the 43rd ranked hospital for pneumonia in Puerto Rico
Add code to handle "best" and "worst" as inputs to the num argument

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

Selecting strings and using in logical expressions to create new variable - R

I have a categorical variable indicating location of flu clinics as well as an "other" category. Participants who select the "other" category give open-ended responses for their location. In most cases, these open-ended responses fit with one of the existing categories (for example, one category is "public health clinic", but some respondents picked "other" and cited "mall" which was a public health clinic). I could easily do this by hand but want to learn the code to select "mall" strings then use logical expressions to assign these people to "public health clinic" (e.g. create a new variable for location of flu clinics).
My categorical variable is "lrecflu2" and my character string variable is "lfother"
So far I have:
mall <- grep("MALL", Motiv82012$lfother, value = TRUE)
This gives me a vector with all the string responses containing "MALL" (all strings are in caps in the dataframe)
How do I use this vector in a logical expression to create a new variable that assigns these people to the "public health clinic" category and assigns the original value of flu clinic location variable for people that did not select "other" (and do not have values in the character string variable) to the new flu clinic location variable?
Perhaps, grep is not even the right function to be using.
As I understand it, you have a column in a data frame, where you want to reassign one character value to another. If so, you were almost there...
set.seed(1) # for generating an example
df1 <- data.frame(flu2=sample(c("MALL","other","PHC"),size=10,replace=TRUE))
df1$flu2[grep("MALL",df1$flu2)] <- "PHC"
Here grep() is giving you the required vector index; you then subset the vector based on this and change those elements.
Update 2
This should produce a data.frame similar to the one you are using:
set.seed(1)
lreflu2 <- sample(c("PHC","Med","Work","other"),size=10,replace=TRUE)
Ifother <- rep("",10) # blank character vector
s1 <- c("Frontenac Mall","Kingston Mall","notMALL")
Ifother[lreflu2=="other"] <- s1
df1 <- data.frame(lreflu2,Ifother)
### alternative:
### df1 <- data.frame(lreflu2,Ifother, stringsAsFactors = FALSE)
df1
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 other Frontenac Mall
5 PHC
6 other Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
If you're looking for an exact string match you don't need grep at all:
df1$lreflu2[df1$Ifother=="MALL"] <- "PHC"
Using a regex:
df1$lreflu2[grep("Mall",df1$Ifother)] <- "PHC"
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 PHC Frontenac Mall
5 PHC
6 PHC Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
Whether Ifother is a factor or vector with mode character doesn't affect things. data.frame will coerce string vectors to factors by default.

Summary statistics of retail prices grouped by categorical data

I need some help writing a function that takes three categorical inputs and returns a vector of summary statistics based on these inputs.
The data set contains information on retail goods that can be specified by their retail segment, brand name, and type of good along with its retail price and what it actually sold for.
Now I need to write a function that will take these inputs and average, count, and calculate whatever else is needed.
I have set the function up as follows (using made up data):
dataold = data.frame(segment=c("golf","tenis","football","tenis","golf","golf"),
brand=c("x","y","z","y","x","a"),
type=c("iron","ball","helmet","shoe","driver","iron"),
retail=c(124,.60,80,75,150,108),
actual=c(112,.60,72,75,135,100))
retailsum = funtion(segment,brand,type){
datanew = dataold[which(dataold$segment='segment' &
dataold$brand='brand' &
dataold$type='type'),c("retail","actaul")]
summary = c(dim(datanew)[1],colMeans(datanew))
return(summary)
}
The code inside the function braces works on its own, but once I wrap a function around it I start getting errors or it will just return 0 counts and NaN for the means.
Any help would be greatly appreciated. I have very little experience in R, so I apologize if this is a trivial question, but I have not been able to find a solution.
There are rather a lot of errors in your code, including:
misspelling of function
using single = (assignment) rather than == (equality test)
mistype of actual
hardcoding of segment, brand and type in your function, rather than referencing the arguments.
This is how your function could look like, i.e. it produces valid results:
retailsum <- function(data, segment,brand,type, FUN=colMeans){
x = with(data, data[segment==segment && brand==brand && type==type,
c("retail","actual")])
match.fun(FUN)(x)
}
retailsum(dataold, "golf", "x", "iron", colMeans)
retail actual
89.60000 82.43333
And here is a (possibly much more flexible) solution using the plyr package. This calculates your function for all combinations of segment, brand and type:
library(plyr)
ddply(dataold, .(segment, brand, type), colwise(mean))
segment brand type retail actual
1 football z helmet 80.0 72.0
2 golf a iron 108.0 100.0
3 golf x driver 150.0 135.0
4 golf x iron 124.0 112.0
5 tenis y ball 0.6 0.6
6 tenis y shoe 75.0 75.0
Andrie's solution is pretty complete already. (ddply is cool! Didn't know about that function...)
Just one addition, though: If you want to compute the summary values over all possible combinations, you can do this as a one-liner using R's onboard function by:
by(dataold, list(dataold$segment, dataold$brand, dataold$type),
function(x) summary(x[,c('retail', 'actual')])
)
That is not strictly what you asked for, but may still be instructive.

Resources