R - set several factor levels as default argument in a function - r

I'm trying to create a function that looks up price and car type from a data set. Both will have default arguments. For price, this is easy enough. But for the car type (which I have as factors), I can't find a way to set all factors as default.
The goal is that if you don't set anything in car_type, it will return all possible car types.
search <- function(start_price = 0, end_price = 1000, car_type = ???){
subset_data <- auto[price <= end_price &
price > start_price &
vehicleType == car_type]
return(subset_data)
}
search()
So that the "search()" returns all cars between the prices of 0 and 1000 and of all possible car types. I've tried using vectors and lists, without any luck.

The usual way to approach this is to use NULL as a default and handle that in the function.
search <- function(start_price = 0, end_price = 1000, car_type = NULL){
if (is.null(car_type) {
car_type <- levels(auto$vehicleType)
}
subset_data <- auto[price <= end_price &
price > start_price &
vehicleType %in% car_type]
return(subset_data)
}

Related

IF statements inside function do not recognize conditions

I want to adjust my function so that my if and else if statements recognize the name of the dataframe used and execute the correct plotting function. These are some mock data structured the same as mine:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
A<-lapply(tbls,"[", c(1,2))
B<-lapply(tbls,"[", c(3,4))
C<-lapply(tbls,"[", c(3,4))
list<-list(A,B,C)
names(list)<-c("A","B","C")
And this is my function:
plot_1<-function (section, subsample) {
data<-list[grep(section, names(list))]
data<-data[[1]]
name=as.character(names(data))
if(section=="A" && subsample=="None"){plot_likert_general_section(df1[c(1:2)],"A")}
else if (section==name && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the",name,"topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the",name,"topics rank?"))}
}
Basically what I want it to do is plot a certain graph by specifying section and subsample I'm interested in if, for example, I want to plot section C and subsample dummy.1, I just write:
plot_1(section="C", subsample="dummy1)
I want to avoid writing this:
else if (section=="A" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the A topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the A topics rank?"))}
else if (section=="B" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the B topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the B topics rank?"))}
else if (section=="C" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the c topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the C topics rank?"))}
else if (section=="C" && subsample=="dummy2")...
.
.
}
So I tried to extract the dataframe used from the list so that it matches the string of the section typed in the function (data<-list[grep(section, names(list))]) and store its name as a character (name=as.character(names(data))), because I thought that in this way the function would have recognized the string "A", "B" or "C" by itself, without the need for me to specify each condition.
However, if I run it, I get this error: Warning message: In section == name && subsample == "dummy1" : 'length(x) = 4 > 1' in coercion to 'logical(1)', that, from what I understand, is due to the presence of a vector in the statement. But I have no idea how to correct for this (I'm still quite new to R).
How can I fix the function so that it does what I want? Thanks in advance!
Well, I can't really test your code without the plot_likert_general_section function or the plot_likert function, but I've done a bit of simplifying and best practices--passing list in as an argument, consistent spaces and assignment operators, etc.--and this is my best guess as to what you want:
plot_1 = function(list, section, subsample) { ## added `list` as an argument
data = list[[grep(section, names(list))]] # use [[ to extract a single item
name = as.character(names(data))
if(subsample == "None"){
plot_likert_general_section(df1[c(1:2)], section)
} else {
yesno = paste(subsample, c("yes", "no"), sep = ".")
plot_likert(data[[yesno[1]]], title = paste("How do the", name, "topics rank?"))
plot_likert(data[[yesno[2]]], title = paste("How do the", name, "topics rank?"))
}
}
plot_1(list, section = "C", subsample = "dummy1)
I'm not sure if your plot_likert functions use base or grid graphics--but either way you'll need to handle the multiple plots. With base, probably use mfrow() to display both of them, if grid I'd suggest putting them in a list to return them both, and then maybe using gridExtra::grid.arrange() (or similar) to plot both of them.
You're right that the error is due to passing a vector where a single value is expected. Try inserting print statements before the equality test to diagnose why this is.
Also, be careful with choosing variable names like name which are baseR functions (e.g. ?name). I'd also recommend following the tidyverse style guide here: https://style.tidyverse.org/.

Using multiple conditional filter conditions in R

I am trying to use multiple conditional filters on a data frame , but when the filters ( which would come from drop down select boxes ) are both used then only one of the filters is actually applied, and the second one is ignored. I cant see anything wrong with it , and have tried ifelse which gave the same results
# seed values
item_levels <-c("alpha","beta","omega","omega","beta","alpha","omega")
prod_types <- c("production","development","test","example","test","test","test")
sample <-data.frame(item_levels,prod_types)
# simulate inputs
selected_item = "omega"
selected_type = "test"
sample %>%
filter(if(selected_item != "all"){
item_levels == selected_item
}else{1==1} &
if(selected_type != "all"){
prod_types == selected_type
}else{1==1}
)
# Further calculations would be done after this filter - groupings , sum etc
This gives a resultset of
item_levels prod_types
1 omega test
2 omega example
3 omega test
I know the equivalent in SQL would be
WHERE IF(#selected_item != "all" , item_levels = #selected_item , 1 = 1 ) AND
IF(#selected_type != "all" , prod_types = #selected_item , 1 = 1 )
Is there something obvious i am missing here ? I know I could apply several filters and pipe them together , but when theres multiple filters this adds up and I would like to avoid if possible
One solution could be to use a helper function that does the all comparison for you:
compare_all = function(selected, to_compare) {
if(selected != "all") {
to_compare == selected
} else {
# Get a vector of only `TRUE` with the same length as `to_compare`
rep_along(to_compare, TRUE)
}
}
sample %>% filter(compare_all(selected_item, item_levels) & compare_all(selected_type, prod_types))
Not only does this solve your problem, but it's also much cleaner to extend to even more filters, as you only have to add more calls to compare_all.

Best practice using multiple null arguments in writing R function

I'm writing a function that subset a dataframe based on the variables passed to it. I read in Advanced R to use the is_null function to check for null arguments. I've added 2 arguments which is already an if/elseif/elseif/else. I'm afraid if I add many more of the arguments readability of the code will greatly suffer. Is my method best practice?
add_scores <- function(data,
study = NULL,
therapeutic_area = NULL ){
if (is_null(study) & is_null(therapeutic_area)){
temp <- data
} else if (!is_null(study) & is_null(therapeutic_area)){
temp <- data %>%
filter(BC == study)
} else if (is_null(study) & !is_null(therapeutic_area)) {
temp <- data %>%
filter(PPDDIVISIONPRI == therapeutic_area)
} else {
temp <- data %>%
filter(
BC == study &
PPDDIVISIONPRI == therapeutic_area)
}
return(
temp %>%
mutate(ENROLLMENTRANK = dense_rank(desc(ENROLLMENTRATE)),
CYCLETIMERANK = dense_rank(CYCLETIME)*2,
TOTALRANK = dense_rank(ENROLLMENTRANK + CYCLETIMERANK)
) %>%
arrange(TOTALRANK, ENROLLMENTRANK, CYCLETIMERANK)
)
}
Edited:
In your specific issue, you can separate out the if tests:
if(!is.null(study)) data <- filter(data, BC==study)
if(!is.null(therapeutic_area)) data <- filter(data, PPDDIVISIONPRI==therapeutic_area)
Otherwise, as you point out, the number of permutations will rapidly increase!

How can I subindex a multiway array with vector that preserves blanks

As a minimal example lets consider the following multiway array (a):
a = as.table(array(c(1:8), dim=c(2,2,2)))
For this array manual subindicing is easy, e.g.
a[1,,] (a 2 x 2 matrix that comply with dimension one being in state 1 (A))
My question is now; how can I do the same thing with a vector that preserves blanks, e.g. something like c(1,,).
Note that I need to define which dimentions are left blank (dynamically) based on the observed variables in an instance; My initial thought was a generic cha. vector b=c("","","") , where I could replace variable 1 with 1 if it was observed as in state 1, e.g. b[1]="1", but, first of all, I do not know how to use this vector for indicing a["b"], or whether there is a better way of doing this.
I need this dynamic indicing, because I want to update parts of the table as I receive evidence (information == counts)
Thank you very much in advance!
Best,
Sebastian
Here's how I would do it:
while (evidence) {
idx <- lapply(dim(a), function (dimsize) 1:dimsize)
## update `idx` according to `evidence`, e.g.,
## if you want to do `a[1,,2]`
idx[[1]] <- 1
idx[[3]] <- 2
do.call(`[`, c(list(a), idx))
## if you want to do `a[1,,2] <- c(20, 30)`
a <- do.call(`[<-`, c(list(a), idx, list(value=c(20, 30))))
}
Here is a dirty way of solving it:
data:
a = as.table(array(c(1:8), dim=c(2,2,2)))
Your dynamic indices should be a text: (that's a new question of how you get your condition into a string like index, index2)
index = "1,,"
index2 = ",2,"
function:
crazyIndexing <- function(obj, index) {
stringExpr = paste0(obj, "[",index,"]")
return(eval(parse(text=stringExpr)))
}
call your function: (see how it does the same!)
a[1,,]
crazyIndexing("a",index)
a[,2,]
crazyIndexing("a",index2)
please note:
b=c("","",""); b[1]="1"
index = paste0(b, collapse = ",")
#[1] "1,,"
You can of course change your function accordingly:
crazyIndexing2 <- function(obj, obj2, index) {
stringExpr = paste0(obj ,"[",index,"]", "<-", obj, "[",index,"]", "+", obj2)
eval(parse(text=stringExpr))
return( get(obj) )
}
a = as.table(array(c(1:8), dim=c(2,2,2)))
aa = a[,2,]
aopt = crazyIndexing2("a","aa","1,,")
Now you have all the tools.

Improve R script efficency

I am trying to match two very big data (nsar & crsp) sets. My code works quite well but needs a lot of time. My procedure works the following way:
Try match via ticker (thereby controlling that NAV (just a number) & date
is the same)
Try match via exact fund name (controlling for NAV & date)
Try match by closest match: search first for same NAV & date --> take list and consider only those companies that are the closest match for both match measures --> take remaining entries and find closest match (but match distance is restricted).
Any suggestions how I could improve the efficiency of the code:
#Go through each nsar entry and try to match with crsp
trackchanges = sapply(seq_along(nsar$fund),function(x){
#Define vars
ticker = nsar$ticker[x]
r_date = format(nsar$r_date[x], "%m%Y")
nav1 = nsar$NAV_share[x]
nav2 = nsar$NAV_sshare[x]
searchbyname = 0
if(nav1 == 0) nav1 = -99
if(nav2 == 0) nav2 = -99
########## If ticker is available --> Merge via ticker and NAV
if(is.na(ticker) == F)
{
#Look for same NAV, date and ticker
found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#Mark that you should search by names
searchbyname = 1
} else { #ticker found
#Record crsp_fundno and that match is found
nsar$match[x] = 1
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 1 --> Merged by ticker
return(1)
}
}
###########
########### No Ticker available or found --> Exact name matching
if(is.na(ticker) == T | searchbyname == 1)
{
#Define vars
name = tolower(nsar$fund[x])
company = tolower(nsar$company[x])
#Exact name, date and same NAV
found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#####Continue searching by closest match
#First search for nav and date to get list of funds
allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
allfunds_companies = crsp$company[allfunds]
#Check if anything found
if(length(allfunds) == 0)
{
#Return: 0 --> nothing found
return(0)
}
#Get best match by lev and substring measure for company
levmatch = levenstheinMatch(company, allfunds_companies)
submatch = substringMatch(company, allfunds_companies)
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
#Check if now anything found
if(length(allfunds) == 0)
{
#Mark match (5=Company not found)
nsar$match[x] = 5
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 5 --> Company not found
return(5)
}
#Get best match by all measures
levmatch = levenstheinMatch(name, allfunds_names)
submatch = substringMatch(name, allfunds_names)
#Only accept if identical
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
if(length(allfunds) > 0)
{
#Mark match (3=closest name matching)
nsar$match[x] = 3
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]]
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return 3=closest name matching
return(3)
} else {
#return 0 -> no match
return(0)
}
#####
} else { #If exact name,date,nav found
#Mark match (2=exact name matching)
nsar$match[x] = 2
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
#Return 2=exact name matching
return(2)
}
}
})#End sapply
Thank you very much for any help!
Laurenz
The script is too complicated to provide a complete answer, but the basic problem is in the first line
#Go through each nsar entry...
where you set out the problem in an iterative way. R works best with vectors.
Hoist the vectorizable components from the sapply that you start your calculations with. For instance, format the r_date column.
nsar$r_date_f <- format(nsar$r_date, "%m%Y")
This advice applies to lines buried deeper in your code, too, for example calculating the rounded crsp$mnav should be done just once on the entire column
crsp$mnav_r <- round(crsp$mnav, 1)
Use R idioms where appropriate, if "-99" represents a missing value, then use NA
nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1
Code from other packages that you might use is more likely to treat NA correctly.
Use well-established R functions for more complex queries. This is tricky, but if I'm reading your code correctly your query about "same NAV, date, and ticker" could use merge to do the joins, assuming the columns have been created by vectorized operations earlier in the code, as
nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp,
by.x = c("ticker", rdate_r", "nav1_r"),
by.y = c("nasdaq", "caldt2", "mnav_r"))
This does not cover the "|" condition, so additional work would be needed. The plyr, data.table, and sqldf packages (among others) were developed in part to simplify these types of operations, so might be worth investigating as you get more comfortable with vectorized calculations.
It's hard to tell, but I think these three steps address the major challenges in your code.

Resources