I'm writing a function that subset a dataframe based on the variables passed to it. I read in Advanced R to use the is_null function to check for null arguments. I've added 2 arguments which is already an if/elseif/elseif/else. I'm afraid if I add many more of the arguments readability of the code will greatly suffer. Is my method best practice?
add_scores <- function(data,
study = NULL,
therapeutic_area = NULL ){
if (is_null(study) & is_null(therapeutic_area)){
temp <- data
} else if (!is_null(study) & is_null(therapeutic_area)){
temp <- data %>%
filter(BC == study)
} else if (is_null(study) & !is_null(therapeutic_area)) {
temp <- data %>%
filter(PPDDIVISIONPRI == therapeutic_area)
} else {
temp <- data %>%
filter(
BC == study &
PPDDIVISIONPRI == therapeutic_area)
}
return(
temp %>%
mutate(ENROLLMENTRANK = dense_rank(desc(ENROLLMENTRATE)),
CYCLETIMERANK = dense_rank(CYCLETIME)*2,
TOTALRANK = dense_rank(ENROLLMENTRANK + CYCLETIMERANK)
) %>%
arrange(TOTALRANK, ENROLLMENTRANK, CYCLETIMERANK)
)
}
Edited:
In your specific issue, you can separate out the if tests:
if(!is.null(study)) data <- filter(data, BC==study)
if(!is.null(therapeutic_area)) data <- filter(data, PPDDIVISIONPRI==therapeutic_area)
Otherwise, as you point out, the number of permutations will rapidly increase!
Related
firstly sorry if this is a stupid question ... I am learning R, and really dont have too much experience
I have following function in R programming language, that is taking value and returning value.
dec2binSingle <- function(decimal) {
print(decimal)
binaryValue <- ""
index <- 0
decimal <- as.numeric(decimal)
while(decimal != 0) {
print(decimal)
temp <- as.numeric(decimal) %% 2
if (temp == 1) {
binaryValue <- paste("1", binaryValue, sep="", collapse = NULL)
decimal <- decimal - 1
} else {
binaryValue <- paste("0", binaryValue, sep="", collapse = NULL)
}
index <- index + 1
decimal <- decimal / 2
}
return(binaryValue)
}
The function is converting decimal number into binary equivalent.
When I try to call the function, the function completes without any error, but when I try to see the data, the following error appears:
Error in View : 'names' attribute [200] must be the same length as the vector [1]
And this is the way, how the function is being called:
test_function <- function(value1) {return(dec2binSingle(as.numeric(unlist(value1))))}
data_example$tv <- with(data_example, test_function(data_example[which(colnames(data_example) == "numbers")]))
Any help is appreciated... thanks
EDIT:
I called the function for single value and it works as expected.
> dec2binSingle(23)
[1] "10111"
>
I hope this is what you wanted to achieve with your code.
#sample data
df <- data.frame(char1=c("abc","def","xyz"), num1=c(1,34,12), num2=c(34,20,8))
df
#function to convert decimal into binary
bin_func <- function(x) {gsub("^0+","",paste(rev(as.numeric(intToBits(x))), collapse=""))}
#verify which all columns are numeric
num_col <- sapply(df,is.numeric)
df1 <- as.data.frame(lapply(df[,num_col], FUN = function(x) {sapply(x, FUN = bin_func)}))
names(df1) <- paste(names(df1),"_converted",sep="")
#final dataframe having original as well as converted columns
df <- cbind(df,df1)
df
Please don't forget to let us know if it helped :)
I want to create a function with optional parameters. I am working on a dataset with income, states, and a few flags. The function takes the data, state and flag and filters it. If state and flag is missing, it should return the entire data, and so on. So far I have this code which is working fine. However, I want to know if there is a simpler code than this.
myfun <- function(data, states, flag){
if(missing(states)) {
if(missing(flag)) {
new_data <- data
} else {
new_data <- subset(data, data$Flag == flag)
}
} else if(missing(flag)) {
new_data <- subset(data, data$State == states)
} else {
new_data <- subset(data, (data$State == states & data$Flag == flag))
}
temp_data <- toJSON(new_data)
save(temp_data, file = "Mydata.JSON")
return(new_data)
}
I know we can have optional parameters by having a default parameter like flag = Y. But how do I pass both the parameters in default parameters i.e. flag = Y & N, or all the 50 states. I am new to R and any help would be greatly appreciated.
Update: Got a solution I was looking for thanks to Daniel Winkler
test2 <- function(data,states = unique(data$State),flag = c("Y","N"))
{my_data<-subset(data, (data$State %in% states & data$Flag %in% flag))}
For passing multiple values to the function:
myfun<-function(data,states,flag){
if(missing(states))
{
if(missing(flag))
{
new_data<-data
}
else
{
new_data<-subset(data, data$Flag %in% flag)
}
}
else if(missing(flag))
{
new_data<-subset(data, data$State %in% states)
}
else
{
new_data<-subset(data, (data$State %in% states & data$Flag %in% flag))
}
temp_data <- toJSON(new_data)
save(temp_data, file="Mydata.JSON")
return(new_data)
}
Then call with:
all_states <- unique(mydata$States)
myfun(data = mydata, states = all_states, flag = c('Y', 'N'))
I am writing function that filters out some random data from a data table random according to an id value ,but I want it first to check if there is actual data rows in the data table first. I ended up writing an if statement and using is.null but its still for the condition but its not working and ends up accessing the else statement and then giving me an error.
The code is posted below, please help me
new.filterID <- function(DataTable,id) {
if(DataTable == is.null){
return(print("No Data Available: ",id))
} else { filtered <- subset(DataTable, ID == id)
return(aggregate(Value ~ YEAR_WW, filtered, mean))
}
}
filteredData <- new.filterID(random, 213)
The error I get when I run this is
Error in aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...) :
no rows to aggregate
Also below the empty data table random
Value YEAR_WW
I think you can use nrow if you just want to check if the number of lines is zero:
new.filterID <- function(DataTable,id) {
if(nrow(DataTable) == 0){
return(print("No Data Available: ",id))
} else {
filtered <- subset(DataTable, ID == id)
return(aggregate(Value ~ YEAR_WW, filtered, mean))
}
}
filteredData <- new.filterID(random, 213)
However, if you want to check if data.table is null you can check if there is some column on it:
new.filterID <- function(DataTable,id) {
if(nrow(DataTable) == 0 & length(names(DataTable)) == 0){
return(print("No Data Available: ",id))
} else {
filtered <- subset(DataTable, ID == id)
return(aggregate(Value ~ YEAR_WW, filtered, mean))
}
}
filteredData <- new.filterID(random, 213)
Data table has not a method to check if it is null, yet.
Consider this dataset:
myData = data.frame(parent = c(1,3,5,7,8,9)
,child = c(2,4,6,8,9,10))
I want to filter this dataset down to only records that belong to parent # 7. In T-SQL, I'd accomplish this using a recursive common table expression. Is there a way to do something like this elegantly in R?
Without a more complex example, it's hard to see exactly what you need (as here there is no "recursion" per se. It's just one level deep. However, to answer your specific question, here are two way:
myData = data.frame(parent = c(1,3,5,7,8,9),child = c(2,4,6,8,9,10))
#Using Dplyr
library(dplyr)
myData %>% filter(parent == "7")
#Using Base R
myData$child[myData$parent == 7]
Edit:
Per the comments below, I think this simple script will do it. There may be a simpler way but the do while loop below is the first that came to mind:
newParent = 7
answer <- c()
repeat{
currentResult <- myData$child[myData$parent == newParent]
if(length(currentResult) != 0){
newParent <- currentResult
answer <- c(answer, newParent)
}else{
break
}
}
> answer
[1] 8 9 10
Edit #2
Here's the same thing rewritten recursively given that the OP was interested in seeing how it is done:
findChild <- function(myData, parent){
currentChild <- myData$child[myData$parent == parent]
if(length(currentChild) != 0){
return(c(currentChild, findChild(myData, currentChild)))
}else{
return()
}
}
findChild(myData, 7)
I am trying to match two very big data (nsar & crsp) sets. My code works quite well but needs a lot of time. My procedure works the following way:
Try match via ticker (thereby controlling that NAV (just a number) & date
is the same)
Try match via exact fund name (controlling for NAV & date)
Try match by closest match: search first for same NAV & date --> take list and consider only those companies that are the closest match for both match measures --> take remaining entries and find closest match (but match distance is restricted).
Any suggestions how I could improve the efficiency of the code:
#Go through each nsar entry and try to match with crsp
trackchanges = sapply(seq_along(nsar$fund),function(x){
#Define vars
ticker = nsar$ticker[x]
r_date = format(nsar$r_date[x], "%m%Y")
nav1 = nsar$NAV_share[x]
nav2 = nsar$NAV_sshare[x]
searchbyname = 0
if(nav1 == 0) nav1 = -99
if(nav2 == 0) nav2 = -99
########## If ticker is available --> Merge via ticker and NAV
if(is.na(ticker) == F)
{
#Look for same NAV, date and ticker
found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#Mark that you should search by names
searchbyname = 1
} else { #ticker found
#Record crsp_fundno and that match is found
nsar$match[x] = 1
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 1 --> Merged by ticker
return(1)
}
}
###########
########### No Ticker available or found --> Exact name matching
if(is.na(ticker) == T | searchbyname == 1)
{
#Define vars
name = tolower(nsar$fund[x])
company = tolower(nsar$company[x])
#Exact name, date and same NAV
found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#####Continue searching by closest match
#First search for nav and date to get list of funds
allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
allfunds_companies = crsp$company[allfunds]
#Check if anything found
if(length(allfunds) == 0)
{
#Return: 0 --> nothing found
return(0)
}
#Get best match by lev and substring measure for company
levmatch = levenstheinMatch(company, allfunds_companies)
submatch = substringMatch(company, allfunds_companies)
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
#Check if now anything found
if(length(allfunds) == 0)
{
#Mark match (5=Company not found)
nsar$match[x] = 5
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 5 --> Company not found
return(5)
}
#Get best match by all measures
levmatch = levenstheinMatch(name, allfunds_names)
submatch = substringMatch(name, allfunds_names)
#Only accept if identical
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
if(length(allfunds) > 0)
{
#Mark match (3=closest name matching)
nsar$match[x] = 3
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]]
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return 3=closest name matching
return(3)
} else {
#return 0 -> no match
return(0)
}
#####
} else { #If exact name,date,nav found
#Mark match (2=exact name matching)
nsar$match[x] = 2
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
#Return 2=exact name matching
return(2)
}
}
})#End sapply
Thank you very much for any help!
Laurenz
The script is too complicated to provide a complete answer, but the basic problem is in the first line
#Go through each nsar entry...
where you set out the problem in an iterative way. R works best with vectors.
Hoist the vectorizable components from the sapply that you start your calculations with. For instance, format the r_date column.
nsar$r_date_f <- format(nsar$r_date, "%m%Y")
This advice applies to lines buried deeper in your code, too, for example calculating the rounded crsp$mnav should be done just once on the entire column
crsp$mnav_r <- round(crsp$mnav, 1)
Use R idioms where appropriate, if "-99" represents a missing value, then use NA
nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1
Code from other packages that you might use is more likely to treat NA correctly.
Use well-established R functions for more complex queries. This is tricky, but if I'm reading your code correctly your query about "same NAV, date, and ticker" could use merge to do the joins, assuming the columns have been created by vectorized operations earlier in the code, as
nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp,
by.x = c("ticker", rdate_r", "nav1_r"),
by.y = c("nasdaq", "caldt2", "mnav_r"))
This does not cover the "|" condition, so additional work would be needed. The plyr, data.table, and sqldf packages (among others) were developed in part to simplify these types of operations, so might be worth investigating as you get more comfortable with vectorized calculations.
It's hard to tell, but I think these three steps address the major challenges in your code.