Improve R script efficency - r

I am trying to match two very big data (nsar & crsp) sets. My code works quite well but needs a lot of time. My procedure works the following way:
Try match via ticker (thereby controlling that NAV (just a number) & date
is the same)
Try match via exact fund name (controlling for NAV & date)
Try match by closest match: search first for same NAV & date --> take list and consider only those companies that are the closest match for both match measures --> take remaining entries and find closest match (but match distance is restricted).
Any suggestions how I could improve the efficiency of the code:
#Go through each nsar entry and try to match with crsp
trackchanges = sapply(seq_along(nsar$fund),function(x){
#Define vars
ticker = nsar$ticker[x]
r_date = format(nsar$r_date[x], "%m%Y")
nav1 = nsar$NAV_share[x]
nav2 = nsar$NAV_sshare[x]
searchbyname = 0
if(nav1 == 0) nav1 = -99
if(nav2 == 0) nav2 = -99
########## If ticker is available --> Merge via ticker and NAV
if(is.na(ticker) == F)
{
#Look for same NAV, date and ticker
found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#Mark that you should search by names
searchbyname = 1
} else { #ticker found
#Record crsp_fundno and that match is found
nsar$match[x] = 1
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 1 --> Merged by ticker
return(1)
}
}
###########
########### No Ticker available or found --> Exact name matching
if(is.na(ticker) == T | searchbyname == 1)
{
#Define vars
name = tolower(nsar$fund[x])
company = tolower(nsar$company[x])
#Exact name, date and same NAV
found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#####Continue searching by closest match
#First search for nav and date to get list of funds
allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
allfunds_companies = crsp$company[allfunds]
#Check if anything found
if(length(allfunds) == 0)
{
#Return: 0 --> nothing found
return(0)
}
#Get best match by lev and substring measure for company
levmatch = levenstheinMatch(company, allfunds_companies)
submatch = substringMatch(company, allfunds_companies)
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
#Check if now anything found
if(length(allfunds) == 0)
{
#Mark match (5=Company not found)
nsar$match[x] = 5
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 5 --> Company not found
return(5)
}
#Get best match by all measures
levmatch = levenstheinMatch(name, allfunds_names)
submatch = substringMatch(name, allfunds_names)
#Only accept if identical
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
if(length(allfunds) > 0)
{
#Mark match (3=closest name matching)
nsar$match[x] = 3
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]]
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return 3=closest name matching
return(3)
} else {
#return 0 -> no match
return(0)
}
#####
} else { #If exact name,date,nav found
#Mark match (2=exact name matching)
nsar$match[x] = 2
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
#Return 2=exact name matching
return(2)
}
}
})#End sapply
Thank you very much for any help!
Laurenz

The script is too complicated to provide a complete answer, but the basic problem is in the first line
#Go through each nsar entry...
where you set out the problem in an iterative way. R works best with vectors.
Hoist the vectorizable components from the sapply that you start your calculations with. For instance, format the r_date column.
nsar$r_date_f <- format(nsar$r_date, "%m%Y")
This advice applies to lines buried deeper in your code, too, for example calculating the rounded crsp$mnav should be done just once on the entire column
crsp$mnav_r <- round(crsp$mnav, 1)
Use R idioms where appropriate, if "-99" represents a missing value, then use NA
nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1
Code from other packages that you might use is more likely to treat NA correctly.
Use well-established R functions for more complex queries. This is tricky, but if I'm reading your code correctly your query about "same NAV, date, and ticker" could use merge to do the joins, assuming the columns have been created by vectorized operations earlier in the code, as
nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp,
by.x = c("ticker", rdate_r", "nav1_r"),
by.y = c("nasdaq", "caldt2", "mnav_r"))
This does not cover the "|" condition, so additional work would be needed. The plyr, data.table, and sqldf packages (among others) were developed in part to simplify these types of operations, so might be worth investigating as you get more comfortable with vectorized calculations.
It's hard to tell, but I think these three steps address the major challenges in your code.

Related

IF statements inside function do not recognize conditions

I want to adjust my function so that my if and else if statements recognize the name of the dataframe used and execute the correct plotting function. These are some mock data structured the same as mine:
df1<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
df1[colnames(df1)] <- lapply(df1[colnames(df1)], factor)
vals <- colnames(df1)[1:5]
dummies <- colnames(df1)[-(1:5)]
step1 <- lapply(dummies, function(x) df1[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
tbls <- unlist(step2, recursive=FALSE)
tbls<-lapply(tbls, function(x) x[(names(x) %in% names(df1[c(1:5)]))])
A<-lapply(tbls,"[", c(1,2))
B<-lapply(tbls,"[", c(3,4))
C<-lapply(tbls,"[", c(3,4))
list<-list(A,B,C)
names(list)<-c("A","B","C")
And this is my function:
plot_1<-function (section, subsample) {
data<-list[grep(section, names(list))]
data<-data[[1]]
name=as.character(names(data))
if(section=="A" && subsample=="None"){plot_likert_general_section(df1[c(1:2)],"A")}
else if (section==name && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the",name,"topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the",name,"topics rank?"))}
}
Basically what I want it to do is plot a certain graph by specifying section and subsample I'm interested in if, for example, I want to plot section C and subsample dummy.1, I just write:
plot_1(section="C", subsample="dummy1)
I want to avoid writing this:
else if (section=="A" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the A topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the A topics rank?"))}
else if (section=="B" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the B topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the B topics rank?"))}
else if (section=="C" && subsample=="dummy1"){plot_likert(data$dummy1.yes, title=paste("How do the c topics rank?"));plot_likert(data$Ldummy1.no, title = paste("How do the C topics rank?"))}
else if (section=="C" && subsample=="dummy2")...
.
.
}
So I tried to extract the dataframe used from the list so that it matches the string of the section typed in the function (data<-list[grep(section, names(list))]) and store its name as a character (name=as.character(names(data))), because I thought that in this way the function would have recognized the string "A", "B" or "C" by itself, without the need for me to specify each condition.
However, if I run it, I get this error: Warning message: In section == name && subsample == "dummy1" : 'length(x) = 4 > 1' in coercion to 'logical(1)', that, from what I understand, is due to the presence of a vector in the statement. But I have no idea how to correct for this (I'm still quite new to R).
How can I fix the function so that it does what I want? Thanks in advance!
Well, I can't really test your code without the plot_likert_general_section function or the plot_likert function, but I've done a bit of simplifying and best practices--passing list in as an argument, consistent spaces and assignment operators, etc.--and this is my best guess as to what you want:
plot_1 = function(list, section, subsample) { ## added `list` as an argument
data = list[[grep(section, names(list))]] # use [[ to extract a single item
name = as.character(names(data))
if(subsample == "None"){
plot_likert_general_section(df1[c(1:2)], section)
} else {
yesno = paste(subsample, c("yes", "no"), sep = ".")
plot_likert(data[[yesno[1]]], title = paste("How do the", name, "topics rank?"))
plot_likert(data[[yesno[2]]], title = paste("How do the", name, "topics rank?"))
}
}
plot_1(list, section = "C", subsample = "dummy1)
I'm not sure if your plot_likert functions use base or grid graphics--but either way you'll need to handle the multiple plots. With base, probably use mfrow() to display both of them, if grid I'd suggest putting them in a list to return them both, and then maybe using gridExtra::grid.arrange() (or similar) to plot both of them.
You're right that the error is due to passing a vector where a single value is expected. Try inserting print statements before the equality test to diagnose why this is.
Also, be careful with choosing variable names like name which are baseR functions (e.g. ?name). I'd also recommend following the tidyverse style guide here: https://style.tidyverse.org/.

R - set several factor levels as default argument in a function

I'm trying to create a function that looks up price and car type from a data set. Both will have default arguments. For price, this is easy enough. But for the car type (which I have as factors), I can't find a way to set all factors as default.
The goal is that if you don't set anything in car_type, it will return all possible car types.
search <- function(start_price = 0, end_price = 1000, car_type = ???){
subset_data <- auto[price <= end_price &
price > start_price &
vehicleType == car_type]
return(subset_data)
}
search()
So that the "search()" returns all cars between the prices of 0 and 1000 and of all possible car types. I've tried using vectors and lists, without any luck.
The usual way to approach this is to use NULL as a default and handle that in the function.
search <- function(start_price = 0, end_price = 1000, car_type = NULL){
if (is.null(car_type) {
car_type <- levels(auto$vehicleType)
}
subset_data <- auto[price <= end_price &
price > start_price &
vehicleType %in% car_type]
return(subset_data)
}

find_element in dataframe in R

I am new to R. I wanted to define a R function, find_element, that takes as its inputs a list and a value of any type, and returns the value of the matched element in the input list that matches the value. thanks for your help
find_element <- function(arr, val){
count = 0
for(i in arr){
if (i == val){
print(count)
} else
count = count + 1
print ("No Match")
}
}
e.g.
arr <- 1:10
find_element(arr, 10)
# 10
find_element(arr, 12)
# NULL
Just for educational purposes, please, try (although this is not recommended practice in R!):
find_element <- function(arr, val) {
count = 1
for (i in arr) {
if (i == val) {
return(count)
} else
count = count + 1
}
return("No Match")
}
This will yield
arr <- 1:10
find_element(arr, 10)
#[1] 10
find_element(arr, 12)
#[1] "No Match"
Please, note
In R, elements of vectors, etc are numbered starting with 1
You have to use return instead of print to indicate the return value of a function (well, I know there's a short cut - but it's for the purpose of education, here)
The final return must come after the for loop.
Built-in function
Also for educational purposes, please, note that Sotos already has shown the R way in his comment:
which(arr == 10)
#[1] 10
which(arr == 12)
#integer(0)
In R, it's almost always better to use the well-documented built-in functions or those from packages. And, yes, try to avoid for loops in R.
Learnig R online
As pointed out in the (now deleted) answer of engAnt there are several ressources to learn R. https://www.rstudio.com/online-learning/#R lists a number of resources.

How do you recursively return data from a parent child hierarchy in R?

Consider this dataset:
myData = data.frame(parent = c(1,3,5,7,8,9)
,child = c(2,4,6,8,9,10))
I want to filter this dataset down to only records that belong to parent # 7. In T-SQL, I'd accomplish this using a recursive common table expression. Is there a way to do something like this elegantly in R?
Without a more complex example, it's hard to see exactly what you need (as here there is no "recursion" per se. It's just one level deep. However, to answer your specific question, here are two way:
myData = data.frame(parent = c(1,3,5,7,8,9),child = c(2,4,6,8,9,10))
#Using Dplyr
library(dplyr)
myData %>% filter(parent == "7")
#Using Base R
myData$child[myData$parent == 7]
Edit:
Per the comments below, I think this simple script will do it. There may be a simpler way but the do while loop below is the first that came to mind:
newParent = 7
answer <- c()
repeat{
currentResult <- myData$child[myData$parent == newParent]
if(length(currentResult) != 0){
newParent <- currentResult
answer <- c(answer, newParent)
}else{
break
}
}
> answer
[1] 8 9 10
Edit #2
Here's the same thing rewritten recursively given that the OP was interested in seeing how it is done:
findChild <- function(myData, parent){
currentChild <- myData$child[myData$parent == parent]
if(length(currentChild) != 0){
return(c(currentChild, findChild(myData, currentChild)))
}else{
return()
}
}
findChild(myData, 7)

Trying to vectorize a for loop in R

UPDATE
Thanks to the help and suggestions of #CarlWitthoft my code was simplified to this:
model <- unlist(sapply(1:length(model.list),
function(i) ifelse(length(model.list[[i]][model.lookup[[i]]] == "") == 0,
NA, model.list[[i]][model.lookup[[i]]])))
ORIGINAL POST
Recently I read an article on how vectorizing operations in R instead of using for loops are a good practice, I have a piece of code where I used a big for loop and I'm trying to make it a vector operation but I cannot find the answer, could someone help me? Is it possible or do I need to change my approach? My code works fine with the for loop but I want to try the other way.
model <- c(0)
price <- c(0)
size <- c(0)
reviews <- c(0)
for(i in 1:length(model.list)) {
if(length(model.list[[i]][model.lookup[[i]]] == "") == 0) {
model[i] <- NA
} else {
model[i] <- model.list[[i]][model.lookup[[i]]]
}
if(length(model.list[[i]][price.lookup[[i]]] == "") == 0) {
price[i] <- NA
} else {
price[i] <- model.list[[i]][price.lookup[[i]]]
}
if(length(model.list[[i]][reviews.lookup[[i]]] == "") == 0) {
reviews[i] <- NA
} else {
reviews[i] <- model.list[[i]][reviews.lookup[[i]]]
}
size[i] <- product.link[[i]][size.lookup[[i]]]
}
Basically the model.list variable is a list from which I want to extract a particular vector, the location from that vector is given by the variables model.lookup, price.lookup and reviews.lookup which contain logical vectors with just one TRUE value which is used to return the desired vector from model.list. Then every cycle of the for loop the extracted vectors are stored on variables model, price, size and reviews.
Could this be changed to a vector operation?
In general, try to avoid if when not needed. I think your desired output can be built as follows.
model <- unlist(sapply(1:length(model.list), function(i) model.list[[i]][model.lookup[[i]]]))
model[model=='']<-NA
And the same for your other variables. This assumes that all model.lookup[[i]] are of length one. If they aren't, you won't be able to write the output to a single element of model in the first place.
I would also note that you are grossly overcoding, e.g. x<-0 is better than x<-c(0), and don't bother with length evaluation on a single item.

Resources