Find matching individuals in two datasets in R - r

There are a lot of matching "X" and "Y" questions in R on this site but I think I have a new one. I have two datasets, one is shorter (500 rows) and has one entry per individual. The second is bigger (~20,000 rows) and an individual can have multiple entries. Both have columns for date of birth and gender. My goal is to find people represented in both datasets and to start by finding date of birth and gender matches. My python influenced brain came up with this horrifically slow solution:
dob_big <- c('1975-05-04','1968-02-16','1985-02-28','1980-12-12','1976-06-06','1979-06-24','1981-01-28',
'1985-01-16','1984-03-04','1979-06-26','1988-12-22','1975-10-02','1968-02-04','1972-02-01',
'1981-08-06','1989-01-21','1956-06-25','1986-01-19','1980-03-24','1965-08-16')
gender_big <- c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0)
big_df <- data_frame(date_birth = dob_big, gender = gender_big)
dob_small <- c('1985-01-16','1984-03-04','1979-06-26')
gender_small <- c(1,0,1)
small_df <- data_frame(date_birth = dob_small, gender = gender_small)
for (i in 1:length(big_df$date_birth)) {
save_row <- FALSE
for (j in 1:length(small_df$date_birth)) {
if (big_df$date_birth[i] == small_df$date_birth[j]
& big_df$gender[i] == small_df$gender[j]) {
print(paste("Match found at ",i,",",j))
save_row <- TRUE
}
}
if (save_row == TRUE) {
matches <- c(matches,i)
}
}
Is there a more functional solution that would run faster in R?

whichcould be an alternative.
paste0("Match found at ",
which(paste(big_df$date_birth, big_df$gender) %in%
paste(small_df$date_birth, small_df$gender)),
", ",
which(paste(small_df$date_birth, small_df$gender) %in%
paste(big_df$date_birth, big_df$gender)),
collapse = "; ")

If you only want to find those that are represented in both, you could do a merge
merge(big_df,small_df, by = c("date_birth","gender"))

Related

Quickly lookup to another data frame using two column values in R

I have a data frame (call it 'ModelOutput') with three columns (Trial, DurationRet, DiscountRate) and another (call it 'drdata') with three columns (Scenario, variable, value).
I want to quickly filter drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet to return drdata$value into the ModelOutput$DiscountRate column. Is there a way to do this efficiently?
Here are my two attempts, the first of which fails and the second of which is entirely too slow.
ModelOutput$Trial <- drdata[drdata$Scenario == ModelOutput$Trial & drdata$variable == ModelOutput$DurationRet,"value"]
foreach(row = 1:nrow(ModelOutput)) %do%{
ModelOutput[row, "DiscountRate"] <- drdata[drdata$Scenario == ModelOutput[row, "Trial"] & drdata$variable == as.factor(ModelOutput[row,"DurationRet"]+1),"value"]
}
It took me a minute, but I realized joins could do the job I was looking for.
Here is my final code:
ModelOutput <- ModelOutput %>% full_join(drdata, by = c(Trial = "Scenario", DurationRet = "variable"))

"Argument is of length zero" error with grepl loop

I am having issues with a loop involving grepl. I am trying to print the index that contains the string "Taxable Revenue by Area", but I keep getting the error Argument is of length zero. I have tried it different ways but keep getting an error. When I check the length of the grepl statement, it is 1, not zero. I'm really stuck! nevlists is a list of dataframes. Each data frame is named by number, 1-48 and the length of nevlists therefore is 48. When I run the grepl statement on its own with the page I want: grepl("Taxable Revenue by Area", nevlists$'48'[3,] this evaluates as TRUE which is what I'm looking for. I just can't adapt this to the loop for whatever reason.
library(readr)
library(stringr)
library(magrittr)
library(dplyr)
library(tidyr)
library(pdftools)
nvsr65_05 <- pdf_text("https://gaming.nv.gov/modules/showdocument.aspx?documentid=13542")
getstats<- function(nvsr65_05){
listofdfs <- list() #Create a list in which you intend to save your df's.
for (i in 1:length(nvsr65_05)) {
table_data2 <- nvsr65_05[[i]] %>%
str_split(pattern = "\n")
table_data2 <- data.frame(matrix(unlist(table_data2)))
listofdfs[[i]] <- table_data2
}
return(listofdfs)
}
nevlists <- getstats(nvsr65_05)
names(nevlists) <-c(1:48)
for (i in 1:length(nevlists)) {
if(grepl("Taxable Revenue by Area", nevlists$'i'[3,]) == TRUE){
print(i)}}
#Try2
for (i in 1:length(nevlists)) {
if(as.numeric(grepl("Taxable Revenue by Area", nevlists$'i'[3,])) > 0){
print(i)}}
I'm not 100% sure, but I think this is due to the way you're indexing- try:
for (i in names(nevlists)) {
# Get the index as a character instead of numeric
# in case your names are something other than pure numbers as
# in this example
ix = paste(i)
if (grepl("Taxable Revenue by Area", nevlists[[ix]][3, ]) == TRUE) {
print(ix)
}
}
This does only print "48" for me- is that what you expect?
If you don't actually care about the name of the list item, you can ignore naming the list items at all and just do:
for (i in 1:length(nevlists)) {
if (grepl("Taxable Revenue by Area", nevlists[[i]][3, ]) == TRUE) {
print(i)
}
}
to output a numeric index value, which may be more useful depending on what you're after.

compare string in R

I'm currently working on a programming project in R (for school) and I'm using a data set made of a large quantity of LastFm users (an application that collects data when you're using a media player).
I want to work on an eventual link between 2 variables present in the dataset which are the "nickname" and the "real name". To do so, I would like to compute a variable that represents the rate of similarity between the characters.
As an example take one individual (regardless of the other variables):
name = 'chris meller'
nickname = 'mellertime'
So far, tried to sort the strings in order to to check for identical characters one by one but I'm stuck here. What i found is just a way to to check if "name" is present inside "nickname" with different kind of functions.
>paste(sort(unlist(strsplit(name, ""))), collapse = "")
[1] "eeeillmmrt"
>paste(sort(unlist(strsplit(nickname, ""))), collapse = "")
[1] " ceehillmrrs"
What I would like to know is if there is a way to count the number of identical letters between 2 character strings, regardless of the order?
I would like to end with something like this:
function(a,b)
[1] 0.63
# a,b are 2 character strings
where the result is the ratio of the number of identical character between the two strings divided by the number of characters in the real name.
Try this:
SimilarityRatio <- function(wholeName, nickname, matchCase) {
n1 <- sort(strsplit(paste(strsplit(wholeName, " ")[[1]], collapse = ""), "")[[1]])
n2 <- sort(strsplit(paste(strsplit(nickname, " ")[[1]], collapse = ""), "")[[1]])
if (!matchCase) {
n1 <- tolower(n1)
n2 <- tolower(n2)
}
MyLen <- tempLen <- length(n1)
j <- 1L
numMatch <- 0L
while (j <= tempLen) {
test1 <- n1[j] %in% n2
if (test1) {
myRemove <- min(which(n2 %in% n1[j]))
n1 <- n1[-j]
n2 <- n2[-myRemove]
numMatch <- numMatch + 1L
tempLen <- tempLen - 1L
} else {
j <- j+1L
}
}
numMatch/MyLen
}
Below are some test cases:
> SimilarityRatio("chris meller", "mellertime", FALSE)
[1] 0.6363636
> SimilarityRatio("SuperMan3000", "The3Musketeers", FALSE)
[1] 0.5
> SimilarityRatio("SuperMan3000", "The3Musketeers", TRUE)
[1] 0.4166667
> SimilarityRatio("should a garbage collection be performed immediately", "same expression can vary considerably depending on whether", FALSE)
[1] 0.7608696

Improve R script efficency

I am trying to match two very big data (nsar & crsp) sets. My code works quite well but needs a lot of time. My procedure works the following way:
Try match via ticker (thereby controlling that NAV (just a number) & date
is the same)
Try match via exact fund name (controlling for NAV & date)
Try match by closest match: search first for same NAV & date --> take list and consider only those companies that are the closest match for both match measures --> take remaining entries and find closest match (but match distance is restricted).
Any suggestions how I could improve the efficiency of the code:
#Go through each nsar entry and try to match with crsp
trackchanges = sapply(seq_along(nsar$fund),function(x){
#Define vars
ticker = nsar$ticker[x]
r_date = format(nsar$r_date[x], "%m%Y")
nav1 = nsar$NAV_share[x]
nav2 = nsar$NAV_sshare[x]
searchbyname = 0
if(nav1 == 0) nav1 = -99
if(nav2 == 0) nav2 = -99
########## If ticker is available --> Merge via ticker and NAV
if(is.na(ticker) == F)
{
#Look for same NAV, date and ticker
found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#Mark that you should search by names
searchbyname = 1
} else { #ticker found
#Record crsp_fundno and that match is found
nsar$match[x] = 1
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 1 --> Merged by ticker
return(1)
}
}
###########
########### No Ticker available or found --> Exact name matching
if(is.na(ticker) == T | searchbyname == 1)
{
#Define vars
name = tolower(nsar$fund[x])
company = tolower(nsar$company[x])
#Exact name, date and same NAV
found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#####Continue searching by closest match
#First search for nav and date to get list of funds
allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
allfunds_companies = crsp$company[allfunds]
#Check if anything found
if(length(allfunds) == 0)
{
#Return: 0 --> nothing found
return(0)
}
#Get best match by lev and substring measure for company
levmatch = levenstheinMatch(company, allfunds_companies)
submatch = substringMatch(company, allfunds_companies)
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
#Check if now anything found
if(length(allfunds) == 0)
{
#Mark match (5=Company not found)
nsar$match[x] = 5
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 5 --> Company not found
return(5)
}
#Get best match by all measures
levmatch = levenstheinMatch(name, allfunds_names)
submatch = substringMatch(name, allfunds_names)
#Only accept if identical
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
if(length(allfunds) > 0)
{
#Mark match (3=closest name matching)
nsar$match[x] = 3
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]]
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return 3=closest name matching
return(3)
} else {
#return 0 -> no match
return(0)
}
#####
} else { #If exact name,date,nav found
#Mark match (2=exact name matching)
nsar$match[x] = 2
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
#Return 2=exact name matching
return(2)
}
}
})#End sapply
Thank you very much for any help!
Laurenz
The script is too complicated to provide a complete answer, but the basic problem is in the first line
#Go through each nsar entry...
where you set out the problem in an iterative way. R works best with vectors.
Hoist the vectorizable components from the sapply that you start your calculations with. For instance, format the r_date column.
nsar$r_date_f <- format(nsar$r_date, "%m%Y")
This advice applies to lines buried deeper in your code, too, for example calculating the rounded crsp$mnav should be done just once on the entire column
crsp$mnav_r <- round(crsp$mnav, 1)
Use R idioms where appropriate, if "-99" represents a missing value, then use NA
nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1
Code from other packages that you might use is more likely to treat NA correctly.
Use well-established R functions for more complex queries. This is tricky, but if I'm reading your code correctly your query about "same NAV, date, and ticker" could use merge to do the joins, assuming the columns have been created by vectorized operations earlier in the code, as
nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp,
by.x = c("ticker", rdate_r", "nav1_r"),
by.y = c("nasdaq", "caldt2", "mnav_r"))
This does not cover the "|" condition, so additional work would be needed. The plyr, data.table, and sqldf packages (among others) were developed in part to simplify these types of operations, so might be worth investigating as you get more comfortable with vectorized calculations.
It's hard to tell, but I think these three steps address the major challenges in your code.

R: create vector from nested for loop

I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"
I have the following function that works (prints a list of genes from column 4 of dmelGenome):
geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')
{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)
chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]
genome <- data.frame(chr, startAdjust, stopAdjust, gene)
hits <- read.table(hitList, header=TRUE)
chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]
genes <- c()
for(i in 1:length(chrHits[,1]))
{
for(j in 1:length(chrGenome[,1]))
{
if( chrHits[i,2] >= chrGenome[j,2] && chrHits[i,3] <= chrGenome[j,3] )
{
print(chrGenome[j,4])
}
}
}
genes <- unique(genes[is.finite(genes)])
print(genes)
fileConn<-file(saveGeneList)
write(genes, fileConn)
close(fileConn)
}
however, when I substitute print() with:
genes[j] <- chrGenome[j,4]
R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?
Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.
This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.
Thanks.
I believe the inner loop could be replaced with:
gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3],
TRUE, FALSE)
Then you can use that logical vector to select what you want. Doing
which(gene.in)
might also be of use to you.

Resources