I am currently trying to find some potential duplicates in a large data set (500,000+ lines) using fuzzy matching. There are three main parts to this code:
A function that I have written that identifies the most like potential duplicate in a data set (by returning a score - it selects the highest score).
A function that identifies the position of the record that is the most likely to be a duplicate.
A for loop that runs both of the functions above for every record and returns values in the DupScore column and the positionBestMatch column.
An example of a resulting dataset is below:
Name: DOB: DupScore positionbestMatch
Ben 6/3/1994 15 3
Abe 5/5/2005 11 5
Benjamin 6/3/1994 15 1
Gabby 01/01/1900 10 6
Abraham 5/5/2005 11 2
Gabriella 01/01/1900 10 4
The for loop to calculate these scores looks a bit like this (scorefunc and position func are self
written functions):
for (i in c(1:length(df$Name))) {
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
}
Obviously, on a data set with so many rows, this loop is time consuming and computationally intensive as it loops through each row. How can I edit my for loop so that :
When a DupScore is calculated for a row, it will also insert the score not only in the [i] row, but also the row of positionbestMatch ?
And have the loop only run for those with empty DupScore and positionBestMatch values.
I hope this makes sense!
Try using a while loop
all_inds <- seq_len(nrow(df))
i <- all_inds[1]
while (length(all_inds) > 1) {
i <- all_inds[1]
df$dupScore[i]<-scorefunc[i]
df$positionBestMatch[i]<-positionfunc[i]
df$dupScore[df$positionBestMatch[i]] <- df$dupScore[i]
all_inds <- setdiff(all_inds, c(i, df$positionBestMatch[i]))
}
But this will keep some empty values for df$positionBestMatch.
Related
In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.
The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)
Let's say I have a function called remove_fun which reduces the number of rows of a dataframe based on some conditions (this function is too verbose to include in this question). This function takes as its input a dataframe with 2 columns. For example, an input df called block_2_df could look likes this:
block_2_df
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
For this example, let's say the function remove_fun removes 1 row at a time based on the highest value of seq in block_2_df$seq. Applying remove_fun once would result in a new dataframe that looks like this:
remove_fun(block_2_df)
Treatment seq
1 29
1 23
1 6
2 41
1 5
2 44
I.e., the row containing seq==60 in block_2_df was removed via remove_fun
I can create a while loop which repeats this operation on block_2_df via remove_fun based on the number of rows remaining in block_2_df as:
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
print(remove_fun(block_2_df))
}
This while loop reduces block_2_df until it has 1 row left (the lowest value of block_2_df$seq), and prints out the 'updated' versions of block_2_df until it is reduced to one row.
However, I'd like to save each 'updated' version of block_2_df (i.e. block_2_df with 7, then 6, then 5,....,then 1 row) produced from the while loop. How can I accomplish this? I know for for loops, this could be done by creating an empty list at storing each 'updated' block_2_df in the ith element in the empty list. But I'm not sure how to do something similar in a while loop. It would be great to have a list of dfs as output from this while loop.
Just create and maintain an index counter yourself. It's a bit more trouble than a for() loop, that does it on its own but it's not so difficult.
saved <- list()
i <- 1
while (dim(block_2_df)[1]>1) {
block_2_df <- remove_fun(block_2_df)
saved[[i]] <- block_2_df
i <- i + 1
print(block_2_df)
}
Also, you were calling remove_funtwice in your loop, that was probably not what you wanted to do. I've corrected that, if I'm wrong please say so.
I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.
Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3
So earlier I answered my own question on thinking in vectors in R. But now I have another problem which I can't 'vectorize.' I know vectors are faster and loops slower, but I can't figure out how to do this in a vector method:
I have a data frame (which for sentimental reasons I like to call my.data) which I want to do a full marginal analysis on. I need to remove certain elements one at a time and 'value' the data frame then I need to do the iterating again by removing only the next element. Then do again... and again... The idea is to do a full marginal analysis on a subset of my data. Anyhow, I can't conceive of how to do this in a vector efficient way.
I've shortened the looping part of the code down and it looks something like this:
for (j in my.data$item[my.data$fixed==0]) { # <-- selects the items I want to loop
# through
my.data.it <- my.data[my.data$item!= j,] # <-- this kicks item j out of the list
sum.data <-aggregate(my.data.it, by=list(year), FUN=sum, na.rm=TRUE) #<-- do an
# aggregation
do(a.little.dance) && make(a.little.love) -> get.down(tonight) # <-- a little
# song and dance
delta <- (get.love) # <-- get some love
delta.list<-append(delta.list, delta, after=length(delta.list)) #<-- put my love
# in a vector
}
So obviously I hacked out a bunch of stuff in the middle, just to make it less clumsy. The goal would be to remove the j loop using something more vector efficient. Any ideas?
Here's what seems like another very R-type way to generate the sums. Generate a vector that is as long as your input vector, containing nothing but the repeated sum of n elements. Then, subtract your original vector from the sums vector. The result: a vector (isums) where each entry is your original vector less the ith element.
> (my.data$item[my.data$fixed==0])
[1] 1 1 3 5 7
> sums <- rep(sum(my.data$item[my.data$fixed==0]),length(my.data$item[my.data$fixed==0]))
> sums
[1] 17 17 17 17 17
> isums <- sums - (my.data$item[my.data$fixed==0])
> isums
[1] 16 16 14 12 10
Strangely enough, learning to vectorize in R is what helped me get used to basic functional programming. A basic technique would be to define your operations inside the loop as a function:
data = ...;
items = ...;
leave_one_out = function(i) {
data1 = data[items != i];
delta = ...; # some operation on data1
return delta;
}
for (j in items) {
delta.list = cbind(delta.list, leave_one_out(j));
}
To vectorize, all you do is replace the for loop with the sapply mapping function:
delta.list = sapply(items, leave_one_out);
This is no answer, but I wonder if any insight lies in this direction:
> tapply((my.data$item[my.data$fixed==0])[-1], my.data$year[my.data$fixed==0][-1], sum)
tapply produces a table of statistics (sums, in this case; the third argument) grouped by the parameter given as the second argument. For example
2001 2003 2005 2007
1 3 5 7
The [-1] notation drops observation (row) one from the selected rows. So, you could loop and use [-i] on each loop
for (i in 1:length(my.data$item)) {
tapply((my.data$item[my.data$fixed==0])[-i], my.data$year[my.data$fixed==0][-i], sum)
}
keeping in mind that if you have any years with only 1 observation, then the tables returned by the successive tapply calls won't have the same number of columns. (i.e., if you drop out the only observation for 2001, then 2003, 2005, and 2007 would be te only columns returned).