Remove duplicate rows for multiple dataframes - r

I have over 100 dataframes (df1, df2, df3, ....) each contains the same variables. I want to loop through all of them and remove duplicates by id. For df1, I can do:
df1 <- df1[!duplicated(df1$id), ]
How can I do this in an efficient way?

If you're dealing with 100 similarly-structured data.frames, I suggest instead of naming them uniquely, you put them in a list.
Assuming they are all named df and a number, then you can easily assign them to a list with something like:
df_varnames <- ls()[ grep("^df[0-9]+$", ls()) ]
or, as #MatteoCastagna suggested in a comment:
df_varnames <- ls(pattern = "^df[0-9]+$")
(which is both faster and cleaner). Then:
dflist <- sapply(df_varnames, get, simplify = FALSE)
And from here, your question is simply:
dflist2 <- lapply(dflist, function(z) z[!duplicated(z$id),])
If you must deal with them as individual data.frames (again, discouraged, almost always slows down processing while not adding any functionality), you can try a hack like this (using df_varnames from above):
for (dfname in df_varnames) {
df <- get(dfname)
assign(dfname, df[! duplicated(df$id), ])
}
I cringe when I consider using this, but I admit I may not understand your workflow.

Related

Apply an `as.character()` function to a list of dataframes

So essentially I have a list of dataframes that I want to apply as.character() to.
To obtain the list of dataframes I have a list of files that I read in using a map() function and a read funtion that I created. I can't use map_df() because there are columns that are being read in as different data types. All of the files are the same and I know that I could hard code the data types in the read function if I wanted, but I want to avoid that if I can.
At this point I throw the list of dataframes in a for loop and apply another map() function to apply the as.character() function. This final list of dataframes is then compressed using bind_rows().
All in all, this seems like an extremely convoluted process, see code below.
audits <- list.files()
my_reader <- function(x) {
my_file <- read_xlsx(x)
}
audits <- map(audits, my_reader)
for (i in 1:length(audits)) {
audits[[i]] <- map_df(audits[[i]], as.character)
}
audits <- bind_rows(audits)
Does anybody have any ideas on how I can improve this? Ideally to the point where I can do everything in a single vectorised map() function?
For reproducibility you can use two iris datasets with one of the columns datatypes changed.
iris2 <- iris
iris2[1] <- as.character(iris2[1])
my_list <- list(iris, iris2)
as.character works on vector whereas data.frame is a list of vectors. An option is to use across if we want only a single use of map
library(dplyr)
library(purrr)
map_dfr(my_list, ~ .x %>%
mutate(across(everything(), as.character)))
I wanted to show a base R solution just incase if it helps anyone else. You can use rapply to recursively go through the list and apply a function. you can specify class and if you want to replace or unlist/list the returned object:
iris2 <- iris
iris2[1] <- as.character(iris2[1])
my_list <- list(iris, iris2)
mylist2 <- rapply(my_list, class = "ANY", f = as.character, how = "replace")
bigdf <- do.call(rbind, mylist2)

Create a variable in Multiple Dataframes in R

I want to create a ranked variable that will appear in multiple data frames.
I'm having trouble getting the ranked variable into the data frames.
Simple code. Can't make it happen.
dfList <- list(df1,df2,df3)
for (df in dfList){
rAchievement <- rank(df["Achievement"])
df[[rAchievement]]<-rAchievement
}
The result I want is for df1, df2 and df3 to each gain a new variable called rAchievement.
I'm struggling!! And my apologies. I know there are similar questions out there. I have reviewed them all. None seem to work and accepted answers are rare.
Any help would be MUCH appreciated. Thank you!
We can use lapply with transform in a single line
dfList <- lapply(dfList, transform, rAchievement = rank(Achievement))
If we need to update the objects 'df1', 'df2', 'df3', set the names of the 'dfList' with the object names and use list2env (not recommended though)
names(dfList) <- paste0('df", 1:3)
list2env(dfList, .GlobalEnv)
Or using the for loop, we loop over the sequence of the list, extract the list element assign a new column based on the rank of the 'Achievement'
for(i in seq_along(dfList)) {
dfList[[i]][['rAchievement']] <- rank(dfList[[i]]$Achievement)
}

lapply set of functions across multiple dataframes

I have a set of functions I need to apply to several dataframes. I want to use the lapply function instead of for() loops.
#sample data frame
id lastpage attribute_2
1 20 232
2 8 232
3 6 129
4 20 1271
5 20 129
6 20 74
The functions work when I apply it to one dataframe at a time. It basically removes duplicates (based on attribute_2) with the lowest values for variable 'lastpage':
df <- df[order(df$attribute_2, -df$lastpage),]
df <- df[!duplicated(df$attribute_2),]
When I try to (l)apply this function to several dataframes, nothing seems to have changed when I call the dataframe. Intuitively I think I am messing up something when calling df, but I am not sure what:
df.list <- list(df0, df1, df2, df3)
myFunc <- function(df) {
df <- df[order(df$attribute_2, -df$lastpage),]
df <- df[!duplicated(df$attribute_2),]
return(df)
}
df.list <- lapply(df.list, FUN = myFunc)
Your help is much appreciated!
I have looked at all similar previous questions on lapply functions, specifically this one: Applying a set of operations across several data frames in r
I am probably making a very obvious mistake, but I just can't find it.
EDIT: thanks everyone for the help
For anyone wondering what code I exactly use now:
df.list <- list(df0, df1, df2, df3)
myFunc <- function(x) {
x <- x[order(x$attribute_2, -x$lastpage),]
x <- x[!duplicated(x$attribute_2),]
}
df.list2 <- lapply(df.list, myFunc)
df2_c<-df.list2[[3]]
Your code probably works as expected but you’re assigning its result to df.list, not to the original data.frames. The list contains copies of these, so they would never get modified. This is intentional, and the desired behaviour in R.
In fact, just keep working with your list of data.frames.
This example does what you intend to do:
set.seed(314)
df <- data.frame(x = sample(1:10, size = 50, replace = TRUE),
y = sample(1:10, size = 50, replace = TRUE))
df.list <- list(df,df,df,df)
lapply(df.list,nrow)
testfunction <- function(data){
data[!duplicated(data$x),]
}
lapply(df.list, testfunction)
I think there is something wrong with your function. I noticed that you reference column email which is not in your dataframe.
It is also advisable to rename the variables that are used inside the function, so you don't reference global variables.
And as Konrad said in the other answer, your original dataframes stayed the same, so call them for example as follows:
df.list2 <- lapply(df.list, testfunction)
df.list2[[1]]

How to apply a single argument function to all columns within all dataframes in a list

Say I have a list dflist which contains dataframes df1 and df2.
df1 <- data.frame(VAR1 = letters[1:10], VAR2 = seq(1:10))
df2 <- data.frame(VAR3 = letters[11:20], VAR4 = seq(11:20))
dflist <- list(df1 = df1, df2 = df2)
In general, I want to apply a single argument function to each of the variables in each dataframe in the list. To make the question more concrete, say I'm interested in setting the variable names to lowercase. Using a dataframe paradigm, I'd just do this:
colnames(df1) <- tolower(colnames(df1))
colnames(df2) <- tolower(colnames(df2))
However, this becomes prohibitive when I have dozens of variables in each of the 20 or 30 dataframes I'm working on, hence the shift to using lists.
I'm aware that this question stems from my fundamental misunderstanding of the *apply family of functions, but I've been unable to locate examples of functions applied to deeper than the first sublevel of a list. Thanks for any input.
As #akrun suggested, the answer is simply:
lapply(dflist, function(x) {colnames(x) <- tolower(colnames(x)); x })

R: merging matrices (not data.frames)

merge is a very nice function: It merges matrices and data.frames, and returns a data.frame.
Having rather big character matrices,
is there another good way to merge -
without data.frame conversion?
Comment 1:
A small function to merge a named vector with a matrix or data.frame. Elements of the vector can link to multiple entries in the matrix:
expand <- function(v,m,by.m,v.name='v',...) {
df <- do.call(rbind,lapply(names(v),function(x) {
pos <- which(m[,by.m] %in% v[x])
cbind(x,m[pos,],...)
}))
colnames(df)[1] <- v.name
df
}
Example:
v <- rep(letters,each=3)[seq_along(letters)]
names(v) <- letters
m <- data.frame(a=unique(v),b=seq_along(unique(v)),stringsAsFactors=F)
expand(v,m,'a')
You can use a combination of match and cbind to do the equivalent of merge without conversion to data frame, a simple example:
st1 <- state.x77[ sample(1:50), ]
st2 <- as.matrix( USArrests )[ sample(1:50), ]
tmp1 <- match(rownames(st1), rownames(st2) )
st3 <- cbind( st1, st2[tmp1,] )
head(st3)
Keeping track of which columns you want, and merging whith many to 1 relationships or missing rows in one group require a bit more thought but are still possible.
No, not without either (a) overwriting the merge function or (b) creating a new merge.matrix() S3 function (this would be the right approach to the problem).
You can see in the merge help:
Value
A data frame.
Also, the merge.default function:
> merge.default
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
There is now a merge.Matrix function in the Matrix.utils package. This works on combinations of matrices as well as capital M Matrices, data.frames, etc.
The match solution is nice, but as someone pointed out does not work on m:n relationships. It also does not implement the other features of merge, including all.x, all.y, etc.

Resources