I am trying to remove rows that have duplicate entries, as defined by two columns, from multiple dataframes located in a single list.
Simple data:
aa <- data.frame(a=rnorm(100),b=rnorm(100),x=rnorm(100),y=rnorm(100),Z=rep(1:4, each=25))
split.aa<-split(aa, aa$Z)
For each df in the list 'split.aa' I am trying to remove rows with duplicated x,y pairs.
I could do this one df a time with:
split[[z]][!duplicated(split[[z]][,c('x','y')]),]
where z is the name of each df within 'split.aa'.
How would I write this into lapply so that the action is performed on each element?
I am having a hard time wrapping my head around how to refer to the specific list elements within the lapply function.
lapply(split.aa, function(x) x[!duplicated(x[c("x", "y")]), ])
will do the trick.
just define a function in lapply
lapply(split.aa, function(x) x[!duplicated(x[c("x", "y")]), ])
Related
Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)
I am trying to remove the first 9 rows of multiple dataframes that have the same structures but different names (keeping similar name structure). In my example, there are 4 dataframes with respectively the names
Mydataframe_A, Mydataframe_B, Mydataframe_C, Mydataframe_D.
Currently it is working with the following code:
`Mydataframe_A`<- `Mydataframe_A`[-c(1:9),]
`Mydataframe_B`<- `Mydataframe_B`[-c(1:9),]
`Mydataframe_C`<- `Mydataframe_C`[-c(1:9),]
`Mydataframe_D`<- `Mydataframe_D`[-c(1:9),]
But I would like to write this is with only one line and not having to specify each time each name of dataframe.
I think this could work by using a pattern name and lists because for example this is what I am doing to rbind different dataframes:
All_mydataframes <- rbindlist(mget(ls(pattern = "^Mydataframe_")))
Any idea on how to do this ?
Thanks a ton!
Since mget turns this into a list, you can use apply family functions:
rbindlist(lapply(mget(ls(pattern = "^Mydataframe_")), function(x) x[-c(1:9), ]))
This takes the list from mget and removes the first 9 rows, then rbind it from list to data.table. The only problem is you can't differentiate what data.frame the original data was part of.
I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))
I try to combine each columns of three different dataframes to get an object with the same length of the original dataframe and three columns of every subobject. Each of the original dataframe has 10 columns and 14 rows.
I tried it with a for-loop, but the result is not usable for me.
t <- NULL
for(i in 1 : length(net)) {
a <- cbind(imp.qua.00.09[i], exp.qua.00.09[i], net[i])
t <- list(t, a)
}
t
But in the end I would like to get 10 seperated dataframes with three columns.
So I want to loop through this:
a <- cbind(imp.qua.00.09[i], exp.qua.00.09[i], net[i])
for every column of each original dataframe. But if I use t <- list(t, a) it constructs a crazy list. Thanks.
The code you're using to append elements to t is wrong, you should do in this way:
t <- list()
for(i in 1:length(net)) {
a <- cbind(imp.qua.00.09[i], exp.qua.00.09[i], net[i])
t[[length(t)+1]] <- a
}
t
Your code is wrong since at each step, you transform t into a list where the first element is the previous t (that is a list, except for the first iteration), and the second element is the subset. So basically in the end you're getting a sort of recursive list composed by two elements where the second one is the data.frame subset and the first is again a list of two elements with the same structure, for ten levels.
Anyway, your code is equivalent to this one-liner (that is probably more efficient since it does not perform any list concatenation):
t <- lapply(1:length(net),
function(i){cbind(imp.qua.00.09[i], exp.qua.00.09[i], net[i])})
This should work:
do.call(cbind,list(imp.qua.00.09, exp.qua.00.09, net))
I am trying to find out which element of each of my sub-lists is the minimum for that particular sub-list. The current chunk of data I am trying to apply the functionality to is a record of 41 entries. They get grouped by another function that produces indices for each of the sub-lists or sub-group. Elements 1:8 are in the first sub-group, the following sub-groups are as follow: 9:17, 18:23, 24:33, 34:41. Please note I called the data I am working with "b1", and the index created to group b1's element into sub-groups is "indx". I am able to find out the minimum value in each sub-group using sapply like this:
sapply(indx, function(i) min(b1[i])
But, I am stuck at finding which "b1" element is each of these numbers sapply provided above. I know I probably need the function which() and mapply(), but have not been able to put it together.
Reproducible data:
b1 <- sample(1:20,41,T)
starts <- c(1,9,18,24,34)
stops <- c(8,17,23,33,41)
indx <- mapply(seq, from=starts, to=stops)
You basically figured it out yourself.
Try
sapply(indx, function(i) which.min(b1[i]))
Edit
I'm not sure anymore if that is actually what you want. The answer above should return you the index of the minimum element within each subgroup.
In that case you could do the following (one of probably quite a few possible ways):
indices <- 1:length(b1)
sapply(indx, function(i) indices[i][which.min(b1[i])])