split dataframe by row number in R - r

This is probably really simple, but I can't find a solution:
df <- data.frame(replicate(10,sample(0:1,10,rep=TRUE)))
v <- c(3, 7)
is there an elegant way to split this dataframe in three elements (of a list) at the row number specified in v?

Assuming that rows 1&2 goes in the first split, 3,4,5,6 in the second and 7 to nrow(df) goes in the last
split(df, cumsum(1:nrow(df) %in% v))
but if 1:3 rows are in the first split, then comes 4:7, and in the third 8 to nrow(df)
split(df, cumsum(c(TRUE,(1:nrow(df) %in% v)[-nrow(df)])) )
Or as #James mentioned in the comments,
split(df, cumsum(1:nrow(df) %in% (v+1)))

Another way:
split(df, findInterval(1:nrow(df), v))
For the alternative interpretation, you can use:
split(df, cut(1:nrow(df), unique(c(1, v, nrow(df))), include.lowest=TRUE))

Related

R - substr over multiple columns in Dataframe

Lets say I have a dataframe that looks like this
Column1, Column2, Column3
a_2019 b_2020 c_2021
d_2019 e_2020 f_2021
a_2019 b_2020 c_2021
d_2019 e_2020 f_2021
And I would like to take out "_2019", "_2020", and "_2021". I could use
df$Column1 <- substr(df$Column1, 1, nchar(df$Column1)-5)
For every column, but I have multiple dataframes with quite a few columns. substr need a text or a vector for it to work, so using df[,3:10] doesn´t work, lapply either.
Any suggestion on how to achieve this in an elegant way? Thank you
We can try using lapply along with sub for a base R option:
df[cols] <- lapply(df[cols], function(x) sub("_(?:2019|2020|2021)$", "", x))
Here cols should be a vector containing the column names on which you seek to make the replacement.
More generally, to target underscore followed by any number, we can use:
df[cols] <- lapply(df[cols], function(x) sub("_\\d+$", "", x)) # or _\\d{4} for a year
Using dplyr
library(dplyr)
df <- df %>%
mutate(across(3:10, ~ substr(.x, 1, nchar(.x)-5)))

R - How to subset all dataframes stored in a list according to a vector of conditions

This is my first time asking a question here so please let me know if I need to change the way I am doing this. I have been looking for awhile and I haven't been able to find what I need.
I have a list of 3 dataframes. They have the same structure (variables) but not the same number of observations. I would like to get several subsets for each dataframe in my list, according to several conditions stored in a vector.
So if I have 5 conditions, I need to get, for each of the 3 dataframes in my list, 5 subsets of these dataframes, so 15 total.
For instance:
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
my_list <- list(df1, df2, df3)
conditions <- c(2, 5, 7, 4, 6)
I know how to subset for one of the conditions using lapply
list_subset <- lapply(my_list, function(x) x[which(x$X1 == conditions[1]), ])
But I would like to do that for all the values in the vector conditions.
I hope it makes sense.
Just lapply again, this time over the conditions:
df1 <-data.frame(replicate(3,sample(0:10,10,rep=TRUE)))
df2 <-data.frame(replicate(3,sample(0:10,7,rep=TRUE)))
df3 <-data.frame(replicate(3,sample(0:10,8,rep=TRUE)))
my_list <- list(df1, df2, df3)
conditions <- c(2, 5, 7, 4, 6)
list_subset <- lapply(my_list, function(x) x[which(x$X1 == conditions[1]), ])
#One Way, Conditions on first list
list.of.list_subsets <- lapply(conditions,function(y){
lapply(my_list, function(x) x[which(x$X1 == y), ])
})
#The other way around
list.of.list_subsets2 <- lapply(my_list,function(x){
lapply(conditions, function(y) x[which(x$X1 == y), ])
})
An option would be to filter with %in% and then split based on the 'X1' column
lapply(my_list, function(x) {x1 <- subset(x, X1 %in% conditions); split(x1, x1$X1)})

Removing rows from a data frame

I have this data.frame:
set.seed(1)
df <- data.frame(id1=LETTERS[sample(26,100,replace = T)],id2=LETTERS[sample(26,100,replace = T)],stringsAsFactors = F)
and this vector:
vec <- LETTERS[sample(26,10,replace = F)]
I want to remove from df any row which either df$id1 or df$id2 are not in vec
Is there any faster way of finding the row indices which meet this condition than this:
rm.idx <- which(!apply(df,1,function(x) all(x %in% vec)))
I used dplyr with such script
df1 <- df %>% filter(!(df$id1 %in% vec)|!(df$id2 %in% vec))
Looping over the columns might be faster than over rows. So, use lapply to loop over the columns, create a list of logical vectors with %in%, use Reduce with | to check whether there are any TRUE values for each corresponding row and use that to subset the 'df'
df[Reduce(`|`, lapply(df, `%in%`, vec)),]
If we need both elements, then replace | with &
df[Reduce(`&`, lapply(df, `%in%`, vec)),]
Actually
rm.idx <- unique(which(!(df$id1 %in% vec) | !(df$id2 %in% vec)))
is also fast.

finding similar element between two data

I asked a question before which was complicated and I did not get any help. So I tried to simplify the question and input output.
I have tried many ways but none worked for example , I sort down some
# 1
for(i in ncol(mydata)){
corsA = grep(colnames(mydata)[i] , colnames(mysecond))
mydata[,corsA]%in%mysecond[,i]}
# here if I get true then means they have match
## 2
are.cols.identical <- function(col1, col2) identical(mydata[,col1], mysecond[,col2])
res <- outer(colnames(mydata), colnames(mysecond),FUN = Vectorize(are.cols.identical))
cut <- apply(res, 1, function(x)match(TRUE, x))
### 3
(mydata$Rad) %in% (mysecond$Ro5_P1_A5)
#### 4
which(mydata %in% mysecond)
#### 5
match(mydata$sus., mysecond$R5_P1_A5)
or
which(mydata$sus. %in% mysecond$RP1_A5)
matches <- sapply(mydata,function(x) sapply(mysecond,identical,x))
and few others, but none led me to an answer
Here is another solution using regex:
rows<-mapply(grep,mysecond,mydata)
The step above will return a list with the matched rows in each column:
rows
If you would like to see how many rows where matched you can do this:
lapply(rows,length)
Now we can go ahead a get the rows of interest in mydata, but rows is a list so we need to unlist() and we might have some duplicate rows, and we don't want them to appear twice in the output, so we use the unique() function:
rows<-unique(unlist(rows))
mydata[rows,]
#View(mydata[rows,])
require(plyr)
dat <- strsplit(as.character(mydata$subunits..UniProt.IDs.), ',')
dat <- data.frame(mydata[,1],rbind.fill(lapply(dat,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)})))
mydata[unlist(apply(dat,2, function(x) which(x %in% mysecond[,2]))),]

lapply: extract specific element

I have a list of subsets obtained through:
lapply(1:5, function(x) combn(5,x))
I would like to extract a specific vector from this list. For example, the 16th element of this list, which is (1,2,3). Any hints? Thanks.
The command produces all the subsets of (1,2,3,4,5), which is a list of 2^5=32 subsets. The 16th being (1,2,3). I want to know how to extract this by using its position (16th).
We could try by splitting (split) the matrix to a list of vectors for each list elements, concatenate c the output to flatten the list, and subset using the numeric index.
lst2 <- do.call(`c`,lapply(lst, function(x) split(x, col(x))))
lst2[[16]]
#[1] 1 2 3
Or instead of splitting the matrix output, we could use the FUN argument within combn to create list and then concatenate c using do.call
lst <- do.call(`c`,lapply(1:5, function(x) combn(5, x, FUN=list)))
lst[[16]]
#[1] 1 2 3
Or instead of do.call(c,..), we can use (contributed by #Marat Talipov)
lst <- unlist(lapply(1:5, function(x)
combn(5, x, FUN=list)), recursive=FALSE)
data
lst <- lapply(1:5, function(x) combn(5,x))
I would rather consider producing the right data instead of looping again on them :)
lst = Reduce('c', lapply(1:5, function(x) as.list(data.frame(combn(5,x)))))
> lst[[16]]
[1] 1 2 3

Resources