Excluding a vector from a list of data.frame - r

I have a list of 50 data.frame objects and each data.frame object contains 20 rows. I need to exclude a row or a vector at each iteration from each of the data.frame object.
The single iteration may look something like this:
to_exclude <- 0 # 0 will be replaced by the induction variable
training_temp <- lapply(training_data, function(x) {
# Exclude the vector numbered to_exclude
}
Regards

df <- data.frame(x=1:10,y=1:10)
thelist <- list(df,df,df,df)
lapply(thelist, function(x) x[-c(1) ,] )
this will always remove the first row. Is this like what you want, or you want to remove rows based on a value?
This will always exclude the first column:
lapply(thelist, function(x) x[, -c(1) ] )
# because there are only two columns in this example you would probably
# want to add drop = FALSe e.g.
# lapply(thelist, function(x) x[, -c(1), drop=FALSE ] )
so from your loop value:
remove_this_one <- 10
lapply(thelist, function(x) x[ -c(remove_this_one) ,] )
# remove row 10

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

How to get all columns with the same column name in R at once?

Let's say I have the following data frame:
> test <- cbind(test=c(1, 2, 3), test=c(1, 2, 3))
> test
test test
[1,] 1 1
[2,] 2 2
[3,] 3 3
Now from such data frame I want to fetch all the columns named "test" to a new data frame:
> new_df <- test[, "test"]
However this last attempt to do so only fetches the first column called "test" in test data frame:
> new_df
[1] 1 2 3
How can I get all of the columns called "test" in this example and put them into a new data frame in a single command? In my real data I have many columns with repeated colnames and I don't know the index of the columns so I can`t get them by number.
It is not advisable to have same column names for practical reasons. But, we can do a comparison (==) to get a logical vector and use that to extract the columns
i1 <- colnames(test) == "test"
new_df <- test[, i1, drop = FALSE]
Note that data.frame doesn't allow duplicate column names and would change it to unique by appending .1 .2 etc at the end with make.unique. With matrix (the OP's dataset), allows to have duplicate column names or row names (not recommended though)
Also, if there are multiple column names that are repeated and want to select them as separate datasets, use split
lst1 <- lapply(split(seq_len(ncol(test)), colnames(test)), function(i)
test[, i, drop = FALSE])
Or loop through the unique column names and do a == by looping through it with lapply
lst2 <- lapply(unique(colnames(test)), function(nm)
test[, colnames(test) == nm, drop = FALSE])

R: remove empty columns in a list within a list

How do I remove empty columns from a list within a list in R, when the columns are either "" or NA?
SAMPLE DATA:
x <- list( a = cars , b = ability.cov , d = mtcars )
x[[3]][2]<-""
So the second column in the third list is now all "", I wish to remove it from x
EDIT:The problem is I do not know which columns in which list (within the list) that is empty. I need some algorithm
I've tried the following which does not work for me:
x<-x[,colSums(x!= "") != 0 ]
To remove all columns with only value "" from the dataframes in the list you can do:
lapply(x, function(xi) xi[!sapply(xi, function(xii) all(xii==""))])
explanation:
If you have a vector xii you can test it against "", this gives a vector of logical with length same as xii.
all(...) is clear: the result is TRUE if all elements are TRUE
sapply(xi, ...) is calculating this for each column of xi. It
gives TRUE or FALSE for each column of xi
xi[!sapply()] inverts the logical vector from sapply() and uses
it as index for xi. If one element of the index is FALSE, the
column is neglected in the result.
lapply(x, ...) is running over your original list
Do not forget to store the result in an object! xnew <- lapply(...)
If you want to remove columns with only NA and "" as values:
lapply(x, function(xi) xi[!sapply(xi, function(xii) all(xii=="" | is.na(xii)))])

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

remove data.frames in a list filled only with NA values

I have a list of data.frames and some of them are filled with NA, I would like to remove those data.frames with only NA in my list.
I am using these two commands:
list.df <- lapply(list.df, na.omit)
list.df <- list.df[sapply(list.df, function(x) dim(x)[1] >0 )]
Is there a way to do the same but in one line?
This keeps all data.frames which have at least one NA-free row:
df.list[ sapply( df.list, function(x){ any( rowSums(is.na(x)) == 0 ) }) ]

Resources