R: remove empty columns in a list within a list - r

How do I remove empty columns from a list within a list in R, when the columns are either "" or NA?
SAMPLE DATA:
x <- list( a = cars , b = ability.cov , d = mtcars )
x[[3]][2]<-""
So the second column in the third list is now all "", I wish to remove it from x
EDIT:The problem is I do not know which columns in which list (within the list) that is empty. I need some algorithm
I've tried the following which does not work for me:
x<-x[,colSums(x!= "") != 0 ]

To remove all columns with only value "" from the dataframes in the list you can do:
lapply(x, function(xi) xi[!sapply(xi, function(xii) all(xii==""))])
explanation:
If you have a vector xii you can test it against "", this gives a vector of logical with length same as xii.
all(...) is clear: the result is TRUE if all elements are TRUE
sapply(xi, ...) is calculating this for each column of xi. It
gives TRUE or FALSE for each column of xi
xi[!sapply()] inverts the logical vector from sapply() and uses
it as index for xi. If one element of the index is FALSE, the
column is neglected in the result.
lapply(x, ...) is running over your original list
Do not forget to store the result in an object! xnew <- lapply(...)
If you want to remove columns with only NA and "" as values:
lapply(x, function(xi) xi[!sapply(xi, function(xii) all(xii=="" | is.na(xii)))])

Related

R - Speeding up combination between for loop and paste/paste0

I am handling a data frame 'df' that have millions of rows and four columns (i.e., Chromosome, Position, Allele1, Allele2). Now I am wanting to concatenate characters in these columns into one separate vector 'cc'. This is my first try:
myfunc = function(CHR) {
chr = subset(df, df$Chromosome == CHR)
cc = data.frame(No=seq.int(nrow(chr)), pos_al1_al2=NA)
for (i in 1: nrow(chr)) {
cc$pos_al1_al2[i] = paste(CHR, chr$Position[i], ".", chr$Allele1[i], chr$Allele2[i])
cc = cc[, -1] # remove the column 'No'
}
}
# Run my code
myfunc(7)
where CHR is the number of chromosome of my interest I will input to the function (e.g., 1,2,3,..., or 22). Of course, CHR must be in a range of from 1 to 22 as in the column Chromosome of the 'df'.
My idea is that: I first created an empty vector called cc whose the number of rows are the same as the data.frame 'df'.
Now I created a new column in the cc called pos_al1_al2 whose each row includes characters as you can see in the function.
The computation time is very slow. I guess It comes from the for loop but I do have no idea to optimize my function.
Any help is appreciated! Thanks in advance.
Is there any reason why you can't use paste() in vectorized mode:
myfunc <- function(CHR) {
chr <- subset(df, df$Chromosome == CHR)
cc <- data.frame(No = seq.int(nrow(chr)), pos_al1_al2=NA)
cc$pos_al1_al2 <- paste(CHR, chr$Position, ".", chr$Allele1, chr$Allele2)
cc = cc[, -1] # remove the column 'No'
}

Modifying a function to add an extra search for an specific value

With the following function, I search into a list (my_list) and returns the elements of the list that have any column with ".Positivedata"
#Function to give me the names of elements in a list with any column with character
".Positivedata" (Works OK)
names(my_list)[sapply(my_list, function(x) any(grep(".Positivedata", names(x))))]
Now I would like to modify the function to return which of the names have a value of "4" on the column ".Positivedata"
Where do I put the "== 4" in the function?
names(my_list)[sapply(my_list, function(x) any(grep(".Positivedata", names(x))) )]
We loop through the list, subset the column with grep, check whether that is equal to 4 and use which to get the position
lapply(my_list, function(x) which(x[, grep("\\.Positivedata", names(x))] == 4))
If there are multiple columns, then it would be better to get the row/column index
lapply(my_list, function(x)
which(x[, grep("\\.Positivedata", names(x))] == 4, arr.ind = TRUE))
If we want the column name among the columns that are .Positivedata having the value '4'
sapply(my_list, function(x) {
nm1 <- grep("\\.Positivedata", names(x), value = TRUE)
nm1[sapply(x[nm1], function(x) any(x==4))]
})

Replace all values of NULL in a list of List to something else/ list of list with varying lengths

I have a list of lists that I need to convert into the correspond dataframe.
The position of the rows is important so that I can link them to another item later on. I tried one approach where I make it into a dataframe, but I can't do that because of some rows being NULL.
first = c(1,2,3)
second = c(1,2)
problemrows = NULL
mylist = list(first,second,problemrows)
#mylist - should turn into a dataframe with each row being same order as the list of lists. NULL value would be a null row
library(plyr) # doesn't work because of NULLs above
t(plyr::rbind.fill.matrix(lapply(mylist, t)))
## help^^^ Ideally the row that doesn't work would just be a null row or the new dataframe would remove these NULL lists as rows and instead assign appropriate row#s to everything else to make up for what was skipped.
# other approach - change all the NULL list of lists to something like -999 which is another fix.
first = c(1,2,3)
second = c(1,2)
problemrows = NULL
mylist = list(first,second,problemrows)
mylist <- sapply(mylist, function(x) ifelse(x == "NULL", NA, x))
library(plyr) # doesn't work because of NULLs above
t(plyr::rbind.fill.matrix(lapply(mylist, t)))

R: Convert several columns from [1,2] to Boolean [TRUE,FALSE]

I have a data frame (imported with read.csv) which has many, but not all, columns which have boolean data which is encoded as 1=false, 2= true.
I would like to convert all of them to booleans. I know I can do
data$someCol <- data$someCol == 2
My questions:
Is this the best way?
Is there another in which I can specify BOTH "1" for FALSE and "2" for TRUE, with NA for the rest?
Can I somehow "mass-process" columns like this, selecting via grep?
Thanks!
You may convert the elements that are not 1 or 2 to NA and just use the logical condition df1==2 to transform it to a logical matrix with TRUE as 2, FALSE as 1, and the rest NA
is.na(df1) <- !(df1==1|df1==2)
df1==2
For large dataset, it may be better to use lapply to loop through the columns
df1[] <- lapply(df1, function(x) {x[!x %in% c(1,2)] <- NA
x==2})
Update
If we want to apply only a subset of columns with column names that start with 'XX', grep would be option to subset the columns and then loop with lapply on that subset of columns and replace that columns with the output of lapply.
indx <- grep('^XX', colnames(df2))
df2[indx] <- lapply(df2[indx], function(x) {x[!x %in% c(1,2)] <- NA
x==2})
Another option would be using mutate_each from dplyr
library(dplyr)
mutate_each(df2, funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
We select the columns that have names that start with XX (matches('^XX')), create the logical condition within the funs. The . means any element within in a column.
. %in% 1:2
gives a logical output. If the element is 1 or 2, we get TRUE and if not FALSE.
(NA^!. %in% 1:2)
We negate (!) the output of TRUE/FALSE so that TRUE becomes FALSE and FALSE changes to TRUE, change the TRUE values to NA (NA^!...), thus converting values that are not 1 or 2 to NA and all other values to 1.
*.==2
Then we multiply * the values we got from the earlier output so that the NA value remain as NA and 1 value get changed to the value in that position, for e.g. 1*2=2. This can be made into a logical output by .==2. If the values are 2, will return as TRUE or else (i.e. 1) return FALSE.
Using mutate_each will not change the original object unless we assign to the original object name
df2 <- mutate_each(df2, funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
Another option without the need to assign it back would be using %<>% operator from magrittr
library(magrittr)
df2 %<>%
mutate_each(funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:5, 20*5, replace=TRUE), ncol=5))
df2 <- df1
colnames(df2)[c(2,4)] <- paste0('XX', 1:2)

Excluding a vector from a list of data.frame

I have a list of 50 data.frame objects and each data.frame object contains 20 rows. I need to exclude a row or a vector at each iteration from each of the data.frame object.
The single iteration may look something like this:
to_exclude <- 0 # 0 will be replaced by the induction variable
training_temp <- lapply(training_data, function(x) {
# Exclude the vector numbered to_exclude
}
Regards
df <- data.frame(x=1:10,y=1:10)
thelist <- list(df,df,df,df)
lapply(thelist, function(x) x[-c(1) ,] )
this will always remove the first row. Is this like what you want, or you want to remove rows based on a value?
This will always exclude the first column:
lapply(thelist, function(x) x[, -c(1) ] )
# because there are only two columns in this example you would probably
# want to add drop = FALSe e.g.
# lapply(thelist, function(x) x[, -c(1), drop=FALSE ] )
so from your loop value:
remove_this_one <- 10
lapply(thelist, function(x) x[ -c(remove_this_one) ,] )
# remove row 10

Resources