R: Convert several columns from [1,2] to Boolean [TRUE,FALSE] - r

I have a data frame (imported with read.csv) which has many, but not all, columns which have boolean data which is encoded as 1=false, 2= true.
I would like to convert all of them to booleans. I know I can do
data$someCol <- data$someCol == 2
My questions:
Is this the best way?
Is there another in which I can specify BOTH "1" for FALSE and "2" for TRUE, with NA for the rest?
Can I somehow "mass-process" columns like this, selecting via grep?
Thanks!

You may convert the elements that are not 1 or 2 to NA and just use the logical condition df1==2 to transform it to a logical matrix with TRUE as 2, FALSE as 1, and the rest NA
is.na(df1) <- !(df1==1|df1==2)
df1==2
For large dataset, it may be better to use lapply to loop through the columns
df1[] <- lapply(df1, function(x) {x[!x %in% c(1,2)] <- NA
x==2})
Update
If we want to apply only a subset of columns with column names that start with 'XX', grep would be option to subset the columns and then loop with lapply on that subset of columns and replace that columns with the output of lapply.
indx <- grep('^XX', colnames(df2))
df2[indx] <- lapply(df2[indx], function(x) {x[!x %in% c(1,2)] <- NA
x==2})
Another option would be using mutate_each from dplyr
library(dplyr)
mutate_each(df2, funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
We select the columns that have names that start with XX (matches('^XX')), create the logical condition within the funs. The . means any element within in a column.
. %in% 1:2
gives a logical output. If the element is 1 or 2, we get TRUE and if not FALSE.
(NA^!. %in% 1:2)
We negate (!) the output of TRUE/FALSE so that TRUE becomes FALSE and FALSE changes to TRUE, change the TRUE values to NA (NA^!...), thus converting values that are not 1 or 2 to NA and all other values to 1.
*.==2
Then we multiply * the values we got from the earlier output so that the NA value remain as NA and 1 value get changed to the value in that position, for e.g. 1*2=2. This can be made into a logical output by .==2. If the values are 2, will return as TRUE or else (i.e. 1) return FALSE.
Using mutate_each will not change the original object unless we assign to the original object name
df2 <- mutate_each(df2, funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
Another option without the need to assign it back would be using %<>% operator from magrittr
library(magrittr)
df2 %<>%
mutate_each(funs((NA^!. %in% 1:2)*.==2), matches('^XX'))
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:5, 20*5, replace=TRUE), ncol=5))
df2 <- df1
colnames(df2)[c(2,4)] <- paste0('XX', 1:2)

Related

Get an output for column name and it's column number at the same time

I am trying to get the column name and number of columns that have "NA" in them, I have tried using this code
names(df)[sapply(df, anyNA)]
but it only gives me the column names and no the numbers,
any idea how to get an output for both?
We may convert the logical vector to index with which and subset the index with names to get the column names
i1 <- sapply(df, anyNA)
which(i1)
names(df)[i1]
We may not need the names(df)[i1] as which gives a named vector of index though i.e
which(sapply(df, anyNA))
will be a single line code to give both column names and index
Or with dplyr
library(dplyr)
df %>%
summarise(across(where(anyNA), ~ match(cur_column(), names(df))))
Using which with colsums of NA.
which(colSums(is.na(iris_na)) > 0)
# Sepal.Length Petal.Length
# 1 3
Data:
iris_na <- iris
iris_na[c(1, 3)] <- lapply(iris_na[c(1, 3)], \(x) replace(x, sample(length(x), length(x)/10), NA_real_))

How to index a dataframe that contains lists/vectors as values

Indexing a dataframe works for single values but not for values that are list elements or vectors.
I have two lists of the genes that I need to match up. In each list, the genes are named as different gene aliases. I need to query a large list of genes in order to filter out any genes that are not shared between the two datasets. To do this, I created a dataframe that contains all genes from both lists. Each value in the dataframe is either a single string or a vector of multiple strings (aliases). A separate column assigns each group of aliases a unique number, which I am using to match the two lists. For each gene I need to check if it is present in the dataframe. But I cannot index the vector values. See below:
df <- data.frame("col1"=I(list(c("MALAT1","FTK2","CAS9"),
"MS4A6A",
c("LACT1","FLEE6","LOC98"))),
"col2"=I(list(c("CASS4","MS4A2","NME"),
"PLD3",
"ADAM4")))
"MALAT1" %in% df$col1
[1] FALSE
"MS4A6A" %in% df$col1
[1] TRUE
As it is a list, we can unlist
"MALAT1" %in% unlist(df$col1)
#[1] TRUE
The reason, second one returns TRUE is because the second element is of length 1 while the one with "MALAT1" is not
-testing
If we change the list element that have a single element to "MALAT1"
df$col1[2] <- "MALAT1"
"MALAT1" %in% df$col1
#[1] TRUE
Generally, when we have a list, if we want to test on each element
lapply(df$col1, `%in%`, x = "LACT1")
#[[1]]
#[1] FALSE
#[[2]]
#[1] FALSE
#[[3]]
#[1] TRUE
Here is another workaround, which plays a trick on df by flattening the lists in columns via rapply + toString
df[] <- rapply(df, toString, how = "unlist")
such that
> df
col1 col2
1 MALAT1, FTK2, CAS9 CASS4, MS4A2, NME
2 MS4A6A PLD3
3 LACT1, FLEE6, LOC98 ADAM4
and then you can use grepl to check if the objective can be found in the column via, e.g.,
> grepl("LACT1", df$col1, fixed = TRUE)
[1] FALSE FALSE TRUE
> grepl("NME", df$col2, fixed = TRUE)
[1] TRUE FALSE FALSE
You were almost there. Just wrap unlist() around the list you have your da

How to get all columns with the same column name in R at once?

Let's say I have the following data frame:
> test <- cbind(test=c(1, 2, 3), test=c(1, 2, 3))
> test
test test
[1,] 1 1
[2,] 2 2
[3,] 3 3
Now from such data frame I want to fetch all the columns named "test" to a new data frame:
> new_df <- test[, "test"]
However this last attempt to do so only fetches the first column called "test" in test data frame:
> new_df
[1] 1 2 3
How can I get all of the columns called "test" in this example and put them into a new data frame in a single command? In my real data I have many columns with repeated colnames and I don't know the index of the columns so I can`t get them by number.
It is not advisable to have same column names for practical reasons. But, we can do a comparison (==) to get a logical vector and use that to extract the columns
i1 <- colnames(test) == "test"
new_df <- test[, i1, drop = FALSE]
Note that data.frame doesn't allow duplicate column names and would change it to unique by appending .1 .2 etc at the end with make.unique. With matrix (the OP's dataset), allows to have duplicate column names or row names (not recommended though)
Also, if there are multiple column names that are repeated and want to select them as separate datasets, use split
lst1 <- lapply(split(seq_len(ncol(test)), colnames(test)), function(i)
test[, i, drop = FALSE])
Or loop through the unique column names and do a == by looping through it with lapply
lst2 <- lapply(unique(colnames(test)), function(nm)
test[, colnames(test) == nm, drop = FALSE])

Turning a data.frame into a list of smaller data.frames in R

Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))

rename the columns name after cbind the data

merger <- cbind(as.character(Date),weather1$High,weather1$Low,weather1$Avg..High,weather1$Avg.Low,sale$Scanned.Movement[a])
After cbind the data, the new DF has column names automatically V1, V2......
I want rename the column by
colnames(merger)[,1] <- "Date"
but failed. And when I use merger$V1 ,
Error in merger$V1 : $ operator is invalid for atomic vectors
You can also name columns directly in the cbind call, e.g.
cbind(date=c(0,1), high=c(2,3))
Output:
date high
[1,] 0 2
[2,] 1 3
Try:
colnames(merger)[1] <- "Date"
Example
Here is a simple example:
a <- 1:10
b <- cbind(a, a, a)
colnames(b)
# change the first one
colnames(b)[1] <- "abc"
# change all colnames
colnames(b) <- c("aa", "bb", "cc")
you gave the following example in your question:
colnames(merger)[,1]<-"Date"
the problem is the comma: colnames() returns a vector, not a matrix, so the solution is:
colnames(merger)[1]<-"Date"
If you pass only vectors to cbind() it creates a matrix, not a dataframe. Read ?data.frame.
A way of producing a data.frame and being able to do this in one line is to coerce all matrices/data frames passed to cbind into a data.frame while setting the column names attribute using setNames:
a = matrix(rnorm(10), ncol = 2)
b = matrix(runif(10), ncol = 2)
cbind(setNames(data.frame(a), c('n1', 'n2')),
setNames(data.frame(b), c('u1', 'u2')))
which produces:
n1 n2 u1 u2
1 -0.2731750 0.5030773 0.01538194 0.3775269
2 0.5177542 0.6550924 0.04871646 0.4683186
3 -1.1419802 1.0896945 0.57212043 0.9317578
4 0.6965895 1.6973815 0.36124709 0.2882133
5 0.9062591 1.0625280 0.28034347 0.7517128
Unfortunately, there is no setColNames function analogous to setNames for data frames that returns the matrix after the column names, however, there is nothing to stop you from adapting the code of setNames to produce one:
setColNames <- function (object = nm, nm) {
colnames(object) <- nm
object
}
See this answer, the magrittr package contains functions for this.
If you offer cbind a set of arguments all of whom are vectors, you will get not a dataframe, but rather a matrix, in this case an all character matrix. They have different features. You can get a dataframe if some of your arguments remain dataframes, Try:
merger <- cbind(Date =as.character(Date),
weather1[ , c("High", "Low", "Avg..High", "Avg.Low")] ,
ScnMov =sale$Scanned.Movement[a] )
It's easy just add the name which you want to use in quotes before adding
vector
a_matrix <- cbind(b_matrix,'Name-Change'= c_vector)

Resources