R Programming : Logical dataframe to actual dataframe - r

I need to convert or manipulate the records based on the logical dataframe in R.
I want to match with original dataframe and populate only those values equal to true from original dataframe and null for false value and also maintain the dataframe structure as well. Please suggest
For eg :
Original dataframe
ID Name Title
1 John Mr
2 Mike Mr
3 Susan Dr
Logical Dataframe
ID Name Title
False False False
False True False
False False True
Expected Dataframe
ID Name Title
2 Mike <null>
3 <null> Dr

Here's a shot:
orig <- read.table(text="ID Name Title
1 John Mr
2 Mike Mr
3 Susan Dr", header = TRUE, stringsAsFactors = FALSE)
lgl <- read.table(text="ID Name Title
False False False
False True False
False False True", header = TRUE, stringsAsFactors = FALSE)
newdf <- mapply(function(d,l) { d[!l] <- NA; d; }, orig, lgl)
newdf
# ID Name Title
# [1,] NA NA NA
# [2,] NA "Mike" NA
# [3,] NA NA "Dr"
newdf[ rowSums(!is.na(newdf)) > 0, ]
# ID Name Title
# [1,] NA "Mike" NA
# [2,] NA NA "Dr"
Your expected output is inconsistent in that you have FALSE in your $ID column, but you keep them in your output. You can fix that by changing those to TRUE and changing the filter to rowSums(!is.na(newdf)) > 1.
Explanation:
mapply runs a function (named or anonymous) on one or more lists, like a "zipper" function. That is:
mapply(func, 1:3, 4:6, 7:9, SIMPLIFY=FALSE)
is equivalent to
list(func(1,4,7), func(2,5,8), func(3,6,9))
!is.na(newdf) creates a data.frame of the same dimensions/names, but all elements are logical.
since in general sum(<logical_vector>) returns a single integer of how many elements are true, rowSums(...) returns a vector, one element per row, where each element is the number of "trues" on that row.
... > 0 returns a logical vector, only passing the rows that have at least one non-NA element.
You said you wanted to always preserve $ID. In that case, you probably want to do (before process):
lgl$ID <- TRUE
and change the condition to ... > 1 to me "at least two non-NA elements, one of which we know is ID".

Related

Getting grepl to return TRUE only if there is a match with the full string

I have example data as follows:
library(data.table)
dat <- fread("q1 q2 ...1 ..2 q3..1 ..1
NA response other else response other
1 4 NA NA 1 NA")
I wanted to filter out all columns that are automatically named when reading in an Excel file with missing column names, which have names like ..x. I thought that the following piece of code would work:
grepl("\\.+", names(dat))
[1] FALSE FALSE TRUE TRUE TRUE TRUE
But it also filters out columns which have a similar structure as column q3..1.
Although I do not know why the ..x part is added to such a column (because it was not empty), I would like to adapt the grepl code, so that the outcome is TRUE, unless the structure is ONLY ..x.
How should I do this?
Desired output:
grepl("\\.+", names(dat))
[1] FALSE FALSE TRUE TRUE FALSE TRUE
Use an anchor ^ to state that the dots have to be in the start of the string:
grepl("^\\.+", names(dat))
#[1] FALSE FALSE TRUE TRUE FALSE TRUE
We may do
library(dplyr)
dat %>%
select(matches('^[^.]+$'))
q1 q2
<int> <char>
1: NA response
2: 1 4

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.
Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE
Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

Matching rownames that are equal to colnames (of a symmetric or asymmetric matrix)

I'm doing a statistical analyses on distance matrices in R and want to compare distances within individuals and between groups. I have a matrix where some of the colnames are equal to some of the rownames. I want to extract the values where this criteria is met (the problem is getting it to work on a asymmetric matrix). If the code could save a matrix with logical values where the criteria is met it would be great)
An example of a smaller matrix is shown below:
1 2 3 4
1 0.4966143 0.8359290 0.7319204 0.7579902
3 0.7002979 0.8621343 0.5152356 0.7875813
4 0.7406555 0.8371479 0.7103873 0.5530200
I want it to end up like this
1 2 3 4
1 TRUE FALSE FALSE FALSE
3 FALSE FALSE TRUE FALSE
4 FALSE FALSE FALSE TRUE
Would be happy if I could do it without any loops, just vectorized code
We can use outer
out <- outer(row.names(m1), colnames(m1), `==`)
dimnames(out) <- dimnames(m1)
out
# 1 2 3 4
#1 TRUE FALSE FALSE FALSE
#3 FALSE FALSE TRUE FALSE
#4 FALSE FALSE FALSE TRUE
Or replicate the rownames and column names to make the lengths equal and then do a ==
`dim<-`(row.names(m1)[row(m1)] == colnames(m1)[col(m1)], dim(m1))
NOTE: as #NelsonGon suggested, when we read data (read.table/read.csv etc.) as a data.frame, the column names can get appended with prefix X as these are non-canonical names i.e. starting with number. To avoid that either use check.names = FALSE argument in the read.table/read.csv or post process by changing the column names
outer(row.names(df), sub("^X","",names(df)),"==")
assuming 'df' is the data.frame identifier object
data
m1 <- structure(list(`1` = c(0.4966143, 0.7002979, 0.7406555),
`2` = c(0.835929, 0.8621343, 0.8371479),
`3` = c(0.7319204, 0.5152356, 0.7103873),
`4` = c(0.7579902, 0.7875813, 0.55302)),
class = "data.frame",
row.names = c("1", "3", "4"))

Writing a boolean matrix to a string

I have a lower triangular matrix containing TRUE/FALSE values. The matrix is created from a pairwise.t.test and a comparison to the acceptable p-value (p<0.05 => TRUE).
I am trying to output the matrix true values in a string according to a specific formatting without using a mess of if conditions. My thoughts were on matrix products/sums to achieve it, but there may be no elegant solution. If you think it's impossible to do it, I would like to know it aswell so I don't hit my head on the wall forever
The formatting:
If a pair of values (ex:1,2) are TRUE, we output it as "1≠2".
If a value is TRUE with multiple values (ex: 1 with 2,3), we output it as "1≠2,3".
If a value is TRUE with everyone(ex:1 with 2,3,4) we use the word "all" => output is "1≠all"
If 2 pairs (ex:1,2 and 3,4) are TRUE, we separate them with a space. output is "1≠2 3≠4"
If everything is TRUE, we output "all≠"
As of now, I am doing it manually so I don't really have any code to show. I am open to any ideas :)
Examples:
1 2 3
2 TRUE NA NA
3 TRUE TRUE NA
4 TRUE TRUE FALSE
The string for this matrix would be "1,2≠all" because 1 and 2 are true with everyone.
1 2 3
2 FALSE NA NA
3 TRUE TRUE NA
4 TRUE TRUE FALSE
The string for this matrix would be "1,2≠3,4 because 1 is true with 3,4 and 2 is true with 3,4.
Test matrices:
mTest = matrix(c(T,T,T,NA,F,T,NA,NA,F),nrow=3,ncol=3) # "1≠all 2≠3"
row.names(mTest) <- c(2,3,4) ; colnames(mTest) <- c(1,2,3)
mTest[] = c(T,F,T,NA,F,T,NA,NA,F) # "1≠2 1,2≠4"
mTest[] = c(T,T,T,NA,T,F,NA,NA,T) # "1,3≠all"

R: Choosing specific number of combinations from all possible combinations

Let's say we have the following dataset
set.seed(144)
dat <- matrix(rnorm(100), ncol=5)
The following function creates all possible combinations of columns and removes the first
(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
# Var1 Var2 Var3 Var4 Var5
# 2 TRUE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE FALSE FALSE FALSE
# 4 TRUE TRUE FALSE FALSE FALSE
# ...
# 31 FALSE TRUE TRUE TRUE TRUE
# 32 TRUE TRUE TRUE TRUE TRUE
My question is how can I calculate single, binary and triple combinations only ?
Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ]
However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:
Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) :
invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow
Any suggestion that would allow me to compute single, binary and triple combinations only ?
You could try either
cols[rowSums(cols) < 4L, ]
Or
cols[Reduce(`+`, cols) < 4L, ]
You can use this solution:
col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# <...skipped...>
#
# [[24]]
# [1] 2 4 5
#
# [[25]]
# [1] 3 4 5
Here, col.i is a list every element of which contains column indices.
How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.
You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.
lapply(col.i, function(cols) dat[,cols])
which will return a list of data frames each containing a certain subset of columns of dat.
In case you want to get column indices as a boolean matrix, you can use:
col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# ...
[UPDATE]
More efficient realization:
library("gRbase")
coli <- function(x=5,m=3) {
col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))
z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
v.b <- rep(F,x*length(col.i))
v.b[unlist(z)] <- TRUE
matrix(v.b,ncol=x,byrow = TRUE)
}
coli(70,5) # takes about 30 sec on my desktop

Resources