R data.fame manipulation: convert to NA after specific column - r

I have a large data.frame and I need some conversion based by row. My purpose is convert all values in rows to NA after if there is specific character in column.
For example I provide little sample from my real data set:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
As an example in sample_df
First I want to turn all values to NA after first "I"
Sample data.frames
I tried base, dpylr, purrr but can not create an algorithm.
Thanks for your help.

Try this:
Find "I" values
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
Find positions from the first "I" seen
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
Replace needed values
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
Your output
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
Example 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"

Here is a brute force approach, which should be the easiest to come up with but the least preferred. Anyway, here it is:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}

Here's a possible answer using ddply from the plyr package
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})

The plyr approach :
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
You have to use an if because x[min(which(x == "I"))] would returns numeric(0) for rows without at least one I

My Solution:
After #Julien Navarre recommendation, first I created toNA() function:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
First, I tried plyr::adply() and purrrlyr::by_row() functions to apply my toNA() function my data.frame which has over 3 million rows.
Both are very slow. (For 1000 rows they take 9 and 6 seconds respectively). These approaches are also slow with a simple function(x) x. I am not sure what is overhead.
So I tried base::apply() function: (result is my data set)
as.tibble(t(apply(result, 1, toNA ) ))
It only takes 0.2 seconds for 1000 rows.
I am not sure about programming style but for now this solution works for me.
Thanks for all your recommendations.

A pure base solution, we're building a boolean matrix of "=="I" or not", then with a double cumsum by row we can find where our NAs must be placed:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V

Related

Display identical columns in R dataframe

Suppose I have the following dataframe :
df <- data.frame(A=c(1,2,3),B=c("a","b","c"),C=c(2,1,3),D=c(1,2,3),E=c("a","b","c"),F=c(1,2,3))
> df
A B C D E F
1 1 a 2 1 a 1
2 2 b 1 2 b 2
3 3 c 3 3 c 3
I want to filter out the columns that are identical. I know that I can do it with
DuplCols <- df[duplicated(as.list(df))]
UniqueCols <- df[ ! duplicated(as.list(df))]
In the real world my dataframe has more than 500 columns and I do not know how many identical columns of the same kind I have and I do not know the names of the columns. However, each columnname is unique (as in df). My desired result is (optimally) a dataframe where in each row the column names of the identical columns of one kind are stored. The number of columns in the DesiredResult dataframe is the maximal number of identical columns of one kind in the original dataframe and if there are less identical columns of another kind, NA should be stored:
> DesiredResult
X1 X2 X3
1 A D F
2 B E NA
3 C NA NA
(With "identical column of the same kind" I mean the following: in df the columns A, D, F are identical columns of the same kind and B, E are identical columns of the same kind.)
You can use unique and then test with %in% where it matches to extract the colname.
tt_lapply(unique(as.list(df)), function(x) {colnames(df)[as.list(df) %in% list(x)]})
tt
#[[1]]
#[1] "A" "D" "F"
#
#[[2]]
#[1] "B" "E"
#
#[[3]]
#[1] "C"
t(sapply(tt, "length<-", max(lengths(tt)))) #As data.frame
# [,1] [,2] [,3]
#[1,] "A" "D" "F"
#[2,] "B" "E" NA
#[3,] "C" NA NA

How to adapt string replacing function to replace specific numbers in data frame with NA?

I wrote a function that perfectly replaces custom values of a matrix with NA.
NAfun <- function (x, z) {
x[x %in% z] <- NA
x
}
M <- matrix(1:12, 3, 4)
M[1, 2] <- -77
M[2, 1] <- -99
> M
[,1] [,2] [,3] [,4]
[1,] 1 -77 7 10
[2,] -99 5 8 11
[3,] 3 6 9 12
z <- c(-77, -99)
> NAfun(M, z)
[,1] [,2] [,3] [,4]
[1,] 1 NA 7 10
[2,] NA 5 8 11
[3,] 3 6 9 12
But this won't work with data frames.
D <- as.data.frame(matrix(LETTERS[1:12], 3, 4))
> D
V1 V2 V3 V4
1 A D G J
2 B E H K
3 C F I L
z <- c("B", "D")
> NAfun(D, z)
V1 V2 V3 V4
1 A D G J
2 B E H K
3 C F I L
D[] <- lapply(D, function(x) as.character(x)) # same with character vectors
> NAfun(D, z)
V1 V2 V3 V4
1 A D G J
2 B E H K
3 C F I L
If I convert the data frame to a matrix it works, though.
> NAfun(as.matrix(D), z)
V1 V2 V3 V4
[1,] "A" NA "G" "J"
[2,] NA "E" "H" "K"
[3,] "C" "F" "I" "L"
But I can't in my case.
I don't understand why this won't work as it is. And which way to adapt the function so that it works with a data frame, or preferably both types, thanks.
You can probably make this more elegant but here's a solution using purrr that works in both cases.
NAfun <- function (x, z) {
f1 <- function(x, z){
x[x %in% z] <- NA
x
}
purrr::modify(x, ~f1(., z))
}
As #Lyngbakr has correctly mentioned that behavior is consistent between D and M. The NAfun function worked on D as it was already converted to matrix by line D <- sapply(D, as.character).
Now, question is why behavior is inconsistent between matrix and data.frame? The actual reason is %in% operator.
The %in% operator compares each value of matrix in vector z as:
D %in% z
#[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
whereas %in% operator on data.frame compares for matching columns. Hence,
M %in% c(-99,-77)
#[1] FALSE FALSE FALSE FALSE
But
M %in% M[1:2]
#[1] TRUE TRUE FALSE FALSE
M %in% list(c(1,-99,3))
[1] TRUE FALSE FALSE FALSE
Modification needed in function NAfun to handle both data.frame and matrix:
NAfun <- function (x, z) {
x <- as.matrix(x)
x[x %in% z] <- NA
x
}

"lapply" in R does not work for each element

test.data <- data.frame(a=seq(10),b=rep(seq(5),times=2),c=rep(seq(5),each=2))
test.data <- data.frame(lapply(test.data, as.character), stringsAsFactors = F)
test.ref <- data.frame(original=seq(10),name=letters[1:10])
test.ref <- data.frame(lapply(test.ref, as.character), stringsAsFactors = F)
test.match <- function (x) {
result = test.ref$name[which(test.ref$original == x)]
return(result)
}
> data.frame(lapply(test.data, test.match))
a b c
1 a a a
2 b b a
3 c c a
4 d d a
5 e e a
6 f a a
7 g b a
8 h c a
9 i d a
10 j e a
> lapply(test.data, test.match)
$a
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
$b
[1] "a" "b" "c" "d" "e"
$c
[1] "a"
Hi all,
I am learning to use the apply family in R. However, I am stuck in a rather simple exercise. Above is my code. I am trying to use the "test.match" function to replace all the elements in "test.data" by the reference rule in "test.ref". However, the last column does not work if I turn the final result into data frame. It is even worse if I keep the result as a list.
Many thanks for your help,
Kevin
As mentioned in the comments, you probably want match:
do.test.match.df <- function(df, ref_df = test.ref){
res <- df
res[] <- lapply(df, function(x) ref_df$name[ match(x, ref_df$original) ])
return(res)
}
do.test.match.df(test.data)
which gives
a b c
1 a a a
2 b b a
3 c c b
4 d d b
5 e e c
6 f a c
7 g b d
8 h c d
9 i d e
10 j e e
This is the idiomatic way. lapply will always return a vanilla list. A data.frame is a special kind of list (a list of column vectors). With res[] <- lapply(df, myfun), we're assigning to columns of res.
Since all your columns are the same class, I'd suggest using a matrix instead of a data.frame.
test.mat <- as.matrix(test.data)
do.test.match <- function(mat, ref_df=test.ref){
res <- matrix(, nrow(mat), ncol(mat))
res[] <- ref_df$name[ match( c(mat), ref_df$original ) ]
return(res)
}
do.test.match(test.mat)

cbind coerces a data frame to matrix

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column
I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

How to create a factor from a binary indicator matrix?

Say I have the following matrix mat, which is a binary indicator matrix for the levels A, B, and C for a set of 5 observations:
mat <- matrix(c(1,0,0,
1,0,0,
0,1,0,
0,1,0,
0,0,1), ncol = 3, byrow = TRUE)
colnames(mat) <- LETTERS[1:3]
> mat
A B C
[1,] 1 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 1 0
[5,] 0 0 1
I want to convert that into a single factor such that the output is equivalent to fac defines as:
> fac <- factor(rep(LETTERS[1:3], times = c(2,2,1)))
> fac
[1] A A B B C
Levels: A B C
Extra points if you get the labels from the colnames of mat, but a set of numeric codes (e.g. c(1,1,2,2,3)) would also be acceptable as desired output.
Elegant solution with matrix multiplication (and shortest up to now):
as.factor(colnames(mat)[mat %*% 1:ncol(mat)])
This solution makes use of the arr.ind=TRUE argument of which, returning the matching positions as array locations. These are then used to index the colnames:
> factor(colnames(mat)[which(mat==1, arr.ind=TRUE)[, 2]])
[1] A A B B C
Levels: A B C
Decomposing into steps:
> which(mat==1, arr.ind=TRUE)
row col
[1,] 1 1
[2,] 2 1
[3,] 3 2
[4,] 4 2
[5,] 5 3
Use the values of the second column, i.e. which(...)[, 2] and index colnames:
> colnames(mat)[c(1, 1, 2, 2, 3)]
[1] "A" "A" "B" "B" "C"
And then convert to a factor
One way is to replicate the names out by row number and index directly with the matrix, then wrap that with factor to restore the levels:
factor(rep(colnames(mat), each = nrow(mat))[as.logical(mat)])
[1] A A B B C
Levels: A B C
If this is from model.matrix, the colnames have fac prepended, and so this should work the same but removing the extra text:
factor(gsub("^fac", "", rep(colnames(mat), each = nrow(mat))[as.logical(mat)]))
You could use something like this:
lvls<-apply(mat, 1, function(currow){match(1, currow)})
fac<-factor(lvls, 1:3, labels=colnames(mat))
Here is another one
factor(rep(colnames(mat), colSums(mat)))

Resources