Replace values in a submatrix of a dataframe in R - r

I have a dataframe, say x, and I would like to replace the 0 values with NA, for columns say c("A", "B", "C", "D"), on rows 1:10. Is there an efficient/compact way of doing it?

Try:
If you want to replace NA's for the whole dataset:
set.seed(41)
d1 <- as.data.frame( matrix(sample(0:5, 4*10, replace=TRUE), dimnames=list(NULL, LETTERS[1:4]), ncol=4))
d1[!d1] <- NA
d1
If you have more columns in your dataset and want to replace only for a subset of columns:
set.seed(41)
d2 <- as.data.frame( matrix(sample(0:5, 8*10, replace=TRUE), dimnames=list(NULL, LETTERS[1:8]), ncol=8))
d2[,LETTERS[1:4]][!d2[,LETTERS[1:4]]] <- NA
d2
# A B C D E F G H
#1 1 4 NA 3 1 5 2 1
#2 5 4 4 3 5 4 5 0
#3 3 4 1 4 5 1 0 4
#4 NA 1 4 5 3 5 1 1
#5 5 NA 5 4 0 0 4 5
#6 5 4 3 4 2 0 4 5
#7 5 5 2 3 2 1 3 4
#8 3 NA 1 1 5 0 2 0
#9 4 5 2 5 3 0 0 1
#10 4 NA 2 5 4 1 1 0
If it is for a subset of 5 rows and 4 columns
d2[1:5, LETTERS[1:4]][!d2[1:5, LETTERS[1:4]]] <- NA
d2
# A B C D E F G H
#1 1 4 NA 3 1 5 2 1
#2 5 4 4 3 5 4 5 0
#3 3 4 1 4 5 1 0 4
#4 NA 1 4 5 3 5 1 1
#5 5 NA 5 4 0 0 4 5
#6 5 4 3 4 2 0 4 5
#7 5 5 2 3 2 1 3 4
#8 3 0 1 1 5 0 2 0
#9 4 5 2 5 3 0 0 1
#10 4 0 2 5 4 1 1 0
You can check the difference in results for the above two cases

Related

Replace a cell with NA according to value in another cell in R

I have a dataset from which I made a reproducible example:
set.seed(1)
Data <- data.frame(
A = sample(0:5),
B = sample(0:5),
C = sample(0:5),
D = sample(0:5),
corr_A.B = sample(0:5),
corr_A.C = sample(0:5),
corr_A.D = sample(0:5))
> Data
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 4 0 1
5 0 4 2 1 2 3 3
6 4 1 0 5 3 4 5
And I would like to check, for each column B, C and D, if one of their cell is equal to 0, I would like to replace, on the same row, the corresponding corr_A column with NA. For instance, since Data$B[4] is equal to 0, I would like Data$corr_A.B[4] to be replaced by NA.
I look to obtain the following result:
> Data
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 NA 0 NA
5 0 4 2 1 2 3 3
6 4 1 0 5 3 NA 5
I have tried different ways, using for loops, but I am struggling a lot. Also, in the dataset I am working on, there are many other columns that do not need to be checked for that condition, I would like to be able to specifically designated in which columns I am looking for 0 values.
If someone would be kind enough to give it a try? Many thanks
A one-liner using function is.na<-.
is.na(Data[5:7]) <- Data[2:4] == 0
Data
# A B C D corr_A.B corr_A.C corr_A.D
#1 1 5 4 2 1 2 4
#2 5 3 1 3 5 5 0
#3 2 2 3 4 0 1 2
#4 3 0 5 0 NA 0 NA
#5 0 4 2 1 2 3 3
#6 4 1 0 5 3 NA 5
For a base R solution, we can just use ifelse here:
Data$corr_A.B <- ifelse(Data$B == 0, NA, Data$corr_A.B)
Data$corr_A.C <- ifelse(Data$C == 0, NA, Data$corr_A.C)
Data$corr_A.D <- ifelse(Data$D == 0, NA, Data$corr_A.D)
df<- data.frame(A=c(1,5,2,3,0,4),
B=c(5,3,2,0,4,1),
C=c(4,1,3,5,2,0),
D=c(2,3,4,0,1,5),
corr_A.B=c(1,5,0,4,2,3),
corr_A.C=c(2,5,1,0,3,4),
corr_A.D=c(4,0,2,1,3,5))
df %>% mutate(corr_A.B=case_when(B==0 ~ NA_real_,
TRUE~ corr_A.B),
corr_A.C=case_when(C==0 ~NA_real_,
TRUE ~ corr_A.C),
corr_A.D=case_when(D==0 ~ NA_real_,
TRUE ~ corr_A.D))
A B C D corr_A.B corr_A.C corr_A.D
1 1 5 4 2 1 2 4
2 5 3 1 3 5 5 0
3 2 2 3 4 0 1 2
4 3 0 5 0 NA 0 NA
5 0 4 2 1 2 3 3
6 4 1 0 5 3 NA 5
A base, one-liner, vectorized, but convoluted solution:
Data[t(t(which(Data[,2:4]==0,arr.ind=TRUE))+c(0,4))]<-NA
Using apply(). You could do:
cbind(Data,apply(Data[c("B","C","D")],2,function(x){
ifelse(x==0,NA,x)
}))

Count non-zero values of column in R [duplicate]

This question already has an answer here:
Add a new column of the sum by group [duplicate]
(1 answer)
Closed 6 years ago.
Suppose i have data frame like this one
DF
Id X Y Z
1 1 5 0
1 2 0 0
1 3 0 5
1 4 9 0
1 5 2 3
1 6 5 0
2 1 5 0
2 2 4 0
2 3 0 6
2 4 9 6
2 5 2 0
2 6 5 2
3 1 5 6
3 2 4 0
3 3 6 5
3 4 9 0
3 5 2 0
3 6 5 0
I want to count the number of non zero entries for variable Z in a particular Id and record that value in a new column Count, so the new data frame will look like
DF1
Id X Y Z Count
1 1 5 0 2
1 2 4 0 2
1 3 6 5 2
1 4 9 0 2
1 5 2 3 2
1 6 5 0 2
2 1 5 0 3
2 2 4 0 3
2 3 6 6 3
2 4 9 6 3
2 5 2 0 3
2 6 5 2 3
3 1 5 6 2
3 2 4 0 2
3 3 6 5 2
3 4 9 0 2
3 5 2 0 2
3 6 5 0 2
We can use base R ave
Counting the number of non-zero values for column Z grouped by Id
df$Count <- ave(df$Z, df$Id, FUN = function(x) sum(x!=0))
df$Count
#[1] 2 2 2 2 2 2 3 3 3 3 3 3 2 2 2 2 2 2
You can try this, it gives you exactly what you want:
library(data.table)
dt <- data.table(df)
dt[, Count := sum(Z != 0), by = Id]
dt
# Id X Y Z Count
# 1: 1 1 5 0 2
# 2: 1 2 0 0 2
# 3: 1 3 0 5 2
# 4: 1 4 9 0 2
# 5: 1 5 2 3 2
# 6: 1 6 5 0 2
# 7: 2 1 5 0 3
# 8: 2 2 4 0 3
# 9: 2 3 0 6 3
# 10: 2 4 9 6 3
# 11: 2 5 2 0 3
# 12: 2 6 5 2 3
# 13: 3 1 5 6 2
# 14: 3 2 4 0 2
# 15: 3 3 6 5 2
# 16: 3 4 9 0 2
# 17: 3 5 2 0 2
# 18: 3 6 5 0 2
This will also work:
df$Count <- rep(aggregate(Z~Id, df[df$Z != 0,], length)$Z, table(df$Id))
Id X Y Z Count
1 1 1 5 0 2
2 1 2 0 0 2
3 1 3 0 5 2
4 1 4 9 0 2
5 1 5 2 3 2
6 1 6 5 0 2
7 2 1 5 0 3
8 2 2 4 0 3
9 2 3 0 6 3
10 2 4 9 6 3
11 2 5 2 0 3
12 2 6 5 2 3
13 3 1 5 6 2
14 3 2 4 0 2
15 3 3 6 5 2
16 3 4 9 0 2
17 3 5 2 0 2
18 3 6 5 0 2

R Partial Reshape Data from Long to Wide

I like to reshape a dataset from long to wide. Specifically, the new wide dataset should consist of rows corresponding to the unique number of IDs in the long dataset, and the number of columns is a multiple of unique values of another variable.
Let's say this is the original dataset:
ID a b C d e f g
1 1 1 1 1 2 3 4
1 1 1 2 5 6 7 8
2 2 2 1 1 2 3 4
2 2 2 3 9 0 1 2
2 2 2 2 5 6 7 8
3 3 3 3 9 0 1 2
3 3 3 2 5 6 7 8
3 3 3 1 1 2 3 4
In the new dataset, the number of rows is the number of IDs, the number of columns is 3 plus the multiple of unique elements found in variable C and the values from variables d to g are populated after sorting variable C in ascending order. It should look something like this:
ID a b d1 e1 f1 g1 d2 e2 f2 g2 d3 e3 f3 g3
1 1 1 1 2 3 4 5 6 7 8 NA NA NA NA
2 2 2 1 2 3 4 5 6 7 8 9 0 1 2
3 3 3 1 2 3 4 5 6 7 8 9 0 1 2
You can use dcast from data.table:
data.table::setDT(df)
data.table::dcast(df, ID + a + b ~ C, sep = "", value.var = c("d", "e", "f", "g"), fill=NA)
ID a b d1 d2 d3 e1 e2 e3 f1 f2 f3 g1 g2 g3
1: 1 1 1 1 5 NA 2 6 NA 3 7 NA 4 8 NA
2: 2 2 2 1 5 9 2 6 0 3 7 1 4 8 2
3: 3 3 3 1 5 9 2 6 0 3 7 1 4 8 2
Base reshape version - just have to use C as your time variable and away you go.
reshape(dat, idvar=c("ID","a","b"), direction="wide", timevar="C", sep="")
# ID a b d1 e1 f1 g1 d2 e2 f2 g2 d3 e3 f3 g3
#1 1 1 1 1 2 3 4 5 6 7 8 NA NA NA NA
#3 2 2 2 1 2 3 4 5 6 7 8 9 0 1 2
#6 3 3 3 1 2 3 4 5 6 7 8 9 0 1 2

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources