Easiest way to replace non-NA values by column index - r

I have a data frame like the following, with some NAs:
mydf=data.frame(ID=LETTERS[1:10], aaa=runif(10), bbb=runif(10), ccc=runif(10), ddd=runif(10))
mydf[c(1,4,5,7:10),2]=NA
mydf[c(1,2,4:8),3]=NA
mydf[c(3,4,6:10),4]=NA
mydf[c(1,3,4,6,9,10),5]=NA
> mydf
ID aaa bbb ccc ddd
1 A NA NA 0.08844614 NA
2 B 0.4912790 NA 0.88925139 0.1233173
3 C 0.1325188 0.1389260 NA NA
4 D NA NA NA NA
5 E NA NA 0.60750723 0.6357998
6 F 0.8218579 NA NA NA
7 G NA NA NA 0.5988206
8 H NA NA NA 0.4008338
9 I NA 0.8784563 NA NA
10 J NA 0.2959320 NA NA
What I want to accomplish here is the following:
1- replace non-NA values by column index -1, so that the output looks like this:
> mydf
ID aaa bbb ccc ddd
1 A NA NA 3 NA
2 B 1 NA 3 4
3 C 1 2 NA NA
4 D NA NA NA NA
5 E NA NA 3 4
6 F 1 NA NA NA
7 G NA NA NA 4
8 H NA NA NA 4
9 I NA 2 NA NA
10 J NA 2 NA NA
2- Then I would like to add an extra column that shows the following:
0 for all NAs in a row
0 for a row with more than 1 non-NA value
the actual value when it is the only non-NA value in a row
The final result should look like this:
> mydf
ID aaa bbb ccc ddd final
1 A NA NA 3 NA 3
2 B 1 NA 3 4 0
3 C 1 2 NA NA 0
4 D NA NA NA NA 0
5 E NA NA 3 4 0
6 F 1 NA NA NA 1
7 G NA NA NA 4 4
8 H NA NA NA 4 4
9 I NA 2 NA NA 2
10 J NA 2 NA NA 2
I could probably do all this with an ugly for loop, then aggregate for the final column, and substitute by 0 where appropriate...
But I was wondering if there would be a clean way to do this with some apply calls in just a few lines...
Thanks!

You could do:
mydf[-1] <- sapply(1:4, \(x) x * mydf[x+1]/mydf[x+1])
mydf$final <- apply(mydf[-1], 1, function(x) {
if(all(is.na(x)) | sum(!is.na(x)) > 1) 0 else na.omit(x)
})
Result:
mydf
#> ID aaa bbb ccc ddd final
#> 1 A NA NA 3 NA 3
#> 2 B 1 NA 3 4 0
#> 3 C 1 2 NA NA 0
#> 4 D NA NA NA NA 0
#> 5 E NA NA 3 4 0
#> 6 F 1 NA NA NA 1
#> 7 G NA NA NA 4 4
#> 8 H NA NA NA 4 4
#> 9 I NA 2 NA NA 2
#> 10 J NA 2 NA NA 2
Created on 2022-12-16 with reprex v2.0.2

Here is an idea,
mydf1 <- cbind.data.frame(ID = mydf$ID, mapply(function(x, y) replace(x, !is.na(x), y),
mydf,
seq(ncol(mydf)) - 1)[,-1])
mydf1$final <- apply(mydf1[-1], 1, \(i)
ifelse(sum(is.na(i)) == (ncol(mydf) - 1) | sum(!is.na(i)) > 1, 0, i[!is.na(i)]))
mydf1
ID aaa bbb ccc ddd final
1 A <NA> <NA> 3 <NA> 3
2 B 1 <NA> 3 4 0
3 C 1 2 <NA> <NA> 0
4 D <NA> <NA> <NA> <NA> 0
5 E <NA> <NA> 3 4 0
6 F 1 <NA> <NA> <NA> 1
7 G <NA> <NA> <NA> 4 4
8 H <NA> <NA> <NA> 4 4
9 I <NA> 2 <NA> <NA> 2
10 J <NA> 2 <NA> <NA> 2

A third option could be
tmp <- mydf[,-1]
tmp[!is.na(tmp)] <- 1
(mydf[,-1] <- tmp * as.list(1:4))
# aaa bbb ccc ddd
#1 NA NA 3 NA
#2 1 NA 3 4
#3 1 2 NA NA
#4 NA NA NA NA
#5 NA NA 3 4
#6 1 NA NA NA
#7 NA NA NA 4
#8 NA NA NA 4
#9 NA 2 NA NA
#10 NA 2 NA NA
The final column can be generated like this
idx <- rowSums(tmp, na.rm = TRUE) == 1
mydf$final <- idx * max.col(replace(tmp, is.na(tmp), -Inf))
Result
mydf
# ID aaa bbb ccc ddd final
#1 A NA NA 3 NA 3
#2 B 1 NA 3 4 0
#3 C 1 2 NA NA 0
#4 D NA NA NA NA 0
#5 E NA NA 3 4 0
#6 F 1 NA NA NA 1
#7 G NA NA NA 4 4
#8 H NA NA NA 4 4
#9 I NA 2 NA NA 2
#10 J NA 2 NA NA 2

Related

Turn value to the left of NA to NA value for entire dataframe

I have the following dataframe
df <- data.frame(a = c(1,2,3,4),
b = c(NA,1,NA,1),
c = c(1,4,5,2),
d = c(1,NA,NA,1))
a b c d
1 1 NA 1 1
2 2 1 4 NA
3 3 NA 5 NA
4 4 1 2 1
I have columns b and d with either NA or 1.
I have columns a and c with my values.
I want all the values to the left of NA values in b and d to be NA
So I want the following df_1 but cant figure out how to get there:
a b c d
1 NA NA 1 1
2 2 1 NA NA
3 NA NA NA NA
4 4 1 2 1
You can try:
df[c(TRUE, FALSE)][is.na(df[c(FALSE, TRUE)])] <- NA
df
a b c d
1 NA NA 1 1
2 2 1 NA NA
3 NA NA NA NA
4 4 1 2 1
You can use this function:
myFun <- function(df){
for(i in seq_along(df)) {
if(is.na(df$b[i]))
df$a[i]="NA"
if(is.na(df$d[i]))
df$c[i]="NA"
}
df
}
Output:
myFun(df)
a b c d
1 NA NA 1 1
2 2 1 NA NA
3 NA NA NA NA
4 4 1 2 1

Match character and order

If I have data such as
idx<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_60_00","1_1_2015_0_80_00")
rr<-c(2,3,4,1,5,6)
no<-seq(1,6)
dat<-data.frame(no,idx,rr)
then i want to pair with a standard index
id<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_20_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_50_00","1_1_2015_0_60_00","1_1_2015_0_70_00","1_1_2015_0_80_00")
so i have rank of index of missing data such
no idx rr
1 1 1_1_2015_0_00_00 2
2 2 1_1_2015_0_10_00 3
3 NA NA NA
4 3 1_1_2015_0_30_00 4
5 4 1_1_2015_0_40_00 1
6 NA NA NA
7 5 1_1_2015_0_60_00 5
8 NA NA NA
9 6 1_1_2015_0_80_00 6
How to get it?
You can use match
dat[match(id, dat$idx), ]
# no idx rr
#1 1 1_1_2015_0_00_00 2
#2 2 1_1_2015_0_10_00 3
#NA NA <NA> NA
#3 3 1_1_2015_0_30_00 4
#4 4 1_1_2015_0_40_00 1
#NA.1 NA <NA> NA
#5 5 1_1_2015_0_60_00 5
#NA.2 NA <NA> NA
#6 6 1_1_2015_0_80_00 6
match(id, dat$idx) returns
#[1] 1 2 NA 3 4 NA 5 NA 6
and we use this vector to select rows of dat.

Column-wise subset of data frame in R

I need some help with subset/filter of data.frame. Below is the code for my random dataset.
A <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
B <- c(3,3,3,3,4,4,4,4,1,1,1,1,2,2,2,2)
C <- c(1,1,1,1,3,3,3,3,2,2,2,2,4,4,4,4)
Fakey <- data.frame(A, B, C)
Filter_Fakey <- subset(Fakey, (Fakey>1 & Fakey<4))
That last line of coode results in the following:
> Filter_Fakey
A B C
5 2 4 3
6 2 4 3
7 2 4 3
8 2 4 3
9 3 1 2
10 3 1 2
11 3 1 2
12 3 1 2
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
NA.9 NA NA NA
NA.10 NA NA NA
NA.11 NA NA NA
NA.12 NA NA NA
NA.13 NA NA NA
NA.14 NA NA NA
NA.15 NA NA NA
But What I really want is this,
> Filter_Fakey
A B C
5 2 3 3
6 2 3 3
7 2 3 3
8 2 3 3
9 3 2 2
10 3 2 2
11 3 2 2
12 3 2 2
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
NA.9 NA NA NA
NA.10 NA NA NA
NA.11 NA NA NA
NA.12 NA NA NA
NA.13 NA NA NA
NA.14 NA NA NA
NA.15 NA NA NA
I've tried subset(), subset(with a negation condition), filter{dplyr}, and the different bracket notations ('[' and '[['). Thanks for helping me out.
Use lapply to loop through columns of the data frame, and set values out of conditions to be NA if that is what you are after. Use order(is.na(...)) to arrange NA values to the last positions:
do.call(cbind, lapply(Fakey, function(col) {
col[col <= 1 | col >= 4] <- NA; col[order(is.na(col))]
}))
A B C
1 2 3 3
2 2 3 3
3 2 3 3
4 2 3 3
5 3 2 2
6 3 2 2
7 3 2 2
8 3 2 2
9 NA NA NA
10 NA NA NA
11 NA NA NA
12 NA NA NA
13 NA NA NA
14 NA NA NA
15 NA NA NA
16 NA NA NA
Another option is using length<- to pad NA at the end after subsetting each of the columns using the logical condition.
data.frame(lapply(Fakey, function(x) `length<-`(x[x > 1 & x <4], nrow(Fakey))))
# A B C
#1 2 3 3
#2 2 3 3
#3 2 3 3
#4 2 3 3
#5 3 2 2
#6 3 2 2
#7 3 2 2
#8 3 2 2
#9 NA NA NA
#10 NA NA NA
#11 NA NA NA
#12 NA NA NA
#13 NA NA NA
#14 NA NA NA
#15 NA NA NA
#16 NA NA NA

Multiple aggregation in R with 4 parameters

I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
2 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
1 4 2 18
1 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
1 4 3 45
4 4 4 74
2 1 4 86
How can I calculate mean and median of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R?
It was discussed how to do it with 3 parameters (Multiple Aggregation in R) but it`s a little unclear how to do it with 4 parameters.
Thank you.
You could try something like this in data.table
data <- data.table(yourdataframe)
bar <- data[,.N,by=y]
foo <- data[x==1 & z==1,list(mean.t=mean(t,na.rm=T),median.t=median(t,na.rm=T)),by=y]
merge(bar[,list(y)],foo,by="y",all.x=T)
y mean.t median.t
1: 1 12.5 12.5
2: 2 NA NA
3: 3 NA NA
4: 4 NA NA
You probably could do the same in aggregate, but I am not sure you can do it in one easy step.
An answer to to an additional request in the comments...
bar <- data.table(expand.grid(y=unique(data$y),z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[x==1 & z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T),
Q25.t=quantile(t,0.25,na.rm=T),
Q75.t=quantile(t,0.75,na.rm=T)
),by=list(y,z)]
merge(bar[,list(y,z)],foo,by=c("y","z"),all.x=T)
y z mean.t median.t Q25.t Q75.t
1: 1 1 12.5 12.5 11.25 13.75
2: 1 2 NA NA NA NA
3: 1 3 NA NA NA NA
4: 1 4 NA NA NA NA
5: 2 1 NA NA NA NA
6: 2 2 NA NA NA NA
7: 2 3 NA NA NA NA
8: 2 4 NA NA NA NA
9: 3 1 NA NA NA NA
10: 3 2 NA NA NA NA
11: 3 3 NA NA NA NA
12: 3 4 NA NA NA NA
13: 4 1 NA NA NA NA
14: 4 2 18.0 18.0 18.00 18.00
15: 4 3 45.0 45.0 45.00 45.00
16: 4 4 NA NA NA NA

How to implement conditional search to upper direction from each row using dplyr?

This is a sample data frame as below:
df <- data.frame(
A=c(1,2,3,4,5,6,7),
B=c(1,NA,3,2,NA,4,3),
C=c(NA,1,NA,NA,1,NA,NA),
D=c(NA,2,NA,NA,4,NA,NA))
> df
A B C D
1 1 1 NA NA
2 2 NA 1 2
3 3 3 NA NA
4 4 2 NA NA
5 5 NA 2 4
6 6 4 NA NA
7 7 3 NA NA
I want to implement following manipulation using dplyr piping function in R.
Adding a new columns E which contains D in the following conditions.
Search !is.na(C) from each row to upper direction
If !is.na(C), pad column E by a value stored in D
This is a desired output.
> df2
A B C D E
1 1 1 NA NA NA
2 2 NA 1 2 NA
3 3 3 NA NA NA
4 4 2 NA NA NA
5 5 NA 2 4 2
6 6 4 NA NA NA
7 7 3 NA NA NA
I prefer to implement upper-directional search using piping function in dplyr.
I know a lag function in base but it does not work for this issue. I also tried to use slice function in dplyr but it also do not do searching from each row to upper direction.
I hope you could suggest other solutions for this matter.
I tried to use slice in dplyr but I could not do appropriate filtering from each row.
We can copy the contents of D in E and use tidyr::fill to replace NA's with recent non-NA values and use lag to get previous value in E.
library(dplyr)
df %>%
mutate(E = D) %>%
tidyr::fill(E) %>%
mutate(E = replace(lag(E), is.na(D), NA))
# A B C D E
#1 1 1 NA NA NA
#2 2 NA 1 2 NA
#3 3 3 NA NA NA
#4 4 2 NA NA NA
#5 5 NA 1 4 2
#6 6 4 NA NA NA
#7 7 3 NA NA NA
This uses bind_rows to combine the NA C values with the non-NA C values with your lag criteria:
bind_rows(df%>%
filter(is.na(C))%>%
mutate(E = NA)
,
df%>%
filter(!is.na(C))%>%
mutate(E = lag(D))
)%>%
arrange(A)
A B C D E
1 1 1 NA NA NA
2 2 NA 1 2 NA
3 3 3 NA NA NA
4 4 2 NA NA NA
5 5 NA 1 4 2
6 6 4 NA NA NA
7 7 3 NA NA NA
In data.table this is very simple:
library(data.table)
dt <- as.data.table(df)
dt[!is.na(C), E:=shift(D)][]
A B C D E
1: 1 1 NA NA NA
2: 2 NA 1 2 NA
3: 3 3 NA NA NA
4: 4 2 NA NA NA
5: 5 NA 1 4 2
6: 6 4 NA NA NA
7: 7 3 NA NA NA
Base isn't too bad either:
# base
df2 <- df
df2$E <- NA
ind <- !is.na(df2$C)
df2[ind, 'E'] <- df2[ind, 'D'][c(NA,seq_len(sum(ind)-1))]
df2

Resources