If I have data such as
idx<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_60_00","1_1_2015_0_80_00")
rr<-c(2,3,4,1,5,6)
no<-seq(1,6)
dat<-data.frame(no,idx,rr)
then i want to pair with a standard index
id<-c("1_1_2015_0_00_00","1_1_2015_0_10_00","1_1_2015_0_20_00","1_1_2015_0_30_00","1_1_2015_0_40_00","1_1_2015_0_50_00","1_1_2015_0_60_00","1_1_2015_0_70_00","1_1_2015_0_80_00")
so i have rank of index of missing data such
no idx rr
1 1 1_1_2015_0_00_00 2
2 2 1_1_2015_0_10_00 3
3 NA NA NA
4 3 1_1_2015_0_30_00 4
5 4 1_1_2015_0_40_00 1
6 NA NA NA
7 5 1_1_2015_0_60_00 5
8 NA NA NA
9 6 1_1_2015_0_80_00 6
How to get it?
You can use match
dat[match(id, dat$idx), ]
# no idx rr
#1 1 1_1_2015_0_00_00 2
#2 2 1_1_2015_0_10_00 3
#NA NA <NA> NA
#3 3 1_1_2015_0_30_00 4
#4 4 1_1_2015_0_40_00 1
#NA.1 NA <NA> NA
#5 5 1_1_2015_0_60_00 5
#NA.2 NA <NA> NA
#6 6 1_1_2015_0_80_00 6
match(id, dat$idx) returns
#[1] 1 2 NA 3 4 NA 5 NA 6
and we use this vector to select rows of dat.
Related
I have a data frame like the following, with some NAs:
mydf=data.frame(ID=LETTERS[1:10], aaa=runif(10), bbb=runif(10), ccc=runif(10), ddd=runif(10))
mydf[c(1,4,5,7:10),2]=NA
mydf[c(1,2,4:8),3]=NA
mydf[c(3,4,6:10),4]=NA
mydf[c(1,3,4,6,9,10),5]=NA
> mydf
ID aaa bbb ccc ddd
1 A NA NA 0.08844614 NA
2 B 0.4912790 NA 0.88925139 0.1233173
3 C 0.1325188 0.1389260 NA NA
4 D NA NA NA NA
5 E NA NA 0.60750723 0.6357998
6 F 0.8218579 NA NA NA
7 G NA NA NA 0.5988206
8 H NA NA NA 0.4008338
9 I NA 0.8784563 NA NA
10 J NA 0.2959320 NA NA
What I want to accomplish here is the following:
1- replace non-NA values by column index -1, so that the output looks like this:
> mydf
ID aaa bbb ccc ddd
1 A NA NA 3 NA
2 B 1 NA 3 4
3 C 1 2 NA NA
4 D NA NA NA NA
5 E NA NA 3 4
6 F 1 NA NA NA
7 G NA NA NA 4
8 H NA NA NA 4
9 I NA 2 NA NA
10 J NA 2 NA NA
2- Then I would like to add an extra column that shows the following:
0 for all NAs in a row
0 for a row with more than 1 non-NA value
the actual value when it is the only non-NA value in a row
The final result should look like this:
> mydf
ID aaa bbb ccc ddd final
1 A NA NA 3 NA 3
2 B 1 NA 3 4 0
3 C 1 2 NA NA 0
4 D NA NA NA NA 0
5 E NA NA 3 4 0
6 F 1 NA NA NA 1
7 G NA NA NA 4 4
8 H NA NA NA 4 4
9 I NA 2 NA NA 2
10 J NA 2 NA NA 2
I could probably do all this with an ugly for loop, then aggregate for the final column, and substitute by 0 where appropriate...
But I was wondering if there would be a clean way to do this with some apply calls in just a few lines...
Thanks!
You could do:
mydf[-1] <- sapply(1:4, \(x) x * mydf[x+1]/mydf[x+1])
mydf$final <- apply(mydf[-1], 1, function(x) {
if(all(is.na(x)) | sum(!is.na(x)) > 1) 0 else na.omit(x)
})
Result:
mydf
#> ID aaa bbb ccc ddd final
#> 1 A NA NA 3 NA 3
#> 2 B 1 NA 3 4 0
#> 3 C 1 2 NA NA 0
#> 4 D NA NA NA NA 0
#> 5 E NA NA 3 4 0
#> 6 F 1 NA NA NA 1
#> 7 G NA NA NA 4 4
#> 8 H NA NA NA 4 4
#> 9 I NA 2 NA NA 2
#> 10 J NA 2 NA NA 2
Created on 2022-12-16 with reprex v2.0.2
Here is an idea,
mydf1 <- cbind.data.frame(ID = mydf$ID, mapply(function(x, y) replace(x, !is.na(x), y),
mydf,
seq(ncol(mydf)) - 1)[,-1])
mydf1$final <- apply(mydf1[-1], 1, \(i)
ifelse(sum(is.na(i)) == (ncol(mydf) - 1) | sum(!is.na(i)) > 1, 0, i[!is.na(i)]))
mydf1
ID aaa bbb ccc ddd final
1 A <NA> <NA> 3 <NA> 3
2 B 1 <NA> 3 4 0
3 C 1 2 <NA> <NA> 0
4 D <NA> <NA> <NA> <NA> 0
5 E <NA> <NA> 3 4 0
6 F 1 <NA> <NA> <NA> 1
7 G <NA> <NA> <NA> 4 4
8 H <NA> <NA> <NA> 4 4
9 I <NA> 2 <NA> <NA> 2
10 J <NA> 2 <NA> <NA> 2
A third option could be
tmp <- mydf[,-1]
tmp[!is.na(tmp)] <- 1
(mydf[,-1] <- tmp * as.list(1:4))
# aaa bbb ccc ddd
#1 NA NA 3 NA
#2 1 NA 3 4
#3 1 2 NA NA
#4 NA NA NA NA
#5 NA NA 3 4
#6 1 NA NA NA
#7 NA NA NA 4
#8 NA NA NA 4
#9 NA 2 NA NA
#10 NA 2 NA NA
The final column can be generated like this
idx <- rowSums(tmp, na.rm = TRUE) == 1
mydf$final <- idx * max.col(replace(tmp, is.na(tmp), -Inf))
Result
mydf
# ID aaa bbb ccc ddd final
#1 A NA NA 3 NA 3
#2 B 1 NA 3 4 0
#3 C 1 2 NA NA 0
#4 D NA NA NA NA 0
#5 E NA NA 3 4 0
#6 F 1 NA NA NA 1
#7 G NA NA NA 4 4
#8 H NA NA NA 4 4
#9 I NA 2 NA NA 2
#10 J NA 2 NA NA 2
I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!
We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))
Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA
For example,
dataX = data.frame(a=c(1:5),b=c(2:6),c=c(3:7),d=c(4:8),e=c(5:9),f=c(6:10))
How do I insert a blank column after every 2 columns?
Here is a similar method that uses a trick with matrices and integer selection of columns. The original data.frame gets an NA column with cbind. The columns of this new object are then referenced with every two columns and then the final NA column using a matrix to fill in the final column with rbind.
cbind(dataX, NewCol=NA)[c(rbind(matrix(seq_along(dataX), 2), ncol(dataX)+1))]
a b NewCol c d NewCol.1 e f NewCol.2
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA
We can use use split to split the dataset at unique positions into a list of data.frame, loop through the list, cbind with NA and cbind the elements together
res <- do.call(cbind, setNames(lapply(split.default(dataX, (seq_len(ncol(dataX))-1)%/%2),
function(x) cbind(x, NewCol = NA)), NULL))
res
# a b NewCol c d NewCol e f NewCol
#1 1 2 NA 3 4 NA 5 6 NA
#2 2 3 NA 4 5 NA 6 7 NA
#3 3 4 NA 5 6 NA 7 8 NA
#4 4 5 NA 6 7 NA 8 9 NA
#5 5 6 NA 7 8 NA 9 10 NA
names(res) <- make.unique(names(res))
Let us construct a empty data frame with the same number of rows as dataX
empty_df <- data.frame(x1=rep(NA,nrow(df)),x2=rep(NA,nrow(df)),x3=rep(NA,nrow(df)))
dataX<-cbind(dataX,empty_df)
dataX<-dataX[c("a","b","x1","c","d","x2","e","f","x3")]
resulting in:
a b x1 c d x2 e f x3
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA
I have a table that looks kind of like this:
# item 1 2 3 4 5 6 7 8
#1 1 2 4 6 NA NA NA NA NA
#2 2 1 4 5 6 NA NA NA NA
#3 3 NA NA NA NA NA NA NA NA
#4 4 1 2 6 NA NA NA NA NA
#5 5 2 3 4 6 7 8 NA NA
and I have a list
list1<-11:13
I want to replace the NAs with the elements in the list by row and result should be like this:
# item 1 2 3 4 5 6 7 8
#1 1 2 4 6 11 12 13 NA NA
#2 2 1 4 5 6 11 12 13 NA
#3 3 11 12 13 NA NA NA NA NA
#4 4 1 2 6 11 12 13 NA NA
#5 5 2 3 4 6 7 8 11 12
I tried
for(i in 1:5){
res<-which(is.na(Mydata[i,]))
Mydata[i,res]<-c(list1, rep(NA, 8))
}
It seems to work with the table in the example but gives many warning messages. And when I run it with a really large table it sometimes gives the wrong result. Can anyone tell me what is wrong my code? Or is there any better way to do this?
We loop through the rows of 'Mydata' using apply with MARGIN=1, create the numeric index for elements that are NA ('i1'), check the minimum length of the NA elements and the list1 ('l1') and replace the elements based on the minimum number of elements.
t(apply(Mydata, 1, function(x) {
i1 <- which(is.na(x))
l1 <- min(length(i1), length(list1))
replace(x, i1[seq(l1)], list1[seq(l1)])}))
# item X1 X2 X3 X4 X5 X6 X7 X8
#1 1 2 4 6 11 12 13 NA NA
#2 2 1 4 5 6 11 12 13 NA
#3 3 11 12 13 NA NA NA NA NA
#4 4 1 2 6 11 12 13 NA NA
#5 5 2 3 4 6 7 8 11 12
Or as #RichardSciven mentioned, we can use na.omit with apply by looping over the rows
t(apply(df, 1, function(x) {
w <- na.omit(which(is.na(x))[1:3])
x[w] <- list1[1:length(w)]
x }))
You could do it all in one go using matrix indexing:
sel <- pmin(outer( 0:2, max.col(is.na(dat), "first"), `+`), ncol(dat))
dat[unique(cbind(c(col(sel)),c(sel)))] <- 11:13
# item 1 2 3 4 5 6 7 8
#[1,] 1 2 4 6 11 12 13 NA NA
#[2,] 2 1 4 5 6 11 12 13 NA
#[3,] 3 11 12 13 NA NA NA NA NA
#[4,] 4 1 2 6 11 12 13 NA NA
#[5,] 5 2 3 4 6 7 8 11 12
I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
2 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
1 4 2 18
1 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
1 4 3 45
4 4 4 74
2 1 4 86
How can I calculate mean and median of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R?
It was discussed how to do it with 3 parameters (Multiple Aggregation in R) but it`s a little unclear how to do it with 4 parameters.
Thank you.
You could try something like this in data.table
data <- data.table(yourdataframe)
bar <- data[,.N,by=y]
foo <- data[x==1 & z==1,list(mean.t=mean(t,na.rm=T),median.t=median(t,na.rm=T)),by=y]
merge(bar[,list(y)],foo,by="y",all.x=T)
y mean.t median.t
1: 1 12.5 12.5
2: 2 NA NA
3: 3 NA NA
4: 4 NA NA
You probably could do the same in aggregate, but I am not sure you can do it in one easy step.
An answer to to an additional request in the comments...
bar <- data.table(expand.grid(y=unique(data$y),z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[x==1 & z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T),
Q25.t=quantile(t,0.25,na.rm=T),
Q75.t=quantile(t,0.75,na.rm=T)
),by=list(y,z)]
merge(bar[,list(y,z)],foo,by=c("y","z"),all.x=T)
y z mean.t median.t Q25.t Q75.t
1: 1 1 12.5 12.5 11.25 13.75
2: 1 2 NA NA NA NA
3: 1 3 NA NA NA NA
4: 1 4 NA NA NA NA
5: 2 1 NA NA NA NA
6: 2 2 NA NA NA NA
7: 2 3 NA NA NA NA
8: 2 4 NA NA NA NA
9: 3 1 NA NA NA NA
10: 3 2 NA NA NA NA
11: 3 3 NA NA NA NA
12: 3 4 NA NA NA NA
13: 4 1 NA NA NA NA
14: 4 2 18.0 18.0 18.00 18.00
15: 4 3 45.0 45.0 45.00 45.00
16: 4 4 NA NA NA NA