R Convert NA's only after the first non-zero value - r

I have a large data set which consists of a columns of IDs followed by a monthly time series for each ID. There are frequent missing values in this set, but what I would like to do is replace all NAs after the first non-zero with a zero while leaving all the NAs before the first non-zero value as NA's.
eg.
[NA NA NA 1 2 3 NA 4 5 NA] would be changed to [NA NA NA 1 2 3 0 4 5 0]
Any help or advice you guys could offer would be much appreciated!

Easy to do using match() and numeric indices:
use match() to find the first occurence of a non-NA value
use which() to convert the logical vector from is.na() to a numeric index
use that information to find the correct positions in x
Hence:
x <- c(NA,NA,NA,1,2,3,NA,NA,4,5,NA)
isna <- is.na(x)
nonna <- match(FALSE,isna)
id <- which(isna)
x[id[id>nonna]] <- 0
gives:
> x
[1] NA NA NA 1 2 3 0 0 4 5 0

Here's another method. Convert all to zeros first, then covert the first zeros back to NA.
> x <- c(NA,NA,NA,1,2,3,NA,NA,4,5,NA)
> x[which(is.na(x))] <- 0
### index from 1 to first element before the first element >0
> x[1:min(which(x>0))-1] <- NA
> x
[1] NA NA NA 1 2 3 0 0 4 5 0
also
### end of vector (elements are >0)
> endOfVec <- min(which(x>0)):length(x)
> x[endOfVec][is.na(x[endOfVec])] <- 0
[1] NA NA NA 1 2 3 0 0 4 5 0

Related

Replace values within a range in a data frame in R

I have ranked rows in a data frame based on values in each column.Ranking 1-10. not every column in picture
I have code that replaces values to NA or 1. But I can't figure out how to replace range of numbers, e.g. 3-6 with 1 and then replace the rest (1-2 and 7-10) with NA.
lag.rank <- as.matrix(lag.rank)
lag.rank[lag.rank > n] <- NA
lag.rank[lag.rank <= n] <- 1
At the moment it only replaces numbers above or under n. Any suggestions? I figure it should be fairly simple?
Is this what your are trying to accomplish?
> x <- sample(1:10,20, TRUE)
> x
[1] 1 2 8 2 6 4 9 1 4 8 6 1 2 5 8 6 9 4 7 6
> x <- ifelse(x %in% c(3:6), 1, NA)
> x
[1] NA NA NA NA 1 1 NA NA 1 NA 1 NA NA 1 NA 1 NA 1 NA 1
If your data aren't integers but numeric you can use between from the dplyr package:
x <- ifelse(between(x,3,6), 1, NA)

summing across rows, leaving NAs in R

In R, I would like to sum across rows but keep NA's as NA if the whole row is NA. My data contains 0's and I want to count them as such. E.g.:
colA colB colC Total
1 NA 2 3
NA NA NA NA
0 NA NA 0
3 0 NA 3
I used the code below and got 0's for the all-NA rows. If I change na.rm to F, I get all NAs all the way down. I would like get NA in the all-NA rows.
Total <- as.data.frame(rowSums(df[,1:3], na.rm = T))
Thanks!
You could simply change the results in a second pass:
dat <- data.frame(colA=c(1,NA,0,3), colB=c(NA,NA,NA,0), colC=c(2,NA,NA,NA))
dat
colA colB colC
1 1 NA 2
2 NA NA NA
3 0 NA NA
4 3 0 NA
res <- rowSums(dat,na.rm=T)
res
[1] 3 0 0 3
res[rowSums(is.na(dat))==3] <- NA
res
[1] 3 NA 0 3
dat <- data.frame(colA=c(1,NA,0,3), colB=c(NA,NA,NA,0), colC=c(2,NA,NA,NA))
dat
colA colB colC
1 1 NA 2
2 NA NA NA
3 0 NA NA
4 3 0 NA
res <- rowSums(dat,na.rm=T)
res
[1] 3 0 0 3
res[rowSums(is.na(dat))==3] <- NA
res
[1] 3 NA 0 3
And if you want save it back in your data:
df$total <- res
You can do this in one line using a manipulation of NA.
rowSums(df, na.rm=TRUE) * NA^(rowSums(is.na(df)) == length(df))
[1] 3 NA 0 3
Here, the first rowSums gets the sums while removing NAs. This is then multiplied by NA^(rowSums(is.na(df)) == length(df)), which returns NA in all cases except when the exponentiated term is 0 (or FALSE). In this case, FALSE occurs when at least one element of the row is non-NA.
use this to get total and then cbind it with your dataframe .
apply(df,1,function(x){if (sum(is.na(x)) == length(x)){
return(NA)
}else{
sum(x,na.rm = T)
}
})
In two steps like the above answer (but shorter):
sums <- rowSums(df, na.rm=TRUE)
allna <- apply(df,1, function(x)all(is.na(x)))
sums[allna] <- NA
Using Dplyr (in one step);
t1<- data.frame ( A=c(1,NA,0,3),
B=c(NA,5,NA,0),
C=c(2,NA,NA,NA))
t1<-t1 %>% rowwise() %>% mutate(Total=sum(A,B,C,na.rm=T))

How can I find out the names of columns that satisfy a condition in a data frame

I wish to know (by name) which columns in my data frame satisfy a particular condition. For example, if I was looking for the names of any columns that contained more than 3 NA, how could I proceed?
>frame
m n o p
1 0 NA NA NA
2 0 2 2 2
3 0 NA NA NA
4 0 NA NA 1
5 0 NA NA NA
6 0 1 2 3
> for (i in frame){
na <- is.na(i)
as.numeric(na)
total<-sum(na)
if(total>3){
print (i) }}
[1] NA 2 NA NA NA 1
[2] NA 2 NA NA NA 2
So this actually succeeds in evaluating which columns satisfy the condition, however, it does not display the column name. Perhaps subsetting the columns which interest me would be another way to do it, but I'm not sure how to solve it that way either. Plus I'd prefer to know if there's a way to just get the names directly.
I'll appreciate any input.
We can use colSums on a logical matrix (is.na(frame)), check whether it is greater than 3 to get a logical vector and then subset the names of 'frame' based on that.
names(frame)[colSums(is.na(frame))>3]
#[1] "n" "o"
If we are using dplyr, one way is
library(dplyr)
frame %>%
summarise_each(funs(sum(is.na(.))>3)) %>%
unlist() %>%
names(.)[.]
#[1] "n" "o"

r - replacing groups of elements in vector

I am trying to replace all the groups of elements in a vector that sum up to zero with NAs.
The size of each group is 3. For instance:
a = c(0,0,0,0,2,3,1,0,2,0,0,0,0,1,2,0,0,0)
should be finally:
c(NA,NA,NA,0,2,3,1,0,2,NA,NA,NA,0,1,2,NA,NA,NA)
Until now, I have managed to find the groups having the sum equal to zero via:
b = which(tapply(a,rep(1:(length(a)/3),each=3),sum) == 0)
which yields c(1,4,6)
I then calculate the starting indexes of the groups in the vector via: b <- b*3-2.
Probably there is a more elegant way, but this is what I've stitched together so far.
Now I am stuck at "expanding" the vector of start indexes, to generate a sequence of the elements to be replaced. For instance, if vector b now contains c(1,10,16), I will need a sequence c(1,2,3,10,11,12,16,17,18) which are the indexes of the elements to replace by NAs.
If you have any idea of a solution without a for loop or even a more simple/elegant solution for the whole problem, I would appreciate it. Thank you.
Marius
You can use something like this:
a[as.logical(ave(a, 0:(length(a)-1) %/% 3,
FUN = function(x) sum(x) == 0))] <- NA
a
# [1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA
The 0:(length(a)-1) %/% 3 creates groups of your desired length (in this case, 3) and ave is used to check whether those groups add to 0 or not.
To designate the values to the same group turn your vector into (a three-row) matrix. You can then calculate the column-wise sums and compare with 0. The rest is simple.
a <- c(0,0,0,0,2,3,1,0,2,0,0,0,0,1,2,0,0,0)
a <- as.integer(a)
is.na(a) <- rep(colSums(matrix(a, 3L)) == 0L, each = 3L)
a
#[1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA
Note that I make the comparison with integers to indicate that if your vector is not an integer, you need to consider this FAQ.
Or using gl, ave and all
n <- length(a)
a[ave(!a, gl(n, 3, n), FUN=all)] <- NA
a
#[1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA

R - Comparing values in a column and creating a new column with the results of this comparison. Is there a better way than looping?

I'm a beginner of R. Although I have read a lot in manuals and here at this board, I have to ask my first question. It's a little bit the same as here but not really the same and i don't understand the explanation there.I have a dataframe with hundreds of thousands of rows and 30 columns. But for my question I created a simplier dataframe that you can use:
a <- sample(c(1,3,5,9), 20, replace = TRUE)
b <- sample(c(1,NA), 20, replace = TRUE)
df <- data.frame(a,b)
Now I want to compare the values of the last column (here column b), so that I'm looking iteratively at the value of each row if it is the same as the in the next row. If it is the same I want to write a 0 as the value in a new column in the same row, otherwise it should be a 1 as the value of the new column.
Here you can see my code, that's not working, because the rows of the new column only contain 0:
m<-c()
for (i in seq(along=df[,1])){
ifelse(df$b[i] == df$b[i+1],m <- 0, m <- 1)
df$mov <- m
}
The result, what I want to get, looks like the example below. What's the mistake? And is there a better way than creating loops? Maybe looping could be very slow for my big dataset.
a b mov
1 9 NA 0
2 1 NA 1
3 1 1 1
4 5 NA 0
5 1 NA 0
6 3 NA 0
7 3 NA 1
8 5 1 0
9 1 1 0
10 3 1 0
11 1 1 0
12 9 1 0
13 1 1 1
14 5 NA 0
15 9 NA 0
16 9 NA 0
17 9 NA 0
18 5 NA 0
19 3 NA 0
20 1 NA 0
Thank you for your help!
There are a couple things to consider in your example.
First, to avoid a loop, you can create a copy of the vector that is shifted by one position. (There are about 20 ways to do this.) Then when you test vector B vs C it will do element-by-element comparison of each position vs its neighbor.
Second, equality comparisons don't work with NA -- they always return NA. So NA == NA is not TRUE it is NA! Again, there are about 20 ways to get around this, but here I have just replaced all the NAs in the temporary vector with a placeholder that will work for the tests of equality.
Finally, you have to decide what you want to do with the last value (which doesn't have a neighbor). Here I have put 1, which is your assignment for "doesn't match its neighbor".
So, depending on the range of values possible in b, you could do
c = df$b
z = length(c)
c[is.na(c)] = 'x' # replace NA with value that will allow equality test
df$mov = c(1 * !(c[1:z-1] == c[2:z]),1) # add 1 to the end for the last value
You could do something like this to mark the ones which match
df$bnext <- c(tail(df$b,-1),NA)
df$bnextsame <- ifelse(df$bnext == df$b | (is.na(df$b) & is.na(df$bnext)),0,1)
There are plenty of NAs here because there are plenty of NAs in your column b as well and any comparison with NA returns an NA and not a TRUE/FALSE. You could add a df[is.na(df$bnextsame),"bnextsame"] <- 0 to fix that.
You can use a "rolling equality test" with zoo 's rollapply. Also, identical is preferred to ==.
#identical(NA, NA)
#[1] TRUE
#NA == NA
#[1] NA
library(zoo)
df$mov <- c(rollapply(df$b, width = 2,
FUN = function(x) as.numeric(!identical(x[1], x[2]))), "no_comparison")
#`!` because you want `0` as `TRUE` ;
#I added a "no_comparison" to last value as it is not compared with any one
df
# a b mov
#1 5 1 0
#2 1 1 0
#3 9 1 1
#4 5 NA 1
#5 9 1 1
#.....
#19 1 NA 0
#20 1 NA no_comparison

Resources