r - replacing groups of elements in vector - r

I am trying to replace all the groups of elements in a vector that sum up to zero with NAs.
The size of each group is 3. For instance:
a = c(0,0,0,0,2,3,1,0,2,0,0,0,0,1,2,0,0,0)
should be finally:
c(NA,NA,NA,0,2,3,1,0,2,NA,NA,NA,0,1,2,NA,NA,NA)
Until now, I have managed to find the groups having the sum equal to zero via:
b = which(tapply(a,rep(1:(length(a)/3),each=3),sum) == 0)
which yields c(1,4,6)
I then calculate the starting indexes of the groups in the vector via: b <- b*3-2.
Probably there is a more elegant way, but this is what I've stitched together so far.
Now I am stuck at "expanding" the vector of start indexes, to generate a sequence of the elements to be replaced. For instance, if vector b now contains c(1,10,16), I will need a sequence c(1,2,3,10,11,12,16,17,18) which are the indexes of the elements to replace by NAs.
If you have any idea of a solution without a for loop or even a more simple/elegant solution for the whole problem, I would appreciate it. Thank you.
Marius

You can use something like this:
a[as.logical(ave(a, 0:(length(a)-1) %/% 3,
FUN = function(x) sum(x) == 0))] <- NA
a
# [1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA
The 0:(length(a)-1) %/% 3 creates groups of your desired length (in this case, 3) and ave is used to check whether those groups add to 0 or not.

To designate the values to the same group turn your vector into (a three-row) matrix. You can then calculate the column-wise sums and compare with 0. The rest is simple.
a <- c(0,0,0,0,2,3,1,0,2,0,0,0,0,1,2,0,0,0)
a <- as.integer(a)
is.na(a) <- rep(colSums(matrix(a, 3L)) == 0L, each = 3L)
a
#[1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA
Note that I make the comparison with integers to indicate that if your vector is not an integer, you need to consider this FAQ.

Or using gl, ave and all
n <- length(a)
a[ave(!a, gl(n, 3, n), FUN=all)] <- NA
a
#[1] NA NA NA 0 2 3 1 0 2 NA NA NA 0 1 2 NA NA NA

Related

summing across rows, leaving NAs in R

In R, I would like to sum across rows but keep NA's as NA if the whole row is NA. My data contains 0's and I want to count them as such. E.g.:
colA colB colC Total
1 NA 2 3
NA NA NA NA
0 NA NA 0
3 0 NA 3
I used the code below and got 0's for the all-NA rows. If I change na.rm to F, I get all NAs all the way down. I would like get NA in the all-NA rows.
Total <- as.data.frame(rowSums(df[,1:3], na.rm = T))
Thanks!
You could simply change the results in a second pass:
dat <- data.frame(colA=c(1,NA,0,3), colB=c(NA,NA,NA,0), colC=c(2,NA,NA,NA))
dat
colA colB colC
1 1 NA 2
2 NA NA NA
3 0 NA NA
4 3 0 NA
res <- rowSums(dat,na.rm=T)
res
[1] 3 0 0 3
res[rowSums(is.na(dat))==3] <- NA
res
[1] 3 NA 0 3
dat <- data.frame(colA=c(1,NA,0,3), colB=c(NA,NA,NA,0), colC=c(2,NA,NA,NA))
dat
colA colB colC
1 1 NA 2
2 NA NA NA
3 0 NA NA
4 3 0 NA
res <- rowSums(dat,na.rm=T)
res
[1] 3 0 0 3
res[rowSums(is.na(dat))==3] <- NA
res
[1] 3 NA 0 3
And if you want save it back in your data:
df$total <- res
You can do this in one line using a manipulation of NA.
rowSums(df, na.rm=TRUE) * NA^(rowSums(is.na(df)) == length(df))
[1] 3 NA 0 3
Here, the first rowSums gets the sums while removing NAs. This is then multiplied by NA^(rowSums(is.na(df)) == length(df)), which returns NA in all cases except when the exponentiated term is 0 (or FALSE). In this case, FALSE occurs when at least one element of the row is non-NA.
use this to get total and then cbind it with your dataframe .
apply(df,1,function(x){if (sum(is.na(x)) == length(x)){
return(NA)
}else{
sum(x,na.rm = T)
}
})
In two steps like the above answer (but shorter):
sums <- rowSums(df, na.rm=TRUE)
allna <- apply(df,1, function(x)all(is.na(x)))
sums[allna] <- NA
Using Dplyr (in one step);
t1<- data.frame ( A=c(1,NA,0,3),
B=c(NA,5,NA,0),
C=c(2,NA,NA,NA))
t1<-t1 %>% rowwise() %>% mutate(Total=sum(A,B,C,na.rm=T))

How can I find out the names of columns that satisfy a condition in a data frame

I wish to know (by name) which columns in my data frame satisfy a particular condition. For example, if I was looking for the names of any columns that contained more than 3 NA, how could I proceed?
>frame
m n o p
1 0 NA NA NA
2 0 2 2 2
3 0 NA NA NA
4 0 NA NA 1
5 0 NA NA NA
6 0 1 2 3
> for (i in frame){
na <- is.na(i)
as.numeric(na)
total<-sum(na)
if(total>3){
print (i) }}
[1] NA 2 NA NA NA 1
[2] NA 2 NA NA NA 2
So this actually succeeds in evaluating which columns satisfy the condition, however, it does not display the column name. Perhaps subsetting the columns which interest me would be another way to do it, but I'm not sure how to solve it that way either. Plus I'd prefer to know if there's a way to just get the names directly.
I'll appreciate any input.
We can use colSums on a logical matrix (is.na(frame)), check whether it is greater than 3 to get a logical vector and then subset the names of 'frame' based on that.
names(frame)[colSums(is.na(frame))>3]
#[1] "n" "o"
If we are using dplyr, one way is
library(dplyr)
frame %>%
summarise_each(funs(sum(is.na(.))>3)) %>%
unlist() %>%
names(.)[.]
#[1] "n" "o"

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Fill in-between entries in an ID vector

Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.
This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

R Convert NA's only after the first non-zero value

I have a large data set which consists of a columns of IDs followed by a monthly time series for each ID. There are frequent missing values in this set, but what I would like to do is replace all NAs after the first non-zero with a zero while leaving all the NAs before the first non-zero value as NA's.
eg.
[NA NA NA 1 2 3 NA 4 5 NA] would be changed to [NA NA NA 1 2 3 0 4 5 0]
Any help or advice you guys could offer would be much appreciated!
Easy to do using match() and numeric indices:
use match() to find the first occurence of a non-NA value
use which() to convert the logical vector from is.na() to a numeric index
use that information to find the correct positions in x
Hence:
x <- c(NA,NA,NA,1,2,3,NA,NA,4,5,NA)
isna <- is.na(x)
nonna <- match(FALSE,isna)
id <- which(isna)
x[id[id>nonna]] <- 0
gives:
> x
[1] NA NA NA 1 2 3 0 0 4 5 0
Here's another method. Convert all to zeros first, then covert the first zeros back to NA.
> x <- c(NA,NA,NA,1,2,3,NA,NA,4,5,NA)
> x[which(is.na(x))] <- 0
### index from 1 to first element before the first element >0
> x[1:min(which(x>0))-1] <- NA
> x
[1] NA NA NA 1 2 3 0 0 4 5 0
also
### end of vector (elements are >0)
> endOfVec <- min(which(x>0)):length(x)
> x[endOfVec][is.na(x[endOfVec])] <- 0
[1] NA NA NA 1 2 3 0 0 4 5 0

Resources