Reducing Vectors to common non-NA values in R - r

I'm pretty new to R so the answer might be obvious, but so far I have only found answers to similar problems that don't match, or which I can't translate to mine.
The Requirement:
I have two vectors of the same length which contain numeric values as well as NA-values which might look like:
[1] 12 8 11 9 NA NA NA
[1] NA 7 NA 10 NA 11 9
What I need now is two vectors that only contain those values that are not NA in both original vectors, so in this case the result should look like this:
[1] 8 9
[1] 7 10
I was thinking about simply going through the vectors in a loop, but the dataset is quite large so I would appreciate a faster solution to that... I hope someone can help me on that...

You are looking for complete.cases But you should put your vectors in a data.frame.
dat <- data.frame(x=c(12 ,8, 11, 9, NA, NA, NA),
y=c(NA ,7, NA, 10, NA, 11, 9))
dat[complete.cases(dat),]
x y
2 8 7
4 9 10

Try this:
#dummy vector
a <- c(12,8,11,9,NA,NA,NA)
b <- c(NA,7,NA,10,NA,11,9)
#result
a[!is.na(a) & !is.na(b)]
b[!is.na(a) & !is.na(b)]

Something plus NA in R is generally NA. So, using that piece of information, you can simply do:
cbind(a, b)[!is.na(a + b), ]
# a b
# [1,] 8 7
# [2,] 9 10
More generally, you could write a function like the following to easily accept any number of vectors:
myFun <- function(...) {
myList <- list(...)
Names <- sapply(substitute(list(...)), deparse)[-1]
out <- do.call(cbind, myList)[!is.na(Reduce("+", myList)), ]
colnames(out) <- Names
out
}
With that function, the usage would be:
myFun(a, b)
# a b
# [1,] 8 7
# [2,] 9 10
In my timings, this is by far the fastest option here, but that's only important if you are able to detect differences down to the microseconds or if your vector lengths are in the millions, so I won't bother posting benchmarks.

Related

rbind.fill but in base R

I am trying to find an efficient (fast in running and simple in coding) way to do the rbind.fill function but in base R. From my searching, there seems to be plenty of library functions such as smartbind, bind_rows, and rbind on data.table, though, as stated before, I need a solution in base R. I found using:
df3 <- rbind(df1, df2[, names(df1)])
From an answer to this question, but it removes the columns while I want extra columns filled with NA to be added instead.
EDIT It would also be nice if this method works on an empty data.frame and a populated one too, essentially just setting returning the populated one. (this is for the sake of simplicity, but if it's not possible it's not hard to just replace the variable with the new data.frame if it's empty.
EDIT2 I would also like it to bind by column names for the columns which are labeled the same. Additionally, the first data frame can be both bigger and smaller than the second one, and both may have columns the other does not have.
EDIT3 As suggested by a comment, here is an example input and output I would like (I just made up the numbers they don't really matter).
#inputs
a <- data.frame(aaa=c(1, 1, 2), bbb=c(2, 3, 3), ccc=c(1, 3, 4))
b <- data.frame(aaa=c(8, 5, 4), bbb=c(1, 1, 4), ddd=c(9, 9, 9), eee=(1, 2, 4))
#desired output
aaa bbb ccc ddd eee
1 2 1 NA NA
1 3 3 NA NA
2 3 4 NA NA
8 1 NA 9 1
5 1 NA 9 2
4 4 NA 9 4
While I've been using R for a few weeks, it's still relatively new to me so I haven't gotten the mechanics down enough yet to actually make a solution, though I've been thinking about using intersect somehow with names(a) and names(b) and trying to bind only those columns first, and then adding the other ones in somehow, but I'm not really sure where to go from here / how to actually implement it in an 'R' way...
I don't know how efficient it may be, but one simple way to code this would be to add the missing columns to each data frame and then rbind together.
rbindx <- function(..., dfs=list(...)) {
ns <- unique(unlist(sapply(dfs, names)))
do.call(rbind, lapply(dfs, function(x) {
for(n in ns[! ns %in% names(x)]) {x[[n]] <- NA}; x }))
}
a <- data.frame(aaa=c(1, 1, 2), bbb=c(2, 3, 3), ccc=c(1, 3, 4))
b <- data.frame(aaa=c(8, 5, 4), bbb=c(1, 1, 4), ddd=c(9, 9, 9), eee=c(1, 2, 4))
rbindx(a, b)
# aaa bbb ccc ddd eee
# 1 1 2 1 NA NA
# 2 1 3 3 NA NA
# 3 2 3 4 NA NA
# 4 8 1 NA 9 1
# 5 5 1 NA 9 2
# 6 4 4 NA 9 4
Just use rbind.fill. If you can't install the plyr package, pull out the parts you need.
rbind.fill seems to have very few internal dependencies: plyr::compact is a one-liner, plyr:::output_template depends on plyr:::allocate_column, but at a glance that looks like it is all base code. So copy those 4 functions (attribute the source and make sure the license is compatible with your use - the current version on CRAN uses the MIT license which is quite permissive, you just need to keep it MIT-licensed), and then you have the real implementation of rbind.fill.
Why take this approach? Because, as Aaron points out - you know it works. It's been in use for and debugged for years.

Identify duplicate values and remove them

I have a vector:
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
I want to check if a particular value is repeated consecutively and if yes, keep the first two values and assign NA to the rest of the values.
For example, in the above vector, 5 is repeated 4 times, therefore I will keep the first two 5's and make the second two 5's NA.
Similarly, 4 is repeated three times, so I will keep the first two 4's and remove the third one.
In the end my vector should look like:
2,3,5,5,NA,NA,6,1,9,4,4,NA
I did this:
bad.values <- vec - binhf::shift(vec, 1, dir="right")
bad.repeat <- bad.values == 0
vec[bad.repeat] <- NA
[1] 2 3 5 NA NA NA 6 1 9 4 NA NA
I can only get it to work to keep the first 5 and 4 (rather than first two 5's or 4',4's).
Any solutions?
Another option with just base R functions:
rl <- rle(vec)
i <- unlist(lapply(rl$lengths, function(l) if (l > 2) c(FALSE,FALSE,rep(TRUE, l - 2)) else rep(FALSE, l)))
vec * NA^i
which gives:
[1] 2 3 5 5 NA NA 6 1 9 4 4 NA
I figured it out. I just had to change the argument to 2 in binhf::shift
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
bad.values <- vec - binhf::shift(vec, 2, dir="right")
bad.repeat <- bad.values == 0
vec[bad.repeat] <- NA
[1] 2 3 5 5 NA NA 6 1 9 4 4 NA
I think this might work, if I got your problem right:
vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)
diffs1<-vec-binhf::shift(vec,1,dir="right")
diffs2<-vec-binhf::shift(vec,2,dir="right")
get_zeros<-abs(diffs1)+abs(diffs2)
vec[which(get_zeros==0)]<-NA
I hope this helps!
This question may refer to a problem you encountered in a dataframe, not a vector. In any case, here's a tidyverse solution to both.
tibble(x = vec) %>%
group_by(x) %>%
mutate(mycol = ifelse(row_number()>2, NA, x) ) %>%
pull(mycol)

Removing NAs from column while calculating the length of it

So i've got a column in dataframe for 237 different pulses, and from those i gotta take pulses that are over 100 and less than 45, and see how many of them there are. I know that i can get the lenght of that with
length(survey$Pulse[survey$Pulse > 100 | survey$Pulse < 45])
However there are NA values in the column and i got no idea how to remove those from the lenght.
If you need more info ill try to provide but the only thing i dont know how to do is removing NA values from the column.
I know i could use na.rm=TRUE but i got no idea how to implement it to the line.
One option is to use na.omit - it returns object with NA values removed.
For example:
# With na.omit
length(na.omit(c(1:10, NA)))
10
# Without na.omit
length(c(1:10, NA))
11
In your case use:
length(na.omit(survey$Pulse[survey$Pulse > 100 | survey$Pulse < 45]))
Another way is to wrap which around the logical condition. When there are NA values present, the logical condition is not enough. I'll give an example with fake data.
x <- c(1:3, NA, 4, NA, 5:7, NA, 8:10)
x[x < 4 | x > 7]
#[1] 1 2 3 NA NA NA 8 9 10
x[which(x < 4 | x > 7)]
#[1] 1 2 3 8 9 10
And the length is obviously different.

Apply function to dataframe with changing argument

I have 2 objects:
A data frame with 3 variables:
v1 <- 1:10
v2 <- 11:20
v3 <- 21:30
df <- data.frame(v1,v2,v3)
A numeric vector with 3 elements:
nv <- c(6,11,28)
I would like to compare the first variable to the first number, the second variable to the second number and so on.
which(df$v1 > nv[1])
which(df$v2 > nv[2])
which(df$v3 > nv[3])
Of course in reality my data frame has a lot more variables so manually typing each variable is not an option.
I encounter these kinds of problems quite frequently. What kind of documentation would I need to read to be fluent in these matters?
One option would be to compare with equally sized elements. For this we can replicate the elements in 'nv' each by number of rows of 'df' (rep(nv, each=nrow(df))) and compare with df or use the col function that does similar output as rep.
which(df > nv[col(df)], arr.ind=TRUE)
If you need a logical matrix that corresponds to comparison of each column with each element of 'nv'
sweep(df, 2, nv, FUN='>')
You could also use mapply:
mapply(FUN=function(x, y)which(x > y), x=df, y=nv)
#$v1
#[1] 7 8 9 10
#
#$v2
#[1] 2 3 4 5 6 7 8 9 10
#
#$v3
#[1] 9 10
I think these sorts of situations are tricky because normal looping solutions (e.g. the apply function) only loop through one object, but you need to loop both through df and nv simultaneously. One approach is to loop through the indices and to use them to grab the appropriate information from both df and nv. A convenient way to loop through indices is the sapply function:
sapply(seq_along(nv), function(x) which(df[,x] > nv[x]))
# [[1]]
# [1] 7 8 9 10
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 9 10

Extract only first line in a data frame from several subgroups that satisfy a conditional

I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2

Resources