Remove groups which do not have non-consecutive NA values in R - r

I have the following Data frame
group <- c(2,2,2,2,4,4,4,4,5,5,5,5)
D <- c(NA,2,NA,NA,NA,2,3,NA,NA,NA,1,1)
df <- data.frame(group, D)
df
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
9 5 NA
10 5 NA
11 5 1
12 5 1
I would like to only keep groups that contain non consecutive NA values at least once. in this case group 5 would be removed because it does not contain non consecutive NA values, but only consecutive NA values. group 2 and 4 remain because they do contain non consecutive NA values (NA values separated by row(s) with a non NA value).
therefore the resulting data frame would look like this:
df2
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
any ideas :)?

How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.

Related

Merging rows by a group [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 5 years ago.
I have a data set
>data.frame(GROUP=c("A","A","A","G","G","F","F","E","T"),
FIRST=c(10,2,3,6,NA,NA,NA,1,NA),
SECOND=c(3,NA,NA,1,NA,4,2,1,NA),
THIRD=c(5,7,NA,NA,NA,1,NA,1,1))
GROUP FIRST SECOND THIRD
1 A 10 3 5
2 A 2 NA 7
3 A 3 NA NA
4 G 6 1 NA
5 G NA NA NA
6 F NA 4 1
7 F NA 2 NA
8 E 1 1 1
9 T NA NA 1
I want to combine the data using the GROUP-column in two ways:
Mean of columns inside a group
GROUP FIRST SECOND THIRD
1 A 5 3 6
2 G 6 1 NA
3 F NA 3 1
4 E 1 1 1
5 T NA NA 1
Column-wise max value inside a group
GROUP FIRST SECOND THIRD
1 A 10 3 7
2 G 6 1 NA
3 F NA 4 1
4 E 1 1 1
5 T NA NA 1
Is there a quick way to do this or should I create a new function?
We can use aggregate from base R
aggregate(.~GROUP, d1, mean, na.rm = TRUE, na.action=NULL)
Or using dplyr
library(dplyr)
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(mean=mean(., na.rm = TRUE)))
Or
d1 %>%
group_by(GROUP) %>%
summarise_each(funs(max=max(., na.rm = TRUE)))

Why didn't mutate fill all rows? Was using mutate and ifelse to look up imputed values from another dataframe

Here is the deal. Was trying to use mutate from the plyr package to look up an appropriate value from another dataframe, if, the v variable in the original dataframe was NA. The looked up value is supposed to go into a new variable imputed. I also defined a custom function for this look up purpose.
Here is the code:
if(!require(plyr)){
install.packages("plyr")
library(plyr)
}
df = data.frame(d=c(1,1,1,2,2,2,3,3,3),
g=rep(c(1,2,3),3),
v=c(5,NA,NA,5,NA,NA,5,NA,NA))
imputed = data.frame(g=c(1,2,3),
v=c(5,10,15))
getImputed = function(p){
imputed[imputed$g==p,"v"]
}
df = mutate(df,imputed=ifelse(is.na(v),getImputed(g),v))
df
And this is the resulting dataframe:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA NA
6 2 3 NA NA
7 3 1 5 5
8 3 2 NA NA
9 3 3 NA NA
As one can see, only the first 3 rows were successfully filled in by mutate. It is likely that the ifelse function is the issue, but I can't see why : (
What is weird is that, if the imputed dataframe has 4 rows, like this:
imputed = data.frame(g=c(1,2,3,4),
v=c(5,10,15,20))
then the df dataframe was filled up properly:
d g v imputed
1 1 1 5 5
2 1 2 NA 10
3 1 3 NA 15
4 2 1 5 5
5 2 2 NA 10
6 2 3 NA 15
7 3 1 5 5
8 3 2 NA 10
9 3 3 NA 15
but R gave me a warning saying:
Warning message:
In imputed$g == p :
longer object length is not a multiple of shorter object length
Am I overlooking something?
The problem is your getImputed function. The mutate function does not iterate over rows. It passes columns as a vectors to functions so each function is basically called one. Your getInputed function works if you pass a single value, but not so great with a vector
getImputed(1)
# [1] 5
getImputed(c(1,2))
# [1] 5 10
# Warning message:
# In imputed$g == p :
# longer object length is not a multiple of shorter object length
A better way to write the function would be
getImputed2 <- function(p){
imputed$v[match(p, imputed$g)]
}
This will properly handle a vector of values
mutate(df,imputed=ifelse(is.na(v),getImputed2(g),v))
# d g v imputed
# 1 1 1 5 5
# 2 1 2 NA 10
# 3 1 3 NA 15
# 4 2 1 5 5
# 5 2 2 NA 10
# 6 2 3 NA 15
# 7 3 1 5 5
# 8 3 2 NA 10
# 9 3 3 NA 15
You might also consider joining and replacing
mutate(join(df, setNames(imputed, c("g","v2")), by=c(g="g")),
v=ifelse(is.na(v), v2, v), v2=NULL)

Fill rows depending on another row values

The problem is as follows:
I have a data base with 3 columns: ID / SCORE / ACTION. I need to identify the fist score different from NA and assign its value (and the action too) to the NA's before it. In this case the observations #1 and #2 swould have the same score and action as the observation #3. As well, observations #4, #5 and #6 should take the values of observation #7.
ID SCORE ACTION
1 NA NA
2 NA NA
3 BB+ T
4 NA NA
5 NA NA
6 NA NA
7 AAA S
8 NA NA
Any ideas? Thanks
You can look into na.locf from the "zoo" package. In this case, you would want to use the fromLast argument:
library(zoo)
na.locf(mydf, fromLast=TRUE)
# ID SCORE ACTION
# 1 1 BB+ T
# 2 2 BB+ T
# 3 3 BB+ T
# 4 4 AAA S
# 5 5 AAA S
# 6 6 AAA S
# 7 7 AAA S
# 8 8 <NA> <NA>

Retrieving subset of a data frame by finding entries with NA in specific columns

Suppose we had a data frame with NA values like so,
>data
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2
I wish to know a general method for retrieving the subset of data with NA values in C or A. So the output should be,
A B C D
1 3 NA 4
NA 3 3 5
4 2 NA NA
I tried using the subset command like so, subset(data, A==NA | C==NA), but it didn't work. Any ideas?
A very handy function for these sort of things is complete.cases. It checks row-wise for NA and if any returns FALSE. If there are no NAs, returns TRUE.
So, you need to subset just the two columns of your data and then use complete.cases(.) and negate it and subset those rows back from your original data, as follows:
# assuming your data is in 'df'
df[!complete.cases(df[, c("A", "C")]), ]
# A B C D
# 1 1 3 NA 4
# 3 NA 3 3 5
# 4 4 2 NA NA
Here is one possibility:
# Read your data
data <- read.table(text="
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2",header=T,sep="")
# Now subset your data
subset(data, is.na(C) | is.na(A))
A B C D
1 1 3 NA 4
3 NA 3 3 5
4 4 2 NA NA

Selecting values in a dataframe based on a priority list

I am new to R so am still getting my head around the way it works. My problem is as follows, I have a data frame and a prioritised list of columns (pl), I need:
To find the maximum value from the columns in pl for each row and create a new column with this value (df$max)
Using the priority list, subtract this maximum value from the priority value, ignoring NAs and returning the absolute difference
Probably better with an example:
My priority list is
pl <- c("E","D","A","B")
and the data frame is:
A B C D E F G
1 15 5 20 9 NA 6 1
2 3 2 NA 5 1 3 2
3 NA NA 3 NA NA NA NA
4 0 1 0 7 8 NA 6
5 1 2 3 NA NA 1 6
So for the first line the maximum is from column A (15) and the priority value is from column D (9) since E is a NA. The answer I want should look like this.
A B C D E F G MAX MAX-PR
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA NA NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1
How about this?
df$MAX <- apply(df[,pl], 1, max, na.rm = T)
df$MAX_PR <- df$MAX - apply(df[,pl], 1, function(x) x[!is.na(x)][1])
df$MAX[is.infinite(df$MAX)] <- NA
> df
# A B C D E F G MAX MAX_PR
# 1 15 5 20 9 NA 6 1 15 6
# 2 3 2 NA 5 1 3 2 5 4
# 3 NA NA 3 NA NA NA NA NA NA
# 4 0 1 0 7 8 NA 6 8 0
# 5 1 2 3 NA NA 1 6 2 1
Example:
df <- data.frame(A=c(1,NA,2,5,3,1),B=c(3,5,NA,6,NA,10),C=c(NA,3,4,5,1,4))
pl <- c("B","A","C")
#now we find the maximum per row, ignoring NAs
max.per.row <- apply(df,1,max,na.rm=T)
#and the first element according to the priority list, ignoring NAs
#(there may be a more efficient way to do this)
first.per.row <- apply(df[,pl],1, function(x) as.vector(na.omit(x))[1])
#and finally compute the difference
max.less.first.per.row <- max.per.row - first.per.row
Note that this code will break for any row that is all NA. There is no check against that.
Here a simple version. First , I take only pl columns , for each line I remove na then I compute the max.
df <- dat[,pl]
cbind(dat, t(apply(df, 1, function(x) {
x <- na.omit(x)
c(max(x),max(x)-x[1])
}
)
)
)
A B C D E F G 1 2
1 15 5 20 9 NA 6 1 15 6
2 3 2 NA 5 1 3 2 5 4
3 NA NA 3 NA NA NA NA -Inf NA
4 0 1 0 7 8 NA 6 8 0
5 1 2 3 NA NA 1 6 2 1

Resources