I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case:
> test <- data.frame(name = rep(c("A", "B", "C"), each = 4),
var1 = rep(c(1:3, NA), 3),
var2 = 1:12,
var3 = c(rep(NA, 4), 1:8))
> test
name var1 var2 var3
1 A 1 1 NA
2 A 2 2 NA
3 A 3 3 NA
4 A NA 4 NA
5 B 1 5 1
6 B 2 6 2
7 B 3 7 3
8 B NA 8 4
9 C 1 9 5
10 C 2 10 6
11 C 3 11 7
12 C NA 12 8
In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1, var2) ~ name, test, FUN = mean, na.action = na.omit) (or the default setting). I'll get averages for each lot over three values for var1 and over four values for var2.
Unfortunately, this will leave me with a dataset completely missing lot A in this case:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.omit)
name var1 var2 var3
1 B 2 6 2
2 C 2 10 6
If I use na.pass, however, I also don't get what I want:
aggregate(cbind(var1, var2, var3) ~ name, test, FUN = mean, na.action = na.pass)
name var1 var2 var3
1 A NA 2.5 NA
2 B NA 6.5 2.5
3 C NA 10.5 6.5
Now I lose the good data I had in var1 since it contained instances of NA.
What I'd like is:
NA as the output of mean() if all unique combinations of varN ~ name are NAs
Output of mean() if there are one or more actual values for varN ~ name
I'm guessing this is pretty simple, but I just don't know how. Do I need to use ddply for something like this? If so... the reason I tend to avoid it is that I end up writing really long equivalents to aggregate() like so:
ddply(test, .(name), summarise,
var1 = mean(var1, na.rm = T),
var2 = mean(var2, na.rm = T),
var3 = mean(var3, na.rm = T))
Yeah... so the result of that apparently does what I want. I'll leave the question anyway in case there's 1) a way to do this with aggregate() or 2) shorter syntax for ddply.
Pass both na.action=na.pass and na.rm=TRUE to aggregate. The former tells aggregate not to delete rows where NAs exist; and the latter tells mean to ignore them.
aggregate(cbind(var1, var2, var3) ~ name, test, mean,
na.action=na.pass, na.rm=TRUE)
Related
This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')
This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
aggregate methods treat missing values (NA) differently
(2 answers)
Closed 2 years ago.
There is a data frame x with 5753 observations of 4 variables.
The column names are: date, Depth, var1, and var2. I converted date and Depth to factor before performing aggregate().
I wanted to calculate average and standard deviation to 2 variables with grouping by date and Depth.
When applying aggregate(x[,3:4], by = list(x$date, x$Depth), FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got average of var1 and average of var 2 grouping by date and Depth, but I did not get SD.
When applying aggregate(. ~ date+Depth, data = x, FUN = function(x) c(avg = mean(x, na.rm = TRUE), SD= sd)), I got an error message: "Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate".
After counting NA in two column, I found out that there are 5622 NA in var1, 5049 NA in var2. I donot want to remove NA before applying aggregate() yet.
My questions are:
why I did not get sd by applying the first syntax?
why is the second syntax not workable? I learned this syntax from stackoverflow, and it worked with the following data frame,
x3 <- read.table(text = " id1 id2 val1 val2
1 a x 1 9
2 a x 2 4
3 a y 3 NA
4 a y 4 NA
5 b x 1 NA
6 b y 4 NA
7 b x 3 9
8 b y 2 8", header = TRUE)
We can use dplyr, where we pass the grouping columns in group_by and the columns to summarise in summarise with across
library(dplyr) #1.0.0
x3 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'),
list(mean = ~ mean(., na.rm = TRUE) , sd = ~sd(., na.rm = TRUE))))
# A tibble: 4 x 6
# Groups: id1 [2]
# id1 id2 val1_mean val1_sd val2_mean val2_sd
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 a x 1.5 0.707 6.5 3.54
#2 a y 3.5 0.707 NaN NA
#3 b x 2 1.41 9 NA
#4 b y 3 1.41 8 NA
If the version of dplyr is < 1.0.0, we can use summarise_at
x3 %>%
group_by(id1, id2) %>%
summarise_at(vars(-group_cols()), list(mean = ~ mean(., na.rm = TRUE),
sd = ~ sd(., na.rm = TRUE)))
With aggregate, the error we get because of the NA elements and it uses by default na.action = na.drop removing the row if there is any NA in that row. Either specify na.action = na.pass or NULL and this would resolve that issue. But, having multiple functions to be applied with c, it will result in a matrix column. Inorder to have normal data.frame, columns, we can wrap with data.frame in do.call
do.call(data.frame, aggregate(. ~ id1 + id2, data = x3, FUN = function(x)
c(avg = mean(x, na.rm = TRUE), SD= sd(x, na.rm = TRUE)), na.action = NULL))
This question already has answers here:
Replacing character values with NA in a data frame
(7 answers)
Closed 3 years ago.
I have a fairly large data frame that has multiple "-" which represent missing data. The data frame consisted of multiple Excel files, which could not use the "na.strings =" or alternative function, so I had to import them with the "-" representation.
How can I replace all "-" in the data frame with NA / missing values? The data frame consists of 200 columns of characters, factors, and integers.
So far I have tried:
sum(df %in c("-"))
returns: [1] 0
df[df=="-"] <-NA #does not do anything
library(plyr)
df <- revalue(df, c("-",NA))
returns: Error in revalue(tmp, c("-", NA)) :
x is not a factor or a character vector.
library(anchors)
df <- replace.value(df,colnames(df),"-",as.character(NA))
Error in charToDate(x) :
character string is not in a standard unambiguous format
The data frame consists of 200 columns of characters, factors, and integers, so I can see why the last two do not work correctly. Any help would be appreciated.
Since you're already using tidyverse functions, you can easily use na_if from dplyr within your pipes.
For example, I have a dataset where 999 is used to fill in a non-answer:
df <- tibble(
alpha = c("a", "b", "c", "d", "e"),
val1 = c(1, 999, 3, 8, 999),
val2 = c(2, 8, 999, 1, 2))
If I wanted to change val1 so 999 is NA, I could do:
df %>%
mutate(val1 = na_if(val1, 999))
In your case, it sounds like you want to replace a value across multiple variables, so using across for multiple columns would be more appropriate:
df %>%
mutate(across(c(val1, val2), na_if, 999)) # or val1:val2
replaces all instances of 999 in both val1 and val2 with NA and now looks like this:
# A tibble: 5 x 3
alpha val1 val2
<chr> <dbl> <dbl>
1 a 1. 2.
2 b NA 8.
3 c 3. NA
4 d 8. 1.
5 e NA 2.
I believe the simplest solution is with base R function is.na<-. It's meant to solve precisely that issue.
First, make up some data. Then set the required values to NA.
set.seed(247) # make the results reproducible
df <- data.frame(X = 1:10, Y = sample(c("-", letters[1:2]), 10, TRUE))
is.na(df) <- df == "-"
df
# X Y
#1 1 a
#2 2 b
#3 3 b
#4 4 a
#5 5 <NA>
#6 6 b
#7 7 a
#8 8 <NA>
#9 9 b
#10 10 a
Here's a solution that will do it:
> library(dplyr)
> test <- tibble(x = c('100', '20.56', '0.003', '-', ' -'), y = 5:1)
> makeNA <- function(x) str_replace(x,'-',NA_character_)
> mutate_all(test, funs(makeNA))
# A tibble: 5 x 2
x y
<chr> <chr>
1 100 5
2 20.56 4
3 0.003 3
4 NA 2
5 NA 1
I have a dataframe with multiple variables, each has values of TRUE, FALSE, or NA. I'm trying to summarize the data, but get anything to work quite the way I want.
names <- c("n1","n2","n3","n4","n5","n6")
groupname <- c("g1","g2","g3","g4","g4","g4")
var1 <- c(TRUE,TRUE,NA,FALSE,TRUE,NA)
var2 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
var3 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
df <- data.frame(names,groupname,var1,var2,var3)
I'm trying to summarize the data for individual groups:
G4 TRUE FALSE NA
var1 3 1 2
var2 2 2 2
var3 2 2 2
I can do table(groupname,var1) to do them individually, but I'm trying to get it all in a single table. Any suggestions?
using dplyr
library(dplyr)
df %>% gather("key", "value", var1:var3) %>%
group_by(key) %>%
summarise(true = sum(value==TRUE, na.rm=T),
false = sum(!value, na.rm=T),
missing = sum(is.na(value)))
# key true false missing
#1 var1 3 1 2
#2 var2 2 2 2
#3 var3 2 2 2
In base R, you could use table to get the counts, lapply to run through the variables, and do.call to put the results together. A minor subsetting with [ orders the columns as desired.
do.call(rbind, lapply(df[3:5], table, useNA="ifany"))[, c(2,1,3)]
TRUE FALSE <NA>
var1 3 1 2
var2 2 2 2
var3 2 2 2
This will work if each variable has all levels (TRUE, FALSE, NA). If one of the levels is missing, you can tell table to fill it with a 0 count by feeding it a factor variable.
Here is an example.
# expand data set
df$var4 <- c(TRUE, NA)
do.call(rbind, lapply(df[3:6],
function(i) table(factor(i, levels=c(TRUE, FALSE, NA)),
useNA="ifany")))[, c(2,1,3)]
FALSE TRUE <NA>
var1 1 3 2
var2 2 2 2
var3 2 2 2
var4 0 3 3
Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6