Combining/joining rows within the same dataframe based on grouping R [duplicate] - r

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 3 years ago.
I am executing a map_df function that results in a dataframe similar to the df below.
name <- c('foo', 'foo', 'foo', 'bar', 'bar', 'bar')
year <- c(19, 19, 19, 18, 18, 18)
A <- c(1, NA, NA, 2, NA, NA)
B <- c(NA, 3, NA, NA, 4, NA)
C <- c(NA, NA, 2, NA, NA, 5)
df <- data.frame(name, year, A, B, C)
name year A B C
1 foo 19 1 NA NA
2 foo 19 NA 3 NA
3 foo 19 NA NA 2
4 bar 18 2 NA NA
5 bar 18 NA 4 NA
6 bar 18 NA NA 5
Based on a my unique groups within the df, in this case: name + year, I want to merge the data into the same row. Desired result:
name year A B C
1 foo 19 1 3 2
2 bar 18 2 4 5
I can definitely accomplish this with a mix of filtering and joins, but with my actual dataframe that would be a lot of code and inefficient. I'm looking for a more elegant way to "squish" this dataframe.

library(dplyr)
df %>%
group_by(name, year) %>%
summarise_all(mean, na.rm = TRUE)
This is a dplyr answer. It works, if your data really looks like the one you posted.
Output:
name year A B C
<fct> <dbl> <dbl> <dbl> <dbl>
1 bar 18 2 4 5
2 foo 19 1 3 2

Related

compare sets of columns in R dataframe and keep one value from each set of two columns

Basically, I have a large dataset with many different variables. The data is ordered in pairs (2019 and 2020) and for some variables for neither year data is available for some only 2019 and some only 2020. I would like the 2020 data to 'override' the 2019 data, but only if it is available in 2020 and 2019. If no data is available for either year, then the data should stay missing. I now do this with a little helper function, but this should be more scalable, so that I can do it for 200+ column pairs. What am I missing in mutate(across(....),)
# Create data
mydf <- tibble(ID = 1:5,
var1_2019 = c(9, NA, 3, 2, NA),
var1_2020 = c(NA, NA, 3, 2, 4),
var2_2019 = c("A", "B",NA, "D", "C"),
var2_2020 = c(NA, "B",NA, "R", NA),
var3_2019 = c(T, F, NA, NA, NA),
var3_2020 = c(NA, NA, NA, NA, F))
# create little helper function. this is good because
# it could be made more complex in the future,
# for example for numeric variables keeping the larger of the two
which_to_keep_f <-
function(x, y) {
if (is.na(x) && is.na(y)) {
output <- NA
}
if (is.na(x) && !is.na(y)) {
output <- y
}
if (!is.na(x) && is.na(y)) {
output <- x
}
if (!is.na(x) && !is.na(y)) {
output <- y
}
output
}
# vectorize it
which_to_keep_f_vec <- Vectorize(which_to_keep_f)
# use function inside mutate
mydf %>%
mutate(var1 = which_to_keep_f_vec(var1_2019, var1_2020)) %>%
mutate(var2 = which_to_keep_f_vec(var2_2019, var2_2020)) %>%
mutate(var3 = which_to_keep_f_vec(var3_2019, var3_2020)) %>%
select(-contains("_20"))
Solution
Thanks to TarJae and micahkimel I got to 99% of the solution. This is the complete solution (including dropping the variables that are no longer needed and renaming the variables to their desired format)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(stringr::str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c()))%>%
select(-contains("_2020")) %>%
rename_all(~ stringr::str_replace(., regex("_2019$", ignore_case = TRUE), ""))
Update: Thanks to micahkimel removing list to not duplicate the data:
Is this what you are looking for. Here we apply your function to sets of pairs:
library(dplyr)
library(stringr)
mydf %>%
mutate(across(ends_with('_2019'),
~(which_to_keep_f_vec(.,
get(str_replace(cur_column(), "_2019$", "_2020"))))) %>%
unnest(cols=c())
ID var1_2019 var1_2020 var2_2019 var2_2020 var3_2019 var3_2020
<int> <dbl> <dbl> <chr> <chr> <lgl> <lgl>
1 1 9 NA A NA TRUE NA
2 2 NA NA B B FALSE NA
3 3 3 3 NA NA NA NA
4 4 2 2 R R NA NA
5 5 4 4 C NA FALSE FALSE
Here's an approach that results in just one variable for each pair of variables in your input table. First, use pivot_longer() to collapse the pairs into single variables, and add year as a column (with twice as many observations).
mydf_long = mydf %>%
pivot_longer(cols = matches("_20"), names_to = c(".value", "year"),
names_sep = "_")
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2019 9 A TRUE
2 1 2020 NA NA NA
3 2 2019 NA B FALSE
4 2 2020 NA B NA
5 3 2019 3 NA NA
6 3 2020 3 NA NA
7 4 2019 2 D NA
8 4 2020 2 R NA
9 5 2019 NA C NA
10 5 2020 4 NA FALSE
Next, use fill() to populate later NA values with earlier non-missing values. Then we can just filter to the most recent year (2020). For each variable, that year will have its own value if it had one before; otherwise, it will carry over the value from the previous year.
mydf_long %>%
group_by(ID) %>%
fill(var1, var2, var3) %>%
filter(year == 2020)
ID year var1 var2 var3
<int> <chr> <dbl> <chr> <lgl>
1 1 2020 9 A TRUE
2 2 2020 NA B FALSE
3 3 2020 3 NA NA
4 4 2020 2 R NA
5 5 2020 4 C FALSE

Count the sequence of numbers while skipping missing values

I have a series of dates and I want to count each record the sequence of dates, while skipping missing values.
Essentially, I want to see the following result, where a are my dates and b is my index of the date record. You can see that row 5 is my 4th record, and visit 7 is my 5th record.
tibble(a = c(12, 24, 32, NA, 55, NA, 73), b = c(1, 2, 3, NA, 4, NA, 5))
a b
<dbl> <dbl>
1 12 1
2 24 2
3 32 3
4 NA NA
5 55 4
6 NA NA
7 73 5
It seems that group_by() %>% mutate(sq = sequence(n())) doesn't work in this case, because I don't know how to filter out the missing values while counting. I need to keep those missing values because my data is pretty large.
Is a separate operation of filtering the data, getting the sequence, and using left_join my best option?
library(dplyr)
dat <- tibble(a = c(12, 24, 32, NA, 55, NA, 73))
dat %>%
mutate(sq = ifelse(is.na(a), NA, cumsum(!is.na(a))))
#> # A tibble: 7 x 2
#> a sq
#> <dbl> <int>
#> 1 12 1
#> 2 24 2
#> 3 32 3
#> 4 NA NA
#> 5 55 4
#> 6 NA NA
#> 7 73 5
Cumulatively sum an indicator of non-NA and then add 0*a to effectively NA out any component that was originally NA while adding 0 to the rest (so not changing them).
a <- c(12, 24, 32, NA, 55, NA, 73)
cumsum(!is.na(a)) + 0 * a
## [1] 1 2 3 NA 4 NA 5
Maybe you can try replace + seq_along like below
within(
df,
b <- replace(a, !is.na(a), seq_along(na.omit(a)))
)
We could specify the i as non-NA logical vector, and create the 'b' by assigning the sequence of rows
library(data.table)
setDT(dat)[!is.na(a), b := seq_len(.N)]
-output
dat
# a b
#1: 12 1
#2: 24 2
#3: 32 3
#4: NA NA
#5: 55 4
#6: NA NA
#7: 73 5

fill in NA by outcome of formula between previous and following non-NA values in R

I have the following dataframe:
day <- c(1,2,3,4,5,6,7,8,9, 10, 11)
totalItems <- c(700, NA, 32013, NA, NA, NA, 39599, NA, NA, NA, 107542)
df <- data.frame(day, totalItems)
I need to create another variable/column where NAs are replaced by the outcome of a formula: (following available non NA - previous available non NA) / (row number of following non NA - row number of previous non NA), in order to get this final dataframe :
day <- c(1,2,3,4,5,6,7,8,9, 10, 11)
totalItems <- c(700, NA, 32013, NA, NA, NA, 39599, NA, NA, NA, 107542)
estimatedDaily <- c(700, 15656, 15656, 1897, 1897, 1897, 1897, 16986, 16986, 16986, 16986)
df.new <- data.frame(day, totalItems, estimatedDaily)
I tried to juggle with tidyr::replace_na() but I couldn't figure out how to define the formula to be able to identify the previous and the following available non NA.
Many thanks in advance for helping.
You can create groups in your data based on presence of NA values.
library(dplyr)
df1 <- df %>% mutate(group = cumsum(lag(!is.na(totalItems), default = TRUE)))
df1
# day totalItems group
#1 1 700 1
#2 2 NA 2
#3 3 32013 2
#4 4 NA 3
#5 5 NA 3
#6 6 NA 3
#7 7 39599 3
#8 8 NA 4
#9 9 NA 4
#10 10 NA 4
#11 11 107542 4
Keep only the rows in df1 which has value in it apply the formula to each group and join it with df1 to get same number of rows back.
df1 %>%
group_by(group) %>%
slice(n()) %>%
ungroup %>%
transmute(group, estimatedDaily = (totalItems - lag(totalItems, default = 0))/
(day - lag(day, default = 0))) %>%
left_join(df1, by = 'group') %>%
select(-group)
# estimatedDaily day totalItems
# <dbl> <dbl> <dbl>
# 1 700 1 700
# 2 15656. 2 NA
# 3 15656. 3 32013
# 4 1896. 4 NA
# 5 1896. 5 NA
# 6 1896. 6 NA
# 7 1896. 7 39599
# 8 16986. 8 NA
# 9 16986. 9 NA
#10 16986. 10 NA
#11 16986. 11 107542

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

Linear intrapolation in data.table [duplicate]

I would like to perform a linear interpolation in a variable of a data frame which takes into account the: 1) time difference between the two points, 2) the moment when the data was taken and 3) the individual taken for measure the variable.
For example in the next dataframe:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, 5, NA, 7))
df
I would like to obtain:
result <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, 4, 5, 6, 7, 5, 5.5, 6))
result
I cannot use exclusively the function na.approx of the package zoo because all observations are not consecutives, some observations belong to one individual and other observations belong to other ones. The reason is because if the second individual would have its first obsrevation with NA and I would use exclusively the function na.approx, I would be using information from the individual==1 to interpolate the NA of the individual==2 (e.g the next data frame would have sucherror)
df_2 <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(1, 2, 3, NA, 5, NA, 7, NA, 5, 7))
df_2
I have tried using the packages zoo and dplyr:
library(dplyr)
library(zoo)
proof <- df %>%
group_by(Individuals) %>%
na.approx(df$Value)
But I cannot perform group_by in a zoo object.
Do you know how to interpolate NA values in one variable by groups?
Thanks in advance,
Use data.frame, rather than cbind to create your data. cbind returns a matrix, but you need a data frame for dplyr. Then use na.approx inside mutate. I've commented out group_by, as you haven't provided the grouping variable in your data, but the approach should work once you've added the grouping variable to the data frame.
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10))
library(dplyr)
library(zoo)
df %>%
group_by(Individuals) %>%
mutate(ValueInterp = na.approx(Value, na.rm=FALSE))
time Individuals Value ValueInterp
1 1 1 NA NA
2 2 1 2 2
3 3 1 3 3
4 4 1 NA 4
5 5 1 5 5
6 6 1 NA 6
7 7 1 7 7
8 1 2 8 8
9 2 2 NA 9
10 3 2 10 10
Update: To interpolate multiple columns, we can use mutate_at. Here's an example with two value columns. We use mutate_at to run na.approx on all columns that include "Value" in the column name. list(interp=na.approx) tells mutate_at to generate new column names by running na.approx and adding interp as a suffix to generate the new column names:
df <- data.frame(time=c(1,2,3,4,5,6,7,1,2,3),
Individuals=c(1,1,1,1,1,1,1,2,2,2),
Value1=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10),
Value2=c(NA, 2, 3, NA, 5, NA, 7, 8, NA, 10)*2)
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), list(interp=na.approx), na.rm=FALSE)
time Individuals Value1 Value2 Value1_interp Value2_interp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 NA NA NA NA
2 2 1 2 4 2 4
3 3 1 3 6 3 6
4 4 1 NA NA 4 8
5 5 1 5 10 5 10
6 6 1 NA NA 6 12
7 7 1 7 14 7 14
8 1 2 8 16 8 16
9 2 2 NA NA 9 18
10 3 2 10 20 10 20
If you don't want to preserve the original, uninterpolated columns, you can do:
df %>%
group_by(Individuals) %>%
mutate_at(vars(matches("Value")), na.approx, na.rm=FALSE)
We can use data.table
library(data.table)
library(zoo)
setDT(df1)[, ValueInterp:= na.approx(Value, na.rm=TRUE), by = Individual]

Resources