Get most recent observation for variables asked at different time points

Get most recent observation for variables asked at different time points - r

Someone asked this already in a simpler version here, but I cannot quite get it to work for my case.
I have observational data on a number of individuals across multiple years for a set of questions, but not everyone is asked every question every year. I want to generate a new dataframe that has the most recent answer for each individual.
The data looks like this:
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), time = c(1:4), questionA = c("Yes", NA, "No", NA, "No", NA, "No", "Yes", "No", NA, NA, "No"), questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 5, NA))
The resulting dataframe for this example should look like this:
most_recent <- data.frame(individual = c("A", "B", "C"), questionA = c("No", "Yes", "No"), questionB = c(5, 4, 5))
Ideally I am looking for a dplyr solution. Thank you!

We can use dplyr's across() for this:
df %>%
group_by(individual) %>%
summarize(across(starts_with("question"), ~ last(na.omit(.))))
# # A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
# 1 A No 5
# 2 B Yes 4
# 3 C No 5

My take in base R, it filters the df by the most recent time of each person
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
time = c(1:4),
questionA = c("Yes", NA, NA, "No", "No", NA, NA, "Yes", "No", NA, NA, "No"),
questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 3, 5),stringsAsFactors = F)
#new column to use with %in%
df$match <- paste(df$individual, df$time)
#find the most recent sample for each individual
id <- unique(df$individual)
most_recent <- sapply(id, function(id){
time <- max(df$time[df$individual == id])
return(paste(id,time))
})
#filter df by most recent
final <- df[df$match %in% most_recent,]
final
individual time questionA questionB match
4 A 4 No 5 A 4
8 B 4 Yes 4 B 4
12 C 4 No 5 C 4

We could use slice_tail after filling the 'question' NA with the adjacent non-NA, grouped and ordered by 'individual', 'time' columns
library(dplyr)
library(tidyr)
df %>%
arrange(individual, time) %>%
select(-time) %>%
group_by(individual) %>%
fill(starts_with('question')) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
#1 A No 5
#2 B Yes 4
#3 C No 5

Related

How to replace factor NA's with the level of the cell above

I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!

Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2

An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)

How to aggregate the data by filtering it out?

I have a data set something like this:
df_A <- tribble(
~product_name, ~position, ~cat_id, ~pr,
"A", 1, 1, "X",
"A", 4, 2, "X",
"A", 3, 3, "X",
"B", 4, 5, NA,
"B", 6, 6, NA,
"C", 3, 1, "Y",
"C", 5, 2, "Y",
"D", 6, 2, "Z",
"D", 4, 8, "Z",
"D", 3, 9, "Z",
)
Now, I want to look up 1 and 2 in the cat_id, and find their position in the position for each product_name. If there is no 1 or 2 in the cat_id, then only these three variable will be returned to NA. Please see my desired data set to get a better understanding:
desired <- tribble(
~product_name, ~position_1, ~position_2, ~pr,
"A", 1, 4, "X",
"B", NA, NA, NA,
"C", 3, 5, "Y",
"D", NA, 6, "Z",
)
How can I get it?

We can filter the rows based on the 'cat_id', then if some of the 'product_name' are missing, use complete to expand the dataset and use pivot_wider to reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
df_A %>%
filter(cat_id %in% 1:2) %>%
mutate(cat_id = str_c('position_', cat_id)) %>%
complete(product_name = unique(df_A$product_name)) %>%
pivot_wider(names_from = cat_id, values_from = position) %>%
select(-`NA`)
# A tibble: 4 x 4
# product_name pr position_1 position_2
# <chr> <chr> <dbl> <dbl>
#1 A X 1 4
#2 B <NA> NA NA
#3 C Y 3 5
#4 D Z NA 6
Or using reshape/subset from base R
reshape(merge(data.frame(product_name = unique(df_A$product_name)),
subset(df_A, cat_id %in% 1:2), all.x = TRUE),
idvar = c('product_name', 'pr'), direction = 'wide', timevar = 'cat_id')[-5]

How to remove duplicates based on two colums with a condition?

I'd like to remove some duplicates but not all of them. I'm going to explain after showing the data i'm working with.
Here is an sample of my dataframe :
df <- data.frame("S" = c("A", "B", "C", "D", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/04/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "004", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B", "B"),
"Q" = c(1, 2, 3, 4, 5, 6),
"U" = c(rep("A", 6)),
"P" = c(2, 3, 4, 4, 7, 7),
stringsAsFactors = FALSE)
And now some code i'm applying on this dataframe :
df$P <- round(as.double(df$P), digits = 2)
df <- df[order(df$R, df$P),]
df <- df %>%
group_by(R) %>%
mutate(price = P - min(P)) %>%
ungroup()
df$Ecart <- df$price * as.double(df$Q)
df <- df %>%
group_by(R) %>%
mutate(EcartTotal = cumsum(Ecart)) %>%
ungroup()
The result I'm expecting :
result <- data.frame("S" = c("A", "B", "C", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B"),
"Q" = c(1, 2, 3, 5, 6),
"U" = c(rep("A", 5)),
"P" = c(2, 3, 4, 7, 7),
"price" = c(0, 1, 0, 3, 3),
"Ecart" = c(0, 2, 0, 15, 18),
"EcartTotal" = c(NA, 2, NA, NA, 33),
stringsAsFactors = FALSE)
So to obtain this I'd like to remove the duplicates of the column R only if their price is equal to 0.
I'd also like to replace the value of EcartTotal by NA if they are not equal to the max value for each R

We can filter based on the condition and then replace the value of 'EcartTotal' to NA after grouping by 'R'
library(dplyr)
df %>%
filter(!(duplicated(R) & price == 0)) %>%
group_by(R) %>%
mutate(EcartTotal = replace(EcartTotal, EcartTotal != max(EcartTotal), NA))
# A tibble: 5 x 12
# Groups: R [2]
# S D N R RF Des Q U P price Ecart EcartTotal
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 01/01/2019 001 ABC1 ABC1F A 1 A 2 0 0 NA
#2 B 01/02/2019 002 ABC1 ABC1F A 2 A 3 1 2 2
#3 C 01/03/2019 003 ABC2 ABC2F B 3 A 4 0 0 NA
#4 E 01/05/2019 005 ABC2 ABC2F B 5 A 7 3 15 NA
#5 F 01/06/2019 006 ABC2 ABC2F B 6 A 7 3 18 33
Or the filter after the group_by step
df %>%
group_by(R) %>%
filter(!(row_number() > 1 & price == 0)) %>%
mutate(EcartTotal = EcartTotal * NA^(EcartTotal != max(EcartTotal)))

Remove NAs after pivot_wider to match up rows

I spread a column using pivot_wider so I could compare two groups (var1 vs var2) using an xy plot. But I can't compare them because there is a corresponding NA in the column.
Here is an example dataframe:
df <- data.frame(group = c("a", "a", "b", "b", "c", "c"), var1 = c(3, NA, 1, NA, 2, NA),
var2 = c(NA, 2, NA, 4, NA, 8))
I would like it to look like:
df2 <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))

You can use summarize. But this treats the symptom not the cause. You may have a column in id_cols which is one-to-one with your variable in values_from.
library(dplyr)
df %>%
group_by(group) %>%
summarize_all(sum, na.rm = T)
# A tibble: 3 x 3
group var1 var2
<fct> <dbl> <dbl>
1 a 3 2
2 b 1 4
3 c 2 8

This solution is a bit more robust, with a slightly more general data.frame to begin with:
df <- data.frame(col_1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
col_2 = c(1, 3, NA, NA, NA, NA, 4, NA, NA),
col_3 = c(NA, NA, 2, 5, NA, NA, NA, 5, NA),
col_4 = c(NA, NA, NA, NA, 5, 6, NA, NA, 7))
df %>% dplyr::group_by(col_1) %>%
dplyr::summarise_all(purrr::discard, is.na)

Here is a way to do it, assuming you only have two rows by group and one row with NA
library(dplyr)
df %>% group_by(group) %>%
summarise(var1=max(var1,na.rm=TRUE),
var2=max(var2,na.rm=TRUE))
The na.rm=TRUE will not count the NAs and get the max on only one value (the one which is not NA)

How can I fill columns based on values in another column? [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
I have a large dataframe containing a cross table of keys from other tables. Instead of having multiple instances of key1 coupled with different values for key2 I would like there to be one row for each key1 with several columns instead.
I tried doing this with a for loop but it couldn't get it to work.
Here's an example. I have a data frame with the structure df1 and I would like it to have the structure of df2.
df1 <- data.frame(c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "d"),c(1, 2, 3, 2, 3, 1, 2, 3, 4, 5, 9))
names(df1) <- c("key1", "key2")
df2 <- data.frame(c("a", "b", "c", "d"), c(1, 2, 1, 9), c(2, 3, 2, NA), c(3, NA, 3, NA), c(NA, NA, 4, NA), c(NA, NA, 5, NA))
names(df2) <- c("key1", "key2_1", "key2_2", "key2_3", "key2_4", "key2_5")
I suspect this is possible using an approach utilizing apply but I haven't found a way yet. Any help is appreciated!

library(dplyr)
library(tidyr)
df1 %>%
group_by(key1) %>%
mutate(var = paste0("key2_", seq(n()))) %>%
spread(var, key2)
# # A tibble: 4 x 6
# # Groups: key1 [4]
# key1 key2_1 key2_2 key2_3 key2_4 key2_5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 2 3 NA NA
# 2 b 2 3 NA NA NA
# 3 c 1 2 3 4 5
# 4 d 9 NA NA NA NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get most recent observation for variables asked at different time points - r

We can use dplyr's across() for this: df %>% group_by(individual) %>% summarize(across(starts_with("question"), ~ last(na.omit(.)))) # # A tibble: 3 x 3 # individual questionA questionB # <chr> <chr> <dbl> # 1 A No 5 # 2 B Yes 4 # 3 C No 5

Related

How to replace factor NA's with the level of the cell above

How to aggregate the data by filtering it out?

How to remove duplicates based on two colums with a condition?

Remove NAs after pivot_wider to match up rows

How can I fill columns based on values in another column? [duplicate]

Categories

Resources