Someone asked this already in a simpler version here, but I cannot quite get it to work for my case.
I have observational data on a number of individuals across multiple years for a set of questions, but not everyone is asked every question every year. I want to generate a new dataframe that has the most recent answer for each individual.
The data looks like this:
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), time = c(1:4), questionA = c("Yes", NA, "No", NA, "No", NA, "No", "Yes", "No", NA, NA, "No"), questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 5, NA))
The resulting dataframe for this example should look like this:
most_recent <- data.frame(individual = c("A", "B", "C"), questionA = c("No", "Yes", "No"), questionB = c(5, 4, 5))
Ideally I am looking for a dplyr solution. Thank you!
We can use dplyr's across() for this:
df %>%
group_by(individual) %>%
summarize(across(starts_with("question"), ~ last(na.omit(.))))
# # A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
# 1 A No 5
# 2 B Yes 4
# 3 C No 5
My take in base R, it filters the df by the most recent time of each person
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
time = c(1:4),
questionA = c("Yes", NA, NA, "No", "No", NA, NA, "Yes", "No", NA, NA, "No"),
questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 3, 5),stringsAsFactors = F)
#new column to use with %in%
df$match <- paste(df$individual, df$time)
#find the most recent sample for each individual
id <- unique(df$individual)
most_recent <- sapply(id, function(id){
time <- max(df$time[df$individual == id])
return(paste(id,time))
})
#filter df by most recent
final <- df[df$match %in% most_recent,]
final
individual time questionA questionB match
4 A 4 No 5 A 4
8 B 4 Yes 4 B 4
12 C 4 No 5 C 4
We could use slice_tail after filling the 'question' NA with the adjacent non-NA, grouped and ordered by 'individual', 'time' columns
library(dplyr)
library(tidyr)
df %>%
arrange(individual, time) %>%
select(-time) %>%
group_by(individual) %>%
fill(starts_with('question')) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
#1 A No 5
#2 B Yes 4
#3 C No 5
Related
I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)
I have a data set something like this:
df_A <- tribble(
~product_name, ~position, ~cat_id, ~pr,
"A", 1, 1, "X",
"A", 4, 2, "X",
"A", 3, 3, "X",
"B", 4, 5, NA,
"B", 6, 6, NA,
"C", 3, 1, "Y",
"C", 5, 2, "Y",
"D", 6, 2, "Z",
"D", 4, 8, "Z",
"D", 3, 9, "Z",
)
Now, I want to look up 1 and 2 in the cat_id, and find their position in the position for each product_name. If there is no 1 or 2 in the cat_id, then only these three variable will be returned to NA. Please see my desired data set to get a better understanding:
desired <- tribble(
~product_name, ~position_1, ~position_2, ~pr,
"A", 1, 4, "X",
"B", NA, NA, NA,
"C", 3, 5, "Y",
"D", NA, 6, "Z",
)
How can I get it?
We can filter the rows based on the 'cat_id', then if some of the 'product_name' are missing, use complete to expand the dataset and use pivot_wider to reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
df_A %>%
filter(cat_id %in% 1:2) %>%
mutate(cat_id = str_c('position_', cat_id)) %>%
complete(product_name = unique(df_A$product_name)) %>%
pivot_wider(names_from = cat_id, values_from = position) %>%
select(-`NA`)
# A tibble: 4 x 4
# product_name pr position_1 position_2
# <chr> <chr> <dbl> <dbl>
#1 A X 1 4
#2 B <NA> NA NA
#3 C Y 3 5
#4 D Z NA 6
Or using reshape/subset from base R
reshape(merge(data.frame(product_name = unique(df_A$product_name)),
subset(df_A, cat_id %in% 1:2), all.x = TRUE),
idvar = c('product_name', 'pr'), direction = 'wide', timevar = 'cat_id')[-5]
I'd like to remove some duplicates but not all of them. I'm going to explain after showing the data i'm working with.
Here is an sample of my dataframe :
df <- data.frame("S" = c("A", "B", "C", "D", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/04/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "004", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B", "B"),
"Q" = c(1, 2, 3, 4, 5, 6),
"U" = c(rep("A", 6)),
"P" = c(2, 3, 4, 4, 7, 7),
stringsAsFactors = FALSE)
And now some code i'm applying on this dataframe :
df$P <- round(as.double(df$P), digits = 2)
df <- df[order(df$R, df$P),]
df <- df %>%
group_by(R) %>%
mutate(price = P - min(P)) %>%
ungroup()
df$Ecart <- df$price * as.double(df$Q)
df <- df %>%
group_by(R) %>%
mutate(EcartTotal = cumsum(Ecart)) %>%
ungroup()
The result I'm expecting :
result <- data.frame("S" = c("A", "B", "C", "E", "F"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/05/2019", "01/06/2019"),
"N" = c("001", "002", "003", "005", "006"),
"R" = c("ABC1", "ABC1", "ABC2", "ABC2", "ABC2"),
"RF" = c("ABC1F", "ABC1F", "ABC2F", "ABC2F", "ABC2F"),
"Des" = c("A", "A", "B", "B", "B"),
"Q" = c(1, 2, 3, 5, 6),
"U" = c(rep("A", 5)),
"P" = c(2, 3, 4, 7, 7),
"price" = c(0, 1, 0, 3, 3),
"Ecart" = c(0, 2, 0, 15, 18),
"EcartTotal" = c(NA, 2, NA, NA, 33),
stringsAsFactors = FALSE)
So to obtain this I'd like to remove the duplicates of the column R only if their price is equal to 0.
I'd also like to replace the value of EcartTotal by NA if they are not equal to the max value for each R
We can filter based on the condition and then replace the value of 'EcartTotal' to NA after grouping by 'R'
library(dplyr)
df %>%
filter(!(duplicated(R) & price == 0)) %>%
group_by(R) %>%
mutate(EcartTotal = replace(EcartTotal, EcartTotal != max(EcartTotal), NA))
# A tibble: 5 x 12
# Groups: R [2]
# S D N R RF Des Q U P price Ecart EcartTotal
# <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 01/01/2019 001 ABC1 ABC1F A 1 A 2 0 0 NA
#2 B 01/02/2019 002 ABC1 ABC1F A 2 A 3 1 2 2
#3 C 01/03/2019 003 ABC2 ABC2F B 3 A 4 0 0 NA
#4 E 01/05/2019 005 ABC2 ABC2F B 5 A 7 3 15 NA
#5 F 01/06/2019 006 ABC2 ABC2F B 6 A 7 3 18 33
Or the filter after the group_by step
df %>%
group_by(R) %>%
filter(!(row_number() > 1 & price == 0)) %>%
mutate(EcartTotal = EcartTotal * NA^(EcartTotal != max(EcartTotal)))
I spread a column using pivot_wider so I could compare two groups (var1 vs var2) using an xy plot. But I can't compare them because there is a corresponding NA in the column.
Here is an example dataframe:
df <- data.frame(group = c("a", "a", "b", "b", "c", "c"), var1 = c(3, NA, 1, NA, 2, NA),
var2 = c(NA, 2, NA, 4, NA, 8))
I would like it to look like:
df2 <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
You can use summarize. But this treats the symptom not the cause. You may have a column in id_cols which is one-to-one with your variable in values_from.
library(dplyr)
df %>%
group_by(group) %>%
summarize_all(sum, na.rm = T)
# A tibble: 3 x 3
group var1 var2
<fct> <dbl> <dbl>
1 a 3 2
2 b 1 4
3 c 2 8
This solution is a bit more robust, with a slightly more general data.frame to begin with:
df <- data.frame(col_1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
col_2 = c(1, 3, NA, NA, NA, NA, 4, NA, NA),
col_3 = c(NA, NA, 2, 5, NA, NA, NA, 5, NA),
col_4 = c(NA, NA, NA, NA, 5, 6, NA, NA, 7))
df %>% dplyr::group_by(col_1) %>%
dplyr::summarise_all(purrr::discard, is.na)
Here is a way to do it, assuming you only have two rows by group and one row with NA
library(dplyr)
df %>% group_by(group) %>%
summarise(var1=max(var1,na.rm=TRUE),
var2=max(var2,na.rm=TRUE))
The na.rm=TRUE will not count the NAs and get the max on only one value (the one which is not NA)
This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 3 years ago.
I have a large dataframe containing a cross table of keys from other tables. Instead of having multiple instances of key1 coupled with different values for key2 I would like there to be one row for each key1 with several columns instead.
I tried doing this with a for loop but it couldn't get it to work.
Here's an example. I have a data frame with the structure df1 and I would like it to have the structure of df2.
df1 <- data.frame(c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "d"),c(1, 2, 3, 2, 3, 1, 2, 3, 4, 5, 9))
names(df1) <- c("key1", "key2")
df2 <- data.frame(c("a", "b", "c", "d"), c(1, 2, 1, 9), c(2, 3, 2, NA), c(3, NA, 3, NA), c(NA, NA, 4, NA), c(NA, NA, 5, NA))
names(df2) <- c("key1", "key2_1", "key2_2", "key2_3", "key2_4", "key2_5")
I suspect this is possible using an approach utilizing apply but I haven't found a way yet. Any help is appreciated!
library(dplyr)
library(tidyr)
df1 %>%
group_by(key1) %>%
mutate(var = paste0("key2_", seq(n()))) %>%
spread(var, key2)
# # A tibble: 4 x 6
# # Groups: key1 [4]
# key1 key2_1 key2_2 key2_3 key2_4 key2_5
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 2 3 NA NA
# 2 b 2 3 NA NA NA
# 3 c 1 2 3 4 5
# 4 d 9 NA NA NA NA