Remove NAs after pivot_wider to match up rows - r

I spread a column using pivot_wider so I could compare two groups (var1 vs var2) using an xy plot. But I can't compare them because there is a corresponding NA in the column.
Here is an example dataframe:
df <- data.frame(group = c("a", "a", "b", "b", "c", "c"), var1 = c(3, NA, 1, NA, 2, NA),
var2 = c(NA, 2, NA, 4, NA, 8))
I would like it to look like:
df2 <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))

You can use summarize. But this treats the symptom not the cause. You may have a column in id_cols which is one-to-one with your variable in values_from.
library(dplyr)
df %>%
group_by(group) %>%
summarize_all(sum, na.rm = T)
# A tibble: 3 x 3
group var1 var2
<fct> <dbl> <dbl>
1 a 3 2
2 b 1 4
3 c 2 8

This solution is a bit more robust, with a slightly more general data.frame to begin with:
df <- data.frame(col_1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
col_2 = c(1, 3, NA, NA, NA, NA, 4, NA, NA),
col_3 = c(NA, NA, 2, 5, NA, NA, NA, 5, NA),
col_4 = c(NA, NA, NA, NA, 5, 6, NA, NA, 7))
df %>% dplyr::group_by(col_1) %>%
dplyr::summarise_all(purrr::discard, is.na)

Here is a way to do it, assuming you only have two rows by group and one row with NA
library(dplyr)
df %>% group_by(group) %>%
summarise(var1=max(var1,na.rm=TRUE),
var2=max(var2,na.rm=TRUE))
The na.rm=TRUE will not count the NAs and get the max on only one value (the one which is not NA)

Related

How to pivot longer with the grouping notation at the center of a dataframe?

I have a dataframe with the following column headers:
df <- data.frame(
ABC1_1_1DEF = c(1, 2, 3),
ABC1_2_1DEF = c(NA, 1, 2),
ABC1_3_1DEF = c(1, 1, NA),
ABC1_1_2DEF = c(3, NA, NA),
ABC1_2_2DEF = c(2, NA, NA),
ABC1_3_2DEF = c(NA, 1, 1)
)
I want to pivot the dataframe longer such that the middle number of each column is the group that contains the new columns:
df2 <- data.frame(
ABC1_1 = c(1, 2, 3, 3, NA, NA),
ABC1_2 = c(3, NA, NA, 2, NA, NA),
ABC1_3 = c(2, NA, NA, NA, 1, 1)
)
What's the best way to achieve this using R, ideally with dplyr?
To combine all the ABC1_1, ABC1_2 and ABC1_3 columns you can use -
tidyr::pivot_longer(df, cols = everything(),
names_to = '.value',
names_pattern = '([A-Z]+\\d+_\\d+)')
# ABC1_1 ABC1_2 ABC1_3
# <dbl> <dbl> <dbl>
#1 1 NA 1
#2 3 2 NA
#3 2 1 1
#4 NA NA 1
#5 3 2 NA
#6 NA NA 1

Get most recent observation for variables asked at different time points

Someone asked this already in a simpler version here, but I cannot quite get it to work for my case.
I have observational data on a number of individuals across multiple years for a set of questions, but not everyone is asked every question every year. I want to generate a new dataframe that has the most recent answer for each individual.
The data looks like this:
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), time = c(1:4), questionA = c("Yes", NA, "No", NA, "No", NA, "No", "Yes", "No", NA, NA, "No"), questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 5, NA))
The resulting dataframe for this example should look like this:
most_recent <- data.frame(individual = c("A", "B", "C"), questionA = c("No", "Yes", "No"), questionB = c(5, 4, 5))
Ideally I am looking for a dplyr solution. Thank you!
We can use dplyr's across() for this:
df %>%
group_by(individual) %>%
summarize(across(starts_with("question"), ~ last(na.omit(.))))
# # A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
# 1 A No 5
# 2 B Yes 4
# 3 C No 5
My take in base R, it filters the df by the most recent time of each person
df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"),
time = c(1:4),
questionA = c("Yes", NA, NA, "No", "No", NA, NA, "Yes", "No", NA, NA, "No"),
questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 3, 5),stringsAsFactors = F)
#new column to use with %in%
df$match <- paste(df$individual, df$time)
#find the most recent sample for each individual
id <- unique(df$individual)
most_recent <- sapply(id, function(id){
time <- max(df$time[df$individual == id])
return(paste(id,time))
})
#filter df by most recent
final <- df[df$match %in% most_recent,]
final
individual time questionA questionB match
4 A 4 No 5 A 4
8 B 4 Yes 4 B 4
12 C 4 No 5 C 4
We could use slice_tail after filling the 'question' NA with the adjacent non-NA, grouped and ordered by 'individual', 'time' columns
library(dplyr)
library(tidyr)
df %>%
arrange(individual, time) %>%
select(-time) %>%
group_by(individual) %>%
fill(starts_with('question')) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 3 x 3
# individual questionA questionB
# <chr> <chr> <dbl>
#1 A No 5
#2 B Yes 4
#3 C No 5

How to replace factor NA's with the level of the cell above

I'm trying to replace NA values in factor column with the values of the cell above. It would be great to have this in a tidy verse approach, but it doesn't matter too much if its not.
I have data that looks like:
data <- tibble(site = as.factor(c("A", "A", NA, "B","B", NA,"C", NA, "C")),
value = c(1, 2, NA, 1, 2, NA, 1, NA, 2))
And I need it to look like:
output <- data <- tibble(site = as.factor(c("A", "A", "A", "B","B", "B","C", "C", "C")),
value = c(1, 1, NA, 1,2, NA, 1, NA, 2))
I've tried a few different approaches using lag and replace_na although they have basically amounted to trying the same thing which is:
mutate(site = as.character(site),
site = ifelse(is.na(site), "zero", site),
site = ifelse(site == "zero", lag(site), site),
site = as.factor(site))
Thanks!
Try fill() from tidyr:
library(tidyverse)
#Code
data <- data %>% fill(site)
Output:
# A tibble: 9 x 2
site value
<fct> <dbl>
1 A 1
2 A 2
3 A NA
4 B 1
5 B 2
6 B NA
7 C 1
8 C NA
9 C 2
An option with na.locf
library(zoo)
data$state <- na.locf0(data$site)

Create date range based on sparse variable by group in R

I have sparse data which has a score taken at periodic intervals and a measurement taken at more regular interval for multiple subjects along with corresponding dates. I would like to generate date ranges based on the score dates for each subject ID ie. starting at the score date and ending at the next score date (or starting/ending at the first/last subject observation if the score doesn't fall on those dates).
I would then like to average the measurement variable within these date ranges. The averaging step should be straightforward but I am stuck on generating the date ranges.
Below is a sample of the data and an example of how I would envision the resulting data
sample data:
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "C", "C", "C", "D", "D", "D", "D", "D", "D", "D", "D",
"D", "D", "D", "D", "D", "D", "D"), date = c("1/21/2020", "1/27/2020",
"2/1/2020", "2/3/2020", "2/5/2020", "2/6/2020", "2/8/2020", "2/9/2020",
"2/11/2020", "2/12/2020", "2/13/2020", "2/15/2020", "2/18/2020",
"2/20/2020", "2/21/2020", "2/22/2020", "2/25/2020", "2/1/2020",
"2/5/2020", "2/7/2020", "2/8/2020", "2/11/2020", "2/12/2020",
"1/30/2020", "2/10/2020", "2/11/2020", "2/6/2020", "2/7/2020",
"2/8/2020", "2/9/2020", "2/11/2020", "2/13/2020", "2/14/2020",
"2/16/2020", "2/17/2020", "2/20/2020", "2/23/2020", "2/26/2020",
"3/1/2020", "3/3/2020", "3/5/2020"), score = c(0.5, 2, NA, NA,
3, NA, NA, NA, NA, NA, 2.5, NA, NA, 1.5, NA, NA, NA, 3, NA, NA,
2.5, NA, 1, 0.5, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 14,
NA, NA, 11.5, NA, 9.5, NA), measure = c(0.394160734, 0.722462998,
0.82984815, 0.738432745, 0.321792398, 0.167492308, 0.218020898,
0.929210786, 0.686818585, 0.939678073, 0.708172942, 0.299863884,
0.48216267, 0.290307369, 0.801947902, 0.579418467, 0.78101844,
0.219494852, 0.875129822, 0.517971003, 0.475625007, 0.723003744,
0.257473477, 0.629818537, 0.817369151, 0.628573413, 0.364660834,
0.5971024, 0.002274261, 0.318937617, 0.983917106, 0.685933928,
0.487922831, 0.151769304, 0.392413694, 0.012429414, 0.149627658,
0.011724992, 0.536998203, 0.798399999, 0.763353822)), class = "data.frame", row.names = c(NA,
-41L))
answer data:
structure(list(ID = c("A", "A", "A"), startDate = c("1/21/2020",
"1/27/2020", "2/5/2020"), endDate = c("1/27/2020", "2/5/2020",
"2/13/2020"), score = c(0.5, 2, 3), measure = c(0.394160734,
0.763581298, 0.543835508)), class = "data.frame", row.names = c(NA,
-3L))
Here's a way with dplyr :
library(dplyr)
df %>%
group_by(ID, grp = cumsum(!is.na(score))) %>%
summarise(start_date = first(date),
score = first(score),
measure = mean(measure)) %>%
mutate(end_date = lead(start_date, default = last(start_date))) %>%
select(-grp)
# ID start_date score measure end_date
# <chr> <chr> <dbl> <dbl> <chr>
# 1 A 1/21/2020 0.5 0.394 1/27/2020
# 2 A 1/27/2020 2 0.764 2/5/2020
# 3 A 2/5/2020 3 0.544 2/13/2020
# 4 A 2/13/2020 2.5 0.497 2/20/2020
# 5 A 2/20/2020 1.5 0.613 2/20/2020
# 6 B 2/1/2020 3 0.538 2/8/2020
# 7 B 2/8/2020 2.5 0.599 2/12/2020
# 8 B 2/12/2020 1 0.257 2/12/2020
# 9 C 1/30/2020 0.5 0.692 1/30/2020
#10 D 2/6/2020 NA 0.449 2/17/2020
#11 D 2/17/2020 14 0.185 2/26/2020
#12 D 2/26/2020 11.5 0.274 3/3/2020
#13 D 3/3/2020 9.5 0.781 3/3/2020
Using data.table
library(data.table)
setDT(df)[, .(start_date = first(date),
score = first(score),
measure = mean(measure)),
by = .(ID, grp = cumsum(!is.na(score)))
][, end_date := shift(start_date, type= 'lead', fill = last(start_date))
][, grp := NULL][]

How to aggregate the data by filtering it out?

I have a data set something like this:
df_A <- tribble(
~product_name, ~position, ~cat_id, ~pr,
"A", 1, 1, "X",
"A", 4, 2, "X",
"A", 3, 3, "X",
"B", 4, 5, NA,
"B", 6, 6, NA,
"C", 3, 1, "Y",
"C", 5, 2, "Y",
"D", 6, 2, "Z",
"D", 4, 8, "Z",
"D", 3, 9, "Z",
)
Now, I want to look up 1 and 2 in the cat_id, and find their position in the position for each product_name. If there is no 1 or 2 in the cat_id, then only these three variable will be returned to NA. Please see my desired data set to get a better understanding:
desired <- tribble(
~product_name, ~position_1, ~position_2, ~pr,
"A", 1, 4, "X",
"B", NA, NA, NA,
"C", 3, 5, "Y",
"D", NA, 6, "Z",
)
How can I get it?
We can filter the rows based on the 'cat_id', then if some of the 'product_name' are missing, use complete to expand the dataset and use pivot_wider to reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
df_A %>%
filter(cat_id %in% 1:2) %>%
mutate(cat_id = str_c('position_', cat_id)) %>%
complete(product_name = unique(df_A$product_name)) %>%
pivot_wider(names_from = cat_id, values_from = position) %>%
select(-`NA`)
# A tibble: 4 x 4
# product_name pr position_1 position_2
# <chr> <chr> <dbl> <dbl>
#1 A X 1 4
#2 B <NA> NA NA
#3 C Y 3 5
#4 D Z NA 6
Or using reshape/subset from base R
reshape(merge(data.frame(product_name = unique(df_A$product_name)),
subset(df_A, cat_id %in% 1:2), all.x = TRUE),
idvar = c('product_name', 'pr'), direction = 'wide', timevar = 'cat_id')[-5]

Resources