R: Cut dataframe but ensure all steps - r

let's say I have this data:
test_data <- dplyr::tibble(
ID = c(1, 1, 1, 1, 1, 1, 1),
values = c(40, 41, 38, 36, 35, 36, 30),
times = c(as.POSIXct("2020-01-01 00:00:00"),
as.POSIXct("2020-01-01 15:00:00"),
as.POSIXct("2020-01-01 18:00:00"),
as.POSIXct("2020-01-02 14:00:00"),
as.POSIXct("2020-01-03 20:00:00"),
as.POSIXct("2020-01-05 10:00:00"),
as.POSIXct("2020-01-05 14:00:00")))
I now want to extract the last value of each day, beginning with the first timestep. For that i do:
test_data %>%
dplyr::mutate(diff = as.double.difftime(times - min(times), units = "days")) %>%
dplyr::mutate(day = cut(diff, breaks = 0:6, include.lowest = TRUE, right = TRUE, ordered_result = TRUE)) %>%
group_by(ID, day) %>%
filter(row_number()==n()) %>%
select(ID, day, values) %>%
tidyr::pivot_wider(names_from = day, values_from = values)
which gives:
ID `[0,1]` `(1,2]` `(2,3]` `(4,5]`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 38 36 35 30
However, as you see there is a step missing as we have no data from day 3 to 4. Is there a way to ensure that alle intervals are included in the result and that for missing data NA is placed instead?
My only idea would be to add a "dummy user" to the dataframe that has data for all intervals so that it is ensured that all intervals are included.
So what i want is:
ID `[0,1]` `(1,2]` `(2,3]` `(3,4]` `(4,5]`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 38 36 35 NA 30

You can fill missing rows in your dataset looking for missing dates like that:
seq_dates <- tibble(times = seq(min(unique(as.Date(test_data$times))), max(unique(as.Date(test_data$times))), by="days"))
missing_dates <- seq_dates %>% filter(!times %in% unique(as.Date(test_data$times)))
missing_dates$times <- as.POSIXct(missing_dates$times)
missing_dates$ID <- 1
missing_dates$values <- NA
missing_dates <- missing_dates %>% select(ID, values, times)
test_data <- test_data %>% bind_rows(missing_dates) %>% arrange(times)
And then execute your code:
test_data %>%
dplyr::mutate(diff = as.double.difftime(times - min(times), units = "days")) %>%
dplyr::mutate(day = cut(diff, breaks = 0:6, include.lowest = TRUE, right = TRUE, ordered_result = TRUE)) %>%
group_by(ID, day) %>%
filter(row_number()==n()) %>%
select(ID, day, values) %>%
tidyr::pivot_wider(names_from = day, values_from = values)
And get the desired result:
ID `[0,1]` `(1,2]` `(2,3]` `(3,4]` `(4,5]`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 38 36 35 NA 30

Related

Using Lubridate producing NA values

I have this data frame;(df)
date Name Name_id x1 x2 x3 x4 x5 x6
01/01/2000 00:00 A U_12 1 1 1 1 1 1
01/01/2000 01:00 A U_12
01/01/2000 02:00
01/01/2000 03:00
....
I am trying to calculate the monthly aggregated mean etc. for some columns using lubridate.
what I did so far;
df$date <- dmy_hm(Sites_tot$date)
df$month <- floor_date(df$date,"month")
monthly_avgerage <- df %>%
group_by(Name, Name_id, month) %>%
summarize_at(vars(x1:x4), .funs = c("mean", "min", "max"), na.rm = TRUE)
I can see the values seem okay although some of the months are turned into NAs.
We can modify the summarise_at to
library(dplyr)
df %>%
group_by(Name, Name_id, month) %>%
summarise(across(x1:x4, list(mean = ~ mean(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE))))
A reproducible example
iris %>%
group_by(Species) %>%
summarise(across(everything(), list(mean = ~ mean(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE))))
If I am not wrong, the challenge is to get date column into datetime format:
Somehow date = dmy_hm(date) does not work:
library(dplyr)
library(lubridate)
df %>%
mutate(date = dmy_hms(paste0(date, ":00")),
month = month(date)) %>%
group_by(Name, Name_id, month) %>%
summarise(across(x1:x4, list(mean = ~ mean(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE))), .groups = "drop")
Name Name_id month x1_mean x1_min x1_max x2_mean x2_min x2_max x3_mean x3_min x3_max
<chr> <chr> <dbl> <dbl> <int> <int> <dbl> <int> <int> <dbl> <int> <int>
1 A U_12 1 1.5 1 2 1.5 1 2 1.5 1 2
2 B U_13 1 3.5 3 4 3.5 3 4 3.5 3 4
# … with 3 more variables: x4_mean <dbl>, x4_min <int>, x4_max <int>
# ℹ Use `colnames()` to see all variable names
fake data:
df <- structure(list(date = c("01/01/2000 00:00", "01/01/2000 01:00",
"01/01/2000 02:00", "01/01/2000 03:00"), Name = c("A", "A", "B",
"B"), Name_id = c("U_12", "U_12", "U_13", "U_13"), x1 = 1:4,
x2 = 1:4, x3 = 1:4, x4 = 1:4, x5 = 1:4, x6 = 1:4), class = "data.frame", row.names = c(NA,
-4L))

How to combine across, summarize, and n() in R to get number of non-NA values by column?

I have a list of questions, and I want to know how many rows have non-NA values using summarize. I want to use summarize because I'm already using that to calculate the average, which works in the below code. Why does the below code not work and how can I fix it?
library(dplyr)
test <- tibble(student = c("j", "c", "s"),
q1 = c(1, 2, 3),
q2 = c(NA_real_, NA_real_, 4),
q3 = c(43, NA_real_, 232))
test %>%
dplyr::summarise(n = across(starts_with("q"), ~n(.x)),
avg = across(contains("q"), ~ round(mean(.x, na.rm = T), 2)))
expected_outcome <- tibble(n_q1 = 3,
n_q2 = 1,
n_q3 = 2,
avg_q1 = 2,
avg_q2 = 4,
avg_q3 = 138)
library(dplyr)
test %>%
summarize(across(starts_with("q"), list(n = ~sum(!is.na(.)),
avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}"))
From the ?across documentation, you can pass a list to the .fns argument:
A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
This will apply every function in that list to the columns you have specified. You can then use the .names argument of across to set the column names how you desire.
Output
n_q1 avg_q1 n_q2 avg_q2 n_q3 avg_q3
<int> <dbl> <int> <dbl> <int> <dbl>
1 3 2 1 4 2 138.
Update: Upps I missed the whole question. sorry: But here is an alternative just for fun: The preferred answer is already given by #LMc:
library(dplyr)
test %>%
summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}")) %>%
bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))
avg_q1 avg_q2 avg_q3 student q1 q2 q3
<dbl> <dbl> <dbl> <int> <int> <int> <int>
1 2 4 138. 3 3 1 2
test %>%
summarise(across(starts_with("q"), list(avg = ~mean(., na.rm = T)),
.names = "{.fn}_{.col}")) %>%
bind_cols(test %>% purrr::map_df(~sum(!is.na(.))))
First not full answer:
To get the non-nas of the whole dataset, we could do this:
library(dplyr)
test %>%
purrr::map_df(~sum(!is.na(.)))
student q1 q2 q3
<int> <int> <int> <int>
1 3 3 1 2

Why does dplyr's coalesce(.) and fill(.) not work and still leave missing values?

I have a simple test dataset that has many repeating rows for participants. I want one row per participant that doesn't have NAs, unless the participant has NAs for the entire column. I tried grouping by participant name and then using coalesce(.) and fill(.), but it still leaves missing values. Here's my test dataset:
library(dplyr)
library(tibble)
test_dataset <- tibble(name = rep(c("Justin", "Corey", "Sibley"), 4),
var1 = c(rep(c(NA), 10), 2, 3),
var2 = c(rep(c(NA), 9), 2, 4, 6),
var3 = c(10, 15, 7, rep(c(NA), 9)),
outcome = c(3, 9, 23, rep(c(NA), 9)),
tenure = rep(c(10, 15, 20), 4))
And here's what I get when I use coalesce(.) or fill(., direction = "downup"), which both produce the same result.
library(dplyr)
library(tibble)
test_dataset_coalesced <- test_dataset %>%
group_by(name) %>%
coalesce(.) %>%
slice_head(n=1) %>%
ungroup()
test_dataset_filled <- test_dataset %>%
group_by(name) %>%
fill(., .direction="downup") %>%
slice_head(n=1) %>%
ungroup()
And here's what I want--note, there is one NA because that participant only has NA for that column:
library(tibble)
correct <- tibble(name = c("Justin", "Corey", "Sibley"),
var1 = c(NA, 2, 3),
var2 = c(2, 4, 6),
var3 = c(10, 15, 7),
outcome = c(3, 9, 23),
tenure = c(10, 15, 20))
You can group_by the name column, then fill the NA (you need to fill every column using everything()) with the non-NA values within the group, then only keep the distinct rows.
library(tidyverse)
test_dataset %>%
group_by(name) %>%
fill(everything(), .direction = "downup") %>%
distinct()
# A tibble: 3 × 6
# Groups: name [3]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
Try this
cleaned<- test_dataset |>
dplyr::group_by(name) |>
tidyr::fill(everything(),.direction = "downup") |>
unique()
# To filter out the ones with all NAs
cleaned[sum(is.na(cleaned[,-1]))<ncol(cleaned[,-1]),]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
``

Converting rows into columns based on the values in the rows, in R

This is the table I am working with. I want to convert reservation into separate columns.
I want it to be transformed into something like this. I have been trying to do this using reshape2 and dplyr's separate but I didn't find a solution.
You may try
library(tidyverse)
df %>%
rowwise %>%
mutate(reservation_main = str_split(reservation,'_' ,simplify = T)[1],
reservation_no = paste0('_',str_split(reservation,'_' ,simplify = T)[2])) %>%
select(id, response_id, reservation_main, reservation_no) %>%
pivot_wider(names_from = reservation_no, values_from = response_id)
id reservation_main `_1` `_2` `_3`
<dbl> <chr> <dbl> <dbl> <dbl>
1 31100 A 1 1 0
2 31100 B 1 1 0
3 31100 C 1 0 0
Since reservation is in a consistent format, we can use the _ to separate into two columns. Then, we can convert the response to 0 and 1. Then, I drop response_id. Finally, I pivot to the wide format. I'm assuming that you don't want the _ before the numbers in the columns.
library(tidyverse)
df %>%
separate(reservation, c("reservation", "number"), sep = "_") %>%
mutate(response = ifelse(response == "yes", 1, 0)) %>%
select(-response_id) %>%
pivot_wider(names_from = "number", values_from = "response")
Output
# A tibble: 2 × 5
id reservation `1` `2` `3`
<dbl> <chr> <dbl> <dbl> <dbl>
1 31100 A 1 1 0
2 31100 B 1 1 0
If you do want to keep the _ in front of the numbers for the columns, then we could adjust the regex in separate.
df %>%
separate(reservation, c("reservation", "number"), sep = "(?=\\_)") %>%
mutate(response = ifelse(response == "yes", 1, 0)) %>%
select(-response_id) %>%
pivot_wider(names_from = "number", values_from = "response")
# A tibble: 2 × 5
id reservation `_1` `_2` `_3`
<dbl> <chr> <dbl> <dbl> <dbl>
1 31100 A 1 1 0
2 31100 B 1 1 0
Data
df <- structure(
list(
id = c(31100, 31100, 31100, 31100, 31100, 31100),
reservation = c("A_1", "A_2", "A_3", "B_1", "B_2", "B_3"),
response = c("yes", "yes", "no", "yes", "yes", "no"),
response_id = c(1,
1, 0, 1, 1, 0)
),
class = "data.frame",
row.names = c(NA,-6L)
)

Why is the NA/NaN error delivered here, and what can I do about it?

My data look like this:
library(tidyverse)
df <- tibble(
Type = c(rep("A", 2), rep("B", 2), rep("A", 2), rep("B", 2)),
Source = c(rep("X", 4), rep("Y", 4)),
ID = c(1001:1008),
January = c(11, 22, 10, 30, NA, NA, NA, NA),
February = c(10, 42, 15, 27, NA, NA, NA, NA)
)
(In reality there are many more columns over multiple years, and some of them are non-NA in the Y rows. But this will do for my question.)
I want to make the manipulation...
newDF <- df %>%
group_by(Type, Source) %>%
summarize(theTotal = sum(January:February, na.rm = TRUE))
...but I get the error Error in January:February : NA/NaN argument. I know why I am getting this error: January and February are NA in some rows. I would get this error even if February had numbers in those rows, as long as January was still NA.
My questions are: 1) Why isn't na.rm = TRUE enough to prevent this from happening? 2) What, if anything, can I do to my code to make sure I get 0 for those combinations of A/B and Y?
In this case, probably, we can use summarise_at and then create a single column with sum. After grouping_by the columns of interest, we get the. sum of columns 'January' to 'February' as a single row. with summarise_at and then ungroup and get the sum again
library(dplyr)
df %>%
group_by(Type, Source) %>%
summarise_at(vars(January:February), sum, na.rm = TRUE) %>%
ungroup %>%
transmute(Type, Source,
theTotal = rowSums(select(.,January:February), na.rm = TRUE))
# A tibble: 4 x 3
# Type Source theTotal
# <chr> <chr> <dbl>
#1 A X 85
#2 A Y 0
#3 B X 82
#4 B Y 0
Or another option is
library(purrr)
df %>%
group_split(Type, Source) %>%
map_dfr(~ .x %>%
summarise(Type = first(Type), Source = first(Source),
theTotal = select(., January:February) %>% unlist %>% sum(., na.rm = TRUE)))
# A tibble: 4 x 3
# Type Source theTotal
# <chr> <chr> <dbl>
#1 A X 85
#2 A Y 0
#3 B X 82
#4 B Y 0

Resources