Failed to use map2 with mutate with purrr and dplyr - r

I am reading a list of files form my computer and doing several transformations on them with purrr and dplyr, everything works great, but I have a vector with the IDs of each data frame created, and I want to add a column with the ID of data for each data frame.
Loading libraries
library(readr)
library(lubridate)
library(dplyr)
library(purrr)
Reading list of files to be read and modified
ArchivosTemp <- list.files(pattern = "Tem.csv")
For reproducible purposes
lets say the list of dataframes called Temperaturas made after the first line of the code is
Temperaturas <- list(structure(list(`Date/Time` = c("01-07-2016 14:55", "01-07-2016 15:55",
"01-07-2016 16:55", "01-07-2016 17:55", "01-07-2016 18:55", "01-07-2016 19:55"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(28L, 24L,
25L, 25L, 25L, 25L), a = c(68L, 682L, 182L, 182L, 182L, 182L)), .Names = c("Date/Time",
"Unit", "Value", "a"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(`Date/Time` = c("12-06-2016 19:44",
"12-06-2016 20:44", "12-06-2016 21:44", "12-06-2016 22:44", "12-06-2016 23:44",
"13-06-2016 0:44"), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(31L,
29L, 27L, 26L, 26L, 24L), a = c(129L, 131L, 632L, 633L, 133L,
633L)), .Names = c("Date/Time", "Unit", "Value", "a"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame")), structure(list(
`Date/Time` = c("07-06-16 7:54:01", "07-06-16 8:54:01", "07-06-16 9:54:01",
"07-06-16 10:54:01", "07-06-16 11:54:01", "07-06-16 12:54:01"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(23L,
19L, 25L, 27L, 30L, 34L), a = c("119", "116", "119", "119",
"118", "113")), .Names = c("Date/Time", "Unit", "Value",
"a"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
and a vector with the ID of each element of the list
IDs <- c("H1F102", "H1F105", "H1F106")
The purrr code that is working so far
a <- ArchivosTemp %>% map(read_csv) %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Since you cant read the csvs from mu computer lets replace the ArchivosTemp %>% map(read_csv) with the list that I made above
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Then I want each of the 3 data frames to have a column called ID with its corresponding element in the IDs vector I tried this:
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% map2(y = IDs,~mutate(.x, ID = y.))
but it does not work, any ideas of What I am doing wrong?
Expected outcome
As an example this is the results I expect using only the first data frame
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% reduce(rbind)
mutate(a[[1]], ID = IDs[1])
which turns into
# A tibble: 6 x 3
Date.Time Temperatura ID
<dttm> <int> <chr>
1 2020-07-01 16:14:55 28 H1F102
2 2020-07-01 16:15:55 24 H1F102
3 2020-07-01 16:16:55 25 H1F102
4 2020-07-01 16:17:55 25 H1F102
5 2020-07-01 16:18:55 25 H1F102
6 2020-07-01 16:19:55 25 H1F102

You have a minor parameter problem with map2, the parameters are named as .x, .y, changing y to .y works for me:
map2(.y = IDs, ~ mutate(.x, ID = .y))
Besides if you eventually need to bind all elements in the list as a single data frame, you can set_names to your list with the IDs vector and then specify the .id parameter in map_df, which will map and bind_rows of all data frames in the lists to form a new final data frame, and converts the list names to a new column with the name of .id:
Temperaturas %>%
set_names(IDs) %>%
map_df(~ transmute(.x, Date.Time=dmy_hms(`Date/Time`), Temperatura=Value), .id="ID")
# A tibble: 18 x 3
# ID Date.Time Temperatura
# <chr> <dttm> <int>
# 1 H1F102 2020-07-01 16:14:55 28
# 2 H1F102 2020-07-01 16:15:55 24
# 3 H1F102 2020-07-01 16:16:55 25
# 4 H1F102 2020-07-01 16:17:55 25
# 5 H1F102 2020-07-01 16:18:55 25
# 6 H1F102 2020-07-01 16:19:55 25
# 7 H1F105 2020-06-12 16:19:44 31
# 8 H1F105 2020-06-12 16:20:44 29
# 9 H1F105 2020-06-12 16:21:44 27
#10 H1F105 2020-06-12 16:22:44 26
#11 H1F105 2020-06-12 16:23:44 26
#12 H1F105 2020-06-13 16:00:44 24
#13 H1F106 2016-06-07 07:54:01 23
#14 H1F106 2016-06-07 08:54:01 19
#15 H1F106 2016-06-07 09:54:01 25
#16 H1F106 2016-06-07 10:54:01 27
#17 H1F106 2016-06-07 11:54:01 30
#18 H1F106 2016-06-07 12:54:01 34
Besides, you can use transmute as a short hand for rename %>% mutate %>% select

Related

Calculate or Filter the wrong date entries of two Date columns in R

I am trying to figure how I am going to filter the wrong entries or calculate the difference between two Date columns of the same data frame in R. The scenario is: I have Patient table and there are two columns of Patient_admit and Patient discharge. How I am going to find if the date entered for Patient_discharge is before the Patient_admit. In the below dataframe example, the entries of patient 2 and 6 are incorrect.
executed
dput(head(patient)
structure(list(id = c(1003L, 1005L, 1006L, 1007L, 1010L, 1010L
), date_admit = structure(c(115L, 18L, 138L,
91L, 34L, 278L), .Label = c("01/01/2020", "01/02/2020", "01/03/2020",............,
date_discharge = structure(c(143L, 130L, 181L, 156L, 198L,
86L), .Label = c("01/01/2020", "01/01/2021", "01/02/2020",
............., class = "factor")), row.names = c(NA, 6L), class = "data.frame")
The list of date is very long so I just put "..........." for ease of understanding. Thanks
Another possible solution, based on lubridate::dmy:
library(dplyr)
library(lubridate)
df %>%
filter(dmy(Patient_admit) <= dmy(Patient_discharge))
#> Patient_ID Patient_admit Patient_discharge
#> 1 1 20/10/2020 21/10/2020
#> 2 3 21/10/2021 22/10/2021
#> 3 4 25/11/2022 25/11/2022
#> 4 5 25/11/2022 26/11/2022
First convert your dates to the right format using strptime. Calculate the difference in days using difftime and filter if the days are negative. You can use the following code:
library(dplyr)
df %>%
mutate(Patient_admit = strptime(Patient_admit, "%d/%m/%Y"),
Patient_discharge = strptime(Patient_discharge, "%d/%m/%Y")) %>%
mutate(diff_days = difftime(Patient_discharge, Patient_admit, units = c("days"))) %>%
filter(diff_days >= 0) %>%
select(-diff_days)
Output:
Patient_ID Patient_admit Patient_discharge
1 1 2020-10-20 2020-10-21
2 3 2021-10-21 2021-10-22
3 4 2022-11-25 2022-11-25
4 5 2022-11-25 2022-11-26
Data
df <- data.frame(Patient_ID = c(1,2,3,4,5,6),
Patient_admit = c("20/10/2020", "22/10/2021", "21/10/2021", "25/11/2022", "25/11/2022", "05/10/2020"),
Patient_discharge = c("21/10/2020", "20/10/2021", "22/10/2021", "25/11/2022", "26/11/2022", "20/09/2020"))

Remove characters including and after third hyphen

I have a dataframe df and want to remove everything including and after the third '-' in the column 'case_id':
df
case_id unit
TCGA-3A-01-03-9441 27
TCGA-9C-01-04-9641 15
TCGA-1E-01-05-9471 6
This is the desired output:
df
case_id unit
TCGA-3A-01 27
TCGA-9C-01 15
TCGA-1E-01 6
We could use str_replace
library(stringr)
library(dplyr)
df1 %>%
mutate(case_id = str_replace(case_id, "^(([^-]+-){2}[^-]+)-.*", "\\1"))
-output
case_id unit
1 TCGA-3A-01 27
2 TCGA-9C-01 15
3 TCGA-1E-01 6
data
df1 <- structure(list(case_id = c("TCGA-3A-01-03-9441", "TCGA-9C-01-04-9641",
"TCGA-1E-01-05-9471"), unit = c(27L, 15L, 6L)),
class = "data.frame", row.names = c(NA,
-3L))

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

How to sum a variable by group with NA?

I have a large data set like this :
ID Number
153 31
28
31
30
104 31
30
254 31
266 31
and I want to compute sum by ID include the NA. I mean get this :
ID Number
153 120
104 61
254 31
266 31
I tried aggregate but I dont get the expected result. Some help would be appreciated
One option is to convert the blanks to NA, then fill replace the NA elements with non-NA adjacent elements above in 'ID', grouped by 'ID', get the sum of 'Number'
library(tidyverse)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Number = sum(Number))
# A tibble: 4 x 2
# ID Number
# <chr> <int>
#1 104 61
#2 153 120
#3 254 31
#4 266 31
Or without using fill, create a grouping variable with a logical expression and cumsum, and then do the sum
df1 %>%
group_by(grp = cumsum(ID != "")) %>%
summarise(ID = first(ID), Number = sum(Number)) %>%
select(-grp)
data
df1 <- structure(list(ID = c("153", "", "", "", "104", "", "254", "266"
), Number = c(31L, 28L, 31L, 30L, 31L, 30L, 31L, 31L)), row.names = c(NA,
-8L), class = "data.frame")
Or do it straightforwardly :) by
cbind(df1[df1$ID != "", "ID", drop = FALSE],
Number = rev(diff(c(0, rev((rev(cumsum(rev(df1$Number)))[df1$ID != ""]))))))

Sorting dataframe by column of letters and numbers

I have been attempting to sort my dataframe by the first column - or day - with multiple different methods listed below to no avail. I suspect it could be because it is attempting to order by the first number but I am unsure how I would change that to get it to order the rows properly. The dataset is as follows:
df1
[day][sample1][sample2]
[1,]day0 22 11
[2,]day11 23 15
[3,]day15 25 14
[4,]day2 21 13
[5,]day8 20 17
...
I am looking to order the entire row by day. I have tried the following
df[sort(as.character(df$day)),]
df[order(as.character(df$day)),]
mixedorder(as.character(df$day)) (gtools package)
The mixedorder merely output an index of numbers.
Current Code:
df_0$day = metadata_df[,3]
df_0 <- df_0[,c(8,1:7)]
df1 <- aggregate(df_0[,2:ncol(df_0)], df_0[1], mean)
df1 <- df1[mixedorder(as.character(df1$day)),]
df1$day <- factor(df1$day, levels = unique(df1$day))
rownames(df1) <- 1:nrow(df1)
##Plotting expression levels
Plot1 <- ggplot() +
geom_line(data=df1, aes(x=day, y=sample1, group=1, color="blue"))+
geom_line(data=df2, aes(x=day, y=sample1, group=2, color="red"))
Note that I have done the same transformations with df2 as I have with df1. Both df1 and df2 are the same, except with slightly different values in them.
The mixedorder gives the ordered index which can be used to order the rows
df1 <- df[mixedorder(as.character(df$day)),]
df1
# day sample1 sample2
#1 day0 22 11
#4 day2 21 13
#5 day8 20 17
#2 day11 23 15
#3 day15 25 14
It is not clear about how the OP is plotting.
library(tidyverse)
df1 %>%
mutate(day = factor(day, levels = unique(day))) %>%
gather(key, val, -day) %>%
ggplot(., aes(x = day, y = val, color = key)) +
geom_point()
data
df <- structure(list(day = structure(1:5, .Label = c("day0", "day11",
"day15", "day2", "day8"), class = "factor"), sample1 = c(22L,
23L, 25L, 21L, 20L), sample2 = c(11L, 15L, 14L, 13L, 17L)), .Names = c("day",
"sample1", "sample2"), class = "data.frame", row.names = c(NA,
-5L))

Resources