I have a data frame with all the information from a racing yacht in that day (lg) and I wish to create variable that tells me what race the yacht was in. This race start and finish time are in a separate df (RaceInfo). I can filter by race time, but there is a changeable amount of races per day so it may need a loop.
Some Data
lg <- structure(list(Date = structure(c(18897, 18897, 18897, 18897,
18897, 18897, 18897, 18897, 18897, 18897), class = "Date"), Time = structure(c(1632725883,
1632725884, 1632725885, 1632725886, 1632725887, 1632725888, 1632725889,
1632725890, 1632725891, 1632725892), tzone = "", class = c("POSIXct",
"POSIXt")), Lat = c(43.2760531, 43.276059, 43.276065, 43.2760708,
43.2760766, 43.2760858, 43.276095, 43.2761, 43.276105, 43.2761095
), Lon = c(6.619109, 6.619136, 6.619163, 6.6191932, 6.6192235,
6.6192488, 6.619274, 6.6192988, 6.6193235, 6.6193532), Awa = c(-7.1,
-7.12, -7.15, -6.57, -6, -6.2, -6.4, -5.28, -4.15, 0.25), X = 1:10), row.names = c(NA,
-10L), class = "data.frame")
This is the yachts onboard data.
More Data
RaceInfo <- structure(list(date = structure(c(18897, 18896), class = "Date"),
RaceStartTime = structure(c(1632738480, 1632751560), tzone = "", class = c("POSIXct",
"POSIXt")), RaceNum = c("1", "2"), RaceFinishTime = structure(c(1632751520,
1632753000), tzone = "", class = c("POSIXct", "POSIXt"))), row.names = c("event.2",
"1"), class = "data.frame")
In the RaceInfo df it tells us the start and finish time of each race, as mentioned before there could be many races and I need to assign a new variable in the lg df as lg$RaceNum based on the times given in the RaceInfo df.
My closes attempt is this but loops are a weak point in my game.
for (i in RaceInfo$RaceNum){
lg <- lg %>%
mutate(Racenum = case_when(
lg$Time >= (subset(RaceInfo$RaceStartTime, RaceInfo$RaceNum == i)) &
lg$Time <= (subset(RaceInfo$RaceFinishTime, RaceInfo$RaceNum == i)) ~ i))
}
But this only returns the last number in the loop
The methods mutate and case_when are really to assign calculated columns within a data frame and not specifically for subsetting data frame itself.
Instead, consider dplyr::filter (similar to base::subset) even dplyr::between and collect your iteration results to build a data frame list. Then, rbind results at end. To subset by unique values, see by
df_list <- lapply(RaceInfo$RaceNum, function(i)
dplyr::filter(
lg,
dplyr::between(
Time,
RaceInfo$StartTime[RaceInfo$Racenum == i],
RaceInfo$RaceFinishTime[RaceInfo$Racenum == i]
)
)
)
final_df <- dplyr::bind_rows(df_list)
But as mentioned above, if your data is manageable with small set of distinct RaceInfo, consider a cross join with filter:
final_df <- dplyr::full_join(lg, RaceInfo, by = character()) %>%
filter(lg, between(
Time,
RaceInfo$StartTime[RaceInfo$Racenum == i],
RaceInfo$RaceFinishTime[RaceInfo$Racenum == i]
)
)
Related
This question already has an answer here:
Reshaping data.frame with a by-group where id variable repeats [duplicate]
(1 answer)
Closed 2 years ago.
I have a data of patients' operations/procedures (example as shown in the picture below) where one row describes a patient's procedure. There are 2 levels of information,
the first being the operation details, i.e. op_start_dt, priority_operation and asa_status
the second being the procedure details, i.e. proc_desc and proc_table
An operation can have more than 1 procedures. In the example below, patient A has 2 operations (defined by distinct op_start_dt). In his first operation, he had 1 procedure (defined by distinct proc_desc) and in his second, he had 2 procedures.
I would like to convert the data into a wide format, where a patient only has one row, and his information will be arranged operation by operation and within each operation, it will be arrange procedure by procedure, as shown below. So, proc_descxy refers to the proc_desc on xth operation and yth procedure.
Data:
df <- structure(list(patient = c("A", "A", "A"), department = c("GYNAECOLOGY /OBSTETRICS DEPT",
"GYNAECOLOGY /OBSTETRICS DEPT", "GYNAECOLOGY /OBSTETRICS DEPT"
), op_start_dt = structure(c(1424853000, 1424870700, 1424870700
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), priority_operation = c("Elective",
"Elective", "Elective"), asa_status = c(2, 3, 3), proc_desc = c("UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
"KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
"HEART, VARIOUS LESIONS, HEART TRANSPLANTATION"), proc_table = c("99",
"6A", "7C")), row.names = c(NA, 3L), class = "data.frame")
Desired output:
df <- structure(list(patient = "A", department = "GYNAECOLOGY /OBSTETRICS DEPT",
no_op = 2, op_start_dt1 = structure(1424853000, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_proc1 = 1, priority_operation1 = "Elective",
asa_status1 = 2, proc_desc11 = "UTERUS, MALIGNANT CONDITION, EXTENDED HYSTERECTOMY WITH/WITHOUT LYMPHADENECTOMY",
proc_table11 = "99", op_start_dt2 = structure(1424870700, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), no_of_proc2 = 2, priority_operation2 = "Elective",
asa_status2 = 3, proc_desc21 = "KIDNEY AND URETER, VARIOUS LESIONS, NEPHROURETERECTOMY, LAPAROSCOPIC",
proc_table21 = "6A", proc_desc22 = "HEART, VARIOUS LESIONS, HEART TRANSPLANTATION",
proc_table22 = "7C"), row.names = 1L, class = "data.frame")
My attempt:
I tried to work this out, but it gets confusing along the way, with pivot_longer then pivot_wideragain.
df %>%
# Operation-level Information
group_by(patient) %>%
mutate(op_nth = dense_rank(op_start_dt),
no_op = n_distinct(op_start_dt)) %>%
# Procedure-level Information
group_by(patient, op_start_dt) %>%
mutate(proc_nth = row_number(),
no_proc = n_distinct(proc_desc)) %>%
ungroup() %>%
# Make pivoting easier
mutate_all(as.character) %>%
# Pivot Procedure-level Information
pivot_longer(-c(patient, department, no_op, op_nth, proc_nth)) %>%
# Remove the indices for "Procedure" for Operation_level Information
mutate(proc_nth = case_when(!(name %in% c("op_start_dt", "no_proc", "priority_operation", "asa_status")) ~ proc_nth)) %>%
# Create the column names
unite(name, c(name, op_nth, proc_nth), sep = "", na.rm = TRUE) %>%
distinct() %>%
pivot_wider(names_from = name, values_from = value)
Create a unique ID column for each patient and then use pivot_wider.
library(dplyr)
df %>%
group_by(patient) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = row, values_from = op_start_dt:proc_table)
I have a dataset in which there are dates describing a time period of interest, as well as events ("Tests" in my toy example) that can fall inside or outside the period of the interest. The events also have a time and some dichotomous characteristics.
My collaborator has asked me to transform the data from this format:
structure(list(ID = c(1, 1, 2, 3), StartDate = structure(c(315878400,
315878400, 357696000, 323481600), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), EndDate = structure(c(316137600, 316310400,
357955200, 323654400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
TestDateTime = structure(c(316135500, 315797700, 357923700,
323422560), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
TestName = c("Test1", "Test2", "Test1", "Test3"), Characteristic = c("Fast",
"Slow", "Fast", "Slow")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
current state
to this format:
desired state
I am unsure how to accomplish this transformation or set of transformations using R, but I believe it is possible.
try the following
library(dplyr)
data %>%
select(-c(StartDate,EndDate)) %>% # Remove extra columns
tidyr::spread(TestDate, TestTime) %>% # Spread df to long form
select(-Characteristic, everything()) %>% # Move Characteristic to the end of the df
group_by(ID) %>% # Group by ID and
group_split() # split it
Take on count that the date columns of the final df are not exact as the "desire" state.
Hope this can help you.
I have a dataframe with dates. Here are the first 3 rows with dput:
df.cv <- structure(list(ds = structure(c(1448064000, 1448150400, 1448236800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), y = c(10.4885204292416,
10.456538985014, 10.4264986311659), yhat = c(10.4851491194439,
10.282089547027, 10.4354960430083), yhat_lower = c(10.4169914076864,
10.2162549984153, 10.368531352493), yhat_upper = c(10.5506038959764,
10.3556867861042, 10.5093092789713), cutoff = structure(c(1447977600,
1447977600, 1447977600), class = c("POSIXct", "POSIXt"), tzone = "UTC")),.Names = c("ds",
"y", "yhat", "yhat_lower", "yhat_upper", "cutoff"), row.names = c(NA,
-3L), class = c("`enter code here`tbl_df", "tbl", "data.frame"))
I'm trying to plot the data with ggplot + geom_line from similar day/month combinations in one plot. So, for example, I want the y-value of 2016-01-01 to appear on the same x-value as 2017-01-01. If found a way to do this, but it seems to be a very complex workaround:
library(tidyverse)
library(lubridate)
p <- df.cv %>%
mutate(jaar = as.factor(year(ds))) %>%
mutate(x = as_date(as.POSIXct(
ifelse(jaar==2016, ds + years(1), ds),
origin = "1970-01-01")))
ggplot(p %>% filter(jaar!=2015), aes(x=x, group=jaar, color=jaar)) +
geom_line(aes(y=y))
It works, but as you can see I first have to extract the year, then use an ifelse to add one year to only the 2016 dates, convert with POSIXct because ifelse strips the class, convert back into POSIXct while supplying an origin, and finally remove the timestamp with as_date.
Isn't there a simpler, more elegant way to do this?
Use year<- to replace the year with any fixed leap year:
p <- df.cv %>%
mutate(jaar = as.factor(year(ds)),
x = `year<-`(as_date(ds), 2000))
ggplot(p, aes(x = x, y = y, color = jaar)) +
geom_line()
I have 2 datasets, one of which contains measurements of temperature at 30 min intervals
ordered.temp<-structure(list(time = structure(c(1385244720, 1385246520, 1385248320,
1385250120, 1385251920, 1385253720, 1385255520, 1385257320, 1385259120,
1385260920), class = c("POSIXct", "POSIXt"), tzone = ""), temp = c(30.419,
29.34, 28.965, 28.866, 28.891, 28.866, 28.692, 28.419, 28.122,
27.85), hoboID = c(2392890L, 2392890L, 2392890L, 2392890L, 2392890L,
2392890L, 2392890L, 2392890L, 2392890L, 2392890L)), .Names = c("time",
"temp", "hoboID"), row.names = c(NA, 10L), class = "data.frame")
, the other I created to be able to assign each measurement into 3-hour bins
df<-structure(list(start = structure(c(1385182800, 1385193600, 1385204400,
1385215200, 1385226000, 1385236800, 1385247600, 1385258400, 1385269200,
1385280000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1385193600,
1385204400, 1385215200, 1385226000, 1385236800, 1385247600, 1385258400,
1385269200, 1385280000, 1385290800), class = c("POSIXct", "POSIXt"
), tzone = ""), b = 1:10), .Names = c("start", "end", "b"), row.names = c(NA,
10L), class = "data.frame")
For simplicity, I created a subset of the data, but in reality the temp dataframe is 460k rows long and growing bigger every year. I wrote a for loop to compare each line in temp with lines in bin and assign it the corresponding b value from the bin dataframe.
m <- length(ordered.temp$time)
b <- numeric(m)
n <- length(df$start)
for (i in 1:m){
for (j in 1:n){
if (df$start[j] < ordered.temp$time[i] & ordered.temp$time[i] <= df$end[j]){
b[i] <- df$b[j]
print(i/dim(ordered.temp)[1]*100)
}
}
}
Running this loop with 460k rows takes a very long time (i ran the loop for 1 minute and calculated that it would take ±277 hours to complete it. Therefore, it is imperative to speed this loop up or find alternative methods if this is not possible. I however have no idea how I achieve the desired result. Any help would be greatly appreciated. thanks.
I have a seemingly small challenge, but I can't get to an answer. Here is my minimum working example.
fr_nuke <- structure(list(Date = structure(c(1420070400, 1420074000, 1420077600,
1420081200, 1420084800, 1420088400), class = c("POSIXct", "POSIXt"), tzone = ""),
`61` = c(57945, 57652, 57583, 57551, 57465, 57683),
`3244` = c(72666.64, 73508.78, 69749.17, 67080.13, 66357.65, 66524.13),
`778` = c(2.1133, 2.1133, 2.1133, 2.1133, 2.1133, 2.1133),
fcasted_nuke_temp = c(54064.6099092888, 54064.6099092888, 54064.6099092888,
54064.6099092888, 54064.6099092888, 54064.6099092888),
fcasted_nuke_cons = c(55921.043096775, 56319.5688170977, 54540.4094334057,
53277.340242333, 52935.4411965463, 53014.2244890147)),
.Names = c("Date", "61", "3244", "778", "fcasted_nuke_temp", "fcasted_nuke_cons"),
row.names = c(NA, 6L), class = "data.frame")
series1 <- as.xts(fr_nuke$'61', fr_nuke$Date)
series2 <- as.xts(fr_nuke$fcasted_nuke_temp, fr_nuke$Date)
series3 <- as.xts(fr_nuke$fcasted_nuke_cons, fr_nuke$Date)
grp_input <- cbind(series1,series2,series3)
dygraph(grp_input)
The resulting plot does not show the label of the individual series. Specifying the series with
dygraph(grp_input) %>% dySeries("V1", label = "Label1")
Results in:
Error in dySeries(., "V1", label = "Label1") : One or more of the
specified series were not found. Valid series names are: ..1, ..2, ..3
However, it works if I plot only one series (e.g. series1).
dygraph(series1) %>% dySeries("V1", label = "Label1")
Either set the colnames for the grp_input object, or use merge to construct the column names from the object names.
# setting colnames
require(dygraphs)
require(xts)
grp_input <- cbind(series1, series2, series3)
colnames(grp_input) <- c("V1", "V2", "V3")
dygraph(grp_input) %>% dySeries("V1", label = "Label1")
# using merge
require(dygraphs)
require(xts)
grp_input <- merge(series1, series2, series3)
dygraph(grp_input) %>% dySeries("series1", label = "Label1")