How to group contiguous variable into a range r - r

I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?

We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))

Related

one hot encoding only factor variables in R recipes

I have a dataframe df like so
height age dept
69 18 A
44 8 B
72 19 B
58 34 C
I want to one-hot encode only the factor variables (only dept is a factor). How can i do this?
Currently right now I'm selecting everything..
and getting this warning:
Warning message:
The following variables are not factor vectors and will be ignored: height, age
ohe <- df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect::everything()) %>%
recipes::prep() %>%
recipes::bake(df)
Use the where with is.factor instead of everything
library(dplyr)
df %>%
recipes::recipe(~ .) %>%
recipes::step_dummy(tidyselect:::where(is.factor)) %>%
recipes::prep() %>%
recipes::bake(df)
-output
# A tibble: 4 × 4
height age dept_B dept_C
<int> <int> <dbl> <dbl>
1 69 18 0 0
2 44 8 1 0
3 72 19 1 0
4 58 34 0 1
data
df <- structure(list(height = c(69L, 44L, 72L, 58L), age = c(18L, 8L,
19L, 34L), dept = structure(c(1L, 2L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), row.names = c(NA, -4L), class = "data.frame")

R: creating continuous timelines

I have some data showing the duration individuals spent in states a, b, and c. The durations should not overlap. In my data, c sometimes overlaps with a and b. I am trying to correct the data so there are no overlaps. a and b take priority, so I am only changing the start or end dates of c.
For example, my data looks like this (for two individuals):
structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), period = c("a", "b", "c", "a", "b", "c", "a", "a", "b",
"c", "a", "b"), start = c(3L, 11L, 8L, 18L, 20L, 26L, 31L, 1L,
11L, 13L, 15L, 16L), end = c(10L, 12L, 20L, 19L, 25L, 32L, 35L,
10L, 14L, 17L, 15L, 20L)), class = "data.frame", row.names = c(NA,
-12L))
And the desired output would be:
In this example, the individual with id == 2 has no time in period c, because periods a and b completely overlap.
How would you do this?
My method is a bit crude, but it'll work
library(tidyverse)
df %>% mutate(rowid = row_number()) %>%
pivot_longer(cols = c(start, end), names_to = "event", values_to = "val") %>%
mutate(event = factor(event, levels = c("start", "end"), ordered = T)) %>%
arrange(rowid, id, period, event) %>%
mutate(val = ifelse(period == "c" & event == "start" & val < lag(val), lag(val)+1, val),
val = ifelse(period == "c" & event == "end" & val > lead(val), lead(val)-1, val)) %>%
pivot_wider(id_cols = c(rowid, id, period), names_from = event, values_from = val) %>%
arrange(rowid) %>% mutate(start = ifelse(start > end, NA, start),
end = ifelse(end < start, NA, end)) %>%
select(-rowid)
# A tibble: 12 x 4
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20
Adding - instead of NA will turn the column type from dbl to chr.
EDIT on second thoughts I propose smaller code
df %>% group_by(id) %>%
mutate(start = ifelse(period == "c" & start < lag(end), lag(end)+1, start),
end = ifelse(period == "c" & end > lead(start), lead(start)-1, end),
start = ifelse(start > end, NA, start),
end = ifelse(start > end, NA, end))
# A tibble: 12 x 4
# Groups: id [2]
id period start end
<int> <chr> <dbl> <dbl>
1 1 a 3 10
2 1 b 11 12
3 1 c 13 17
4 1 a 18 19
5 1 b 20 25
6 1 c 26 30
7 1 a 31 35
8 2 a 1 10
9 2 b 11 14
10 2 c NA NA
11 2 a 15 15
12 2 b 16 20

Calculation groups with specific columns in r

The pattern my data is like this
df1<-read.table(text="Car1 Car2 Car3 Time1 Time2 Time3
22 33 90 20 90 20
11 45 88 10 80 30
22 33 40 40 10 10
11 45 40 10 10 40
11 45 88 10 12 60
22 45 90 60 20 100",header=TRUE)
I want to calculate mean and SD based on Car and time. The point is Car 1 corresponds to Time1, Car2 corresponds to Time 2 and Car3 Corresponds to Time3 and so on.
I want to get the following table :
Car1 Mean SD
11 10 0
22 40 20
Car2
33 xx xx
45 xx xx
Car3
40 xx xx
88 xx xx
90 xx xx
I have tried:
df1 %>% group_by(Car1,Car2,Car3) %>%
summarise(mean=mean(Time,SD=sd(Time))
Unfortunately, it does not work. Any help?
You can also use the package data.table:
library(data.table)
melt(setDT(df1),
measure = patterns("Car", "Time"),
value.name = c("Car", "Time"),
variable.name = "group"
)[, .(Mean = mean(Time), Sd = sd(Time)), .(group, Car)]
# group Car Mean Sd
# 1: 1 22 40.0 20.00000
# 2: 1 11 10.0 0.00000
# 3: 2 33 50.0 56.56854
# 4: 2 45 30.5 33.28163
# 5: 3 90 60.0 56.56854
# 6: 3 88 45.0 21.21320
# 7: 3 40 25.0 21.21320
Here is one option with pivot_longer where we reshape from 'wide' to 'long' format and group by the 'group1' index and 'Car', get the mean and sd of 'Time' by summariseing the 'Time'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c(".value", "group"),
names_sep="(?<=[a-z])(?=\\d+)") %>%
group_by(group, Car) %>%
summarise(Mean = mean(Time), SD = sd(Time))
# A tibble: 7 x 4
# Groups: group [3]
# group Car Mean SD
# <chr> <int> <dbl> <dbl>
#1 1 11 10 0
#2 1 22 40 20
#3 2 33 50 56.6
#4 2 45 30.5 33.3
#5 3 40 25 21.2
#6 3 88 45 21.2
#7 3 90 60 56.6
Assuming you can easily segregate your data into Time and Cars, then you can do this using loop, assuming you have data into structure as provided by you.
cars <- df1[1:3]
Time <- df1[4:6]
ls <- list()
for(i in 1:ncol(cars)) {
ls[[i]] <- aggregate(Time[i], by = cars[i], FUN = function(x) c(mean(x), sd(x)))
}
ls
Data for the results is:
df1 <- structure(list(Car1 = c(22L, 11L, 22L, 11L, 11L, 22L), Car2 = c(33L,
45L, 33L, 45L, 45L, 45L), Car3 = c(90L, 88L, 40L, 40L, 88L, 90L
), Time1 = c(20L, 10L, 40L, 10L, 10L, 60L), Time2 = c(90L, 80L,
10L, 10L, 12L, 20L), Time3 = c(20L, 30L, 10L, 40L, 60L, 100L)), class = "data.frame", row.names = c(NA,
-6L))
lapply(split.default(df1, gsub("\\D+", "", names(df1))), function(x){
d = gsub("\\D+", "", names(x)[1])
x %>%
group_by(!!sym(paste0("Car", d))) %>%
summarise(mean = mean(!!sym(paste0("Time", d))),
sd = sd(!!sym(paste0("Time", d)))) %>%
ungroup()
})

Summarise a group value into single row

I have a large dataset with longitudinal readings from single individuals.
I want to summarise information over time into a binary variable. i.e. if diff in the input table below is >5 for any value I want to then reduce the observation for A to a new column saying TRUE.
#Input
individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
#Output
individual newval
A TRUE
B FALSE
Using dplyr you can:
library(dplyr)
df %>%
group_by(individual) %>% # first group data
summarize(newval = any(diff > 5)) # then evaluate test for each group
#> # A tibble: 2 x 2
#> individual newval
#> <fct> <lgl>
#> 1 A TRUE
#> 2 B FALSE
data
df <- read.table(text = "individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
", header = TRUE)
Multiple ways to do this :
In base R we can use aggregate
aggregate(diff~individual, df,function(x) any(x>5))
# individual diff
#1 A TRUE
#2 B FALSE
Or tapply
tapply(df$diff > 5, df$individual, any)
We can also use data.table
library(data.table)
setDT(df)[ ,(newval = any(diff > 5)), by = individual]
An option in base R with rowsum
rowsum(+(df1$diff > 5), df1$individual) != 0
or with by
by(df1$diff > 5, df1$individual, any)
data
df1 <- structure(list(individual = c("A", "A", "A", "A", "B", "B", "B",
"B"), val1 = c(32L, 36L, 28L, 26L, 65L, 58L, 57L, 54L), val2 = c(36L,
28L, 26L, 26L, 64L, 59L, 54L, 51L), diff = c(-4L, 8L, 2L, 0L,
1L, -1L, 3L, 3L)), class = "data.frame", row.names = c(NA, -8L
))

new data set with combining some rows

This is not a question about long and wide shape:!!!!! don't make it duplicate plz
Suppose I have :
HouseholdID. PersonID. time. dur. age
1 1 3 4 19
1 2 3 4 29
1 3 5 5 30
1 1 5 5 18
2 1 21 30 18
2 2 21 30 30
In each household some people have the same time and dur. want to combine only rows whose have the same HouseholdID,time and dur
OUTPUT:
HouseholdID. PersonID. time. dur. age. HouseholdID. PersonID. time. dur. age
1 1 3 4 19 1 2 3 4 29
1 3 5 5 30 1 1 5 5 18
2 1 21 30 18 2 2 21 30 30
An option would be dcast from data.table which can take multiple value.var columns
library(data.table)
dcast(setDT(df1), HouseholdID. ~ rowid(HouseholdID.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 2 1 2 21 21 30 30 18 30
Or an option with pivot_wider from the devel version of tidyr
library(tidyr) # ‘0.8.3.9000’
library(dplyr)
df1 %>%
group_by(HouseholdID.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= HouseholdID., names_from = rn,
values_from = c(PersonID., time., dur., age), name_sep="")
# A tibble: 2 x 9
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 2 3 3 4 4 19 29
#2 2 1 2 21 21 30 30 18 30
Update
With the new dataset, extend the id columns by including the 'time.' and 'dur.'
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "age"), sep="")
If we need duplicate columns for 'time.' and 'dur.' (not clear why it is needed though)
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")[,
c('time.', 'dur.') := NULL][]
# HouseholdID. PersonID.1 PersonID.2 time..11 time..12 dur..11 dur..12 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 1 3 1 5 5 5 5 30 18
#3: 2 1 2 21 21 30 30 18 30
Or with tidyverse
df2 %>%
group_by(HouseholdID., time., dur.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= c(HouseholdID., time., dur.), names_from = rn,
values_from = c(PersonID., age), names_sep = "")
# A tibble: 3 x 7
# HouseholdID. time. dur. PersonID.1 PersonID.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int>
#1 1 3 4 1 2 19 29
#2 1 5 5 3 1 30 18
#3 2 21 30 1 2 18 30
NOTE: duplicate column names are not recommended as it can lead to confusion in identification of columns.
data
df1 <- structure(list(HouseholdID. = c(1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 1L, 2L), time. = c(3L, 3L, 21L, 21L), dur. = c(4L, 4L, 30L,
30L), age = c(19L, 29L, 18L, 30L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(HouseholdID. = c(1L, 1L, 1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 3L, 1L, 1L, 2L), time. = c(3L, 3L, 5L, 5L, 21L, 21L), dur. = c(4L,
4L, 5L, 5L, 30L, 30L), age = c(19L, 29L, 30L, 18L, 18L, 30L)),
class = "data.frame", row.names = c(NA,
-6L))

Resources