Calculation groups with specific columns in r - r

The pattern my data is like this
df1<-read.table(text="Car1 Car2 Car3 Time1 Time2 Time3
22 33 90 20 90 20
11 45 88 10 80 30
22 33 40 40 10 10
11 45 40 10 10 40
11 45 88 10 12 60
22 45 90 60 20 100",header=TRUE)
I want to calculate mean and SD based on Car and time. The point is Car 1 corresponds to Time1, Car2 corresponds to Time 2 and Car3 Corresponds to Time3 and so on.
I want to get the following table :
Car1 Mean SD
11 10 0
22 40 20
Car2
33 xx xx
45 xx xx
Car3
40 xx xx
88 xx xx
90 xx xx
I have tried:
df1 %>% group_by(Car1,Car2,Car3) %>%
summarise(mean=mean(Time,SD=sd(Time))
Unfortunately, it does not work. Any help?

You can also use the package data.table:
library(data.table)
melt(setDT(df1),
measure = patterns("Car", "Time"),
value.name = c("Car", "Time"),
variable.name = "group"
)[, .(Mean = mean(Time), Sd = sd(Time)), .(group, Car)]
# group Car Mean Sd
# 1: 1 22 40.0 20.00000
# 2: 1 11 10.0 0.00000
# 3: 2 33 50.0 56.56854
# 4: 2 45 30.5 33.28163
# 5: 3 90 60.0 56.56854
# 6: 3 88 45.0 21.21320
# 7: 3 40 25.0 21.21320

Here is one option with pivot_longer where we reshape from 'wide' to 'long' format and group by the 'group1' index and 'Car', get the mean and sd of 'Time' by summariseing the 'Time'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c(".value", "group"),
names_sep="(?<=[a-z])(?=\\d+)") %>%
group_by(group, Car) %>%
summarise(Mean = mean(Time), SD = sd(Time))
# A tibble: 7 x 4
# Groups: group [3]
# group Car Mean SD
# <chr> <int> <dbl> <dbl>
#1 1 11 10 0
#2 1 22 40 20
#3 2 33 50 56.6
#4 2 45 30.5 33.3
#5 3 40 25 21.2
#6 3 88 45 21.2
#7 3 90 60 56.6

Assuming you can easily segregate your data into Time and Cars, then you can do this using loop, assuming you have data into structure as provided by you.
cars <- df1[1:3]
Time <- df1[4:6]
ls <- list()
for(i in 1:ncol(cars)) {
ls[[i]] <- aggregate(Time[i], by = cars[i], FUN = function(x) c(mean(x), sd(x)))
}
ls
Data for the results is:
df1 <- structure(list(Car1 = c(22L, 11L, 22L, 11L, 11L, 22L), Car2 = c(33L,
45L, 33L, 45L, 45L, 45L), Car3 = c(90L, 88L, 40L, 40L, 88L, 90L
), Time1 = c(20L, 10L, 40L, 10L, 10L, 60L), Time2 = c(90L, 80L,
10L, 10L, 12L, 20L), Time3 = c(20L, 30L, 10L, 40L, 60L, 100L)), class = "data.frame", row.names = c(NA,
-6L))

lapply(split.default(df1, gsub("\\D+", "", names(df1))), function(x){
d = gsub("\\D+", "", names(x)[1])
x %>%
group_by(!!sym(paste0("Car", d))) %>%
summarise(mean = mean(!!sym(paste0("Time", d))),
sd = sd(!!sym(paste0("Time", d)))) %>%
ungroup()
})

Related

merge ID of three data base to obtain multiples row

I have 3 data base, in this three data base my sample ID varryig. But i want to merge to obtain one data base with multiple row of the same ID but not the same values
This is whats i have
df1
ID
tstart
1
12
2
4
df2
ID
tstart
2
40
3
15
df3
ID
tstart
2
80
3
80
this is what i want
ID
tstart
1
12
2
4
2
40
3
15
3
80
now i want to create a new variable i have this
ID
tstart
t stop
results 1
result 2
1
12
20
5
NA
2
4
40
10
NA
2
40
80
NA
52
3
15
80
68
NA
3
80
100
NA
56
and i want a new variable to have this df :
ID
tstart
t stop
result
1
12
20
5
2
4
40
10
2
40
80
52
3
15
80
68
3
80
100
56
We can use bind_rows to bind the datasets and then get the distinct rows
library(dplyr)
bind_rows(df1, df2, df3) %>%
distinct()
For the second case, if the input already have the 'tstart', 'tstop', we could coalesce the 'result' columns
dfnew %>%
mutate(result = coalesce(result1, result2), .keep = 'unused')
-output
ID tstart tstop result
1 1 12 20 5
2 2 4 40 10
3 2 40 80 52
4 3 15 80 68
5 3 80 100 56
data
dfnew <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L), tstart = c(12L, 4L,
40L, 15L, 80L), tstop = c(20L, 40L, 80L, 80L, 100L), result1 = c(5L,
10L, NA, 68L, NA), result2 = c(NA, NA, 52L, NA, 56L)),
class = "data.frame", row.names = c(NA,
-5L))

How to group contiguous variable into a range r

I have an example dataset:
Road Start End Cat
1 0 50 a
1 50 60 b
1 60 90 b
1 70 75 a
2 0 20 a
2 20 25 a
2 25 40 b
Trying to output following:
Road Start End Cat
1 0 50 a
1 50 90 b
1 70 75 a
2 0 25 a
2 25 40 b
My code doesn't work:
df %>% group_by(Road, cat)
%>% summarise(
min(Start),
max(End)
)
How can I achieve the results I wanted?
We can use rleid from data.table to get the run-length-id-encoding for grouping and then do the summarise
library(dplyr)
library(data.table)
df %>%
group_by(Road, grp = rleid(Cat)) %>%
summarise(Cat = first(Cat), Start = min(Start), End = max(End)) %>%
select(-grp)
# A tibble: 5 x 4
# Groups: Road [2]
# Road Cat Start End
# <int> <chr> <int> <int>
#1 1 a 0 50
#2 1 b 50 90
#3 1 a 70 75
#4 2 a 0 25
#5 2 b 25 40
Or using data.table methods
library(data.table)
setDT(df)[, .(Start = min(Start), End = max(End)), .(Road, Cat, grp = rleid(Cat))]
data
df <- structure(list(Road = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Start = c(0L,
50L, 60L, 70L, 0L, 20L, 25L), End = c(50L, 60L, 90L, 75L, 20L,
25L, 40L), Cat = c("a", "b", "b", "a", "a", "a", "b")),
class = "data.frame", row.names = c(NA,
-7L))

Summarise a group value into single row

I have a large dataset with longitudinal readings from single individuals.
I want to summarise information over time into a binary variable. i.e. if diff in the input table below is >5 for any value I want to then reduce the observation for A to a new column saying TRUE.
#Input
individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
#Output
individual newval
A TRUE
B FALSE
Using dplyr you can:
library(dplyr)
df %>%
group_by(individual) %>% # first group data
summarize(newval = any(diff > 5)) # then evaluate test for each group
#> # A tibble: 2 x 2
#> individual newval
#> <fct> <lgl>
#> 1 A TRUE
#> 2 B FALSE
data
df <- read.table(text = "individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
", header = TRUE)
Multiple ways to do this :
In base R we can use aggregate
aggregate(diff~individual, df,function(x) any(x>5))
# individual diff
#1 A TRUE
#2 B FALSE
Or tapply
tapply(df$diff > 5, df$individual, any)
We can also use data.table
library(data.table)
setDT(df)[ ,(newval = any(diff > 5)), by = individual]
An option in base R with rowsum
rowsum(+(df1$diff > 5), df1$individual) != 0
or with by
by(df1$diff > 5, df1$individual, any)
data
df1 <- structure(list(individual = c("A", "A", "A", "A", "B", "B", "B",
"B"), val1 = c(32L, 36L, 28L, 26L, 65L, 58L, 57L, 54L), val2 = c(36L,
28L, 26L, 26L, 64L, 59L, 54L, 51L), diff = c(-4L, 8L, 2L, 0L,
1L, -1L, 3L, 3L)), class = "data.frame", row.names = c(NA, -8L
))

Unable to summarize the minimum and maximum while using for loop

Below is random data.
drop drop1 drop2 ch
15 14 40 1
20 15 45 1
35 16 90 1
40 17 70 0
25 18 80 0
30 18 90 0
11 20 100 0
13 36 11 0
16 70 220 0
19 40 440 1
25 45 1 1
35 30 70 1
40 40 230 1
17 11 170 1
30 2 160 1
I am using code below for variable profiling for continuous variable in R.
library(dplyr)
dt %>% mutate(dec=ntile(drop, n=2)) %>%
count(ch, dec) %>%
filter(ch == 1) -> datcbld
datcbld$N <- unclass(dt %>%
mutate(dec=ntile(drop, n=2)) %>%
count(dec) %>%
unname())[[2]]
datcbld$ch_perc <- datcbld$n / datcbld$N
datcbld$GreaterThan <- unclass(dt %>% mutate(dec=ntile(drop, n=2)) %>%
group_by(dec) %>%
summarise(min(drop)))[[2]]
datcbld$LessThan <- unclass(dt %>%
mutate(dec=ntile(drop, n=2)) %>%
group_by(dec) %>%
summarise(max(drop)))[[2]]
datcbld$Varname <- rep("dt", nrow(datcbld))
And below is output of the code.
ch dec n N ch_perc GreaterThan LessThan Varname
1 1 4 8 0.5 11 25 drop
1 2 5 7 0.714285714 25 40 drop
This code works perfectly fine when I am using it for a single variable.
When I am trying to run it for each column using a for loop it is unable to summarise with min and max for each decile.
Below is my code using for running for loop.
finaldata <- data.frame()
for(i in 1:(ncol(dt) - 1)){
dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n = 2)) %>%
count(ch,dec) %>%
filter(ch == 1) -> dat
dat$N <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
count(dec) %>%
unname())[[2]]
dat$ch_perc <- dat$n / dat$N
dat$GreaterThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(min(dt[, colnames(dt[i])])))[[2]]
dat$LessThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(max(dt[, colnames(dt[i])])))[[2]]
dat$Varname <- rep(colnames(dt[i]), nrow(dat))
finaldata <- rbind(finaldata, dat)
}
But I'm unable to get same result.
We could do this with map by looping over the names and this can be done without breaking off the chain (%>%)
library(tidyverse)
names(dt)[1:3] %>%
map_df(~
dt %>%
select(.x, ch) %>%
mutate(dec = ntile(!! rlang::sym(.x), n = 2)) %>%
group_by(dec) %>%
mutate(N = n(),
GreaterThan = max(!!rlang::sym(.x)),
LessThan = min(!!rlang::sym(.x))) %>%
select(-1) %>%
count(!!! rlang::syms(names(.))) %>%
filter(ch == 1)%>%
mutate(ch_perc = n/N,
Varname = .x))
# A tibble: 6 x 8
# Groups: dec [2]
# dec ch N GreaterThan LessThan n ch_perc Varname
# <int> <int> <int> <dbl> <dbl> <int> <dbl> <chr>
#1 1 1 8 25 11 4 0.5 drop
#2 2 1 7 40 25 5 0.714 drop
#3 1 1 8 18 2 5 0.625 drop1
#4 2 1 7 70 20 4 0.571 drop1
#5 1 1 8 90 1 5 0.625 drop2
#6 2 1 7 440 90 4 0.571 drop2
The issue in the OP's for loop is the use of
dt[, colnames(dt[i])]
within summarise. It will apply the min or max on the full column value instead of applying the function on the column respecting the group by structure
We could convert the column names to symbols as showed above (sym) and do an evaluation or use summarise_at
finaldata <- data.frame()
for(i in 1:(ncol(dt) - 1)){
dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n = 2)) %>%
count(ch,dec) %>%
filter(ch == 1) -> dat
dat$N <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
count(dec) %>%
unname())[[2]]
dat$ch_perc <- dat$n / dat$N
dat$GreaterThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(max(!! rlang::sym(names(dt)[i]))))[[2]]
dat$LessThan <- unclass(dt %>%
mutate(dec=ntile(dt[, colnames(dt[i])], n=2)) %>%
group_by(dec) %>%
summarise(min(!! rlang::sym(names(dt)[i]))))[[2]]
dat$Varname <- rep(colnames(dt[i]), nrow(dat))
finaldata <- rbind(finaldata, dat)
}
finaldata
# A tibble: 6 x 8
# ch dec n N ch_perc GreaterThan LessThan Varname
# <int> <int> <int> <int> <dbl> <dbl> <dbl> <chr>
#1 1 1 4 8 0.5 25 11 drop
#2 1 2 5 7 0.714 40 25 drop
#3 1 1 5 8 0.625 18 2 drop1
#4 1 2 4 7 0.571 70 20 drop1
#5 1 1 5 8 0.625 90 1 drop2
#6 1 2 4 7 0.571 440 90 drop2
data
dt <- structure(list(drop = c(15L, 20L, 35L, 40L, 25L, 30L, 11L, 13L,
16L, 19L, 25L, 35L, 40L, 17L, 30L), drop1 = c(14L, 15L, 16L,
17L, 18L, 18L, 20L, 36L, 70L, 40L, 45L, 30L, 40L, 11L, 2L), drop2 = c(40L,
45L, 90L, 70L, 80L, 90L, 100L, 11L, 220L, 440L, 1L, 70L, 230L,
170L, 160L), ch = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 1L, 1L)), .Names = c("drop", "drop1", "drop2", "ch"),
class = "data.frame", row.names = c(NA,
-15L))

how to calculate a specific subset in dataframe in r and save the calculation in another list

I have two lists:
list 1:
id name age
1 jake 21
2 ashly 19
45 lana 18
51 james 23
5675 eric 25
list 2 (tv watch):
id hours
1 1.1
1 3
1 2.5
45 5.6
45 3
51 2
51 1
51 2
this is just an example, the real lists are very big :list 1 - 5000 id's, list 2/3/4 - has more then 1 million rows (not a unique id).
I need for every list 2 and up to calculate average/sum/count to every id and add the value to list 1.
notice that I need the calculation saved in another list with different row numbers.
example:
list 1:
id name age tv_average
1 jake 21 2.2
2 ashly 19 n/a
45 lana 18 4.3
51 james 23 1.6667
5675 eric 25 n/a
this are my tries:
for (i in 1:nrow(list2)) {
p <- subset(list2,list2$id==i)
list2$tv_average[i==list2$id] <- sum(p$hours)/(nrow(p))
}
error:
out of 22999 rows it only work on 21713 rows.
Try this
#Sample Data
data1 = structure(list(id = c(1L, 2L, 45L, 51L, 5675L), name = structure(c(3L,
1L, 5L, 4L, 2L), .Label = c("ashly", "eric", "jake", "james",
"lana"), class = "factor"), age = c(21L, 19L, 18L, 23L, 25L)
), .Names = c("id",
"name", "age"), row.names = c(NA, -5L), class = "data.frame")
data2 = structure(list(id = c(1L, 1L, 1L, 3L, 45L, 45L, 51L, 51L, 51L,
53L), hours = c(1.1, 3, 2.5, 10, 5.6, 3, 2, 1, 2, 6)), .Names = c("id",
"hours"), class = "data.frame", row.names = c(NA, -10L))
# Use aggregate to calculate Average, Sum, and Count and Merge
merge(x = data1,
y = aggregate(hours~id, data2, function(x)
c(mean = mean(x),
sum = sum(x),
count = length(x))),
by = "id",
all.x = TRUE)
# id name age hours.mean hours.sum hours.count
#1 1 jake 21 2.200000 6.600000 3.000000
#2 2 ashly 19 NA NA NA
#3 45 lana 18 4.300000 8.600000 2.000000
#4 51 james 23 1.666667 5.000000 3.000000
#5 5675 eric 25 NA NA NA

Resources