I'm struggling to find an easy a fast solution to create a new data frame by multiplying all "group" of columns between them.
Data for example
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
For example in this in my initial datatable
Original <- data.frame(
date = seq(today()-9, today(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2)
and this datatable is what I would like to achieve (e.i., columns with all the possible combination between the columns that end with a 1 and columns with all the possible combination between the columns that end with a 2)
Objective <- data.frame(
date = seq(today()-9, today(), by = 1),
b1a1 = b1*a1,
c1a1 = c1*a1,
c1b1 = c1*b1,
b2c2 = b2*c2,
b2a2 = b2*a2,
c2a2 = c2*a2)
I tried with loops but it's not a very elegant and efficient solution; or at least mine wasn't. A solution using the tidyverse would be very welcome
Thanks in advance
I.T
Here is base R option -
cbind(Original[1], do.call(cbind,
unname(lapply(split.default(Original[-1],
gsub('\\D', '', names(Original[-1]))), function(x) {
do.call(cbind, combn(names(x), 2, function(y) {
setNames(data.frame(do.call(`*`, Original[y])),
paste0(y, collapse = ''))
}, simplify = FALSE))
}))))
# date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
#1 2021-05-28 -0.06708 1.393018 -0.1213 0.1795 -1.0878 -0.0947
#2 2021-05-29 0.33234 0.045563 0.0201 0.0607 0.0247 0.9219
#3 2021-05-30 0.05043 0.160582 0.0341 0.1748 -0.3893 -0.1184
#4 2021-05-31 0.93642 0.980333 0.8156 0.0746 -1.1128 -0.1571
#5 2021-06-01 -1.21365 -0.256619 0.3268 -1.0106 -0.3542 2.1991
#6 2021-06-02 -0.09550 1.311417 -0.0754 -0.8243 -0.5532 1.1986
#7 2021-06-03 0.32514 0.373324 2.3262 -1.1904 -3.0764 0.7171
#8 2021-06-04 -0.41219 1.034527 -0.8338 -1.8588 -1.0202 2.6916
#9 2021-06-05 0.12488 -0.155639 -0.2294 0.2380 0.4288 0.3711
#10 2021-06-06 -0.00665 0.000139 -0.0105 -2.0117 -0.6363 1.0802
Explanation of the answer -
split.default is used to split the data in groups.
split.default(Original[-1], gsub('\\D', '', names(Original[-1])))
#$`1`
# a1 b1 c1
#1 -0.87773 0.0764 -1.5871
#2 0.86812 0.3828 0.0525
#3 0.48761 0.1034 0.3293
#4 -1.06095 -0.8826 -0.9240
#5 0.97625 -1.2432 -0.2629
#6 -1.28910 0.0741 -1.0173
#7 -0.22843 -1.4234 -1.6343
#8 -0.71512 0.5764 -1.4467
#9 0.29108 0.4290 -0.5347
#10 -0.00937 0.7098 -0.0149
#$`2`
# a2 b2 c2
#1 -1.4360 -0.125 0.758
#2 -0.0403 -1.507 -0.612
#3 -0.7580 -0.231 0.514
#4 0.7270 0.103 -1.531
#5 -0.4035 2.505 0.878
#6 0.6168 -1.336 -0.897
#7 2.2599 -0.527 -1.361
#8 -0.8394 2.215 1.215
#9 -0.5244 -0.454 -0.818
#10 1.0886 -1.848 -0.585
where gsub is used to remove all non-numeric character from the column names which is used to create groups.
gsub('\\D', '', names(Original[-1]))
#[1] "1" "1" "1" "2" "2" "2"
For every group using lapply we create every combination of column names (combn(names(x), 2.....) taking 2 columns at a time.
Multiply each combination (do.call(*, Original[y])) create a one-column dataframe and give the name of the column using setNames that is name of the combination (paste0(y, collapse = ''))
All the combinations from step 3 are combined into one dataframe. (do.call(cbind, combn.....).
All such groups are again combined into one dataframe (do.call(cbind, lapply...).
First column with dates is kept as it is in the final output (cbind(Original[1], ....).
Very good question. A tidyverse approach. This approach will have combination of uneven number of columns per group. Explanation -
Data is divided into a list with each sub-group as a separate item in the list. For this division
Firstly, the data is pivoted long using pivot_longer
then a dummy group (sub-group identification) column in created using gsub. You may use str_replace too.
list created using dplyr::group_split
data in all items reshaped back to its original form using tidyr::pivot_wider inside purrr::map now
thereafter all individual items of list -
first combined using combn and Reduce. You may also use purrr::reduce here
secondly names of new columns generated using same combn and Reduce
these names bound above matrix into a named dataframe.
lastly, using purrr::reduce in conjunction with dplyr::left_join list is converted back to intended shape
set.seed(123)
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
Original <- data.frame(
date = seq(Sys.Date()-9, Sys.Date(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2)
library(tidyverse)
Original %>% pivot_longer(!date) %>%
mutate(grp = gsub('^\\D*(\\d)+$', '\\1', name)) %>%
group_split(grp, .keep = F) %>%
map(~ .x %>% pivot_wider(names_from = name, values_from = value)) %>%
map(~ combn(.x[-1], 2, FUN = Reduce, f = `*`) %>% as.data.frame() %>%
setNames(combn(names(.x[-1]), 2, FUN = Reduce, f = paste0)) %>% cbind(.x[1], .)) %>%
reduce(~left_join(.x, .y, by = 'date'))
date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
1 2021-05-28 -0.68606804 0.59848918 -1.30710356 -0.29626767 0.108031283 -0.175982140
2 2021-05-29 -0.08282104 0.05017292 -0.07843039 0.06135046 0.008423333 0.005935364
3 2021-05-30 0.62468579 -1.59924166 -0.41119329 -1.13268875 -0.038374446 0.054248120
4 2021-05-31 0.00780406 -0.05139295 -0.08067566 1.90463287 1.201815497 2.968438088
5 2021-06-01 -0.07186344 -0.08080991 0.34742254 0.99243873 -0.185489171 -0.272722771
6 2021-06-02 3.06467216 -2.89278864 -3.01397443 -0.77341778 1.044302702 -1.703161152
7 2021-06-03 0.22946735 0.38614963 0.41709268 -0.22316502 -0.857881519 0.623969018
8 2021-06-04 2.48789113 -0.19402639 -0.30162620 0.02889143 -0.036194437 -0.272813136
9 2021-06-05 -0.48172830 0.78173260 -0.79823906 -0.23864021 -0.037894774 0.096601990
10 2021-06-06 0.21070515 -0.55877763 -0.59279292 0.03171951 -0.082159505 -0.018002847
Check it for this extended dataset
set.seed(123)
a1 <- rnorm(n = 10)
b1 <- rnorm(n = 10)
c1 <- rnorm(n = 10)
a2 <- rnorm(n = 10)
b2 <- rnorm(n = 10)
c2 <- rnorm(n = 10)
d2 <- rnorm(n = 10)
Original <- data.frame(
date = seq(Sys.Date()-9, Sys.Date(), by = 1),
a1 = a1,
b1 = b1,
c1 = c1,
a2 = a2,
b2 = b2,
c2 = c2,
d2 = d2)
library(tidyverse)
Original %>% pivot_longer(!date) %>%
mutate(grp = gsub('^\\D*(\\d)+$', '\\1', name)) %>%
group_split(grp, .keep = F) %>%
map(~ .x %>% pivot_wider(names_from = name, values_from = value)) %>%
map(~ combn(.x[-1], 2, FUN = Reduce, f = `*`) %>% as.data.frame() %>%
setNames(combn(names(.x[-1]), 2, FUN = Reduce, f = paste0)) %>% cbind(.x[1], .)) %>%
reduce(~left_join(.x, .y, by = 'date'))
date a1b1 a1c1 b1c1 a2b2 a2c2 a2d2 b2c2 b2d2 c2d2
1 2021-05-28 -0.68606804 0.59848918 -1.30710356 -0.29626767 0.108031283 0.161902656 -0.175982140 -0.26373820 0.09616971
2 2021-05-29 -0.08282104 0.05017292 -0.07843039 0.06135046 0.008423333 0.148221326 0.005935364 0.10444173 0.01433970
3 2021-05-30 0.62468579 -1.59924166 -0.41119329 -1.13268875 -0.038374446 -0.298262480 0.054248120 0.42163941 0.01428475
4 2021-05-31 0.00780406 -0.05139295 -0.08067566 1.90463287 1.201815497 -0.894445153 2.968438088 -2.20924515 -1.39402460
5 2021-06-01 -0.07186344 -0.08080991 0.34742254 0.99243873 -0.185489171 -0.880563395 -0.272722771 -1.29468307 0.24197936
6 2021-06-02 3.06467216 -2.89278864 -3.01397443 -0.77341778 1.044302702 0.209022041 -1.703161152 -0.34089562 0.46029226
7 2021-06-03 0.22946735 0.38614963 0.41709268 -0.22316502 -0.857881519 0.248271309 0.623969018 -0.18057692 -0.69416615
8 2021-06-04 2.48789113 -0.19402639 -0.30162620 0.02889143 -0.036194437 -0.003281582 -0.272813136 -0.02473471 0.03098700
9 2021-06-05 -0.48172830 0.78173260 -0.79823906 -0.23864021 -0.037894774 -0.282179411 0.096601990 0.71933645 0.11422674
10 2021-06-06 0.21070515 -0.55877763 -0.59279292 0.03171951 -0.082159505 -0.779997773 -0.018002847 -0.17091365 0.44269850
Created on 2021-06-06 by the reprex package (v2.0.0)
You can also use the following solution, not as concise as other answers but here is a different approach that might have some points worthy of consideration. Much of the first chunk of codes I tried to emulate combn function with tidyverse equivalences. So first chuck which leads to df2 data set creates all the combinations whose products you would like to calculate and the second chunk just evaluates them in the context of Original data set. Anyway thank you for this fantastic question that pushed me to the limits.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(rlang)
cols <- c("(\\w1)", "(\\w2)")
cols %>%
map_dfc(~ names(Original)[str_detect(names(Original), .x)] %>%
as_tibble() %>%
mutate(value2 = rev(value)) %>%
expand(value, value2) %>%
filter(value != value2) %>%
rowwise() %>%
mutate(comb = paste0(sort(c(value, value2)), collapse = "*")) %>%
select(comb) %>%
distinct(comb)) %>%
rename_with(~ str_remove(., "\\.\\.\\."), everything()) %>%
pivot_longer(everything(), names_to = c(".value", "id"),
names_pattern = "(\\w+)(\\d)") -> df2
df2 %>%
select(comb) %>%
rowwise() %>%
mutate(data = map(comb, ~ eval_tidy(parse_expr(.x), data = Original))) %>%
unnest(cols = c(data)) %>%
group_by(comb) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = comb, values_from = data) %>%
relocate(ends_with("1")) %>%
bind_cols(Original$date) %>%
rename_with(~ str_remove(., "\\*"), everything()) %>%
rename(Date = ...8) %>%
relocate(Date) %>%
select(-id)
# A tibble: 10 x 7
Date a1b1 a1c1 b1c1 a2b2 a2c2 b2c2
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021-05-28 -0.129 0.0912 -0.0838 -1.55 -1.52 2.11
2 2021-05-29 -0.477 -1.58 0.352 -3.55 -0.144 0.101
3 2021-05-30 0.195 0.708 0.105 0.910 -0.356 -0.177
4 2021-05-31 -0.194 0.0219 -0.0111 -1.35 0.261 -0.200
5 2021-06-01 0.0140 0.107 0.000601 -0.0279 -0.126 0.104
6 2021-06-02 0.242 0.141 0.174 -0.0174 0.695 -0.0570
7 2021-06-03 -0.439 -0.360 0.589 0.804 -2.76 -1.79
8 2021-06-04 -1.02 -0.0349 0.0137 2.07 0.357 0.495
9 2021-06-05 -0.00670 0.550 -0.00161 -0.000907 0.00503 -0.925
10 2021-06-06 -0.287 -0.505 0.718 -0.0290 -0.00351 0.0256
Related
I have to make a series of operations over a subset of columns. I have a set of columns which measures the same thing to different parties A, B, and C:
id var1_A var1_B var1_C var2_A var2_B var2_C var3_A var3_B var3_C
So, in the example, var1_A var1_B var1_C refer to the same measurement for different parties. And var1_A, var2_A, var3_A refer to different variables for the same party A.
I would like to accomplish 2 things:
I need to create multiple data frames and merge the id with another dataframe, each one refers to one specific party. I wrote the code for each data frame individually, as the example below. The issue is that in the example it is simple. What complicates my life is that I have multiple datasets like df, and each of them contain information for multiple parties, and I end up with 50 lines of repetitive code. Is that a way to simplify?
df_A <- df %>% select(id var1_A var2_A var3_A)
df_A <- merge(df_A, df_merge, by="id")
df_B <- df %>% select(id var1_B var2_B var3_B)
df_B <- merge(df_B, df_merge, by="id")
df_C <- df %>% select(id var1_C var2_C var3_C)
df_C <- merge(df_C, df_merge, by="id")
The second thing I would like to accomplish is to change the variable name for df. I would like to change the variable name for all the columns that measure the same thing, but maintaining the party which it refers to. For example, say var1 refers to height, var2 refers to weight, and var3 refers to gender:
id var1_A var1_B var1_C var2_A var2_B var2_C var3_A var3_B var3_C
I would like to get something like:
id height_A height_B height_C weight_A weight_B weight_C gender_A gender_B gender_C
Is there a way to accomplish this with few lines of code? Or do I have to rename each of them individually (using rename command, for example)?
A tidy way:
require(tidyverse)
#CREATE DATA
df <- data.frame(id = 1:10,
var1_A = runif(10),
var1_B = runif(10),
var1_C = runif(10),
var2_A = runif(10),
var2_B = runif(10),
var2_C = runif(10),
var3_A = runif(10),
var3_B = runif(10),
var3_C = runif(10))
df_merge<-data.frame(id = 1:10,
value=11:20)
#grabs current names
nam<-colnames(df)
#Create map of new names
new_names = c('var1'='height','var2'='weight','var3'='gender')
#replace the strings with new strings in map
nam <- str_replace_all(nam, new_names)
#reassign column names to dataframe
colnames(df)<-nam
# loop through all letters in list assign to variable
#pasted with "df" and the letter, selects columns ending with
# letter, merges with df_ids and returns the new subset of data
#to the assigned variable name
for (letter in c('A', "B", "C")){
assign(paste("df", letter, sep = '_'),
df%>%select(id, ends_with(letter))%>%
merge(df_merge, by='id'))
}
This is similar to #thelatemail's comment (answer) above, but with a couple of extra subsequent steps, i.e. rename the columns, pivot the data to 'long' format, split the df into groups ("df_A", "df_B", "df_C"), pivot the data back to wide, and save the dfs to your global environment:
library(tidyverse)
library(purrr)
df <- data.frame(id = 1:10,
var1_A = runif(10),
var1_B = runif(10),
var1_C = runif(10),
var2_A = runif(10),
var2_B = runif(10),
var2_C = runif(10),
var3_A = runif(10),
var3_B = runif(10),
var3_C = runif(10))
list_of_dfs <- df %>%
rename_with(.cols = starts_with("var1"), ~gsub("var1", "height", .x)) %>%
rename_with(.cols = starts_with("var2"), ~gsub("var2", "weight", .x)) %>%
rename_with(.cols = starts_with("var3"), ~gsub("var3", "gender", .x)) %>%
pivot_longer(-id) %>%
mutate(group = case_when(
str_detect(name, "_A") ~ "df_A",
str_detect(name, "_B") ~ "df_B",
str_detect(name, "_C") ~ "df_C"
)) %>%
split(., .$group)
df_list <- map(list_of_dfs,
\(x) pivot_wider(x, names_from = name,
values_from = value) %>%
select(-group))
list2env(df_list, envir = .GlobalEnv)
#> <environment: R_GlobalEnv>
ls()
#> [1] "df" "df_A" "df_B" "df_C" "df_list"
#> [6] "list_of_dfs"
df_A
#> # A tibble: 10 × 4
#> id height_A weight_A gender_A
#> <int> <dbl> <dbl> <dbl>
#> 1 1 0.417 0.693 0.320
#> 2 2 0.387 0.879 0.00590
#> 3 3 0.882 0.805 0.861
#> 4 4 0.611 0.246 0.336
#> 5 5 0.795 0.185 0.680
#> 6 6 0.274 0.00675 0.568
#> 7 7 0.722 0.950 0.757
#> 8 8 0.776 0.757 0.0457
#> 9 9 0.613 0.352 0.853
#> 10 10 0.0603 0.438 0.421
Created on 2022-10-05 by the reprex package (v2.0.1)
You can then merge/join the dfs as required. Hope this helps.
I am trying to calculate mean for some data along a non-regular date sequence. For example, I have minute level data for specific periods of time during the day and I am interested in calculating 5 minute averages. However, I am not sure how does the width parameter in rollapply works when is specified as a list.
library(tidyverse)
library(zoo)
length = 16
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
# Create a "discontinuity"
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
# Add some noise
dxf$date <- dxf$date + runif(length, 0, 1)
diff(dxf$date)
dxf %>%
arrange(date) %>%
mutate(
diff = c(as.numeric(diff(date)), NA),
mean = rollapply(value, width = 5, mean, partial = TRUE, align = "left")
)
# This is what I need. Therefore, I need a variable width but adjusting to the discontinuity in the rows.
mean1 <- mean(dxf$value[1:5])
mean2 <- mean(dxf$value[2:6])
mean3 <- mean(dxf$value[3:7])
mean4 <- NA # Only have 4 values mean(dxf$value[4:7])
mean5 <- NA # Only have 3 values mean(dxf$value[5:7])
mean6 <- NA # Only have 2 values mean(dxf$value[6:7])
mean7 <- NA # Only have 1 values mean(dxf$value[7:7])
mean8 <- mean(dxf$value[7:11])
etc.
I think this is a tricky problem. Here is one approach
1 Generate a 1 min sequence from the first to the last datetime
2 Interpolate so we have a value at each 1 min. This includes interpolating across the discontinuity
3 Calculate the running 5 min mean based on the 1 min interpolated values
4 Remove the values where the gap in the original datetime values is too large
Also, take care with time zones, best to set these to some deliberately chosen value or UTC which the lubridate functions do by default.
library(tidyverse)
library(RcppRoll)
library(lubridate)
dxf <- tibble(
date = seq(from = ymd_hms('2019-08-14 09:06:05'), by = "59 sec", length.out = 30),
value = runif(30)
)
dxf$date[15:30] <- dxf$date[15:30] + 3600*24 # discontinuing
dxf$date <- dxf$date + round(runif(30)) # noise
dxf <- dxf %>%
mutate(date = ymd_hms(date),
date_num = as.numeric(date),
diff = date_num - lag(date_num))
discontinuity <- which(dxf$diff > 70)
n = nrow(dxf)
date_seq <- seq(from = dxf$date_num[1], to = dxf$date_num[n], by = 60) # create a 1 min sequence
value_interp = approx(x = dxf$date_num, y = dxf$value, xout = date_seq) # interpolate values for the 5 min sequence
df <- tibble(
date = as_datetime(date_seq),
mean_value = RcppRoll::roll_mean(value_interp$y, n = 5, fill = NA, align = 'left'))
df %>%
filter(date < dxf$date[discontinuity - 1] | date > dxf$date[discontinuity])
We could extract the date, group them and then use rollmean
library(dplyr)
dxf %>%
mutate(d1 = as.Date(date)) %>%
group_by(d1) %>%
mutate(mean = zoo::rollmean(value, 5, align = "left", fill = NA)) %>%
ungroup %>%
select(-d1)
# date value mean
# <dttm> <dbl> <dbl>
# 1 2019-08-14 12:49:09 0.507 0.404
# 2 2019-08-14 12:50:08 0.307 0.347
# 3 2019-08-14 12:51:07 0.427 0.341
# 4 2019-08-14 12:52:07 0.693 NA
# 5 2019-08-14 12:53:06 0.0851 NA
# 6 2019-08-14 12:54:05 0.225 NA
# 7 2019-08-14 12:55:04 0.275 NA
# 8 2019-08-15 12:56:02 0.272 0.507
# 9 2019-08-15 12:57:01 0.616 0.476
#10 2019-08-15 12:58:01 0.430 0.472
#11 2019-08-15 12:59:00 0.652 0.457
#12 2019-08-15 12:59:58 0.568 0.413
#13 2019-08-15 13:00:58 0.114 NA
#14 2019-08-15 13:01:56 0.596 NA
#15 2019-08-15 13:02:56 0.358 NA
#16 2019-08-15 13:03:54 0.429 NA
data
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
dxf$date <- dxf$date + runif(length, 0, 1)
Here w[i] is number of elements of date that are less than or equal to date[i] + 300 minus i - 1 noting that 300 refers to 300 seconds.
date <- dxf$date
w <- findInterval(date + 300, date) - seq_along(date) + 1
rollapply(dxf$value, w, mean, align = "left") * ifelse(w < 5, NA, 1)
# same
sapply(seq_along(w), function(i) mean(dxf$value[seq(i, length = w[i])])) *
ifelse(w < 5, NA, 1)
In R, how do deal with messy data frame with mixed up row and column as variables?
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
streetnames <- c(NA,rep(c("Astr.","Bstr.","Cstr."),3))
molecule <- c(NA, rep(c("SO","CO","O3"),3))
d <- rbind(streetnames, molecule,d)
see df as tbl in this printscreen
in this case idealy should have only 5 rows (Date, SO, NO, O3, Station)
Here's my approach. The advantage of doing it this way is that it's completely programmatic. It's fine to have a solution where you manually rename the variables if the dataset is complete, but this approach can scale to the dataset if you're still adding new stations and gases.
# OP changed the 'streetnames' vector, below is the correct one they've provided.
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
streetnames <- c(NA,rep(c("Astr."),3),rep(c("Bstr."),3),rep(c("Cstr."),3))
molecule <- c(NA, rep(c("SO","CO","O3"),3))
d <- rbind(streetnames, molecule, d)
# ---------------
library(tidyr)
library(dplyr)
library(janitor)
# Replace column names with the combined first two rows. This is tricky to do inside
# a dplyr pipeline so I do it outside.
names(d) <- paste(d[1,], d[2,])
d2 <-
d %>%
slice(3:n()) %>% # Remove first 2 rows
clean_names() %>% # Janitor standardises column names
rename(date = na_na) %>%
gather(measure, value, -date) %>% # Collapse wide to long
separate(measure, # Break this column into several columns
into = c("station", "gas")) %>%
mutate_at("value", as.numeric) %>%
# You can stop there to have a long table. To get a wide table:
spread(gas, value) %>%
identity()
head(d2)
#> date station co o3 so
#> 1 2011-07-01 astr 6.517 8.647 5.075
#> 2 2011-07-01 bstr 2.755 3.543 5.356
#> 3 2011-07-01 cstr 0.756 8.614 0.319
#> 4 2011-07-02 astr 5.677 6.154 3.068
#> 5 2011-07-02 bstr 2.289 9.364 0.931
#> 6 2011-07-02 cstr 5.344 4.644 1.145
str(d2)
#> 'data.frame': 30 obs. of 5 variables:
#> $ date : Date, format: "2011-07-01" "2011-07-01" "2011-07-01" ...
#> $ station: chr "astr" "bstr" "cstr" "astr" ...
#> $ co : num 6.517 2.755 0.756 5.677 2.289 ...
#> $ o3 : num 8.65 3.54 8.61 6.15 9.36 ...
#> $ so : num 5.075 5.356 0.319 3.068 0.931 ...
Note: I always throw an identity() at the end of pipelines for debugging purposes. It lets you comment out entire lines of the pipe without having to worry about trailing %>% raising errors.
A base R approach could be the following.
res <- lapply(seq(2, ncol(d), by = 3), function(i){
Date <- d[-(1:2), "Date"]
SO <- d[-(1:2), i]
CO <- d[-(1:2), i + 1]
O3 <- d[-(1:2), i + 2]
data.frame(Date, SO, CO, O3)
})
res <- do.call(rbind, res)
res$Date <- as.Date(res$Date)
row.names(res) <- NULL
head(res)
# Date SO CO O3
#1 2011-07-01 5.075 6.517 8.647
#2 2011-07-02 3.068 5.677 6.154
#3 2011-07-03 4.269 1.135 7.751
#4 2011-07-04 6.931 5.959 3.556
#5 2011-07-05 0.851 3.58 4.058
#6 2011-07-06 2.254 4.288 7.066
Starting from the beginning of your code sample with your rbind calls omitted:
days <- c(as.Date("2011-07-01") + 0:9)
set.seed(10)
d <- data.frame(days,replicate(9,round(runif(10,0,10),3)))
names(d) <- c("Date", "x.astreet.1", "x.astreet.2", "x.astreet.3",
"x.Bstreet.1", "x.Bstreet.2", "x.Bstreet.3",
"x.Cstreet.1", "x.Cstreet.2", "x.Cstreet.3")
d %<>% gather(col_name, value, -Date) %>%
separate(col_name, c("x", "street_name", "molecule_number"), sep = "\\.", convert = TRUE) %>%
select(-x) %>%
spread(molecule_number, value) %>%
rename(SO = `1`, NO = `2`, O3 = `3`)
I think this is what you're trying to get to. There is likely a more elegant solution, but this will work.
I assumed that the suffix 1, 2, 3 correspond to SO, CO, and O3.
This solution does not use the streetnames or molucule_number vectors that you created, so you can leave off the rbind() call that you made.
library(dplyr)
library(tidyr)
e <- d %>% gather(key = "station", value = "val", x.astreet.1:x.Cstreet.3)
SO <- e %>% filter(grepl("1", station))
CO <- e %>% filter(grepl("2", station))
O3 <- e %>% filter(grepl("3", station))
f <- data.frame(SO, CO %>% select(val), O3 %>% select(val))
g <- f %>% mutate(Station = case_when(station == "x.astreet.1" ~ "Astr",
station == "x.Bstreet.1" ~ "Bstr",
station == "x.Cstreet.1" ~ "Cstr"),
SO = val,
CO = val.1,
O3 = val.2) %>%
select(Date, SO, CO, O3, Station)
I left in the DF renaming so you could see the result after each step.
Given a dataframe df like below
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE)
I want to add a row with parameter b_diff for each car, qtr combination with val as difference of parameter b for two consecutive qtr. The qtr ascending order is FY18Q1, FY18Q2. For the first qtr which is FY18Q1, the val for b_diff shall be NA as there is no previous qtr.
The expected output is as below.
parameter car qtr val
a a3 FY18Q1 23
b a3 FY18Q1 10000
b_diff a3 FY18Q1 NA
a a3 FY18Q2 14
b a3 FY18Q2 12000
b_diff a3 FY18Q2 2000
a cla FY18Q1 15
b cla FY18Q1 12000
c cla FY18Q1 5.5
b_diff cla FY18Q1 NA
a cla FY18Q2 26
b cla FY18Q2 10000
c cla FY18Q2 6.2
b_diff cla FY18Q2 -2000
How do I go about doing this with dplyr ?
A solution using dplyr and purrr. We can create a group ID using group_indices and based on that to split the data frame, summarize the data and then combine them. df5 is the final output.
library(dplyr)
library(purrr)
df2 <- df %>% mutate(GroupID = group_indices(., car, qtr))
df3 <- df2 %>%
filter(parameter %in% "b") %>%
group_by(car) %>%
mutate(val = val - lag(val), parameter = "b_diff") %>%
ungroup() %>%
split(f = .$GroupID)
df4 <- df2 %>% split(f = .$GroupID)
df5 <- map2_dfr(df4, df3, bind_rows) %>% select(-GroupID)
df5
# parameter car qtr val
# 1 a a3 FY18Q1 23.0
# 2 b a3 FY18Q1 10000.0
# 3 b_diff a3 FY18Q1 NA
# 4 a a3 FY18Q2 14.0
# 5 b a3 FY18Q2 12000.0
# 6 b_diff a3 FY18Q2 2000.0
# 7 a cla FY18Q1 15.0
# 8 b cla FY18Q1 12000.0
# 9 c cla FY18Q1 5.5
# 10 b_diff cla FY18Q1 NA
# 11 a cla FY18Q2 26.0
# 12 b cla FY18Q2 10000.0
# 13 c cla FY18Q2 6.2
# 14 b_diff cla FY18Q2 -2000.0
DATA
Notice that it is better to have stringsAsFactors = FALSE.
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE, stringsAsFactors = FALSE)
Here is one algorithm:
Reshape the data to "wide" format, so that qtr and car form a unique row index, with the parameter column "spread" into columns
Within each car value, take the 1-period diff of the new parameter_b column
Reshape the data back to "long" format
Equivalent code, using reshape2 and dplyr:
# optional. you could just use `c(NA, diff(x))` below, but this is more general
padded_diff <- function(x, lag = 1L) {
c(rep.int(NA, lag), diff(x, lag = lag))
}
df %>%
dcast(car + qtr ~ parameter, value.var = "val") %>%
mutate(b_diff = padded_diff(b)) %>%
melt(id.vars = c("car", "qtr"), variable.name = "parameter") %>%
arrange(car, qtr, parameter)
Here is another algorithm:
Group the data frame by car
Within each group, temporarily filter so that only rows with paramter == "b" are present
Take the 1-period diff of the val column
Remove the filter and ungroup
Equivalent code, using only dplyr, using a temporary table to simulate a "removable" filter:
make_b_diff_within_group <- function(df) {
tmp <- df %>%
filter(parameter == "b") %>%
transmute(
qtr = qtr,
val = padded_diff(val),
parameter = "b_diff")
bind_rows(df, tmp)
}
df %>%
group_by(car) %>%
do(make_b_diff_within_group(.)) %>%
ungroup() %>%
arrange(car, qtr, parameter)
This second algorithm could be implemented using several other "split-apply-combine" paradigms, including the tapply or by functions in base R, the ddply function in the plyr package (an ancestor of dplyr by the same author), and the split method from dplyr, as shown in this answer.
Given a situation such as the following
library(dplyr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
I would like to group `myData' to eventually find summary data grouping by all possible combinations of var2, var3, and var4.
I can create a list with all possible combinations of variables as character values with
groupNames <- names(myData)[2:4]
myGroups <- Map(combn,
list(groupNames),
seq_along(groupNames),
simplify = FALSE) %>%
unlist(recursive = FALSE)
My plan was to make separate data sets for each variable combination with a for() loop, something like
### This Does Not Work
for (i in 1:length(myGroups)){
assign( myGroups[i]%>%
unlist() %>%
paste0(collapse = "")%>%
paste0("Data"),
myData %>%
group_by_(lapply(myGroups[[i]], as.symbol)) %>%
summarise( n = length(var1),
avgVar2 = var2 %>%
mean()))
}
Admittedly I am not very good with lists, and looking up this issue was a bit challenging since dpyr updates have altered how grouping works a bit.
If there is a better way to do this than separate data sets I would love to know.
I've gotten a loop similar to above working when I am only grouping by a single variable.
Any and all help is greatly appreciated! Thank you!
This seems convulated, and there's probably a way to simplify or fancy it up with a do, but it works. Using your myData and myGroups,
results = lapply(myGroups, FUN = function(x) {
do.call(what = group_by_, args = c(list(myData), x)) %>%
summarise( n = length(var1),
avgVar1 = mean(var1))
}
)
> results[[1]]
Source: local data frame [3 x 3]
var2 n avgVar1
1 a 31 0.38929738
2 b 31 -0.07451717
3 c 38 -0.22522129
> results[[4]]
Source: local data frame [9 x 4]
Groups: var2
var2 var3 n avgVar1
1 a A 11 -0.1159160
2 a B 11 0.5663312
3 a C 9 0.7904056
4 b A 7 0.0856384
5 b B 13 0.1309756
6 b C 11 -0.4192895
7 c A 15 -0.2783099
8 c B 10 -0.1110877
9 c C 13 -0.2517602
> results[[7]]
# I won't paste them here, but it has all 27 rows, grouped by var2, var3 and var4.
I changed your summarise call to average var1 since var2 isn't numeric.
I have created a function based on the answer of #Gregor and the comments that followed:
library(magrittr)
myData <- tbl_df(data.frame( var1 = rnorm(100),
var2 = letters[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var3 = LETTERS[1:3] %>%
sample(100, replace = TRUE) %>%
factor(),
var4 = month.abb[1:3] %>%
sample(100, replace = TRUE) %>%
factor()))
Function combSummarise
combSummarise <- function(data, variables=..., summarise=...){
# Get all different combinations of selected variables (credit to #Michael)
myGroups <- lapply(seq_along(variables), function(x) {
combn(c(variables), x, simplify = FALSE)}) %>%
unlist(recursive = FALSE)
# Group by selected variables (credit to #konvas)
df <- eval(parse(text=paste("lapply(myGroups, function(x){
dplyr::group_by_(data, .dots=x) %>%
dplyr::summarize_( \"", paste(summarise, collapse="\",\""),"\")})"))) %>%
do.call(plyr::rbind.fill,.)
groupNames <- c(myGroups[[length(myGroups)]])
newNames <- names(df)[!(names(df) %in% groupNames)]
df <- cbind(df[, groupNames], df[, newNames])
names(df) <- c(groupNames, newNames)
df
}
Call of combSummarise
combSummarise (myData, var=c("var2", "var3", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"),
summarise=c("length(var1)"))
etc
Inspired by the answers by Gregor and dimitris_ps, I wrote a dplyr style function that runs summarise for all combinations of group variables.
summarise_combo <- function(data, ...) {
groupVars <- group_vars(data) %>% map(as.name)
groupCombos <- map( 0:length(groupVars), ~combn(groupVars, ., simplify=FALSE) ) %>%
unlist(recursive = FALSE)
results <- groupCombos %>%
map(function(x) {data %>% group_by(!!! x) %>% summarise(...)} ) %>%
bind_rows()
results %>% select(!!! groupVars, everything())
}
Example
library(tidyverse)
mtcars %>% group_by(cyl, vs) %>% summarise_combo(cyl_n = n(), mean(mpg))
Using unite to create a new column is the simplest way
library(tidyverse)
df = tibble(
a = c(1,1,2,2,1,1,2,2),
b = c(3,4,3,4,3,4,3,4),
val = c(1,2,3,4,5,6,7,8)
)
print(df)#output1
df_2 = unite(df, 'combined_header', a, b, sep='_', remove=FALSE) #remove=F doesn't remove existing columns
print(df_2)#output2
df_2 %>% group_by(combined_header) %>%
summarize(avg_val=mean(val)) %>% print()#output3
#avg 1_3 = mean(1,5)=3 avg 1_4 = mean(2, 6) = 4
RESULTS
Output:
output1
a b val
<dbl> <dbl> <dbl>
1 1 3 1
2 1 4 2
3 2 3 3
4 2 4 4
5 1 3 5
6 1 4 6
7 2 3 7
8 2 4 8
output2
combined_header a b val
<chr> <dbl> <dbl> <dbl>
1 1_3 1 3 1
2 1_4 1 4 2
3 2_3 2 3 3
4 2_4 2 4 4
5 1_3 1 3 5
6 1_4 1 4 6
7 2_3 2 3 7
8 2_4 2 4 8
output3
combined_header avg_val
<chr> <dbl>
1 1_3 3
2 1_4 4
3 2_3 5
4 2_4 6