I have a dataframe:
dat<- data.frame(date = c("2015-01-01","2015-01-01","2015-01-01", "2015-01-01","2015-02-02","2015-02-02","2015-02-02","2015-02-02","2015-02-02"), val= c(10,20,30,50,300,100,200,200,400), type= c("A","A","B","C","A","A","B","C","C") )
dat
date val type
1 2015-01-01 10 A
2 2015-01-01 20 A
3 2015-01-01 30 B
4 2015-01-01 50 C
5 2015-02-02 300 A
6 2015-02-02 100 A
7 2015-02-02 200 B
8 2015-02-02 200 C
9 2015-02-02 400 C
and I would like to have one row for each day with averages by type so the output would be:
Date A B C
2015-01-01 15 30 50
2015-02-02 200 200 300
additionally how would I get the counts so the results are:
Date A B C
2015-01-01 2 1 1
2015-02-02 2 1 2
library(reshape2)
dcast(data = dat, formula = date ~ type, fun.aggregate = mean, value.var = "val")
# date A B C
# 1 2015-01-01 15 30 50
# 2 2015-02-02 200 200 300
With dcast, the LHS of the formula defines rows, the RHS defines columns, the value.var is the name of the column that becomes values, and the fun.aggregate is how those values are computed. The default fun.aggregate is length, i.e., the number of values. You asked for the average, so we use mean. You could also do min, max, sd, IQR, or any function that takes a vector and returns a single value.
You may also use table for the updated question
table(dat[c(1,3)])
# type
#date A B C
#2015-01-01 2 1 1
#2015-02-02 2 1 2
For the first question, I think #Gregor's solution is the best (so far), a possible option with dplyr/tidyr would be
library(dplyr)
library(tidyr)
dat %>%
group_by(date,type) %>%
summarise(val=mean(val)) %>%
spread(type, val)
Or a base R option would be (nchar=50 and the dcast(.. nchar=44. So not so bad :-))
with(dat, tapply(val, list(date, type), FUN=mean))
# A B C
#2015-01-01 15 30 50
#2015-02-02 200 200 300
Personally I would go with Gregor's solution using reshape2. But for the sake of completeness I'll include a base R solution.
agg <- with(dat, aggregate(val, by = list(date = date, type = type), FUN = mean))
out <- reshape(agg, timevar = "type", idvar = "date", direction = "wide")
out
# date x.A x.B x.C
# 1 2015-01-01 15 30 50
# 2 2015-02-02 200 200 300
If you want to get rid of the x. on the column names, you can remove it with gsub.
colnames(out) <- gsub("^x\\.", "", colnames(out))
To get the counts of rows, replace FUN = mean with FUN = length in the call to aggregate.
Using data.table v1.9.5 (current devel), we can do:
require(data.table) ## v1.9.5+
dcast(setDT(dat), date ~ type, fun = list(mean, length), value.var="val")
# date A_mean_val B_mean_val C_mean_val A_length_val B_length_val C_length_val
# 1: 2015-01-01 15 30 50 2 1 1
# 2: 2015-02-02 200 200 300 2 1 2
Installation instructions here.
I'll add the pivot_wider solution, which is meant to replace earlier tidyverse options, and which is
Using pivot_wider with the values_fn option, we can do the following:
library(tidyr) # At least 1.0.0
dat %>% pivot_wider(names_from = type, values_from = val, values_fn = list(val = mean))
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <dbl> <dbl> <dbl>
#> 1 2015-01-01 15 30 50
#> 2 2015-02-02 200 200 300
and
dat %>% pivot_wider(names_from = type, values_from = val, values_fn = list(val = length))
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <int> <int> <int>
#> 1 2015-01-01 2 1 1
#> 2 2015-02-02 2 1 2
Of course, if we want to get fancy, we can do both at once:
library(purrr)
library(rlang)
map(quos(mean, length),
~pivot_wider(dat, names_from = type, values_from = val, values_fn = list(val = eval_tidy(.))))
#> [[1]]
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <dbl> <dbl> <dbl>
#> 1 2015-01-01 15 30 50
#> 2 2015-02-02 200 200 300
#>
#> [[2]]
#> # A tibble: 2 x 4
#> date A B C
#> <fct> <int> <int> <int>
#> 1 2015-01-01 2 1 1
#> 2 2015-02-02 2 1 2
Created on 2019-12-04 by the reprex package (v0.3.0)
Note that if you're concerned about speed, it may be worth updating to the dev version of tidyr.
Related
DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
"TYPE" = rep(1:2, 400),
"TIME" = rep(100:101, 400),
"SCORE" = sample(1:100,r=T,800))
Cheers all,
I have 'DATA' and wish to estimation the CORRELATION VALUES of SCORE at each TIME and SCORE and TYPE combination BETWEEN AND WITHIN GROUP in this way:
I am assuming you want to compute the correlation between groups 1-2, 1-3, 1-4 and so on for each combination of TIME and TYPE. Here's an approach:
# create the dataset
set.seed(123)
df <- data.frame("group" = sort(rep(1:4, 200)),
"type" = rep(1:2, 400),
"time" = rep(100:101, 400),
"score" = sample(1:100,r=T,800))
library(tidyr)
library(purrr)
library(data.table)
# another dataset to filter combinations
# (G1G2 is same G2G1, so remove G2G1)
df2 <- combn(4, 2) %>% t %>%
as_tibble() %>%
rename(group1 = V1, group2 = V2) %>%
mutate(value = TRUE)
df %>%
# add identifiers per group
group_by(time, type, group) %>%
mutate(id = row_number()) %>%
ungroup() %>%
# nest data to get separate tibble for each
# combination of time and type
nest(data = -c(time, type)) %>%
# convert each data.frame to data.table
mutate(dt = map(data, function(dt){
setDT(dt)
setkey(dt, id)
dt
})) %>%
# correlation between groups in R
# refer answer below for more details
# https://stackoverflow.com/a/26357667/15221658
# cartesian join of dts
mutate(dtj = map(dt, ~.[., allow.cartesian = TRUE])) %>%
# compute between group correlation
mutate(cors = map(dtj, ~.[, list(cors = cor(score, i.score)), by = list(group, i.group)])) %>%
# unnest correlation object
unnest(cors) %>%
# formatting for display
select(type, time, group1 = group, group2 = i.group, correlation = cors) %>%
filter(group1 != group2) %>%
arrange(time, group1, group2) %>%
# now use df2 since currently we have G1G2, and G2G1
# which are both equal so remove G2G1
left_join(df2, by = c("group1", "group2")) %>%
filter(value) %>%
select(-value)
# A tibble: 12 x 5
type time group1 group2 correlation
<int> <int> <int> <int> <dbl>
1 1 100 1 2 0.121
2 1 100 1 3 0.0543
3 1 100 1 4 -0.0694
4 1 100 2 3 -0.104
5 1 100 2 4 -0.0479
6 1 100 3 4 -0.0365
7 2 101 1 2 -0.181
8 2 101 1 3 -0.0673
9 2 101 1 4 0.00765
10 2 101 2 3 0.0904
11 2 101 2 4 -0.0126
12 2 101 3 4 -0.154
Here is an alternative approach which creates all unique combinations of TIME, TYPE, and duplicated GROUPs through a cross join and then computes the correlation of SCORE for the correspondings subsets of DATA:
library(data.table) # development version 1.14.3 required
setDT(DATA, key = c("GROUP", "TYPE", "TIME"))[
, CJ(time = TIME, type = TYPE, groupA = GROUP, groupB = GROUP, unique = TRUE)][
groupA < groupB][
, corType := paste0("G", groupA, "G", groupB)][][
, corValue := cor(DATA[.(groupA, type, time), SCORE],
DATA[.(groupB, type, time), SCORE]),
by = .I][]
time type groupA groupB corType corValue
1: 100 1 1 2 G1G2 0.11523940
2: 100 1 1 3 G1G3 -0.05124326
3: 100 1 1 4 G1G4 -0.16943203
4: 100 1 2 3 G2G3 0.05475435
5: 100 1 2 4 G2G4 -0.10769738
6: 100 1 3 4 G3G4 0.01464146
7: 100 2 1 2 G1G2 NA
8: 100 2 1 3 G1G3 NA
9: 100 2 1 4 G1G4 NA
10: 100 2 2 3 G2G3 NA
11: 100 2 2 4 G2G4 NA
12: 100 2 3 4 G3G4 NA
13: 101 1 1 2 G1G2 NA
14: 101 1 1 3 G1G3 NA
15: 101 1 1 4 G1G4 NA
16: 101 1 2 3 G2G3 NA
17: 101 1 2 4 G2G4 NA
18: 101 1 3 4 G3G4 NA
19: 101 2 1 2 G1G2 -0.04997479
20: 101 2 1 3 G1G3 -0.02262932
21: 101 2 1 4 G1G4 -0.00331578
22: 101 2 2 3 G2G3 -0.01243952
23: 101 2 2 4 G2G4 0.16683223
24: 101 2 3 4 G3G4 -0.10556083
time type groupA groupB corType corValue
Explanation
DATA is coerced to class data.table while setting a key on columns GROUP, TYPE, and TIME. Keying is required for fast subsetting later.
The cross join CJ() creates all unique combinations of columns TIME, TYPE, GROUP, and GROUP (twice). The columns of the cross join have been renamed to avoid name clashes later on.
[groupA < groupB] ensures that equivalent combinations of groupA and groupB only appear once, e.g., G2G1 is dropped in favour of G1G2. So, this is kind of data.table version of t(combn(unique(DATA$GROUP), 2)).
A new column corType is append by reference.
Finally, the groupwise correlations are computed by stepping rowwise through the cross join table (using by = .I) and subsetting DATA by groupA, type, time and groupB, type, time, resp., using fast subsetting through keys. Please, see the vignette Keys and fast binary search based subset for more details.
Note that by = .I is a new feature of data.table development version 1.14.3.
Combinations of time, type, and group which do not exist in DATA will appear in the result set but are marked by NA in column corValue.
Data
set.seed(42) # required for reproducible data
DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
"TYPE" = rep(1:2, 400),
"TIME" = rep(100:101, 400),
"SCORE" = sample(1:100, r=T, 800))
I want to create new rows based on the value of pre-existent rows in my dataset. There are two catches: first, some cell values need to remain constant while others have to increase by +1. Second, I need to cycle through every row the same amount of times.
I think it will be easier to understand with data
Here is where I am starting from:
mydata <- data.frame(id=c(10012000,10012002,10022000,10022002),
col1=c(100,201,44,11),
col2=c("A","C","B","A"))
Here is what I want:
mydata2 <- data.frame(id=c(10012000,10012001,10012002,10012003,10022000,10022001,10022002,10022003),
col1=c(100,100,201,201,44,44,11,11),
col2=c("A","A","C","C","B","B","A","A"))
Note how I add +1 in the id column cell for each new row but col1 and col2 remain constant.
Thank you
library(tidyverse)
mydata |>
mutate(id = map(id, \(x) c(x, x+1))) |>
unnest(id)
#> # A tibble: 8 × 3
#> id col1 col2
#> <dbl> <dbl> <chr>
#> 1 10012000 100 A
#> 2 10012001 100 A
#> 3 10012002 201 C
#> 4 10012003 201 C
#> 5 10022000 44 B
#> 6 10022001 44 B
#> 7 10022002 11 A
#> 8 10022003 11 A
Created on 2022-04-14 by the reprex package (v2.0.1)
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
mydata %>%
group_by(id) %>%
uncount(2) %>%
mutate(id = first(id) + row_number() - 1) %>%
ungroup()
This returns
# A tibble: 8 x 3
id col1 col2
<dbl> <dbl> <chr>
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
library(data.table)
setDT(mydata)
final <- setorder(rbind(copy(mydata), mydata[, id := id + 1]), id)
# id col1 col2
# 1: 10012000 100 A
# 2: 10012001 100 A
# 3: 10012002 201 C
# 4: 10012003 201 C
# 5: 10022000 44 B
# 6: 10022001 44 B
# 7: 10022002 11 A
# 8: 10022003 11 A
I think this should do it:
library(dplyr)
df1 <- arrange(rbind(mutate(mydata, id = id + 1), mydata), id, col2)
Gives:
id col1 col2
1 10012000 100 A
2 10012001 100 A
3 10012002 201 C
4 10012003 201 C
5 10022000 44 B
6 10022001 44 B
7 10022002 11 A
8 10022003 11 A
in base R, for nostalgic reasons:
mydata2 <- as.data.frame(lapply(mydata, function(col) rep(col, each = 2)))
mydata2$id <- mydata2$id + 0:1
Let's say I have a dataframe of Name and value, is there any ways to extract BOTH minimum and maximum values within Name in a single function?
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
# A tibble: 9 x 2
Name Value
<chr> <int>
1 A 27
2 A 37
3 A 57
4 B 89
5 B 20
6 B 86
7 C 97
8 C 62
9 C 58
The output should contains TWO columns only (Name and Value).
Thanks in advance!
You can use range to get max and min value and use it in summarise to get different rows for each Name.
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value = range(Value), .groups = "drop")
# Name Value
# <chr> <int>
#1 A 27
#2 A 57
#3 B 20
#4 B 89
#5 C 58
#6 C 97
If you have large dataset using data.table might be faster.
library(data.table)
setDT(df)[, .(Value = range(Value)), Name]
You can use dplyr::group_by() and dplyr::summarise() like this:
library(dplyr)
set.seed(1)
df <- tibble(Name = rep(LETTERS[1:3], each = 3), Value = sample(1:100, 9))
df %>%
group_by(Name) %>%
summarise(
maximum = max(Value),
minimum = min(Value)
)
This outputs:
# A tibble: 3 × 3
Name maximum minimum
<chr> <int> <int>
1 A 68 1
2 B 87 34
3 C 82 14
What's a little odd is that my original df object looks a little different than yours, in spite of the seed:
# A tibble: 9 × 2
Name Value
<chr> <int>
1 A 68
2 A 39
3 A 1
4 B 34
5 B 87
6 B 43
7 C 14
8 C 82
9 C 59
I'm currently using rbind() together with slice_min() and slice_max(), but I think it may not be the best way or the most efficient way when the dataframe contains millions of rows.
library(tidyverse)
rbind(df %>% group_by(Name) %>% slice_max(Value),
df %>% group_by(Name) %>% slice_min(Value)) %>%
arrange(Name)
# A tibble: 6 x 2
# Groups: Name [3]
Name Value
<chr> <int>
1 A 57
2 A 27
3 B 89
4 B 20
5 C 97
6 C 58
In base R, the output format can be created with tapply/stack - do a group by tapply to get the output as a named list or range, stack it to two column data.frame and change the column names if needed
setNames(stack(with(df, tapply(Value, Name, FUN = range)))[2:1], names(df))
Name Value
1 A 27
2 A 57
3 B 20
4 B 89
5 C 58
6 C 97
Using aggregate.
aggregate(Value ~ Name, df, range)
# Name Value.1 Value.2
# 1 A 1 68
# 2 B 34 87
# 3 C 14 82
I want to perform multiple joins to original dataframe, from the same source with different IDs each time. Specifically I actually only need to do two joins, but when I perform the second join, the columns being joined already exist in the input df, and rather than add these columns with new names using the .x/.y suffixes, I want to sum the values to the existing columns. See the code below for the desired output.
# Input data:
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
> values
# A tibble: 10 x 3
id variable1 variable2
<chr> <int> <dbl>
1 A 1 10
2 B 2 20
3 C 3 30
4 D 4 40
5 E 5 50
6 F 6 60
7 G 7 70
8 H 8 80
9 I 9 90
10 J 10 100
> df
# A tibble: 5 x 1
twin_id
<chr>
1 A/F
2 B/G
3 C/H
4 D/I
5 E/J
So this is the two joins:
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
left_join(values, by = c("right_id" = "id"))
> joined_df
# A tibble: 5 x 7
twin_id left_id right_id variable1.x variable2.x variable1.y variable2.y
<chr> <chr> <chr> <int> <dbl> <int> <dbl>
1 A/F A F 1 10 6 60
2 B/G B G 2 20 7 70
3 C/H C H 3 30 8 80
4 D/I D I 4 40 9 90
5 E/J E J 5 50 10 100
And this is the output I want, using the only way I can see to get it:
output_df_wanted <- joined_df %>%
mutate(
variable1 = variable1.x + variable1.y,
variable2 = variable2.x + variable2.y) %>%
select(twin_id, left_id, right_id, variable1, variable2)
> output_df_wanted
# A tibble: 5 x 5
twin_id left_id right_id variable1 variable2
<chr> <chr> <chr> <int> <dbl>
1 A/F A F 7 70
2 B/G B G 9 90
3 C/H C H 11 110
4 D/I D I 13 130
5 E/J E J 15 150
I can see how to get what I want using a mutate statement, but I will have a much larger number of variables in the actually dataset. I am wondering if this is the best way to do this.
You can try reshaping your data and using dplyr::summarise_at:
library(tidyr)
library(dplyr)
df %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
pivot_longer(-twin_id) %>%
left_join(values, by = c("value" = "id")) %>%
group_by(twin_id) %>%
summarise_at(vars(starts_with("variable")), sum) %>%
separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE)
## A tibble: 5 x 5
# twin_id left_id right_id variable1 variable2
# <chr> <chr> <chr> <int> <dbl>
#1 A/F A F 7 70
#2 B/G B G 9 90
#3 C/H C H 11 110
#4 D/I D I 13 130
#5 E/J E J 15 150
You can use my package safejoin if it's acceptable to you to use a github package.
The idea is that you have conflicting columns, dplyr and base R deal with conflict by renaming them while safejoin is more flexible, you can use the function you want to apply in case of conflicts. Here you want to add them so we'll use conflict = `+`, for the same effect you could have used conflict = ~ .x + .y or conflict = ~ ..1 + ..2.
# remotes::install_github("moodymudskipper/safejoin")
library(tidyverse)
library(safejoin)
values <- tibble(
id = LETTERS[1:10],
variable1 = 1:10,
variable2 = (1:10)*10
)
df <- tibble(
twin_id = c("A/F", "B/G", "C/H", "D/I", "E/J")
)
joined_df <- df %>%
tidyr::separate(col = twin_id, into = c("left_id", "right_id"), sep = "/", remove = FALSE) %>%
left_join(values, by = c("left_id" = "id")) %>%
safe_left_join(values, by = c("right_id" = "id"), conflict = `+`)
joined_df
#> # A tibble: 5 x 5
#> twin_id left_id right_id variable1 variable2
#> <chr> <chr> <chr> <int> <dbl>
#> 1 A/F A F 7 70
#> 2 B/G B G 9 90
#> 3 C/H C H 11 110
#> 4 D/I D I 13 130
#> 5 E/J E J 15 150
Created on 2020-04-29 by the reprex package (v0.3.0)
I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)