Related
So i need to merge 2 data frames:
The first data frame contains dates in YYYY-mm-dd format and event lengths:
datetime length
2003-06-03 1
2003-06-07 1
2003-06-13 1
2003-06-17 3
2003-06-28 5
2003-07-10 1
2003-07-23 1
...
The second data frame contains dates in the same format and discharge data:
datetime q
2003-05-29 36.2
2003-05-30 34.6
2003-05-31 33.1
2003-06-01 30.7
2003-06-02 30.0
2003-06-03 153.0
2003-06-04 69.0
...
The second data frame is much larger.
I want to merge/join only the following rows of the second data frame to the first:
all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
two rows before that row
n-1 rows after that row, where n = "length" value of row in first data frame.
I would like to identify the rows belonging to the same event as well.
Ideally i would have the following output: (Notice the event from 2003-06-17)
EventDatesNancy length q event#
2003-06-03 1 153.0 1
2003-06-07 1 120.0 2
2003-06-13 1 45.3 3
2003-06-15 na 110.0 4
2003-06-16 na 53.1 4
2003-06-17 3 78.0 4
2003-06-18 na 167.0 4
2003-06-19 na 145.0 4
...
I hope this makes clear what I am trying to do.
This might be one approach using tidyverse and fuzzyjoin.
First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).
Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))
I would like to calculate the sum of each flower in each year in R. Below is an example of how the table looks (Table 1) and what I want the outcome to be (Table 2). I know how to do the code calculation in a long table format but I am not sure how to do it in a wide table format. Note: I am using package: dplyr
(Table 1)
flower
1902
1950
2010
2012
2021
lily
23
0
0
8
5
rose
50
60
5
16
0
daisy
30
7
10
2
0
I need to calculate the sum for each flower in each year. The end result should give me:
(Table 2)
flower
1902
1950
2010
2012
2021
lily
23
23
23
31
36
rose
50
110
115
131
131
daisy
30
37
47
49
49
One option involving dplyr and purrr might be:
dat %>%
mutate(pmap_dfr(across(-1), ~ cumsum(c(...))))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Using rowCumsums from matrixStats
library(matrixStats)
df1[-1] <- rowCumsums(as.matrix(df1[-1]))
-output
df1
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Here is one way of getting your expected result:
Your data frame :
dat <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Apply a function that calculate the cumulative sums and apply to each row of the data at column 2 to 6:
dat[1:nrow(dat), 2:6] <- t(apply(dat[1:nrow(dat), 2:6], 1, function(x) cumsum(c(x))))
# The result
dat
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
#benson23 has kindly suggested the following simpler code to get the same result:
dat[, 2:6] <- t(apply(dat[,2:6], 1, cumsum))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
You can use apply with cumsum, plus a little bit of re-formatting.
setNames(as.data.frame(cbind(df[, 1], t(apply(df[, -1], 1, cumsum)))), colnames(df))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Data
df <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Here is an alternative using pivoting:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-flower) %>%
group_by(flower) %>%
mutate(value = cumsum(value)) %>%
pivot_wider() %>%
ungroup()
flower X1902 X1950 X2010 X2012 X2021
<chr> <int> <int> <int> <int> <int>
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
I have a data set that contains some missing values which can be completed by merging with a another dataset. My example:
This is the updated data set I am working with.
DF1
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667
DF2
Name Paper Book Mug soap computer tablet coffee coupons
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I want to get:
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I have tried the following code:
DF1[,c(4:6)][is.na(DF1[,c(4:6)]<-DF2[,c(2:4)][match(DF1[,1],DF2[,1])]
[which(is.na(DF1[,c(4:6)]))]
One of the solutions using dplyr will work, if I omit the columns which are already complete. Not sure if it my version of dplyr, which I have updated last week.
Any help is greatly appreciated! Thanks!
We can do a left join and then coalesce the columns
library(dplyr)
DF1 %>%
left_join(DF2, by = c('NameVar')) %>%
transmute(NameVar, Var1, Var2,
Var3 = coalesce(Var3.x, Var3.y),
Var4 = coalesce(Var4.x, Var4.y),
Var5 = coalesce(Var5.x, Var5.y))
-output
# NameVar Var1 Var2 Var3 Var4 Var5
#1 Sub1 30 45 40 34 65
#2 Sub2 25 30 30 45 45
#3 Sub3 74 34 25 30 49
#4 Sub4 30 45 40 34 65
#5 Sub5 25 30 69 56 72
#6 Sub6 74 34 74 34 60
Or using data.table
library(data.table)
nm1 <- setdiff(intersect(names(DF1), names(DF2)), 'NameVar')
setDT(DF1)[DF2, (nm1) := Map(fcoalesce, mget(nm1),
mget(paste0("i.", nm1))), on = .(NameVar)]
data
DF1 <- structure(list(NameVar = c("Sub1", "Sub2", "Sub3", "Sub4", "Sub5",
"Sub6"), Var1 = c(30L, 25L, 74L, 30L, 25L, 74L), Var2 = c(45L,
30L, 34L, 45L, 30L, 34L), Var3 = c(40L, NA, NA, 40L, 69L, NA),
Var4 = c(34L, NA, NA, 34L, 56L, NA), Var5 = c(65L, NA, NA,
65L, 72L, NA)), class = "data.frame", row.names = c(NA, -6L
))
DF2 <- structure(list(NameVar = c("Sub2", "Sub3", "Sub6"), Var3 = c(30L,
25L, 74L), Var4 = c(45L, 30L, 34L), Var5 = c(45L, 49L, 60L)),
class = "data.frame", row.names = c(NA,
-3L))
I am trying to merge my sales data and patients data in R (and some other attributes) which are rolled-up at the country level for the same time-frame. After merging, I want to consolidate it to a long format instead of wide format and keep it unique at the Country-Month level.
This is how my input data looks like -
1) Sales Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 44 72 85 25 72
A0002 Germany 98 70 69 48 41
A0003 Russia 82 42 32 29 43
A0004 UK 79 83 51 48 47
A0005 France 45 75 10 13 23
A0006 India 92 85 28 13 18
2) Patients Data
Coutry_ID Country_Name 1/28/2018 2/28/2018 3/28/2018 4/28/2018 5/28/2018
A0001 USA 7 13 22 23 13
A0002 Germany 9 10 17 25 25
A0003 Russia 24 19 6 8 5
A0004 UK 6 8 20 1 11
A0005 France 4 9 8 10 25
A0006 India 18 21 2 13 17
AND this is how I intend output to look like -
Coutry_ID Country_Name Month Sales Patients
A0001 USA 1/28/2018 44 7
A0001 USA 2/28/2018 72 13
A0001 USA 3/28/2018 85 22
A0001 USA 4/28/2018 25 23
A0001 USA 5/28/2018 72 13
A0002 Germany 1/28/2018 98 9
A0002 Germany 2/28/2018 70 10
A0002 Germany 3/28/2018 69 17
A0002 Germany 4/28/2018 48 25
A0002 Germany 5/28/2018 41 25
A0003 Russia 1/28/2018 82 24
A0003 Russia 2/28/2018 42 19
A0003 Russia 3/28/2018 32 6
A0003 Russia 4/28/2018 29 8
A0003 Russia 5/28/2018 43 5
A0004 UK 1/28/2018 79 6
A0004 UK 2/28/2018 83 8
A0004 UK 3/28/2018 51 20
A0004 UK 4/28/2018 48 1
A0004 UK 5/28/2018 47 11
A0005 France 1/28/2018 45 4
A0005 France 2/28/2018 75 9
A0005 France 3/28/2018 10 8
A0005 France 4/28/2018 13 10
A0005 France 5/28/2018 23 25
A0006 India 1/28/2018 92 18
A0006 India 2/28/2018 85 21
A0006 India 3/28/2018 28 2
A0006 India 4/28/2018 13 13
A0006 India 5/28/2018 18 17
I need a little guidance on these 2 things -
1 - How to convert the data from wide to long?
2 - For merging data, I am thinking about using DPLYR left_join on all these data-sets with my master list of countries with ID and Name. My doubt is whether I should first convert the data sets into The long format from wide or do that after merging?
You can get both the dataframes in long format and then join :
library(dplyr)
library(tidyr)
inner_join(
sales %>% pivot_longer(cols = -c(Coutry_ID, Country_Name), values_to = 'Sales'),
patients %>% pivot_longer(cols = -c(Coutry_ID, Country_Name),
values_to = 'Patients'),
by = c("Coutry_ID", "Country_Name", "name"))
# A tibble: 30 x 5
# Coutry_ID Country_Name name Sales Patients
# <fct> <fct> <chr> <int> <int>
# 1 A0001 USA 1/28/2018 44 7
# 2 A0001 USA 2/28/2018 72 13
# 3 A0001 USA 3/28/2018 85 22
# 4 A0001 USA 4/28/2018 25 23
# 5 A0001 USA 5/28/2018 72 13
# 6 A0002 Germany 1/28/2018 98 9
# 7 A0002 Germany 2/28/2018 70 10
# 8 A0002 Germany 3/28/2018 69 17
# 9 A0002 Germany 4/28/2018 48 25
#10 A0002 Germany 5/28/2018 41 25
# … with 20 more rows
data
sales <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(44L, 98L, 82L, 79L, 45L, 92L), `2/28/2018` = c(72L,
70L, 42L, 83L, 75L, 85L), `3/28/2018` = c(85L, 69L, 32L,
51L, 10L, 28L), `4/28/2018` = c(25L, 48L, 29L, 48L, 13L,
13L), `5/28/2018` = c(72L, 41L, 43L, 47L, 23L, 18L)), class =
"data.frame", row.names = c(NA, -6L))
patients <- structure(list(Coutry_ID = structure(1:6, .Label = c("A0001",
"A0002", "A0003", "A0004", "A0005", "A0006"), class = "factor"),
Country_Name = structure(c(6L, 2L, 4L, 5L, 1L, 3L), .Label = c("France",
"Germany", "India", "Russia", "UK", "USA"), class = "factor"),
`1/28/2018` = c(7L, 9L, 24L, 6L, 4L, 18L), `2/28/2018` = c(13L,
10L, 19L, 8L, 9L, 21L), `3/28/2018` = c(22L, 17L, 6L, 20L,
8L, 2L), `4/28/2018` = c(23L, 25L, 8L, 1L, 10L, 13L), `5/28/2018` = c(13L,
25L, 5L, 11L, 25L, 17L)), class = "data.frame", row.names = c(NA, -6L))
Base R (not as eloquent as above):
# Create a named list of dataframes:
df_list <- list(patients = patients, sales = sales)
# Create a vector in each with the name of the dataframe:
df_list <- mapply(cbind, df_list, "desc" = as.character(names(df_list)),
SIMPLIFY = FALSE)
# Define a function to reshape the data:
reshape_ps <- function(x){
tmp <- setNames(reshape(x,
direction = "long",
varying = which(names(x) %in% names(x[,sapply(x, is.numeric)])),
idvar = c(!(names(x) %in% names(x[,sapply(x, is.numeric)]))),
v.names = "month",
times = as.Date(names(x[,sapply(x, is.numeric)]), "%m/%d/%Y"),
new.row.names = 1:(nrow(x)*length(which(names(x) %in% names(x[,sapply(x, is.numeric)]))))),
c(names(x[!(names(x) %in% names(x[,sapply(x, is.numeric)]))]), "month", as.character(unique(x$desc))))
# Drop the dataframe name vector:
clean <- tmp[,names(tmp) != "desc"]
# Specify the return object:
return(clean)
}
# Merge the result of the function applied on both dataframes:
Reduce(function(y, z){merge(y, z, by = intersect(colnames(y), colnames(z)), all = TRUE)},
Map(function(x){reshape_ps(x)}, df_list))
I have a dataframe with 10 variables all of them numeric, and one of the variable name is age, I want to group the observation based on age.example. age 17 to 18 one group, 19-22 another group and then each row should be attached to each group. And resulting should be a dataframe for further manipulations.
Model of the dataframe:
A B AGE
25 50 17
30 42 22
50 60 19
65 105 17
355 400 21
68 47 20
115 98 18
25 75 19
And I want result like
17-18
A B AGE
25 50 17
65 105 17
115 98 18
19-22
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
I did group the dataset according to Age var using the split function, now my concern is how I could manipulate the grouped data. Eg:the answer looked like
$1
A B AGE
25 50 17
65 105 17
115 98 18
$2
A B AGE
30 42 22
50 60 19
355 400 21
68 47 20
115 98 18
25 75 19
My question is how can I access each group for further manipulation?
for eg: if I want to do t-test for each group separately?
The split function will work with dataframes. Use either cut with 'breaks' or findInterval with an appropriate set of cutpoints (named 'vec' if you are using named parameters) as the criterion for grouping, the second argument to split. The default for cut is intervals closed on the right and default for findInterval is closed on the left.
> split(dat, findInterval(dat$AGE, c(17, 19.5, 22.5)))
$`1`
A B AGE
1 25 50 17
3 50 60 19
4 65 105 17
7 115 98 18
8 25 75 19
$`2`
A B AGE
2 30 42 22
5 355 400 21
6 68 47 20
Here is the approach with cut
lst <- split(df1, cut(df1$AGE, breaks=c(16, 18, 22), labels=FALSE))
lst
# $`1`
# A B AGE
#1 25 50 17
#4 65 105 17
#7 115 98 18
#$`2`
# A B AGE
#2 30 42 22
#3 50 60 19
#5 355 400 21
#6 68 47 20
#8 25 75 19
Update
If you need to find the sum, mean of columns for each "list" element
lapply(lst, function(x) rbind(colSums(x[-3]),colMeans(x[-3])))
But, if the objective is to find the summary statistics based on the group, it can be done using any of the aggregating functions
library(dplyr)
df1 %>%
group_by(grp=cut(AGE, breaks=c(16, 18, 22), labels=FALSE)) %>%
summarise_each(funs(sum=sum(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), A:B)
# grp A_sum B_sum A_mean B_mean
#1 1 205 253 68.33333 84.33333
#2 2 528 624 105.60000 124.80000
Or using aggregate from base R
do.call(data.frame,
aggregate(cbind(A,B)~cbind(grp=cut(AGE, breaks=c(16, 18, 22),
labels=FALSE)), df1, function(x) c(sum=sum(x), mean=mean(x))))
data
df1 <- structure(list(A = c(25L, 30L, 50L, 65L, 355L, 68L, 115L, 25L
), B = c(50L, 42L, 60L, 105L, 400L, 47L, 98L, 75L), AGE = c(17L,
22L, 19L, 17L, 21L, 20L, 18L, 19L)), .Names = c("A", "B", "AGE"
), class = "data.frame", row.names = c(NA, -8L))