Related
I have two tables first table has stress score recorded at various time points and second table has date of treatment. I want to get the stress scores before and after treatment for each participant who has received the treatment. Also I want a column that gives information on when was the stress score recorded before and after treatment. I do not understand from where do I begin,and what should my code look like.
score.dt = data.table(
participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44),
repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1),
date.recorded = c(
'2017-07-13',
'2017-06-26',
'2018-09-17',
'2016-04-14',
'2014-03-24',
'2016-05-30',
'2018-06-20',
'2014-08-03',
'2015-07-06',
'2014-12-17',
'2014-09-05',
'2013-06-10',
'2015-10-04',
'2016-11-04',
'2016-04-18',
'2014-02-13',
'2013-05-24',
'2014-09-10',
'2014-11-25'
),
subscale = c(
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress",
"stress"
),
score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)
)
date.treatment.dt = data.table (
participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26),
date.treatment = c(
'2018 - 06 - 27',
'2001 - 07 - 16',
'2009 - 12 - 09',
'2009 - 05 - 20',
'2009 - 07 - 22',
'2008-07 - 02',
'2009 - 11 - 25',
'2009 - 09 - 16',
'1991 - 07 - 30',
'2016 - 05 - 25',
'2012 - 07 - 25',
'2007 - 03 - 19',
'2012 - 01 - 25',
'2011 - 09 - 21',
'2000 - 03 - 06',
'2001 - 09 - 25',
'1999 - 12 - 20',
'1997 -07 - 28',
'2002 - 03 - 12',
'2008 - 01 - 23'
))
Desired output columns: is something like this
score.date.dt = c("candidate.index.x", "repeat.instance", "subscale", "score", "date.treatment", "date.recorded", "score.before.treatment", "score.after.treatment", "months.before.treatment", "months.after.treatment")
Here the columns months.before.treatment indicates how many months before treatment the stress score was measured and month.after.treatment indicates how many months after treatment the stress score was measured.
In your example set, you only have four individuals with stress scores that have any rows in the treatment table (participants 1,4,21,and 25). Only one of these, participant 1, has both a pre-treatment stress measures and post-treatment stress measure...
Here is one way to produce the information you need:
inner_join(score.dt,date.treatment.dt, by="participant.index") %>%
group_by(participant.index, date.treatment) %>%
summarize(pre_treatment = min(date.recorded[date.recorded<=date.treatment]),
post_treatment = max(date.recorded[date.recorded>=date.treatment])) %>%
pivot_longer(cols = -(participant.index:date.treatment), names_to = "period", values_to = "date.recorded") %>%
left_join(score.dt, by=c("participant.index", "date.recorded" )) %>%
mutate(period=str_extract(period,".*(?=_)"),
months = abs(as.numeric(date.treatment-date.recorded))/(365.25/12)) %>%
pivot_wider(id_cols = participant.index:date.treatment, names_from = period, values_from=c(date.recorded, subscale, months,score))
Output:
participant.index date.treatment date.recorded_pre date.recorded_post subscale_pre subscale_post months_pre months_post score_pre score_post
<dbl> <date> <date> <date> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2018-06-27 2017-06-26 2018-09-17 stress stress 12.0 2.69 10 18
2 4 2001-07-16 NA 2016-05-30 NA stress Inf 178. NA 30
3 21 2000-03-06 NA 2015-07-06 NA stress Inf 184. NA 12
4 25 2002-03-12 NA 2014-12-17 NA stress Inf 153. NA 40
Note: you will have to fix the date inputs to the two source files, like this:
# first correct, your date.treatment column, and convert to date
date.treatment.dt[, date.treatment := as.Date(str_replace_all(date.treatment," ",""), "%Y-%m-%d")]
# second, similarly fix the date column in your stress score table
score.dt[,date.recorded := as.Date(date.recorded,"%Y-%m-%d")]
It seems like there are a few parts to what you're asking. First, you need to merge the two tables together. Here I use dplyr::inner_join() which automatically detects that the candidate.index is the only column in common and merges on that while discarding records found in only one of the tables. Second, we convert to a date format for both dates to enable the calculation of elapsed months.
library(tidyverse)
library(data.table)
library(lubridate)
score.dt <- structure(list(participant.index = c(1, 1, 1, 3, 4, 4, 13, 21, 21, 25, 37, 40, 41, 41, 41, 43, 43, 43, 44), repeat.instance = c(2, 3, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1), date.recorded = c("2017-07-13", "2017-06-26", "2018-09-17", "2016-04-14", "2014-03-24", "2016-05-30", "2018-06-20", "2014-08-03", "2015-07-06", "2014-12-17", "2014-09-05", "2013-06-10", "2015-10-04", "2016-11-04", "2016-04-18", "2014-02-13", "2013-05-24", "2014-09-10", "2014-11-25"), subscale = c("stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress", "stress"), score = c(18, 10, 18, 36, 16, 30, 28, 10, 12, 40, 16, 12, 10, 14, 6, 32, 42, 26, 18)), row.names = c(NA, -19L), class = c("data.table", "data.frame"))
date.treatment.dt <- structure(list(participant.index = c(1, 4, 5, 6, 8, 10, 11, 12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26), date.treatment = c("2018 - 06 - 27", "2001 - 07 - 16", "2009 - 12 - 09", "2009 - 05 - 20", "2009 - 07 - 22", "2008-07 - 02", "2009 - 11 - 25", "2009 - 09 - 16", "1991 - 07 - 30", "2016 - 05 - 25", "2012 - 07 - 25", "2007 - 03 - 19", "2012 - 01 - 25", "2011 - 09 - 21", "2000 - 03 - 06", "2001 - 09 - 25", "1999 - 12 - 20", "1997 -07 - 28", "2002 - 03 - 12", "2008 - 01 - 23")), row.names = c(NA, -20L), class = c("data.table", "data.frame"))
inner_join(date.treatment.dt, score.dt) %>%
mutate(across(contains("date"), as_date)) %>%
mutate(months.after = interval(date.treatment, date.recorded) %/% months(1)) %>%
mutate(months.before = 0 - months.after)
#> Joining, by = "participant.index"
#> participant.index date.treatment repeat.instance date.recorded subscale
#> 1: 1 2018-06-27 2 2017-07-13 stress
#> 2: 1 2018-06-27 3 2017-06-26 stress
#> 3: 1 2018-06-27 6 2018-09-17 stress
#> 4: 4 2001-07-16 1 2014-03-24 stress
#> 5: 4 2001-07-16 2 2016-05-30 stress
#> 6: 21 2000-03-06 1 2014-08-03 stress
#> 7: 21 2000-03-06 2 2015-07-06 stress
#> 8: 25 2002-03-12 1 2014-12-17 stress
#> score months.after months.before
#> 1: 18 -11 11
#> 2: 10 -12 12
#> 3: 18 2 -2
#> 4: 16 152 -152
#> 5: 30 178 -178
#> 6: 10 172 -172
#> 7: 12 184 -184
#> 8: 40 153 -153
Created on 2022-04-05 by the reprex package (v2.0.1)
This question already has an answer here:
Create a matrix of dummy variables from my data frame; use `NA` for missing values
(1 answer)
Closed last year.
How do I generate a dummy variable which is zero before year and takes the value 1 from year and onwards to 2019?
Original data:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), Year = c(2017,
2015, 2018, 2018, 2018, 2018, 2018, 2018)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
what I need:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8), Year = c(2017,
2015, 2018, 2018, 2018, 2018, 2018, 2018), `2015` = c(NA, 1,
NA, NA, NA, NA, NA, NA), `2016` = c(NA, 1, NA, NA, NA, NA, NA,
NA), `2017` = c(1, 1, NA, NA, NA, NA, NA, NA), `2018` = c(1,
1, 1, 1, 1, 1, 1, 1), `2019` = c(1, 1, 1, 1, 1, 1, 1, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))
split on id, extend year range i:2019, then reshape from long-to-wide:
res <- reshape(stack(sapply(split(df2$Year, df2$id), function(i) i:2019)),
timevar = "values", v.names = "values", idvar = "ind",
direction = "wide")
# fix the column names order
res <- res[ sort(colnames(res)) ]
res
# ind values.2015 values.2016 values.2017 values.2018 values.2019
# 1 1 NA NA 2017 2018 2019
# 4 2 2015 2016 2017 2018 2019
# 9 3 NA NA NA 2018 2019
# 11 4 NA NA NA 2018 2019
# 13 5 NA NA NA 2018 2019
# 15 6 NA NA NA 2018 2019
# 17 7 NA NA NA 2018 2019
# 19 8 NA NA NA 2018 2019
I'm trying to merge two datasets I have.
df1:
day
month
year
lon
lat
month-year
3
5
2009
5.7
53.9
May 2009
8
9
2004
6.9
52.6
Sep 2004
15
9
2004
3.8
50.4
Sep 2004
5
5
2009
2.7
51.2
May 2009
28
7
2005
14.8
62.4
Jul 2005
18
9
2004
5.1
52.5
Sep 2004
df2:
nao-value
sign
month-year
- 2.1
Negative
Sep 2004
1.3
Positive
Jul 2005
- 1.1
Negative
May 2009
I want to merge this to add the NAO value for each month and year in the occurrence data, meaning i want the NAO value for each specific month repeated for all registrations of that month in the occurrence data.
Problem is I cannot get the NAO values to line up where it should by the occurrence data, its either placed just repetitive and not aligned with the date it should, given as month-year.x and month-year.y ,or it is given back as NA value.
I have tried a few different approaches:
df3 <- merge(df1, df2, by="month-year")
df3 <- merge(cbind(df1, X=rownames(df1)), cbind(df2, variable=rownames(df2)))
df3 <- merge(df1,df2, by ="month-year", all.x = TRUE,all.y=TRUE, sort = FALSE)
df3 <- merge(df1, df2, by=intersect(df1$month-year(df1), df2$month-year(df2)))
But not of those give the result I desire.
Edit to include dput:
dput(head(df1, 10)) :
structure(list(Day = c(29, 2, 14, 31, 16, 7, 25, 12, 21, 22),
Month = c(7, 7, 7, 8, 8, 7, 8, 6, 6, 9), Year = c(2010, 2015,
2010, 2018, 2016, 2018, 2019, 2004, 2015, 2019), Lon = c(-6.155014,
-5.820868, -5.509842, -5.495277, -5.469389, -5.469389, -5.469389,
-5.466995, -5.461942, -5.457127), Lat = c(59.09478, 59.125228,
57.959196, 57.96022, 57.986825, 57.986825, 57.986825, 57.874527,
57.95972, 58.07697), Date = c("Jul 2010", "Jul 2015", "Jul 2010",
"Aug 2018", "Aug 2016", "Jul 2018", "Aug 2019", "Jun 2004",
"Jun 2015", "Sep 2019")), row.names = c(NA, -10L), class =
c("tbl_df",
"tbl", "data.frame"))
dput(head(df2, 10)) :
structure(list(NAO = c(1.04, 1.41, 1.46, 2, -1.53, -0.02, 0.53,
0.97, 1.06, 0.23), Sign = c("Positive", "Positive", "Positive",
"Positive", "Negative", "Negative", "Positive", "Positive",
"Positive",
"Positive"), Date = c("jan 1990", "feb 1990", "mar 1990", "apr 1990",
"mai 1990", "jun 1990", "jul 1990", "aug 1990", "sep 1990", "okt
1990"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
merge function is case sensitive. You have different cases in two dataframes that you are merging. Make the case in both the dataframe same and then perform the merge. Try -
result <- merge(transform(df1, Date = tolower(Date)), df2, by = 'Date')
Using tidyverse
library(dplyr)
df1 %>%
mutate(Date = tolower(Date)) %>%
inner_join(df2, by = 'Date')
I have successfully created a stacked bar plot but I cannot add labels indicating the percentages. That is all that I am missing.
I basically do not know how to use the geom_label/geom_text correctly, I have tried many many solutions but nothing has worked for me.
I have tried the geom_text function but it keeps telling me I am doing it wrong.
year Month2 Month Day HE Supply MUnit MPrice MBlock Fuel
2017 1 Jan 01 8 9408 SD2 15.38 126 COAL
2017 1 Jan 01 9 9388 SD3 15.46 218 COAL
2017 1 Jan 01 10 9393 SD3 15.46 218 COAL
2017 1 Jan 01 11 9628 SD4 15.47 203 COAL
2017 1 Jan 01 12 9943 EGC1 21.40 72 GAS
2017 1 Jan 01 13 10106 BR5 21.41 245 COAL
2017 1 Jan 01 14 10114 BR5 21.41 245 COAL
2017 1 Jan 01 15 9971 EGC1 20.75 75 GAS
2017 1 Jan 01 16 10302 BR5 21.41 245 COAL
2017 1 Jan 01 17 10655 TC01 22.77 11 GAS
2017 1 Jan 01 18 10811 CAL1 24.88 25 GAS
2017 1 Jan 01 19 10821 CAL1 24.88 25 GAS
2017 1 Jan 01 20 10765 BIG 26.00 30 HYDRO
2017 1 Jan 02 8 10428 CAL1 22.04 30 GAS
2017 1 Jan 02 9 10723 CAL1 29.97 59 GAS
2017 1 Jan 02 10 10933 BRA 44.50 30 HYDRO
2017 1 Jan 02 11 11107 ANC1 46.46 63 GAS
2017 1 Jan 02 12 11098 ANC1 46.46 38 GAS
2017 1 Jan 02 13 10839 JOF1 26.59 45 GAS
2017 1 Jan 02 14 10814 JOF1 26.09 15 GAS
2017 1 Jan 02 15 10797 BIG 26.00 30 HYDRO
sp <- ggplot(data = MU17) +
geom_bar(mapping = aes(x = factor(Month,levels=month.abb),
fill = factor(Fuel, levels=c("COAL", "GAS","HYDRO","BIOMASS"))),
position = "Fill") +
scale_y_continuous(labels = scales::percent)
sp + scale_fill_manual(breaks=c("COAL", "GAS","HYDRO","BIOMASS"),
values=c("black","yellow","blue","green")) +
labs(x = "2017" , y="Marginal Fuel Between HE8 & HE20") +
labs(fill="Fuel Type")
I am hoping to get the exact same plot that I get, just with labels indicating percentages.
I personally prefer using geom_col over geom_bar and process the data myself rather than let ggplot2 do it. This way you have more control over whats going on.
Since you have not provided all of you data I just use the snippet you provided.
library(tibble)
MU17 <- tribble(~year, ~Month2, ~Month, ~Day, ~HE, ~Supply, ~MUnit, ~MPrice, ~MBlock, ~Fuel,
2017, 1, "Jan", 01, 8, 9408, "SD2", 15.38, 126, "COAL",
2017, 1, "Jan", 01, 9, 9388, "SD3", 15.46, 218, "COAL",
2017, 1, "Jan", 01, 10, 9393, "SD3", 15.46, 218, "COAL",
2017, 1, "Jan", 01, 11, 9628, "SD4", 15.47, 203, "COAL",
2017, 1, "Jan", 01, 12, 9943, "EGC1", 21.40, 72, "GAS",
2017, 1, "Jan", 01, 13, 10106, "BR5", 21.41, 245, "COAL",
2017, 1, "Jan", 01, 14, 10114, "BR5", 21.41, 245, "COAL",
2017, 1, "Jan", 01, 15, 9971, "EGC1", 20.75, 75, "GAS",
2017, 1, "Jan", 01, 16, 10302, "BR5", 21.41, 245, "COAL",
2017, 1, "Jan", 01, 17, 10655, "TC01", 22.77, 11, "GAS",
2017, 1, "Jan", 01, 18, 10811, "CAL1", 24.88, 25, "GAS",
2017, 1, "Jan", 01, 19, 10821, "CAL1", 24.88, 25, "GAS",
2017, 1, "Jan", 01, 20, 10765, "BIG", 26.00, 30, "HYDRO",
2017, 1, "Jan", 02, 8, 10428, "CAL1", 22.04, 30, "GAS",
2017, 1, "Jan", 02, 9, 10723, "CAL1", 29.97, 59, "GAS",
2017, 1, "Jan", 02, 10, 10933, "BRA", 44.50, 30, "HYDRO",
2017, 1, "Jan", 02, 11, 11107, "ANC1", 46.46, 63, "GAS",
2017, 1, "Jan", 02, 12, 11098, "ANC1", 46.46, 38, "GAS",
2017, 1, "Jan", 02, 13, 10839, "JOF1", 26.59, 45, "GAS",
2017, 1, "Jan", 02, 14, 10814, "JOF1", 26.09, 15, "HYDRO",
2017, 1, "Jan", 02, 15, 10797, "BIG", 26.00, 30, "BIOMASS",
2017, 2, "Feb", 01, 8, 9408, "SD2", 15.38, 126, "COAL",
2017, 2, "Feb", 01, 9, 9388, "SD3", 15.46, 218, "COAL",
2017, 2, "Feb", 01, 10, 9393, "SD3", 15.46, 218, "COAL",
2017, 2, "Feb", 01, 11, 9628, "SD4", 15.47, 203, "COAL",
2017, 2, "Feb", 01, 12, 9943, "EGC1", 21.40, 72, "GAS",
2017, 2, "Feb", 01, 13, 10106, "BR5", 21.41, 245, "COAL",
2017, 2, "Feb", 01, 14, 10114, "BR5", 21.41, 245, "COAL",
2017, 2, "Feb", 01, 15, 9971, "EGC1", 20.75, 75, "GAS",
2017, 2, "Feb", 01, 16, 10302, "BR5", 21.41, 245, "COAL",
2017, 2, "Feb", 01, 17, 10655, "TC01", 22.77, 11, "GAS",
2017, 2, "Feb", 01, 18, 10811, "CAL1", 24.88, 25, "GAS",
2017, 2, "Feb", 01, 19, 10821, "CAL1", 24.88, 25, "GAS",
2017, 2, "Feb", 01, 20, 10765, "BIG", 26.00, 30, "HYDRO",
2017, 2, "Feb", 02, 8, 10428, "CAL1", 22.04, 30, "GAS",
2017, 2, "Feb", 02, 9, 10723, "CAL1", 29.97, 59, "GAS",
2017, 2, "Feb", 02, 10, 10933, "BRA", 44.50, 30, "HYDRO",
2017, 2, "Feb", 02, 11, 11107, "ANC1", 46.46, 63, "GAS",
2017, 2, "Feb", 02, 12, 11098, "ANC1", 46.46, 38, "GAS",
2017, 2, "Feb", 02, 13, 10839, "JOF1", 26.59, 45, "GAS",
2017, 2, "Feb", 02, 14, 10814, "JOF1", 26.09, 15, "HYDRO",
2017, 2, "Feb", 02, 15, 10797, "BIG", 26.00, 30, "BIOMASS"
)
When doing the processing I calculate:
the number of occurences/observations (n)
their relative frequency per month (p)
a percent label of p (p2)
the y-position in the bar chart of each label (pos)
This data I pipe into ggplot. Important is that I use geom_col with position = “fill”. Since I provide a positon value pos for geom_text, it is necessary to use position = “identity” here . Further, you need some kind of ifelse-Statement to adjust the colour of geom_text to white #FFFFFF for darker background colors in HYDRO and COAL.
Good luck using this approach on your original data.
library(ggplot2)
library(dplyr)
MU17 %>%
mutate(Fuel = factor(Fuel),
Month = factor(Month,levels = month.abb)) %>%
group_by(Month, Month2, Fuel) %>%
summarise(n = n()) %>%
group_by(Month) %>%
mutate(p = n / sum(n),
p2 = paste(formatC(p*100, digits = 2, format = "fg"),"%",sep = ""),
pos = cumsum(p) - (0.5 * p)) %>%
ggplot(aes(x = Month, y = p, fill = factor(Fuel, levels = rev(levels(Fuel))))) +
geom_col(width = 0.5, position = "fill") +
scale_y_continuous(limits = c(0, 1), breaks = c(-.5,-.25,0,.25,.5,.75,1), expand = c(0, 0),
labels = scales::percent) +
scale_fill_manual(breaks = c("COAL", "GAS","HYDRO","BIOMASS"),
values = c("black","yellow","blue","green")) +
geom_text(aes(label = p2, y = pos),
position = "identity",
vjust = 0.5,
colour = ifelse(data$Fuel == "COAL" | data$Fuel == "HYDRO", "#FFFFFF", "#000000")) +
labs(x = "2017" , y = "Marginal Fuel Between HE8 & HE20") +
labs(fill = "Fuel Type")
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have a piece of code:
sql_iv <- "select year, month, day,
count(HR)
from y2
group by year, month, day
order by year, month, day"
y3=sqldf(sql_iv)
Which calculates how many times a measurement was taken in a single day (amount varies day to day):
Year Month Day count(HR)
1 2018 4 7 88
2 2018 4 8 327
3 2018 4 9 318
4 2018 4 10 274
5 2018 4 11 345
6 2018 4 12 275
.
.
.
189 2018 10 12 167
Now I need to take these calculated values and join them with my data which has every measurement in a different row (i.e. all the measurements made of April 4th would have to have value 88 in the last column). Could anyone help me out with this?
Data structure for first 10 measurements (out of 48650):
structure(list(Date = structure(c(1523119800, 1523119920, 1523119980,
1523120280, 1523120340, 1523120400, 1523120460, 1523120520, 1523120580,
1523120640), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
HR = c("97.0", "98.0", "95.0", "93.0", "94.0", "94.0", "92.0",
"96.0", "89.0", "90.0"), Year = c(2018, 2018, 2018, 2018,
2018, 2018, 2018, 2018, 2018, 2018), Month = c(4, 4, 4, 4,
4, 4, 4, 4, 4, 4), Day = c(7, 7, 7, 7, 7, 7, 7, 7, 7, 7),
Hour = c(16, 16, 16, 16, 16, 17, 17, 17, 17, 17), Minute = c(50,
52, 53, 58, 59, 0, 1, 2, 3, 4)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
are you looking for this?
library(dplyr)
mydata %>%
as_tibble() %>%
left_join(sqldf %>% as_tibble, by = c("Year", "Month", "Day"))