ggplot and barchart with multiple variable in an certain order - r

I´ve working for days on this problem but i am not able to solve this problem.
I have a matrix
gesamt
year <10 10-60 60-100 100-150 >150
2001 2001 376 57 7 0 0
2002 2015 322 60 10 2 0
2003 2016 324 59 5 2 0
which i convert to data.frame: df <- data.frame(gesamt)
and melt it:
df.molten <- melt(df, id.vars='year', value.name='mean')
df.molten
year variable mean
1 2001 <10 376
2 2015 <10 322
3 2016 <10 324
4 2001 10-60 57
5 2015 10-60 60
6 2016 10-60 59
7 2001 60-100 7
8 2015 60-100 10
9 2016 60-100 5
10 2001 100-150 0
11 2015 100-150 2
12 2016 100-150 2
13 2001 >150 0
14 2015 >150 0
15 2016 >150 0
I would like to get a solution like this:
But i am not able to plot it (didn´t had enough time to study ggplot2), whatever i tried.
Any hints?

This is an easy task, should be able to do it with:
library(tidyverse)
df %>%
ggplot(aes(year, mean, fill = variable)) +
geom_col(position = position_dodge())
Since your dataset has a large gap in years, you can convert the years column to character with df$year <- as.character(df$year) and you won't have that large gap in your plot.

Related

R - manipulating time series data

I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

Aggregate by specific year in R

Apologies if this question has already been dealt with already on SO, but I cannot seem to find a quick solution as of yet.
I am trying to aggregate a dataset by a specific year. My data frame consists of hourly climate data over a period of 10 years.
head(df)
# day month year hour rain temp pressure wind
#1 1 1 2005 0 0 7.6 1016 15
#2 1 1 2005 1 0 8.0 1015 14
#3 1 1 2005 2 0 7.7 1014 15
#4 1 1 2005 3 0 7.8 1013 17
#5 1 1 2005 4 0 7.3 1012 17
#6 1 1 2005 5 0 7.6 1010 17
To calculate daily means from the above dataset, I use this aggregate function
g <- aggregate(cbind(temp,pressure,wind) ~ day + month + year, d, mean)
options(digits=2)
head(g)
# day month year temp pressure wind
#1 1 1 2005 6.6 1005 25
#2 2 1 2005 6.5 1018 25
#3 3 1 2005 9.7 1019 22
#4 4 1 2005 7.5 1010 25
#5 5 1 2005 7.3 1008 25
#6 6 1 2005 9.6 1009 26
Unfortunately, I get a huge dataset spanning the whole 10 years (2005 to 2014). I am wondering if anybody would be able to help me tweak the above aggregate code so as I would be able to summaries daily means over a specific year as opposed to all of them in one swipe?
You can use the subset argument in aggregate
aggregate(cbind(temp,pressure,wind) ~ day + month + year, df,
subset=year %in% 2005:2014, mean)
Dplyr also does it nicely.
library(dplyr)
df %>%
filter(year==2005) %>%
group_by(day, month, year) %>%
summarise_each(funs(mean), temp, pressure, wind)

dcast keeping four value variables and two factors

I have a data.frame in R in long format, and I want to cast it into wide.
It has monthly data from several clients, and I want the final data.frame to have the mean per client of he, vo, ep and fe.
store and pr should be fixed for each client.
I think dcast from package reshape2 should do the job, but I can't make it work.
month store client he vo ep fe pr
jan 1 54010 12 392 1 7 Basic
jan 2 54011 12 376 2 2 Premium
jan 1 54012 11 385 2 6 Basic
feb 1 54010 10 394 3 7 Basic
feb 2 54011 10 385 1 1 Premium
feb 1 54012 11 395 1 1 Basic
mar 1 54010 11 416 2 2 Basic
mar 2 54011 11 417 3 4 Premium
mar 1 54012 11 390 0 2 Basic
apr 1 54010 11 389 2 NA Basic
apr 2 54011 7 398 6 3 Premium
apr 1 54012 11 368 1 3 Basic
If you need annual mean of those columns by client (it wasn't clear), dplyr can do it:
library(dplyr)
dat <- read.table(text="month store client he vo ep fe pr
jan 1 54010 12 392 1 7 Basic
jan 2 54011 12 376 2 2 Premium
jan 1 54012 11 385 2 6 Basic
feb 1 54010 10 394 3 7 Basic
feb 2 54011 10 385 1 1 Premium
feb 1 54012 11 395 1 1 Basic
mar 1 54010 11 416 2 2 Basic
mar 2 54011 11 417 3 4 Premium
mar 1 54012 11 390 0 2 Basic
apr 1 54010 11 389 2 NA Basic
apr 2 54011 7 398 6 3 Premium
apr 1 54012 11 368 1 3 Basic", stringsAs=F, header=T)
mt <- function(x, ...) { mean(x, na.rm=TRUE) }
dat %>%
group_by(client) %>%
summarise_each(funs(mt), -store, -pr, -month)
## Source: local data frame [3 x 5]
##
## client he vo ep fe
## 1 54010 11 397.75 2 5.333333
## 2 54011 10 394.00 3 2.500000
## 3 54012 11 384.50 1 3.000000
Here's a data table solution using the dat data from #hrbrmstr's answer:
library(data.table)
## coerce to data table
DT <- as.data.table(dat)
## run mean() on columns 4 through 7, grouped by 'client'
DT[, lapply(.SD, mean, na.rm = TRUE), .SDcols = 4:7, by = client]
# client he vo ep fe
# 1: 54010 11 397.75 2 5.333333
# 2: 54011 10 394.00 3 2.500000
# 3: 54012 11 384.50 1 3.000000

In R, sum over all rows above a given row and restarting at new ID?

The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]

Resources