box plot for multiple observations - r

I have multiple observation of rainfall for the same station for around 14 years the data frame is in something like this :
df (from date -01/01/2000)
v1 v2 v3 v4 v5 v6 ........ v20
1 1 2 4 8 9..............
1.4 4 3.8..................
1.5 3 1.6....................
1.6 8 .....................
.
.
.
.
till date 31/01/2013 i.e total 5114 observations
where v1 v2 ...v20 are the rainfall simulation for the same point; I want to plot the box plot which represents the collective range of quantiles and median monthly when all the observations are taken together.
I can plot box plot for single monthly values using :
df$month<-factor(month.name,levels=month.name)
library(reshape2)
df.long<-melt(df,id.vars="month")
ggplot(df.long,aes(month,value))+geom_boxplot()
but in this problem as the data is daily and there are multiple observations i don't get idea where to start.
sample data
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
In case if u want to work with a zoo object :
date<-seq(as.POSIXct("2000-01-01 00:00:00","GMT"),as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min")
If you want yo can also convert it to zoo object
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"), as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))

I am not familiar with zoo. So, I converted your sample to data frame. Your idea of using melt() is a right way. Then, you need to aggregate rain amount by month. I think it is good to look up aggregate() and other options. Here, I used dplyr and tidyr to arrange the sample data. I hope this will let you move forward.
### zoo to data frame by # Joshua Ulrich
### http://stackoverflow.com/questions/14064097/r-convert-between-zoo-object-and-data-frame-results-inconsistent-for-different
zoo.to.data.frame <- function(x, index.name="Date") {
stopifnot(is.zoo(x))
xn <- if(is.null(dim(x))) deparse(substitute(x)) else colnames(x)
setNames(data.frame(index(x), x, row.names=NULL), c(index.name,xn))
}
### to data frame
foo <- zoo.to.data.frame(df)
str(foo)
library(dplyr)
library(tidyr)
### wide to long data frame, aggregate rain amount by Date
ana <- foo %>%
melt(., id.vars = "Date") %>%
group_by(Date) %>%
summarize(rain = sum(value))
### Aggregate rain amount by year and month
bob <- ana %>%
separate(Date, c("year", "month", "date")) %>%
group_by(year, month) %>%
summarize(rain = sum(rain))
### Drawing a ggplot figure
ggplot(data = bob, aes(x = month, y = rain)) +
geom_boxplot()

just found out an easier way to do it, hwoever your answered really helped jazzuro
install.packages("reshape2")
library(dplyr)
library(reshape2)
require(ggplot2)
df = data.frame(matrix(rnorm(20), nrow=5114,ncol=100))
x <- zoo(df, order.by=seq(as.POSIXct("2000-01-01 00:00:00","GMT"),
as.POSIXct("2013-12-31 00:00:00","GMT"), by="1440 min"))
v<-aggregate(x, as.yearmon, mean)
months<- rep(1:12,14)
lol<-data.frame(v,months)
df.m <- melt(lol, id.var = "months")
View(df.m)
p <- ggplot(df.m, aes(factor(months), value))
p + geom_boxplot(aes(fill = months))

Related

Plotting dummy variables with ggplot2

I actually need help building on this question:
ggplot2 graphic order by grouped variable instead of in alphabetical order.
I need to produce a similar graph and I actually have a problem with the black points. I have data where column names are dates and rows are filled with 0 or 1 and I need to plot the point if the value is 1. To reproduce, here is a small sample (in my dataset, there is over 300 columns):
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
I need to plot the dates on the x axis, match the id to the canton and show the points where the value is 1.
Could anyone help?
Try this:
plot_data = df %>%
## put data in long format
pivot_longer(-id, names_to = "colname") %>%
## keep only 1s
filter(value == 1) %>%
## convert dates to Date class
mutate(date = as.Date(colname, format = "%d%B%Y"))
plot_data
# # A tibble: 2 x 4
# id colname value date
# <dbl> <chr> <dbl> <date>
# 1 2 14August1970 1 1970-08-14
# 2 3 26April1970 1 1970-04-26
## plot
ggplot(plot_data, aes(x = date, y = factor(id))) +
geom_point()
Using this data:
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0), check.names = FALSE)
Maybe you are looking for this:
library(ggplot2)
library(dplyr)
library(tidyr)
#Data
df <- data.frame(id=c(1,2,3),
"26April1970"=c(0,0,1),
"14August1970"=c(0,1,0))
#Code
df %>% pivot_longer(-id) %>%
ggplot(aes(x=name,y=factor(value)))+
geom_point(aes(color=factor(value)))+
scale_color_manual(values=c('transparent','black'))+
theme(legend.position = 'none')+xlab('Date')+ylab('value')
Output:

how to display grouped values in r?

I have data in form: date, key, value, n,
where:
date is the first date and time when a variable key got a specific value.
key is the variable name
value is a value
n is the number of subsequent occurrences of the same value
For example, if a has a value of 20 from 8am to 11am on 2017-01-01, and there are four recordings during that time span, its n value for 2017-01-01 8am would be 4. The reason the data is highly aggregated like this is that there are billions of rows of data.
This is a small example:
r1 <- c("2017-01-01 08:00:00","a",20,5)
r2 <- c("2017-01-01 08:00:00","b",10,20)
r3 <- c("2017-01-01 14:00:00","a",35,4)
dat <- rbind(r1,r2,r3)
colnames(dat) <- c("Date","Key","Value","n")
My goal is to show the value distributions over time, using different plots including lines (for time series).
As the amount of data is huge, I'm looking for an effective way of ungrouping this kind of data (i.e. replicating the value n-times) or displaying the data as it is.
Here is how I would ungroup the data, using dplyr chain. But as you can see, the comment of Roman is quite similar.
r1 <- c("2017-01-01 08:00:00","a",20,5)
r2 <- c("2017-01-01 08:00:00","b",10,20)
r3 <- c("2017-01-01 14:00:00","a",35,4)
dat <- as.data.frame(rbind(r1,r2,r3),stringsAsFactors = F)
colnames(dat) <- c("Date","Key","Value","n")
library(dplyr)
dat %>% mutate(n = as.numeric(n)) %>%
do(.[rep(1:nrow(.), .$n),])
You could do this:
dat <- as.data.frame(dat)
dat$Date <- as.character(dat$Date)
dat$n <- as.numeric(dat$n)
dat$Value <- as.numeric(dat$Value)
ggplot(dat) +
geom_point(aes(x = Date, y = Value, color = Key, stroke = n)) +
expand_limits(y = 0)

Use aggregate to obtain the fourth highest value per year in R

I have a data from of dates and values. I am trying to get the fourth highest value per year using dplyr and order or multiple aggregate statements. I want the date that the fourth highest value occurred on as well as the value in a data frame for all years.
Here is my script:
timeozone <- import(i, date="DATES", date.format = "%Y-%m-%d %H", header=TRUE, na.strings="NA")
colnames(timeozone) <- c("column","date", "O3")
timeozone %>%
mutate(month = format(date, "%m"), day = format(date, "%d"), year = format(date, "%Y")) %>%
group_by(month, day, year) %>%
summarise(fourth = O3[order(O3, decreasing = TRUE)[4] ])
I am not sure what is wrong with what I've got above. Any help would be appreciated.
Data:
Dates Values
11/12/2000 14
11/13/2000 16
11/14/2000 17
11/15/2000 21
11/13/2001 31
11/14/2001 21
11/15/2001 62
11/16/2001 14
Another option with base (and using the iris data again) would be to split the variable by the group, then order it and extract the fourth element. For example
data(iris)
petals <- split(iris$Petal.Length, iris$Species)
sapply(petals, function(x) x[order(x)][4])
or, actually, even more succinctly with tapply
tapply(iris$Petal.Length, iris$Species, function(x) x[order(x)][4])
Edit
Using the sample data above, you could extract the full row (or just the date, if you wanted), as follows.
date <- c("11/12/00", "11/13/00", "11/14/00", "11/15/00", "11/13/01",
"11/14/01", "11/15/01", "11/16/01")
value <- c(14, 16, 17, 21, 31, 21, 62, 14)
date_splt <- strsplit(date, "/")
year <- sapply(date_splt, "[", 3)
d <- data.frame(date, value, year)
d_splt <- split(d, d$year)
lapply(d_splt, function(x) x[order(x$value), ][4, ])
Since you didn't provide reproducible data, here is an example using iris. You will need to group by your years instead of by Species but the same ideas apply.
You can do it relatively directly with dplyr if you are not wedded to aggregate:
iris %>%
group_by(Species) %>%
summarise(fourth = Petal.Length[order(Petal.Length, decreasing = TRUE)[4] ])
gives:
Species fourth
1 setosa 1.7
2 versicolor 4.9
3 virginica 6.6
You can confirm that the values are correct using:
by(iris$Petal.Length, iris$Species, sort)
Using nth, following the suggestion of #tchakravarty :
iris %>%
group_by(Species) %>%
summarise(fourth = nth(sort(Petal.Length), -4L))
Gives the same value as above.

Substituting dates with number of days in time series

I have following data on student scores on several pretests before their true exam.
a<-(c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b<-c(300,230,400,NA,NA,NA,"2013-04-30")
c<-c(NA,260,410,420,NA,NA,"2013-05-30")
d<-c(300,230,400,NA,370,390,"2013-08-30")
df<-as.data.frame(rbind(b,c,d))
colnames(df)<-a
rownames(df)<-(c("student 1","student 2","student 3"))
The actual datasheet is much larger. Since the dates vary so much, and the timing between the pretests and to the exam are relatively similar, I would rather convert the true dates into the number of days before the exam, so that they are the new column names, not dates. I understand that this will merge some of the columns which is OK. How would I be able to do that?
This is another good use case for reshape2, because you want to go to long form for plotting. For example:
# you are going to need the student id as a field
df$student_id <- row.names(df)
library('reshape2')
df2 <- melt(df, id.vars = c('student_id','actual_exam_date'),
variable.name = 'pretest_date',
value.name = 'pretest_score')
# drop empty observations
df2 <- df2[!is.na(df2$pretest_score),]
# these need to be dates
df2$actual_exam_date <- as.Date(df2$actual_exam_date)
df2$pretest_date <- as.Date(df2$pretest_date)
# date difference
df2$days_before_exam <- as.integer(df2$actual_exam_date - df2$pretest_date)
# scores need to be numeric
df2$pretest_score <- as.numeric(df2$pretest_score)
# now you can make some plots
library('ggplot2')
ggplot(df2, aes(x = days_before_exam, y = pretest_score, col=student_id) ) +
geom_line(lwd=1) + scale_x_reverse() +
geom_vline(xintercept = 0, linetype = 'dashed', lwd = 1) +
ggtitle('Pretest Performance') + xlab('Days Before Exam') + ylab('Pretest Score')
Here is one way to approach this one. I am sure there are many others. I commented the code to explain what is going on at each step:
# Load two libraries you need
library(tidyr)
library(dplyr)
# Construct data frame you provided
a <- (c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b <- c(300,230,400,NA,NA,NA,"2013-04-30")
c <- c(NA,260,410,420,NA,NA,"2013-05-30")
d <- c(300,230,400,NA,370,390,"2013-08-30")
df <- as.data.frame(rbind(b,c,d))
colnames(df) <- a
# Add student IDs as a column instead of row names and move them to first position
df$StudentID <- row.names(df)
row.names(df) <- NULL
df <- select(df, StudentID, everything())
# Gather date columns as 'categories' with score as the new column value
newdf <- df %>% gather(Date, Score, -actual_exam_date, -StudentID) %>% arrange(StudentID)
# Convert dates coded as factor variables into actual dates so we can do days to exam computation
newdf$actual_exam_date <- as.Date(as.character(newdf$actual_exam_date))
newdf$Date <- as.Date(as.character(newdf$Date))
# Create a new column of days before exam per student ID (group) and filter
# out dates with missing scores for each student
newdf <- newdf %>% group_by(StudentID) %>% mutate(daysBeforeExam = as.integer(difftime(actual_exam_date, Date, units = 'days'))) %>% filter(!is.na(Score))
# Plot the trends using ggplot
ggplot(newdf, aes(x = daysBeforeExam, y = Score, col = StudentID, group = StudentID)) + geom_line(size = 1) + geom_point(size = 2)

Change date order at axis

I am always struggeling with this, so I think it is finally time to ask some help...
I tried to make a reproducible example, but for some reason I cannot get my x$monthday in the %m-%d format :(.
x<-data.frame(seq(as.POSIXct('2012-10-01'), as.POSIXct('2015-03-01'), by= "day"))
names(x)<- "date"
x$month<- months(x$date)
x$monthday<- as.POSIXct(x$date, format= "%m-%d")
x1<- x[x$month== 'October' |x$month== 'November' | x$month== 'December' |x$month== 'January'|x$month== 'February', ]
y<- 1: nrow(x1)
x2<-cbind(x1, y)
x3<- aggregate(list(y=x2$y), list(monthday=x2$monthday), mean)
plot(x3$monthday, x3$y)
The date has the format of %m/%d and is of a time series from October-March.
R orders the axis beautifully from January to December, which causes a big gap in between, because my data range from October-March.
How can I make my x axis order in the form from October-March?
Thank you very much in advance.
library(dplyr)
library(ggplot2)
library(lubridate)
# Fake data
dat <- data.frame(date=seq(as.POSIXct('2012-10-01'), as.POSIXct('2015-03-01'), by="day"))
set.seed(23)
dat$temperature = cumsum(rnorm(nrow(dat)))
# Subset to October - March
dat <- dat[months(dat$date) %in% month.name[c(1:2,10:12)], ]
# Calculate mean daily temperature
dat = dat %>% group_by(Month=month(date), Day=day(date)) %>%
summarise(dailyMeanTemp = mean(temperature)) %>%
mutate(newDate = as.Date(ifelse(Month %in% 10:12,
paste0("2014-", Month, "-", Day),
paste0("2015-", Month, "-", Day))))
The mutate function above creates a fake year, only so that we can keep the dates in "date" format and get them ordered from October to March. There's probably a better way to do it (maybe a function in the zoo or xts packages), but this seems to work.
ggplot(dat, aes(newDate, dailyMeanTemp)) +
geom_line() + geom_point() +
labs(y="Mean Temperature", x="Month")
Or, in base graphics:
plot(dat$newDate, dat$dailyMeanTemp)

Resources