I want to create a plot of time per temperature in 2 sites. I have data of the temperature each 10 minutes a day from february to april and I need daily cycles of hourly averages of temperature to plot.
I calculated the mean temperature for hour a day and try to create a plot with geom_plot and geopm_line of different ways.
data <- read.xlsx("temperatura.xlsx", 1)
data <- data %>% mutate (month = as.factor(month), month = as.factor (month), day = as.factor(day), h = as.factor(h), min = as.factor(min))
head (data)
month day h min t.site1 t.site2
2 1 0 0 15.485 16.773
2 1 0 10 15.509 16.773
2 1 0 20 15.557 16.773
2 1 0 30 15.557 16.773
2 1 0 40 15.605 16.773
2 1 0 50 15.605 16.773
str(data)
'data.frame': 12816 obs. of 6 variables:
$ month : Factor w/ 3 levels "2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ h : Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ min : Factor w/ 6 levels "0","10","20",..: 1 2 3 4 5 6 1 2 3 4 ...
$ t.site1: num 15.5 15.5 15.6 15.6 15.6 ...
$ t.site2: num 16.8 16.8 16.8 16.8 16.8 ...
hour <- group_by(data, month, day, h)
mean.h.site1 <- summarize(hour, mean.h.site1 = mean(t.site1))
t1 <- ggplot (data = mean.h.site1, aes(x=h, y=mean.h.site1)) +
geom_line()
t2 <- ggplot(data = mean.h.site1, aes(x=h, y=mean.h.site1, group = month))+
geom_line() +
geom_point()
t3 <- ggplot (data = mean.h.site1, aes(x=day, y=mean.h.site1, group=1))+
geom_point()
I expect the output of the variability of temperature across the time for each site, but the actual output show temperature variability during each day.
It's interesting that your data is showing month, day and hour as factor. Is it possible that there are some character values somewhere in that column when you read the data? It's very unusual to see numbers stored as factor in that fashion.
I'll do 4 things:
Convert factors to numbers
Convert numbers to dates
Convert a wide table to a long one, and finally
plot the temps against a real date
# Load packages and data
library(data.table) # for overall fast data processing
library(lubridate) # for dates wrangling
library(ggplot2) # plotting
dt <- fread("month day h min t.site1 t.site2
2 1 0 0 15.485 16.773
2 1 0 10 15.509 16.773
2 1 0 20 15.557 16.773
2 1 0 30 15.557 16.773
2 1 0 40 15.605 16.773
2 1 0 50 15.605 16.773")
# Convert factors to numbers (I actuall didn't run this because I just created the data.table, but it seems you'll need to do it):
dt[, names(dt)[1:4] := lapply(.SD, function(x) as.numeric(as.character(x)), .SDcols = 1:4]
# Create proper dates. We'll consider all dates occurring in 2019.
dt[, date := ymd_hm(paste0("2019/", month, "/", day, " ", h, ":", min))]
# convert wide data to long one
dt2 <- melt(dt[, .(date, t.site1, t.site2)], id.vars = "date")
# plot the data
ggplot(dt2, aes(x = date, y = value, color = variable))+geom_point()+geom_path()
You could paste the time columns together and convert them as.POSIXct.
As #PavoDive already pointed out we'll need numeric time columns. Check your code that produced the data or transform to numeric with d[1:4] <- Map(function(x) as.numeric(as.character(x)), d[1:4]).
Now paste the rows with apply, convert as.POSIXct, and cbind it to the remainder. The sprintf looks first that all values have the same digits before pasting.
d2 <- cbind(time=as.POSIXct(apply(sapply(d[1:4], sprintf, fmt="%02d"), 1, paste, collapse=""),
format="%m%d%H%M"),
d[5:6])
Plots nicely, here in base R:
with(d2, plot(time, t.site1, ylim=c(15, 17), xaxt="n",
xlab="time", ylab="value", type="b", col="red",
main="Time series"))
with(d2, lines(time, t.site2, type="b", col="green"))
mtext(strftime(d2$time, "%H:%M"), 1, 1, at=d2$time) # strftime gives the desired formatting
legend("bottomright", names(d2)[2:3], col=c("red", "green"), lty=rep(1, 2))
Data
d <- structure(list(month = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2", class = "factor"),
day = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"),
h = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "0", class = "factor"),
min = structure(1:6, .Label = c("0", "10", "20", "30", "40",
"50"), class = "factor"), t.site1 = c(15.485, 15.509, 15.557,
15.557, 15.605, 15.605), t.site2 = c(16.773, 16.773, 16.773,
16.773, 16.773, 16.773)), row.names = c(NA, -6L), class = "data.frame")
I'm assuming that you needed the actual output showing temperature variability by hour for each day in the same plot?
EDITED:
I have updated the code to generate a day worth of data. And, also generate the chart.
library(tidyverse)
library(lubridate)
df <- data_frame(month = rep(2, 144),
day = rep(1, 144),
h = rep(0:24, each = 6, len = 144),
min = rep((0:5)*10,24),
t.site1 = rnorm(n = 144, mean = 15.501, sd = 0.552),
t.site2 = rnorm(n = 144, mean = 16.501, sd = 0.532))
df %>%
group_by(month, day, h) %>%
summarise(mean_t_site1 = mean(t.site1), mean_t_site2 = mean(t.site2)) %>%
mutate(date = ymd_h(paste0("2019-",month,"-",day," ",h))) %>%
ungroup() %>%
select(mean_t_site1:date) %>%
gather(key = "site", value = "mean_temperature", -date) %>%
ggplot(aes(x = date, y = mean_temperature, colour = site)) +
geom_line()
Could you verify if this is the output you need?
Related
I am trying to create a stacked bar chart showing % frequency of occurrences by group
library(dplyr)
library(ggplot2)
brfss_2013 %>%
group_by(incomeLev, mentalHealth) %>%
summarise(count_mentalHealth=n()) %>%
group_by(incomeLev) %>%
mutate(count_inc=sum(count_mentalHealth)) %>%
mutate(percent=count_mentalHealth / count_inc * 100) %>%
ungroup() %>%
ggplot(aes(x=forcats::fct_explicit_na(incomeLev),
y=count_mentalHealth,
group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
However, this is the traceback I receive:
1. dplyr::group_by(., incomeLev, mentalHealth)
8. plyr::summarise(., count_mentalHealth = n())
9. [ base::eval(...) ] with 1 more call
11. dplyr::n()
12. dplyr:::from_context("..group_size")
13. `%||%`(...)
In addition: Warning message:
Factor `incomeLev` contains implicit NA, consider using `forcats::fct_explicit_na`
>
Here is a sample of my data
brfss_2013 <- structure(list(incomeLev = structure(c(2L, 3L, 3L, 2L, 2L, 3L,
NA, 2L, 3L, 1L, 3L, NA), .Label = c("$25,000-$35,000", "$50,000-$75,000",
"Over $75,000"), class = "factor"), mentalHealth = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Excellent",
"Ok", "Very Bad"), class = "factor")), row.names = c(NA, -12L
), class = "data.frame")
Update:
Output of str(brfss_2013):
'data.frame': 491775 obs. of 9 variables:
$ mentalHealth: Factor w/ 5 levels "Excellent","Good",..: 5 1 1 1 1 1 3 1 1 1 ...
$ pa1min_ : int 947 110 316 35 429 120 280 30 240 260 ...
$ bmiLev : Factor w/ 6 levels "Underweight",..: 5 1 3 2 5 5 2 3 4 3 ...
$ X_drnkmo4 : int 2 0 80 16 20 0 1 2 4 0 ...
$ X_frutsum : num 413 20 46 49 7 157 150 67 100 58 ...
$ X_vegesum : num 53 148 191 136 243 143 216 360 172 114 ...
$ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
$ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
$ incomeLev : Factor w/ 4 levels "$25,000-$35,000",..: 2 4 4 2 2 4 NA 2 4 1 ...
First of all, your code works incredibly well when you transform everything into character. So you could just do
brfss_2013[c("incomeLev", "mentalHealth")] <-
lapply(brfss_2013[c("incomeLev", "mentalHealth")], as.character)
and then just run your code as you figured it out.
But, let's do it with factors (don't run the lapply(.) line in this case!).
You want a "missing" category, which you can obtain by adding a new level "missing" for the NAs.
levels(brfss_2013$incomeLev) <- c(levels(brfss_2013$incomeLev), "missing")
brfss_2013$incomeLev[is.na(brfss_2013$incomeLev)] <- "missing"
Then, your aggregation (in a base R way).
b1 <- with(brfss_2013, aggregate(list(count_mentalHealth=incomeLev),
by=list(mentalHealth=mentalHealth, incomeLev=incomeLev),
length))
b2 <- aggregate(mentalHealth ~ ., brfss_2013, length)
names(b2)[2] <- "count_inc"
brfss_2013.agg <- merge(b1, b2)
rm(b1, b2) # just to clean up
Add the "percent" column.
brfss_2013.agg$percent <- with(brfss_2013.agg, count_mentalHealth / count_inc)
Plot.
library(ggplot2)
ggplot(brfss_2013.agg, aes(x=incomeLev, y=count_mentalHealth, group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth), stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
Result
So your code actually works fine for me. It looks like it might be an issue with package versions because it seems odd that you're using the plyr summarise function.
However, here's a slightly more concise way to create that graph (and hopefully this is helpful for whatever you want to add to this plot)
brfss_2013 %>%
# Add count of income levels first (note this only adds a variable)
add_count(incomeLev) %>%
rename(count_inc = n) %>%
# Count observations per group (this transforms data)
count(incomeLev, mentalHealth, count_inc) %>%
rename(count_mentalHealth = n) %>%
mutate(percent= count_mentalHealth / count_inc) %>%
ggplot(aes(x= incomeLev,
y= count_mentalHealth,
# Technically you don't need this group here but groups can be handy
group= mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity")+
# Using the scales package does the percent formatting for you
geom_text(aes(label = scales::percent(percent)), vjust = 1)+
theme_minimal()
I have a table with the following headers and example data
Lat Long Date Value.
30.497478 -87.880258 01/01/2016 10
30.497478 -87.880258 01/02/2016 15
30.497478 -87.880258 01/05/2016 20
33.284928 -85.803608 01/02/2016 10
33.284928 -85.803608 01/03/2016 15
33.284928 -85.803608 01/05/2016 20
I would like to average the value column on monthly basis for a particular location.
So example output would be
Lat Long Month Avg Value
30.497478 -87.880258 January 15
A solution using dplyr and lubridate.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date = mdy(Date), Month = month(Date)) %>%
group_by(Lat, Long, Month) %>%
summarise(`Avg Value` = mean(Value))
dt2
# A tibble: 2 x 4
# Groups: Lat, Long [?]
Lat Long Month `Avg Value`
<dbl> <dbl> <dbl> <dbl>
1 30.49748 -87.88026 1 15
2 33.28493 -85.80361 1 15
You can try the following, but it first modifies the data frame adding an extra column, Month, using package zoo.
library(zoo)
dat$Month <- as.yearmon(as.Date(dat$Date, "%m/%d/%Y"))
aggregate(Value. ~ Lat + Long + Month, dat, mean)
# Lat Long Month Value.
#1 30.49748 -87.88026 jan 2016 15
#2 33.28493 -85.80361 jan 2016 15
If you don't want to change the original data, make a copy dat2 <- dat and change the copy.
DATA
dat <-
structure(list(Lat = c(30.497478, 30.497478, 30.497478, 33.284928,
33.284928, 33.284928), Long = c(-87.880258, -87.880258, -87.880258,
-85.803608, -85.803608, -85.803608), Date = structure(c(1L, 2L,
4L, 2L, 3L, 4L), .Label = c("01/01/2016", "01/02/2016", "01/03/2016",
"01/05/2016"), class = "factor"), Value. = c(10L, 15L, 20L, 10L,
15L, 20L)), .Names = c("Lat", "Long", "Date", "Value."), class = "data.frame", row.names = c(NA,
-6L))
EDIT.
If you want to compute several statistics, you can define a function that computes them and returns a named vector and call it in aggregate, like the following.
stat <- function(x){
c(Mean = mean(x), Median = median(x), SD = sd(x))
}
agg <- aggregate(Value. ~ Lat + Long + Month, dat, stat)
agg <- cbind(agg[1:3], as.data.frame(agg[[4]]))
agg
# Lat Long Month Mean Median SD
#1 30.49748 -87.88026 jan 2016 15 15 5
#2 33.28493 -85.80361 jan 2016 15 15 5
So i've been trying to get my head around this but i can't figure out how to do it.
This is an example:
ID Hosp. date Discharge date
1 2006-02-02 2006-02-04
1 2006-02-04 2006-02-18
1 2006-02-22 2006-03-24
1 2008-08-09 2008-09-14
2 2004-01-03 2004-01-08
2 2004-01-13 2004-01-15
2 2004-06-08 2004-06-28
What i want is a way to combine rows by ID, IF the discarge date is the same as the Hosp. date (or +-7 days) in the next row. So it would look like this:
ID Hosp. date Discharge date
1 2006-02-02 2006-03-24
1 2008-08-09 2008-09-14
2 2004-01-03 2004-01-15
2 2004-06-08 2004-06-28
Using the data.table-package:
# load the package
library(data.table)
# convert to a 'data.table'
setDT(d)
# make sure you have the correct order
setorder(d, ID, Hosp.date)
# summarise
d[, grp := cumsum(Hosp.date > (shift(Discharge.date, fill = Discharge.date[1]) + 7))
, by = ID
][, .(Hosp.date = min(Hosp.date), Discharge.date = max(Discharge.date))
, by = .(ID,grp)]
you get:
ID grp Hosp.date Discharge.date
1: 1 0 2006-02-02 2006-03-24
2: 1 1 2008-08-09 2008-09-14
3: 2 0 2004-01-03 2004-01-15
4: 2 1 2004-06-08 2004-06-28
The same logic with dplyr:
library(dplyr)
d %>%
arrange(ID, Hosp.date) %>%
group_by(ID) %>%
mutate(grp = cumsum(Hosp.date > (lag(Discharge.date, default = Discharge.date[1]) + 7))) %>%
group_by(grp, add = TRUE) %>%
summarise(Hosp.date = min(Hosp.date), Discharge.date = max(Discharge.date))
Used data:
d <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
Hosp.date = structure(c(13181, 13183, 13201, 14100, 12420, 12430, 12577), class = "Date"),
Discharge.date = structure(c(13183, 13197, 13231, 14136, 12425, 12432, 12597), class = "Date")),
.Names = c("ID", "Hosp.date", "Discharge.date"), class = "data.frame", row.names = c(NA, -7L))
I'd like to apply analytical weights to some time series data, but am not sure how to do this in R. I'm transcribing some Stata code and the code uses collapse and [aweight='weightVar'].
Stata Code
collapse temp [aweight='weightVar], by(year);
How can I apply analytical weights to data use croparea below as the weighting variable to temp for each id of each year?
Sample data
df <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2), year = c(1900,
1900, 1900, 1900, 1901, 1901, 1901, 1901), month = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L), temp = c(51.8928991815029, 52.8768994596968,
70.0998976356871, 62.2724802472936, 51.8928991815029, 52.8768994596968,
70.0998976356871, 62.2724802472936), croparea = c(50, 50, 50,
50, 30, 30, 30, 30)), .Names = c("id", "year", "month", "temp",
"croparea"), row.names = c(NA, -8L), class = "data.frame")
id year month temp croparea
1 1 1900 1 51.89290 50
2 1 1900 2 52.87690 50
3 1 1900 3 70.09990 50
4 1 1900 4 62.27248 50
5 2 1901 1 51.89290 30
6 2 1901 2 52.87690 30
7 2 1901 3 70.09990 30
8 2 1901 4 62.27248 30
Thanks for including sample data! That makes things much easier.
Stata collapse is similar to the R functions aggregate or ddply. It looks like you want a weighted (by croparea) mean of temp grouped by id.
For weighted means in R see this SO question; I'll take the top solution and apply it to your data:
library(plyr)
ddply(df, .(id), function(x) data.frame(wtempmean=weighted.mean(x$temp, x$croparea)))
id wtempmean
1 1 59.28554
2 2 59.28554
There is a column in my dataset that contains time in the format 00:20:10. I have two questions. First, when I import it into R using read.xlsx2(), this column is converted to factor type. How can I convert it to time type?
Second, I want to calculate each person's total time in number of minutes.
ID Time
1 00:10:00
1 00:21:30
2 00:30:10
2 00:04:10
The output I want is:
ID Total.time
1 31.5
2 34.3
I haven't dealt with time issue before, and I hope someone would recommend some packages as well.
You could use times() from the chron package to convert the Time column to "times" class. Then aggregate() to sum the times, grouped by the ID column. This first block will give us actual times in the result.
library(chron)
df$Time <- times(df$Time)
aggregate(list(Total.Time = df$Time), df[1], sum)
# ID Total.Time
# 1 1 00:31:30
# 2 2 00:34:20
For decimal output, we can employ minutes() and seconds(), also from chron.
aggregate(list(Total.Time = df$Time), df[1], function(x) {
minutes(s <- sum(x)) + (seconds(s) / 60)
})
# ID Total.Time
# 1 1 31.50000
# 2 2 34.33333
Furthermore, we can also use data.table for improved efficiency.
library(data.table)
setDT(df)[, .(Total.Time = minutes(s <- sum(Time)) + (seconds(s) / 60)), by = ID]
# ID Total.Time
# 1: 1 31.50000
# 2: 2 34.33333
Data:
df <- structure(list(ID = c(1L, 1L, 2L, 2L), Time = structure(c(2L,
3L, 4L, 1L), .Label = c("00:04:10", "00:10:00", "00:21:30", "00:30:10"
), class = "factor")), .Names = c("ID", "Time"), class = "data.frame", row.names = c(NA,
-4L))