Time-series data visualization - r

I have a pretty large data frame in R stored in long form. It contains body temperature data collected from 40 different individuals, with 10 sec intervals, over 16 days. Individuals have been exposed to conditions (cond1 and cond2). It essentially looks like this:
ID Cond1 Cond2 Day ToD Temp
1 A B 1 18.0 37.1
1 A B 1 18.3 37.2
1 A B 2 18.6 37.5
2 B A 1 18.0 37.0
2 B A 1 18.3 36.9
2 B A 2 18.6 36.9
3 A A 1 18.0 36.8
3 A A 1 18.3 36.7
3 A A 2 18.6 36.7
...
I want to create four separate line plots for each combination of conditions(AB, BA, AA, BB) that shows mean temp over time (day 1-16).
p.s. ToD stands for time of day. Not sure if I need to provide it in order to create the plot.
So far I have tried to define the dataset as time series by doing
ts <- ts(data=dataset$Temp, start=1, end=16, frequency=8640)
plot(ts)
This returns a plot of Temp, but I can't figure out how to define condition values for breaking up the data.
Edit:
Essentially I want a plot that looks like this 1, but one for each group separately, and using mean Temp values. This plot is just for one individual in one condition, and I want one that shows the mean for all individuals in the same condition.

You can use summarise and group_by to group the data by condition and then plot it. Is this what you're looking for?
library(dplyr)
## I created a dataframe df that looks like this:
ID Cond1 Cond2 Day ToD Temp
1 1 A B 1 18.0 37.1
2 1 A B 1 18.3 37.2
3 1 A B 2 18.6 37.5
4 2 B A 1 18.0 37.0
5 2 B A 1 18.3 36.9
6 2 B A 2 18.6 36.9
7 3 A A 1 18.0 36.8
8 3 A A 1 18.3 36.7
9 3 A A 2 18.6 36.7
df$Cond <- paste0(df$Cond1, df$Cond2)
d <- summarise(group_by(df, Cond, Day), t = mean(Temp))
ggplot(d, aes(Day, t, color = Cond)) + geom_line()
which results in:

Related

filter by observation that cumulate X% of values

I would like to filter by observations (after sorting in decreasing way in every group) that cumulate X % of values, in my case less than or equal to 80 percent of total of the values. And that in every group.
So from this dataframe below:
Group<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","C")
value<-c(c(2,3,6,3,1,1,3,3,5,4,3,5,3,4,2))
data1<-data.frame(Group,value)
data1<-data1%>%arrange(Group,desc(value))%>%
group_by(Group)%>%mutate(pct=round (100*value/sum(value),1))%>%
mutate(cumPct=cumsum(pct))
I would like to have the below filtered dataframe according to conditions I decribed above:
Group value pct cumPct
1 A 6 40.0 40.0
2 A 3 20.0 60.0
3 A 3 20.0 80.0
4 B 5 41.7 41.7
5 B 3 25.0 66.7
6 C 5 23.8 23.8
7 C 4 19.0 42.8
8 C 4 19.0 61.8
9 C 3 14.3 76.1
You can arrange the data in descending order of value, for each Group calculate pct and cum_pct and select rows where cum_pct is less than equal to 80.
library(dplyr)
data1 %>%
arrange(Group, desc(value)) %>%
group_by(Group) %>%
mutate(pct = value/sum(value) * 100,
cum_pct = cumsum(pct)) %>%
filter(cum_pct <= 80)
# Group value pct cum_pct
# <chr> <dbl> <dbl> <dbl>
#1 A 6 40 40
#2 A 3 20 60
#3 A 3 20 80
#4 B 5 41.7 41.7
#5 B 3 25 66.7
#6 C 5 23.8 23.8
#7 C 4 19.0 42.9
#8 C 4 19.0 61.9
#9 C 3 14.3 76.2

Aggregating data with NA values based on site

I am using the EPA NLA dataset to find the average temperature in the epiliminion for some lake data. The data set looks like this:
SITE DEPTH METALIMNION TEMP FIELD
1 0.0 NA 25.6
1 0.5 NA 25.1
1 0.8 T 24.9
1 1.0 NA 24.1
1 2.0 B 23.0
2 0.0 NA 29.0
2 0.5 T 28.0
"T" indicates the end of the epiliminion, and I want to average all corresponding temperature values including and above the "T" for each site. I have no idea where to even begin. (The "B" is irrelevant for this issue).
Thanks!
With base R you can do it like this.
I use ave twice, the first time to determine where column METALIMNION has a "T", by group of SITE. This is vector g.
The second, average METALIMNION by SITE and that vector g.
g <- with(NLA, ave(as.character(METALIMNION), SITE,
FUN = function(x) {
x[is.na(x)] <- ""
rev(cumsum(rev(x) == "T"))
}))
NLA$AVG <- ave(NLA$TEMP.FIELD, NLA$SITE, g)
NLA
# SITE DEPTH METALIMNION TEMP.FIELD AVG
#1 1 0.0 <NA> 25.6 25.20
#2 1 0.5 <NA> 25.1 25.20
#3 1 0.8 T 24.9 25.20
#4 1 1.0 <NA> 24.1 23.55
#5 1 2.0 B 23.0 23.55
#6 2 0.0 <NA> 29.0 28.50
#7 2 0.5 T 28.0 28.50
Assuming that there is only one 'T' for each value of site, using dplyr package:
library(dplyr)
data.frame(SITE=c(1,1,1,1,1,2,2),TEMP=c(25.6,25.1,24.9,24.1,23.0,29.0,28.0)) %>%
group_by(SITE) %>%
summarise(meanTemp=mean(TEMP))
Result:
# A tibble: 2 x 2
SITE meanTemp
<dbl> <dbl>
1 1 24.5
2 2 28.5

boxplot doesn't show all the parameter in R

I write this code to execute an ANOVA for a simple dataframe and I want to draw a boxplot out of it
DF <- read.table('chromium.txt',header=TRUE)
Chromium.aov <- aov(Concentration ~ Lab,data=DF)
print(summary(Chromium.aov))
with(DF,boxplot(Concentration,Lab))
here is the text file
Lab Concentration
1 26.1
1 21.5
1 22.0
1 22.6
1 24.9
1 22.6
1 23.8
1 23.2
2 18.3
2 19.7
2 18.0
2 17.4
2 22.6
2 11.6
2 11.0
2 15.7
3 19.1
3 13.9
3 15.7
3 18.6
3 19.1
3 16.8
3 25.5
3 19.7
4 30.7
However, R only show 2 box plots for lab 1 and 2, not 3 and 4, how can I fix this?
boxplot(DF$Concentration ~ DF$Lab)
The syntax you used is making one box with all the values of 'Concentration', and another with the values of 'Lab'
When you do with(DF,boxplot(Concentration,Lab)), you are providing two sets of values to be plotted - Concentration and lab. You want to split the Concentration based on the unique values Lab and then create the boxplot.
boxplot(split(DF$Concentration, DF$Lab))

R: add a column for the day of value

I have the following code to select days (24h) when both maximum and minimum temperatures have high temperatures (higher than the 90th percentiles of both). The code calculates the length of the individual event and the highest mean temperature recorded during each event.
setDT(df)
df[, hotday := +(df$MAX>=(quantile(df$MAX,.90, na.rm = T, type = 6)) & df$MIN>=(quantile(df$MIN,.90, na.rm = T, type = 6)))
] [, hw.length := with(rle(hotday), rep(lengths,lengths))
] [hotday==0, hw.length:=0][!!hotday, Highest_Mean := max(MEAN) , rleid(hw.length)][]
The result of the code looks like this:
> head(df)
YEAR MONTH DAY Date MEAN MAX MIN D hotday hw.length Highest_Mean
1: 1991 5 14 5/14/1991 32.2 41.0 23.6 17.4 1 3 34.9
2: 1991 5 15 5/15/1991 34.9 43.3 26.0 17.3 1 3 34.9
3: 1991 5 16 5/16/1991 31.4 39.2 23.6 15.6 1 3 34.9
4: 1994 5 27 5/27/1994 30.7 41.0 23.0 18.0 1 2 30.7
5: 1994 5 28 5/28/1994 30.6 39.4 23.4 16.0 1 2 30.7
The first event lived for 3 days and the highest mean was 34.9, but the code does not tell on which day that was recorded (was it on the first, second or third day of the event).
How can I add a column that gives that information along with the maximum length (non-duplicated values, one per each event)? something like this
YEAR MONTH DAY Date MEAN MAX MIN D hotday hw.length Highest_Mean mean.day.max.length
1: 1991 5 14 5/14/1991 32.2 41.0 23.6 17.4 1 3 34.9
2: 1991 5 15 5/15/1991 34.9 43.3 26.0 17.3 1 3 34.9 2-3
3: 1991 5 16 5/16/1991 31.4 39.2 23.6 15.6 1 3 34.9
You would be better off adding a unique identifying code for each heat wave event and then indexing that,but this solution will work (as long as two events do not have the exact same length and max mean temperature)
hottest_day = which(df$MEAN == df$Highest_Mean)
df$mean.day.max.length=""
for(i in hottest_day){
subset=df[(which(df$hw.length==df$hw.length[i] & df$Highest_Mean==df$Highest_Mean[i])),]
df$mean.day.max.length[i]=paste0(which(subset$MEAN==df$Highest_Mean[i]),"-",df$hw.length[i])
}

Reshaping a data frame with more than one measure variable

I'm using a data frame similar to this one:
df<-data.frame(student=c(rep(1,5),rep(2,5)), month=c(1:5,1:5),
quiz1p1=seq(20,20.9,0.1),quiz1p2=seq(30,30.9,0.1),
quiz2p1=seq(80,80.9,0.1),quiz2p2=seq(90,90.9,0.1))
print(df)
student month quiz1p1 quiz1p2 quiz2p1 quiz2p2
1 1 1 20.0 30.0 80.0 90.0
2 1 2 20.1 30.1 80.1 90.1
3 1 3 20.2 30.2 80.2 90.2
4 1 4 20.3 30.3 80.3 90.3
5 1 5 20.4 30.4 80.4 90.4
6 2 1 20.5 30.5 80.5 90.5
7 2 2 20.6 30.6 80.6 90.6
8 2 3 20.7 30.7 80.7 90.7
9 2 4 20.8 30.8 80.8 90.8
10 2 5 20.9 30.9 80.9 90.9
Describing grades received by students during five months – in two quizzes divided into two parts each.
I need to get the two quizzes into separate rows – so that each student in each month will have two rows, one for each quiz, and two columns – for each part of the quiz.
When I melt the table:
melt.data.frame(df, c("student", "month"))
I get the two parts of the quiz in separate lines too.
dcast(dfL,student+month~variable)
of course gets me right back where I started, and I can't find a way to cast the table back in to the required form.
Is there a way to make the melt command function something like:
melt.data.frame(df, measure.var1=c("quiz1p1","quiz2p1"),
measure.var2=c("quiz1p2","quiz2p2"))
Here's how you could do this with reshape(), from base R:
df2 <- reshape(df, direction="long",
idvar = 1:2, varying = list(c(3,5), c(4,6)),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
## Checking the output
rbind(head(df2, 3), tail(df2, 3))
# student month time p1 p2
# 1.1.quiz1 1 1 quiz1 20.0 30.0
# 1.2.quiz1 1 2 quiz1 20.1 30.1
# 1.3.quiz1 1 3 quiz1 20.2 30.2
# 2.3.quiz2 2 3 quiz2 80.7 90.7
# 2.4.quiz2 2 4 quiz2 80.8 90.8
# 2.5.quiz2 2 5 quiz2 80.9 90.9
You can also use column names (instead of column numbers) for idvar and varying. It's more verbose, but seems like better practice to me:
## The same operation as above, using just column *names*
df2 <- reshape(df, direction="long", idvar=c("student", "month"),
varying = list(c("quiz1p1", "quiz2p1"),
c("quiz1p2", "quiz2p2")),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
I think this does what you want:
#Break variable into two columns, one for the quiz and one for the part of the quiz
dfL <- transform(dfL, quiz = substr(variable, 1,5),
part = substr(variable, 6,7))
#Adjust your dcast call:
dcast(dfL, student + month + quiz ~ part)
#-----
student month quiz p1 p2
1 1 1 quiz1 20.0 30.0
2 1 1 quiz2 80.0 90.0
3 1 2 quiz1 20.1 30.1
...
18 2 4 quiz2 80.8 90.8
19 2 5 quiz1 20.9 30.9
20 2 5 quiz2 80.9 90.9
There was a very similar question asked about half a year ago, in which I wrote the following function:
melt.wide = function(data, id.vars, new.names) {
require(reshape2)
require(stringr)
data.melt = melt(data, id.vars=id.vars)
new.vars = data.frame(do.call(
rbind, str_extract_all(data.melt$variable, "[0-9]+")))
names(new.vars) = new.names
cbind(data.melt, new.vars)
}
You can use the function to "melt" your data as follows:
dfL <-melt.wide(df, id.vars=1:2, new.names=c("Quiz", "Part"))
head(dfL)
# student month variable value Quiz Part
# 1 1 1 quiz1p1 20.0 1 1
# 2 1 2 quiz1p1 20.1 1 1
# 3 1 3 quiz1p1 20.2 1 1
# 4 1 4 quiz1p1 20.3 1 1
# 5 1 5 quiz1p1 20.4 1 1
# 6 2 1 quiz1p1 20.5 1 1
tail(dfL)
# student month variable value Quiz Part
# 35 1 5 quiz2p2 90.4 2 2
# 36 2 1 quiz2p2 90.5 2 2
# 37 2 2 quiz2p2 90.6 2 2
# 38 2 3 quiz2p2 90.7 2 2
# 39 2 4 quiz2p2 90.8 2 2
# 40 2 5 quiz2p2 90.9 2 2
Once the data are in this form, you can much more easily use dcast() to get whatever form you desire. For example
head(dcast(dfL, student + month + Quiz ~ Part))
# student month Quiz 1 2
# 1 1 1 1 20.0 30.0
# 2 1 1 2 80.0 90.0
# 3 1 2 1 20.1 30.1
# 4 1 2 2 80.1 90.1
# 5 1 3 1 20.2 30.2
# 6 1 3 2 80.2 90.2

Resources