Aggregating data with NA values based on site - r

I am using the EPA NLA dataset to find the average temperature in the epiliminion for some lake data. The data set looks like this:
SITE DEPTH METALIMNION TEMP FIELD
1 0.0 NA 25.6
1 0.5 NA 25.1
1 0.8 T 24.9
1 1.0 NA 24.1
1 2.0 B 23.0
2 0.0 NA 29.0
2 0.5 T 28.0
"T" indicates the end of the epiliminion, and I want to average all corresponding temperature values including and above the "T" for each site. I have no idea where to even begin. (The "B" is irrelevant for this issue).
Thanks!

With base R you can do it like this.
I use ave twice, the first time to determine where column METALIMNION has a "T", by group of SITE. This is vector g.
The second, average METALIMNION by SITE and that vector g.
g <- with(NLA, ave(as.character(METALIMNION), SITE,
FUN = function(x) {
x[is.na(x)] <- ""
rev(cumsum(rev(x) == "T"))
}))
NLA$AVG <- ave(NLA$TEMP.FIELD, NLA$SITE, g)
NLA
# SITE DEPTH METALIMNION TEMP.FIELD AVG
#1 1 0.0 <NA> 25.6 25.20
#2 1 0.5 <NA> 25.1 25.20
#3 1 0.8 T 24.9 25.20
#4 1 1.0 <NA> 24.1 23.55
#5 1 2.0 B 23.0 23.55
#6 2 0.0 <NA> 29.0 28.50
#7 2 0.5 T 28.0 28.50

Assuming that there is only one 'T' for each value of site, using dplyr package:
library(dplyr)
data.frame(SITE=c(1,1,1,1,1,2,2),TEMP=c(25.6,25.1,24.9,24.1,23.0,29.0,28.0)) %>%
group_by(SITE) %>%
summarise(meanTemp=mean(TEMP))
Result:
# A tibble: 2 x 2
SITE meanTemp
<dbl> <dbl>
1 1 24.5
2 2 28.5

Related

filter by observation that cumulate X% of values

I would like to filter by observations (after sorting in decreasing way in every group) that cumulate X % of values, in my case less than or equal to 80 percent of total of the values. And that in every group.
So from this dataframe below:
Group<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","C")
value<-c(c(2,3,6,3,1,1,3,3,5,4,3,5,3,4,2))
data1<-data.frame(Group,value)
data1<-data1%>%arrange(Group,desc(value))%>%
group_by(Group)%>%mutate(pct=round (100*value/sum(value),1))%>%
mutate(cumPct=cumsum(pct))
I would like to have the below filtered dataframe according to conditions I decribed above:
Group value pct cumPct
1 A 6 40.0 40.0
2 A 3 20.0 60.0
3 A 3 20.0 80.0
4 B 5 41.7 41.7
5 B 3 25.0 66.7
6 C 5 23.8 23.8
7 C 4 19.0 42.8
8 C 4 19.0 61.8
9 C 3 14.3 76.1
You can arrange the data in descending order of value, for each Group calculate pct and cum_pct and select rows where cum_pct is less than equal to 80.
library(dplyr)
data1 %>%
arrange(Group, desc(value)) %>%
group_by(Group) %>%
mutate(pct = value/sum(value) * 100,
cum_pct = cumsum(pct)) %>%
filter(cum_pct <= 80)
# Group value pct cum_pct
# <chr> <dbl> <dbl> <dbl>
#1 A 6 40 40
#2 A 3 20 60
#3 A 3 20 80
#4 B 5 41.7 41.7
#5 B 3 25 66.7
#6 C 5 23.8 23.8
#7 C 4 19.0 42.9
#8 C 4 19.0 61.9
#9 C 3 14.3 76.2

How to calculate group of sequent nonzero rows in R using 1 row above and 1 row after that group?

I want to create another data frame (df) that lists only events. For example, there should be 4 events in df(XX,YY). The column XX should be sum of event value greater than zero separated by zero rows. The column YY should be Max minus Min of event value greater than zero separated by zero rows.
XX YY
1 3.0 23.6
2 0.0 23.2
3 0.0 23.7
4 0.0 25.2
5 1.3 24.5
6 4.8 24.2
7 0.2 23.1
8 0.0 23.3
9 0.0 23.9
10 0.0 24.3
11 1.8 24.6
12 3.2 23.7
13 0.0 23.2
14 0.0 23.6
15 0.0 24.1
16 0.2 24.5
17 4.8 24.1
18 3.7 22.1
19 0.0 23.4
20 0.0 23.8
From my table, I would like to get the results as following.
Event 1. XX[1] = sum(row1,row2) ; YY[1] = [Max(row1,row2)- Min(row1,row2)]
XX[1]=3, YY[1]=0.4
Event 2. XX[2] = sum(row4,row5,row6,row7,row8) ; YY[2] = [Max(row4,row5,row6,row7,row8)- Min(row4,row5,row6,row7,row8)]
XX[2]=6.3, YY[2]=2.1
Event 3. XX[3] = sum(row10,row11,row12,row13) ; YY[3] = [Max(row10,row11,row12,row13)- Min(row10,row11,row12,row13)]
XX[3]=5, YY[3]=1.4
Event 4. XX[4] = sum(row15,row16,row17,row18,row19) ; YY[4] = [Max(row15,row16,row17,row18,row19)- Min(row15,row16,row17,row18,row19)]
XX[4]=5, YY[4]=2.4
XX YY
1 3 0.4
2 6.3 2.1
3 5 1.4
4 8.7 2.4
Method 1 in base R
Split the original data.frame into a list.
lst <- split(df, c(rep(1, 2), 2, rep(3, 5), 4, rep(5, 4), 6, rep(7, 5), 8));
lst <- lst[sapply(lst, function(x) nrow(x) > 1)];
names(lst) <- NULL;
Note that this is exactly the same as your original data, with the only difference that relevant rows are grouped into separate data.frames, and irrelevant rows (row3, row9, row14, row20) have been removed.
Next define a custom function
# Define a custom function that returns
# the sum(column XX) and max(column YY)-min(column YY)
calc_summary_stats <- function(df) {
c(sum(df$XX), max(df$YY) - min(df$YY));
}
Apply the function to your list elements using sapply to get your expected outcome.
# Apply the function to the list of dataframes
m <- t(sapply(lst, calc_summary_stats))
colnames(m) <- c("XX", "YY");
# XX YY
#[1,] 3.0 0.4
#[2,] 6.3 2.1
#[3,] 5.0 1.4
#[4,] 8.7 2.4
Method 2 using tidyverse
Using dplyr, we can first add an idx column by which we group the data; then filter the groups with >1 row, calculate the two summary statistics for every group, and output the ungrouped data with the idx column removed.
library(tidyverse);
df %>%
mutate(idx = c(rep(1, 2), 2, rep(3, 5), 4, rep(5, 4), 6, rep(7, 5), 8)) %>%
group_by(idx) %>%
filter(n() > 1) %>%
summarise(XX = sum(XX), YY = max(YY) - min(YY)) %>%
ungroup() %>%
select(-idx);
## A tibble: 4 x 2
# XX YY
# <dbl> <dbl>
#1 3.00 0.400
#2 6.30 2.10
#3 5.00 1.40
#4 8.70 2.40
Sample data
df <- read.table(text =
"XX YY
1 3.0 23.6
2 0.0 23.2
3 0.0 23.7
4 0.0 25.2
5 1.3 24.5
6 4.8 24.2
7 0.2 23.1
8 0.0 23.3
9 0.0 23.9
10 0.0 24.3
11 1.8 24.6
12 3.2 23.7
13 0.0 23.2
14 0.0 23.6
15 0.0 24.1
16 0.2 24.5
17 4.8 24.1
18 3.7 22.1
19 0.0 23.4
20 0.0 23.8", header = T)

Spline interpolation R with conditions

I have a very large data set, structured as the sample below.
I have been trying to use the na.spline function in order to
1) identify the "fips" category with missing Yield.
2) if less than than 3 Yield values are NA per fips (here 1-3) the spline function should kick in and fill in the NA.
3) If 3 or more Yields are NA for a "fips" the code should remove the entire "fips" subset, in this case fips 2 should be removed.
My code so far:
finX <- dataset
finxx <- transform(subset(finX, ave(na.spline(finX$Yield), fips, FUN=sum)<2))
#or
finxx <- transform(subset(finX, ave(is.na(finX$Yield), fips, FUN=sum)<2))
Year fips Max Min Rain Yield
1980 1 24.7 0.0 71 37
1981 1 22.8 0.0 62 40
1982 1 22.6 0.0 47 37
1983 1 24.2 0.0 51 39
1984 1 23.8 0.0 61 47
1985 1 25.1 0.0 67 43
1980 2 24.8 0.0 72 34
1981 2 23.2 0.4 54 **NA**
1982 2 25.3 0.1 83 55
1983 2 23.0 0.0 68 **NA**
1984 2 22.4 0.7 70 **NA**
1985 2 24.6 0.0 47 31
1980 3 25.5 0.0 51 31
1981 3 25.5 0.0 51 31
1982 3 25.5 0.0 51 31
1983 3 25.5 0.0 51 **NA**
1984 3 25.5 0.0 51 31
...
Currently the codes above either do not fill in all the NA's in the final product, or simply have no result at all.
Any guidance would be very useful, thank you.
Yield needs to be converted from character to numeric or NA. Then use by to divide finX into separate data frames by fips value. For each data frame with less than 3 NA's, do the spline interpolation. Those with 3 or greater are returned as NULL. Combine the list of returned data frames into single data frame. Code would look like:
library(zoo)
# convert finX$Yield values from character to either numeric or NA
finX$Yield <- sapply(finX$Yield, function(x) if(x =="**NA**") NA_real_ else as.numeric(x))
# use spline interpolation on fips sets with less than 3 NA's
finxx <- by(finX, finX$fips, function(x) if(sum(is.na(x$Yield)) < 3) transform(x, Yield=na.spline(object=Yield, x=Year)) )
# combine results into a single data frame
finxx <- do.call(rbind, finxx)
Alternatively after the conversion to numeric values, you could use ave on the Yield column where spline interpolation returns values on fips sets with less than 3 NA's and all NA's on any other sets. All rows with any NA's in the final result would then be deleted. Code is as follows:
finxx2 <- transform(finX, Yield=ave(Yield, fips, FUN=function(x) if(sum(is.na(x)) < 3) na.spline(object=x) else NA))
finxx2 <- na.omit(finxx2)
Both versions give the same result for the sample data but the first version using by allows you to work with a full data frame for each fips set rather than with just Yield. In this case, this allowed Year to be specified for the x values in the spline interpolation so any data set with a missing Year would still give the correct interpolation. The ave version would get an incorrect answer. So the by version seems more robust.
There's also the dplyr version which is very much like the by version above and gives the same answer as the base R versions. If you're OK with working with dplyr, this is probably the most straightforward and robust approach.
library(dplyr)
finxx3 <- finX %>% group_by(fips) %>%
filter(sum(is.na(Yield)) < 3) %>%
mutate(Yield=na.spline(object=Yield, x=Year))
The first version returns
Year fips Max Min Rain Yield
1.1 1980 1 24.7 0 71 37
1.2 1981 1 22.8 0 62 40
1.3 1982 1 22.6 0 47 37
1.4 1983 1 24.2 0 51 39
1.5 1984 1 23.8 0 61 47
1.6 1985 1 25.1 0 67 43
3.13 1980 3 25.5 0 51 31
3.14 1981 3 25.5 0 51 31
3.15 1982 3 25.5 0 51 31
3.16 1983 3 25.5 0 51 31
3.17 1984 3 25.5 0 51 31

Time-series data visualization

I have a pretty large data frame in R stored in long form. It contains body temperature data collected from 40 different individuals, with 10 sec intervals, over 16 days. Individuals have been exposed to conditions (cond1 and cond2). It essentially looks like this:
ID Cond1 Cond2 Day ToD Temp
1 A B 1 18.0 37.1
1 A B 1 18.3 37.2
1 A B 2 18.6 37.5
2 B A 1 18.0 37.0
2 B A 1 18.3 36.9
2 B A 2 18.6 36.9
3 A A 1 18.0 36.8
3 A A 1 18.3 36.7
3 A A 2 18.6 36.7
...
I want to create four separate line plots for each combination of conditions(AB, BA, AA, BB) that shows mean temp over time (day 1-16).
p.s. ToD stands for time of day. Not sure if I need to provide it in order to create the plot.
So far I have tried to define the dataset as time series by doing
ts <- ts(data=dataset$Temp, start=1, end=16, frequency=8640)
plot(ts)
This returns a plot of Temp, but I can't figure out how to define condition values for breaking up the data.
Edit:
Essentially I want a plot that looks like this 1, but one for each group separately, and using mean Temp values. This plot is just for one individual in one condition, and I want one that shows the mean for all individuals in the same condition.
You can use summarise and group_by to group the data by condition and then plot it. Is this what you're looking for?
library(dplyr)
## I created a dataframe df that looks like this:
ID Cond1 Cond2 Day ToD Temp
1 1 A B 1 18.0 37.1
2 1 A B 1 18.3 37.2
3 1 A B 2 18.6 37.5
4 2 B A 1 18.0 37.0
5 2 B A 1 18.3 36.9
6 2 B A 2 18.6 36.9
7 3 A A 1 18.0 36.8
8 3 A A 1 18.3 36.7
9 3 A A 2 18.6 36.7
df$Cond <- paste0(df$Cond1, df$Cond2)
d <- summarise(group_by(df, Cond, Day), t = mean(Temp))
ggplot(d, aes(Day, t, color = Cond)) + geom_line()
which results in:

Partially transpose a dataframe in R

Given the following set of data:
transect <- c("B","N","C","D","H","J","E","L","I","I")
sampler <- c(rep("J",5),rep("W",5))
species <- c("ROB","HAW","HAW","ROB","PIG","HAW","PIG","PIG","HAW","HAW")
weight <- c(2.80,52.00,56.00,2.80,16.00,55.00,16.20,18.30,52.50,57.00)
wingspan <- c(13.9, 52.0, 57.0, 13.7, 11.0,52.5, 10.7, 11.1, 52.3, 55.1)
week <- c(1,2,3,4,5,6,7,8,9,9)
# Warning to R newbs: Really bad idea to use this code
ex <- as.data.frame(cbind(transect,sampler,species,weight,wingspan,week))
What Iā€™m trying to achieve is to transpose the species and its associated information on weight and wingspan. For a better idea of the expected result please see below. My data set is about half a million lines long with approximately 200 different species so it will be a very large dataframe.
transect sampler week ROBweight HAWweight PIGweight ROBwingspan HAWwingspan PIGwingspan
1 B J 1 2.8 0.0 0.0 13.9 0.0 0.0
2 N J 2 0.0 52.0 0.0 0.0 52.0 0.0
3 C J 3 0.0 56.0 0.0 0.0 57.0 0.0
4 D J 4 2.8 0.0 0.0 13.7 0.0 0.0
5 H J 5 0.0 0.0 16.0 0.0 0.0 11.0
6 J W 6 0.0 55.0 0.0 0.0 52.5 0.0
7 E W 7 0.0 0.0 16.2 0.0 0.0 10.7
8 L W 8 0.0 0.0 18.3 0.0 0.0 11.1
9 I W 9 0.0 52.5 0.0 0.0 52.3 0.0
10 I W 9 0.0 57.0 0.0 0.0 55.1 0.0
The main problem is that you don't currently have unique "id" variables, which will create problems for the usual suspects of reshape and dcast.
Here's a solution. I've used getanID from my "splitstackshape" package, but it's pretty easy to create your own unique ID variable using many different methods.
library(splitstackshape)
library(reshape2)
idvars <- c("transect", "sampler", "week")
ex <- getanID(ex, id.vars=idvars)
From here, you have two options:
reshape from base R:
reshape(ex, direction = "wide",
idvar=c("transect", "sampler", "week", ".id"),
timevar="species")
melt and dcast from "reshape2"
First, melt your data into a "long" form.
exL <- melt(ex, id.vars=c(idvars, ".id", "species"))
Then, cast your data into a wide form.
dcast(exL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1 B J 1 1 NA NA NA NA 2.8 13.9
# 2 C J 3 1 56.0 57.0 NA NA NA NA
# 3 D J 4 1 NA NA NA NA 2.8 13.7
# 4 E W 7 1 NA NA 16.2 10.7 NA NA
# 5 H J 5 1 NA NA 16.0 11.0 NA NA
# 6 I W 9 1 52.5 52.3 NA NA NA NA
# 7 I W 9 2 57.0 55.1 NA NA NA NA
# 8 J W 6 1 55.0 52.5 NA NA NA NA
# 9 L W 8 1 NA NA 18.3 11.1 NA NA
# 10 N J 2 1 52.0 52.0 NA NA NA NA
A better option: "data.table"
Alternatively (and perhaps preferably), you can use the "data.table" package (at least version 1.8.11) as follows:
library(data.table)
library(reshape2) ## Also required here
packageVersion("data.table")
# [1] ā€˜1.8.11ā€™
DT <- data.table(ex)
DT[, .id := sequence(.N), by = c("transect", "sampler", "week")]
DTL <- melt(DT, measure.vars=c("weight", "wingspan"))
dcast.data.table(DTL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1: B J 1 1 NA NA NA NA 2.8 13.9
# 2: C J 3 1 56.0 57.0 NA NA NA NA
# 3: D J 4 1 NA NA NA NA 2.8 13.7
# 4: E W 7 1 NA NA 16.2 10.7 NA NA
# 5: H J 5 1 NA NA 16.0 11.0 NA NA
# 6: I W 9 1 52.5 52.3 NA NA NA NA
# 7: I W 9 2 57.0 55.1 NA NA NA NA
# 8: J W 6 1 55.0 52.5 NA NA NA NA
# 9: L W 8 1 NA NA 18.3 11.1 NA NA
# 10: N J 2 1 52.0 52.0 NA NA NA NA
Add fill = 0 to either of the dcast versions to replace NA values with 0.

Resources