Introduction
Summary:
Trying to average data by season (when necessary) when certain conditions are met.
Hello everyone.
I am currently working with numerous large data sets (>200 sets with >5000 rows each) of long-term time series data collection for multiple variables across different locations. So far, I've extracted data into separate CSV files per site and per station.
For the most part, the data reported per parameter is one instance per season.
Season here is defined ecologically as DJF, MAM, JJA, SON for months corresponding to Winter, Spring, Summer, and Fall respectively.
However, there are some cases where there were multiple readings during a seasonal event. Here, the parameter values and dates have to be averaged; this is before further analysis can take place on these data sets.
To complicate things even further, some of the data is marked by a Greater Than or Less Than (GTLT) symbol). In these cases, values and dates are not averaged unless the recorded value is the same.
Data Example
Summary:
Code and Tables show requested changes in data-set
So, for a data-driven example...
Here's a few rows from a data set.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
This is ideally what I would like as output
Data.Example.New<-structure(list(
Station.ID.new = c(13402, 13402, 13402, 13402),
End.Date.new = structure(c(2L, 3L, 2L, 1L), .Label = c("10/13/2016", "7/28/2016", "8/15/2016"), class = "factor"),
Parameter.Name.new = structure(c(2L, 2L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT.new = structure(c(2L, 2L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value.new = c(10, 20, 22.5, 10)),
.Names = c("Station.ID.new", "End.Date.new", "Parameter.Name.new", "GTLT.new", "Value.new"), row.names = c(NA, -4L), class = "data.frame")
Here, the following things are occurring:
For Enterococci measured in July and Aug 13, there is a GTLT symbol, but Value for both == 10. So average dates. New row is 7/28/2016 and Value 10.
While Enterococci on Aug 15 is within same season as other values, since GTLT value is different, it would only be averaged in same season of same year with other values of 20. In this case, since it is only one where Value==20, that row does not change and is repeated in final data frame.
Alkalinity in July and August are same season, so average dates (7/28/16) and Value (22.5) in new row.
Alkalinity in October is different season, so keep row.
All other data (such as Station.ID and Parameter.Name) should just be copied since they shouldn't differ here.
If for some reason you have a GTLT and non-GTLT for same parameter:
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/13/2016 10 Alk
8/15/2016 20 Alk
Then final result would be
End.Date GTLT Value Parameter
7/13/2015 < 10 Alk
7/13/2016 < 10 Alk
8/14/2016 15 Alk
Approach
Summary:
Define seasons and then aggregate using package like dplyr?
Create loop function to read row by row (after sort by Parameter.Name then Date?)
As one might expect, this is where I'm stuck.
I know seasons can be defined in R from prior Stack Q's:
New vector of seasons based on dates
And I know that average/aggregation packages such as dplyr (and possibly zoo?) can do chaining commands.
My issue is putting this thought process into code that can be repeated for each data set.
I'm not sure if that's the best approach (define seasons and then set conditions for averaging data), or if some sort of loop function would work here by going through row by row of the data set post-sort by Parameter.Name then End.Date.
I quickly sketched my thoughts on what some sort of loop function would have to include:
Rough idea of flow diagram
Note, you can't just average starting row [i] and [i+1] because [i+2], etc. might need averaged as well. Hence finding row [i+n] that breaks loop before last step, averaging all prior rows [i+n-1], and moving on to next new row [i+n].
Further, as clarification, the season would have to be within season of that annual cycle. So 7/13/2016 == 8/13/2016 for same season. 12/12/2015 == 01/01/2016 for same season. But 4/13/2016! == 4/13/2015 in regards to averaging.
Conclusion and Summary
In short, I need help designing code to average individual parameter time-series values by annual season with specific exceptions for multiple large data sets.
I'm not sure of the best approach in designing code to do this, whether it's a large loop function or a combination of code and specialized chaining-enabled packages.
Thank you for your time in advance.
Cheers,
soccernamlak
Using dplyr and lubridate I was able to come up with a solution. My output matches your example output, except I did not keep the exact dates, which I felt were misleading in the final result.
Data.Example<-structure(list(
Station.ID = c(13402, 13402, 13402, 13402, 13402, 13402),
End.Date = structure(c(2L, 3L, 4L, 2L, 3L, 1L), .Label = c("10/13/2016", "7/13/2016", "8/13/2016", "8/15/2016"), class = "factor"),
Parameter.Name = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Alkalinity", "Enterococci"), class = "factor"),
GTLT = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("", "<"), class = "factor"),
Value = c(10, 10, 20, 30, 15, 10)),
.Names = c("Station.ID", "End.Date", "Parameter.Name","GTLT", "Value"), row.names = c(NA, -6L), class = "data.frame")
# Create season key
seasons <- data.frame(month = 1:12, season = c(rep("DJF",2), rep("MAM", 3), rep("JJA", 3), rep("SON",3), "DJF"))
# Isolate Month and Year, create Season column
Data.Example$Month <- lubridate::month(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Year <- lubridate::year(as.Date((Data.Example$End.Date), "%m/%d/%Y"))
Data.Example$Season <- seasons$season[Data.Example$Month]
# Update 'year' where month = December so that it is grouped with Jan and Feb of following year
Data.Example$Year[Data.Example$Month == 12] <- Data.Example$Year[Data.Example$Month == 12]+1
# Find out which station/year/season/paramaters have at least one record with a GTLT
GTLT.Test<- Data.Example %>%
group_by(Station.ID, Year, Season, Parameter.Name) %>%
summarize(has_GTLT = max(nchar(as.character(GTLT))))
# First only calculate averages for groups without any GTLT
Data.Example.New1 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 1,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT) %>%
summarize(Value.new = mean(Value))
# Now do the same for groups with GTLT, only combining when values and GTLT symbols match.
Data.Example.New2 <- Data.Example %>%
anti_join(GTLT.Test[GTLT_test$has_GTLT == 0,],
by = c("Station.ID", "Year", "Season", "Parameter.Name")) %>%
group_by(Station.ID, Year, Season, Parameter.Name, GTLT, Value) %>%
summarize(Value.new = mean(Value)) %>%
select(-Value)
# Combine both
Data.Example.New <- rbind(Data.Example.New1, Data.Example.New2)
EDIT: I just noticed you linked to another SO question for converting dates to seasons. Mine simply converts by month, not date, and does not use actual seasons. I did this because in your example, Dec. 12 matches with Jan. 1. December 12 is technically fall, so I assumed you weren't using actual seasons, but were instead using four three-month groupings.
Related
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
For hundreds of matters, my data frame has daily text entries by dozens of timekeepers. Not every timekeeper enters time each day for each matter. Text entries can be any length. Each entry for a matter is for work done on a different day (but for my purposes, figuring out readability measures for the text, dates don't matter). What I would like to do is to combine for each matter all of its text entries.
Here is a toy data set and what it looks like:
> dput(df)
structure(list(Matter = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 4L, 4L), .Label = c("MatterA", "MatterB", "MatterC", "MatterD"
), class = "factor"), Timekeeper = structure(c(1L, 2L, 3L, 4L,
2L, 3L, 1L, 1L, 3L, 4L), .Label = c("Alpha", "Baker", "Charlie",
"Delta"), class = "factor"), Text = structure(c(5L, 8L, 1L, 3L,
7L, 6L, 9L, 2L, 10L, 4L), .Label = c("all", "all we have", "good men to come to",
"in these times that try men's souls", "Now is", "of", "the aid",
"the time for", "their country since", "to fear is fear itself"
), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
Dplyr groups the time records by matter, but I am stumped as to how to combine the text entries for each matter so that the result is along these lines -- all text gathered for a matter:
1 MatterA Now is the time for all good men to come to
5 MatterB the aid of their country since
8 MatterC all we have
9 MatterD to fear is fear itself in these times that try men's souls
dplyr::mutate() does not work with various concatenation functions:
textCombined <- df %>% group_by(Matter) %>% mutate(ComboText = str_c(Text))
textCombined2 <- df %>% group_by(Matter) %>% mutate(ComboText = paste(Text))
textCombined3 <- df %>% group_by(Matter) %>% mutate(ComboText = c(Text)) # creates numbers
Maybe a loop will do the job, as in "while the matter stays the same, combine the text" but I don't know how to write that. Or maybe dplyr has a conditional mutate, as in "mutate(while the matter stays the same, combine the text)."
Thank you for your help.
Hi you can use group by and summarise with paste,
> df %>% group_by(Matter) %>% summarise(line= paste(Text, collapse = " "))
# A tibble: 4 x 2
# Matter line
# <fct> <chr>
#1 MatterA Now is the time for all good men to come to
#2 MatterB the aid of their country since
#3 MatterC all we have
#4 MatterD to fear is fear itself in these times that try men's souls
Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp
I am trying to calculate Month over Month % Revenue change on data rows using R. For example my current data is:
Booking.Date Revenue Month
4/1/2018 3160 April
4/1/2018 12656 April
4/1/2018 5157 April
5/8/2018 12152 May
5/8/2018 2824 May
5/8/2018 4600 May
6/30/2018 6936 June
6/30/2018 17298 June
6/30/2018 9625 June
I want to make a dynamic function in R which calculates the Revenue
MoM((Revenue_month2-Revenue_month1)/Revenue_month1)*100)
for any new month.
The output should be similar to:
Month Revenue_MoM
April 3%
May -8%
June 50%
and so on.
I got a data.table solution, only the ordering needs to be fixed, by making the month a proper date function. But it should give you an idea. Please keep in mind that for the first month there's no way to calculate a growth rate. I used the logarithmic growth rate, which is in my opinion the best way, but you can easily switch that to any other growth rate calculation.
library(data.table)
dt <- structure(list(Booking.Date = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("4/1/2018", "5/8/2018", "6/30/2018"
), class = "factor")
, Revenue = c(3160L, 12656L, 5157L, 12152L, 2824L, 4600L, 6936L, 17298L, 9625L)
, Month = structure(c(1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L), .Label = c("April", "June", "May"), class = "factor"))
, row.names = c(NA, -9L)
, class = c("data.table","data.frame"))
# Change the month column into date one
# Setting the locale, so that the months can be converted
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dt[, `:=`(Month.Date = as.Date(paste0("2018-",Month,"-01"), tryFormats = "%Y-%B-%d"))
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = log(Sum.Revenue) - log(shift(Sum.Revenue, n =1L, type = "lag"))
)]
# Calculations, based on the normal growth rate calculation
dt[,.(Sum.Revenue = sum(Revenue)), by = list(Month.Date)][, .(Month.Date
, Sum.Revenue
, Change.Revenue = (Sum.Revenue - shift(Sum.Revenue, n =1L, type = "lag"))/shift(Sum.Revenue, n =1L, type = "lag")
)]
I have a data set containing data sorted in rows like this:
*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*
I want it to be that the VarNames become individual columns. I have achieved this by using the following code:
DFP1 <- as.data.frame(t(DFP)) #DFP contains the data
Now, this is a very big data set. It contains multiple years (millions of rows) of data. Above code creates a dataframe which has > 1E6 columns. I need to split these columns by each entry. I saw that in the first piece of data, a new entry recurs at every 86th column. So, I tried this:
tmp <- data.frame(
X = DFP$noFloat,
ind = rep(1:86, nrow(DFP)/86)
)
y <- rbind(DFP$nmlVar[1:86], unstack(tmp, X~ind))
This works for a few rows. The problem is that the number of variables increased over the years and that I cannot simply assume that the number of variables per entry are the same. This results in variable values mismatching it's names. I am looking for a way to match variables and values based on their variable names.
I am new to advanced data-analysis, so please let me know if you need anything more.
EDIT: I created some sample data of how DFP looks like, to hopefully make you better understand my question:
DFP <- data.frame(
nmlVar = c("Batch", "Mass", "Length", "Product","Batch", "Mass",
"Length", "Product", "Batch", "Mass", "Length", "Width", "Product"),
noFloat = c(254578, 20, 24, 24547, 254579, 23, 24, 24547, 254580, 20,
24, 19, 24547)
)
Important to note here is the apperance of new variable width in the third recurrence. This is typical for my dataset, introduction of new variables. The key indicator here is batch and it should be split at each time the variable batch appears.
dput output of sample data:
structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L,
4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass",
"Product", "Width"), class = "factor"), noFloat = c(254578, 20,
24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar",
"noFloat"), row.names = c(NA, -13L), class = "data.frame")
Is this what you are after?:
library(dplyr)
library(tidyr)
DFP %>%
mutate(sample = cumsum(nmlVar == 'Batch')) %>%
spread(nmlVar, noFloat)
Gives:
sample Batch Length Mass Product Width
1 1 254578 24 20 24547 NA
2 2 254579 24 23 24547 NA
3 3 254580 24 20 24547 19
I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]