Subsetting based on observations in a month - r

I'm trying to subset some data and am stuck on the last part of cleaning.
What I need to do is calculate the number of observations for each individual (indivID) in months (June, July, and August) and return a percentage for each without missing data and then keep those observations that are over 75%.
I was able to create a nested for loop, but it took probably 6 hours to process today. I would like to be able to take advantage of parallel computer by using ddply, or another function, but an very lost.
Here's the data (Note this is a very small subset that only includes individuals from 1:5):
https://www.dropbox.com/s/fmk8900622klsgt/data.csv?dl=0
And here's the for loop :
epa.d <- read.csv("/.../data.csv")
#Function for loops
days <- function (month){
if (month == 06) return(as.numeric(30))
if (month == 07) return(as.numeric(31))
if (month == 08) return(as.numeric(31))
}
#Subset data for 75% in June, July, and August
for (i in unique(epa.d$indivID)){
for (j in unique(epa.d$year)){
for (k in unique(epa.d$month)){
monthsum <- sum(epa.d$indivID == i & epa.d$year == j & epa.d$month == k )
monthperc = (monthsum/days(k))* 100
if (monthperc < 75){
epa.d <- epa.d[! (epa.d$indivID == i & epa.d$year == j), ]
}
}
}
}

If I understand you correctly, you want to keep daily observations for each combination of indivID-month-year in which at least 75% of days have ozone measurements. Here's a way to do it that should be pretty fast:
library(dplyr)
# For each indivID, calculate percent of days in each month with
# ozone observations, and keep those with pctCoverage >= 0.75
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31)) %>%
filter(pctCoverage >= 0.75)
We now have a data frame epa.d_75 that has one row for each indivID-month-year with at least 75% coverage. Next, we'll merge the daily data into this data frame, resulting in one row for each daily observation for each unique indivID-month-year.
# Merge in daily data for each combination of indivID-month-year that meets
# the 75% coverage criterion
epa.d_75 = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
Update: To answer the questions in the commments:
Can you explain what the %>% is doing, and if possible a break down of how you logically thought about this.
The %>% is a "chaining" operator that allows you to chain functions one after the other without having to store the result of the previous function before running the next one. Take a look at the dplyr Vignette to learn more about how to use it. Here's how the logic works in this case:
group_by splits the data set by the grouping variables, then runs the next functions separately on each group. In this case, summarise counts the number of rows in the data frame for each unique combination of indivID, month, and year, then mutate adds a column with the fractional coverage for that indivID for that month and year. filter then gets rid of any combination of indivID, month, and year with less than 75% coverage. You can stop the chain at any point to see what it's doing. For example, run the following code to see what epa.d_75 looks like before the filtering operation:
epa.d_75 = epa.d %>%
group_by(indivID, year, month) %>%
summarise(count=n()) %>%
mutate(pctCoverage = ifelse(month==6, count/30, count/31))
why the hell this is so much faster than running for loops? I don't know the answer in detail, but dplyr does most of its magic in C code under the hood, which is faster than native R. Hopefully someone else can give a more precise and detailed answer.

Another option would be to use data.table (similar to #eipi10's dplyr method), which would be very fast.
library(data.table)
epa.d_75 <- setDT(epa.d)[, list(pctCoverage=ifelse(month==6, .N/30,
.N/31)),by=list(indivID, year, month)][pctCoverage >=0.75]
epa.d_75New = merge(epa.d_75, epa.d, by=c("indivID","month","year"),
all.x=TRUE)
data
epa.d <- read.csv('data.csv', row.names=1)

Related

Remove rows based on data in another dataframe?

I have right now a dataset with more than 186k observations (rows), this is presented in figure 1. These are all companies in BVDID column and they should contain data in all years of 2013 to 2017.
missingdata <- series %>% filter(LIABILITIES == 0) %>% select(BVDID)
However, I found 87k rows of only zero-values in missingdata object using the code above.
How do I delete the rows of the series object with BVDID (company code) in the dataframe missing data? Also there should be a way to make those years look better under my str(series) and put them ascending based on each company code.
Best regards
THERE are many ways, one such way.
use tidyverse anti_join function which gives the result as similar to set operation A-B and therefore will remove all matching rows from the second data.
series %>% anti_join(missingdata, by =c("BVDID" = "BVDID"))
Or directly. Liabilities == 0 will return boolean values, adding + before it converts these to 0 or 1 and checking the sum of these values if greater than 1, which are to be removed.
series %>% group_by(BVDID) %>% filter(sum(+(LIABILITIES == 0)) > 0)
series %>%
# filter out the BVDIDs from missingdata
filter(!BVDID %in% pull(missingdata)) %>%
# order the df
arrange(BVDID, year)

Averages of conditional subsets of a dataframe in R to store in new dataframe

I'm learning to work with R and I would need some help figuring out how to create a dataframe from averages of subsets of my initial dataframe, based on a condition.
I have a df of ~18000 rows, 9 columns, one being a distance. I want to use conditions on the distance to average the values of the 9 columns. The first subset would correspond to a distance range of 0:2.5, the second one a range of 2.5:5, and so on, every 2.5 meters.
I can make a first subset this way :
df1 <- subset(df_ini, df_ini$Distance..m.>0 & df_ini$Distance..m.<2.5)
The new dataframe now has 18 rows.
I then need to average the values of each column, store them in a new df and continue to do so for every subset, appending the averages to the same df.
I can't get the right loop(s) to do that, I would really appreciate any ideas/tips.
Thanks!
I am pretty new to R, but maybe this will give some inspiration. I suggest that you take a look at the dplyr package.
For mtcars:
library(dplyr)
df1 <- mtcars %>% filter(mpg >= 10 & mpg <= 20) %>% summarise_all(funs(mean))
df2 <- mtcars %>% filter(mpg >=20 & mpg <= 30) %>% summarise_all(funs(mean))
combined <- rbind(subset, subset2)
In your dataset you can filter on ranges of distances. Ideally, you would use a loop to automatically create a range of groups (0:2.5, 2.5:5.0, etc..), like you stated in your post. I don't know how to do that though.

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Vectorising function on subset of dataframe based on other columns

I have a dataframe from a psychology experiment with the time since the beginning of the experiment for each subject, and what I want is to set from that the time since the beginning of each trial for each subject. To do so I'm basically just substracting the minimum time value for each trial/subject to all the values for that same trial/subject.
I'm currently doing it with two for loops, I was just wondering if there's a way to vectorise it. What I have at the minute:
for (s in 1:max(df$Subject)){
subject <- df[df$Subject==s,]
for (t in 1:max(subject$TrialId)){
trial <- subject[subject$TrialId==t,]
start_offset <- min(trial$timestamp)
df$timestamp[df$Subject==s & df$TrialId==t] <- df$timestamp[df$Subject==s &
df$TrialId==t]
- start_offset
}
}
And what I would like is something like
df$timestamp <- df$timestamp - min_per_trial_per_subject(df$timestamp)
With dplyr
library(dplyr)
df %>% group_by(Subject, TrialId) %>%
mutate(modified_timestamp = timestamp - min(timestamp))
Should work. If it doesn't, please share a reproducible example so we can test.

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

Resources