I have a data frame with three columns: DATE, HOUR, HRC
(So there are 24 rows for each DATE)
The HRC column is sometimes a number and sometimes NA.
I am trying to figure out a way of taking a subset of DATEs and then figuring out the HOURs that have non-NA values across all days.
Example: so if DATES are Aug16, Aug18, Aug19, and HRC column has non-NA values on Aug16 at HOURS 8, 9, 10, 11, 12... Aug18 at HOURS 7, 8, 9, 10, 11...Aug19 at HOURS 9, 10, 11, 12, 13. I would like the outcome to be the list of HOURS 9, 10, 11 since those are the non-NA HOURS for all DATES.
Adjusting sum(is.na(x$HRC)) to sum(!is.na(x$HRC)) in Gary's solution did the trick. Thanks everyone!
You didn't produce an example, so we are really confused about your question. It is generally constructive to provide a reproducible example. Even if I admit that it is little bit challenging to create example with date types.
set.seed(1234)
#generate sequence of 25 days hour by hour
x <- Sys.time() + seq(1,by=60*60,length.out=24*25)
hh <- as.POSIXlt(x)$hour
## generate the data.frame
dat <- data.frame(DATE = as.POSIXct(format(x,"%Y-%m-%d")),
HOUR=as.POSIXlt(x)$hour,
HRC = 1:length(x))
## introduce random NA
id <- sample(nrow(dat),10,rep=F)
dat$HRC[id] <- NA
Here begins my solution; it is similar to Gary solution, I am using plyr package but with different function.
## I choose 2 dates to subset
min.d <- as.POSIXct('2013-03-01')
max.d <- as.POSIXct('2013-03-15')
dat.s <- subset(dat, DATE >=min.d & DATE <= max.d )
res <- ddply(dat.s, .(HOUR), ## grouping by hour
function(x){
any(is.na(x$HRC)) ## I retuen one HRC at least is NA
})
The result:
res[res$V1,]
HOUR V1
6 5 TRUE
12 11 TRUE
14 13 TRUE
17 16 TRUE
19 18 TRUE
22 21 TRUE
You might try something like this:
library(plyr)
# assuming your dates are in some date format
d_0 <- as.Date('02-01-2010',format='%m-%d-%Y')
d_1 <- as.Date('02-10-2010',format='%m-%d-%Y')
# assuming your data are in data frame 'dat', get some subset of dates
some_dates <- subset(dat, DATE > d_0 & DATE < d_1)
# count the NAs for each hour
hr_count <- ddply(some_dates, .(HOUR), function(x) sum(!is.na(x$HRC)))
Related
Say I have a data frame of numeric values and a second dataframe of numeric values that are weights that is built like this:
Monday <- c(1, 1, 10)
Tuesday <- c(1, 2, 3)
df <- data.frame(Monday, Tuesday)
Monday <- c(10, 10, 1)
Tuesday <- c(1, 1, 1)
df_weights <- data.frame(Monday, Tuesday)
How can I summarize each column of the first data frame using weighted mean with the corresponding column in the second data frame as a source of the values for the weights?
In addition, I would like both the mean and the weighted mean in a single dataframe, how could I summarize_all with two functions like so?
Is it something like that?
library(dplyr)
library(Hmisc)
bind_cols(df,rename_all(df_weights,function(x) paste0(x,".wt"))) %>%
summarise(Monday=wtd.mean(Monday,w=Monday.wt),
Tuesday=wtd.mean(Tuesday,w=Tuesday.wt))
## Monday Tuesday
##1 1.428571 2
Or possibly something more general without dplyr :
Map(function(x) wtd.mean(df[[x]],w=df_weights[[x]]),colnames(df))
## $Monday
## [1] 1.428571
##
## $Tuesday
## [1] 2
Getting the mean and the weigthed mean together is a little more tricky but purrr can help to generalize the previous answer. I don't know if the structure of the result match your need :
purrr::map_dfr(colnames(df),
function(x) list(column=x,
mean=mean(df[[x]]),
wmean=wtd.mean(df[[x]],w=df_weights[[x]])))
### A tibble: 2 x 3
## column mean wmean
## <chr> <dbl> <dbl>
##1 Monday 4 1.43
##2 Tuesday 2 2
The data I am working with is the number of people in a group. The columns in the dataset I'm concerned with are the date (column 1) and the number of people in a group (column 3 where there is a separate row for each group on a given day). I am looking for an output spreadsheet that gives me a column for a date, one for the sum of all the groups with one person in it on a day, and a column for the sum of all the people who are in groups larger than one on a day.
For example if this was my dataset:
Date People
10/18 1
10/18 3
10/18 1
10/18 8
10/20 1
10/20 4
10/20 2
My desired output would be:
Date p=1 p>1
10/18 2 11
10/20 1 6
My data frame is "DF" and a csv with the different dates is "times". I tried to use a for loop but the output was just zeros.
Here is what I tried:
ntimes = length(times$UniTimes)
for(i in 1:ntimes)
{
s<- sum(DF[which (DF[,3] > 1 & DF[,1]==i),3])
t<- sum(DF[which (DF[,3] < 2 & DF[,1]== i),3])
}
ndf<-data.frame(times,s,t)
write.csv(ndf,'groups_c.csv')
Thank you for your time and help!
You can use aggregate:
aggregate(People ~ Date, x, function(x) c("p=1" = sum(x[x==1]),
"p>1" = sum(x[x>1])))
# Date People.p=1 People.p>1
#1 10/18 2 11
#2 10/20 1 6
This should work, but without data to reproduce it's difficult to say:
library(dplyr)
DF %>%
group_by(Date) %>%
summarise(peq1 = sum(People == 1),
pgeq1 = sum(People[People > 1]))
An option with data.table
library(data.table)
setDT(DF)[, .(peq1 = sum(People == 1), pgeq1 = sum(People[People >1])), .(Date)]
I'm wondering if there is an easy way to average over the previous 30 seconds of data in R when there may be more than one data point per second.
For instance, for the sample weight taken at 32 seconds, I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 9, 10, 7, ..14,20, 18, 2). For the sample weight taken at 31 seconds,I want the mean of the concentrations recorded in the past 30 seconds, so the mean of 5, 9, 10, 7, .. 14,20, 18). It's technically not a rolling average over the 30 previous measurements because there can be more than one measurement per second.
I'd like to do this in R.
1) sqldf Using DF below and 3 seconds join the last three seconds of data to each row of DF and then take the mean over them:
DF <- data.frame(time = c(1, 2, 2, 3, 4, 5, 6, 7, 8, 10), data = 1:10)
library(sqldf)
sqldf("select a.*, avg(b.data) mean
from DF a join DF b on b.time between a.time - 3 and a.time
group by a.rowid")
giving:
time data mean
1 1 1 1.0
2 2 2 2.0
3 2 3 2.0
4 3 4 2.5
5 4 5 3.0
6 5 6 4.0
7 6 7 5.5
8 7 8 6.5
9 8 9 7.5
10 10 10 9.0
The first mean value is the mean(1) which is 1, the second and third mean values are mean(1:3) which is 2, the fourth mean value is mean(1:4) which is 2.5, the fifth mean value is mean(1:5) which is 3, the sixth mean value is mean(2:6) which is 4, the seventh mean value is mean(3:7) which is 5 and so on.
2) This 2nd solution uses no packages. For each row of DF it finds the rows within 3 seconds back and takes the mean of their data:
Mean3 <- function(i) with(DF, mean(data[time <= time[i] & time >= time[i] - 3]))
cbind(DF, mean = sapply(1:nrow(DF), Mean3))
The rollapply function should do the trick.
library(zoo)
rollapply(weight.vector, 30, mean)
You can do (assuming your data is stored in a dataframe called df):
now <- 32
step <- 30
subsetData <- subset(df, time >= (now-step) & time < now)
average <- mean(subsetData$concentration)
And if you want to calculate the mean for at more time points, you can put this in a loop where you must adjust now
My first idea would be to summarise the data so the value column would contain a list of all values.
test.data <- data.frame(t = 1:50 + rbinom(50, 30, 0.3), y=rnorm(50)) %>% arrange(t)
prep <- test.data %>% group_by(t) %>% summarise(vals = list(y))
wrk <- left_join(data.frame(t=1:max(test.data$t)), prep, by='t')
Unfortunately zoos rollapply would not work on such a data.frame.
For testing I was thinking to only use a window of 5 lines.
I tried commands along: rollapply(wrk, 5, function(z) mean(unlist(z)))
But maybe someone else can fill in the missing bit of information.
This is sufficiently different that it warrants another answer.
This should do what you're asking with no extra libraries needed.
It just loops through each row, filters based on that row's time, and computes the mean.
Don't fear a simple loop :)
count = 200 # dataset rows
windowTimespan = 30 # timespan of window
# first lets make some data
df = data.frame(
# 200 random numbers from 0-99
time = sort(floor(runif(count)*100)),
concentration = runif(count),
weight = runif(count)
)
# add placeholder column(s)
df$rollingMeanWeight = NA
df$rollingMeanConcentration = NA
# for each row
for (r in 1:nrow(df)) {
# get the time in this row
thisTime = df$time[r]
# find all the rows within the acceptable timespan
# note: figure out if you want < vs <=
thisSubset = df[
df$time < thisTime &
df$time >= thisTime-windowTimespan
,]
# get the mean of the subset
df$rollingMeanWeight[r] = mean(thisSubset$weight)
df$rollingMeanConcentration[r] = mean(thisSubset$concentration)
}
Say I have a series of dates, and I want to break them into groups (let's call the groups "epochs"). My first idea of how to do this would be to create a variable that indicates which epoch a date belongs in. The following code shows what I want.
library(dplyr)
library(mosaic)
library(magrittr)
# Generate 1,000,000 random dates
set.seed(919)
df <- data.frame(dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
# Add a variable that indicates what the lowest cutoff date was
system.time(df %<>% mutate(epoch = derivedFactor(epoch.1 = dates < e1,
epoch.2 = dates < e2,
.method = "first",
.default = "epoch.3")))
# user system elapsed
# 341.86 0.16 344.70
But this is very slow -- about 5 minutes on my laptop. I imagine there is a faster way to do this. For example, my naive guess would be that you could sort the data by date, find the last row where dates < e1, and then mark all the preceding rows as a 1, etc. But maybe someone on here knows a better or more elegant way to do this?
I think you're overthinking this. I did it in base R, but presumably you could do this in dplyr too. Just order the data, and then set the factor levels you want in decreasing order.
Conceptually, you just set everything to the most recent epoch, 3. Then, you go through and find all the rows that are less than the epoch 2 cutoff, and then change those to 2. Then, repeat the same process with 1.
# Generate 1,000,000 random dates
set.seed(919)
test.data <- data.frame(row_id = 1:1000000,dates = runif(1e6, -100, 100) + as.Date("2015-12-18"))
# Set two arbitrary dates as cutoffs
e1 <- as.Date("2015-10-01")
e2 <- as.Date("2015-12-20")
test.data <- test.data[order(test.data$dates),]
test.data$epoch <- 3
test.data[test.data$dates < e2,"epoch"] <- 2
test.data[test.data$dates < e1,"epoch"] <- 1
table(test.data$epoch)
As Ben Bolker pointed out, you can use findInterval to do this:
df %<>% mutate(epoch = findInterval(df$dates, c(e1, e2)))
head(df, 10)
## dates epoch
## 1 2016-03-15 2
## 2 2016-01-02 2
## 3 2016-01-30 2
## 4 2015-10-03 1
## 5 2015-09-17 0
## 6 2016-02-11 2
## 7 2015-12-05 1
## 8 2015-12-15 1
## 9 2016-03-11 2
## 10 2015-10-21 1
On my machine, this takes much less than 0.1 second.
I want to split my data into 3 parts with the ratio of 6:2:2. Is there a R command that can do that? Thanks.
I used createDataPartition in the caret package, that can split data into two parts. But how to do it with 3 splits? Is that possible? Or I need two steps to do that?
You randomly split with (roughly) this ratio using sample:
set.seed(144)
spl <- split(iris, sample(c(1, 1, 1, 2, 3), nrow(iris), replace=T))
This split your initial data frame into a list. Now you can check that you've gotten the split ratio you were looking for using lapply with nrow called on each element of your list:
unlist(lapply(spl, nrow))
# 1 2 3
# 98 26 26
If you wanted to randomly shuffle but to get exactly your ratio for each group, you could shuffle the indices and then select the correct number of each type of index from the shuffled list. For iris, we would want 90 for group 1, 30 for group 2, and 30 for group 3:
set.seed(144)
nums <- c(90, 30, 30)
assignments <- rep(NA, nrow(iris))
assignments[sample(nrow(iris))] <- rep(c(1, 2, 3), nums)
spl2 <- split(iris, assignments)
unlist(lapply(spl2, nrow))
# 1 2 3
# 90 30 30