I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day.
Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Try this:
aggregate(StomCond_Trunc~Day,data=subset(sap,Time>=12 & Time<=14),mean)
If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.
Example:
Large(ish) dataset
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
Using aggregate on the data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
Converting it to a data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
Using your original method, but with less typing:
sapply(sap[sap$Day==165 & sap$Time %in% seq(12, 14, 0.1), ],mean)
However this is only a slightly better method than your original one. It's not as flexible as the other answers since it depends on 0.1 increments in your time values. The other methods don't care about the increment size, which makes them more versatile. I'd recommend #Maiasaura's answer with data.table
Related
I'm trying to create a tibble that will allow me to see the proportion for each answer in one of my variables.
Currently, my code looks like this
drinkchoice <- tibble(prop.table(table(surveyq$drink_choice)))
when running the code, it returns the proportion of each answer in the variable but does not list the answers that come with it. For example it returns a table looking like:
0.007
0.04
0.29
0.13
0.09
but when I remove tibble() from the original line of code, it responds back with
pepsi 0.007
fanta 0.04
sprite 0.29
brisk 0.13
coke 0.09
I was wondering if there is any way to code it so even when using the tibble() function, I can keep it so when the code runs it returns me back the answer with the correct proportion with it.
Edit: I added the line ~ tibble::rownames_to_column("Drink") and I was wondering whether it is possible to rewrite the numbers under the new column? As it would solve my problem
I am planning to construct a panel dataset and, as a first step, I am trying to create a vector that has 25 repetitive ids (same 25 ids to assign to each year later) for each 99537 unique observation. What I have so far is:
unique_id=c(1:99573)
panel=c()
for(i in 1:99573){
x=rep(unique_id[i],25)
panel=append(panel,x)
}
The problem I have is the codes above are taking too much time. RStudio keeps processing and does not give me any output. Could there be any other ways to speed up the process? Please share any ideas with me.
We don't need a loop here
panel <- rep(unique_id, each = 25)
Benchmarks
system.time(panel <- rep(unique_id, each = 25))
# user system elapsed
# 0.046 0.002 0.047
length(panel)
#[1] 2489325
I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:
sort by ascending Year, ascending MemberID, descending Month
Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:
setkey(X, Year, MemberID)
This is very fast and takes only a few second.
X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]
This step takes so much longer (5 minutes).
Update:
Someone made a comment to do X <- X[sort(Year, MemberID, -Month)] and later deleted. This approach seems to be much faster:
user system elapsed
5.560 11.242 66.236
My approach: setkey() then order(-Month)
user system elapsed
816.144 9.648 848.798
My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?
Update 2: to response to Matthew Dowle:
After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:
monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]
Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:
user system elapsed
7.910 7.750 53.916
Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.
Update June 5 2014:
The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.
In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.
Benchmarks:
Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:
require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year = sample(1950:2000, 3e6, TRUE),
memberID = sample(paste0("V", 1:1e4), 3e6, TRUE),
month = sample(12, 3e6, TRUE))
## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
# user system elapsed
# 76.909 0.262 81.266
## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
# user system elapsed
# 0.985 0.030 1.027
## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
# user system elapsed
# 0.585 0.013 0.600
## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))
## are they equal?
identical(ans2, DT) # [1] TRUE
identical(ans1, ans2) # [1] TRUE
On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.
data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].
All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.
NOTE: Please refer to the history/revisions for earlier answer and updates.
The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster.
X <- X[order(Year, MemberID, -Month)]
Summarizing shouldn't depend on the order of your rows.
ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.
I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:
sort by ascending Year, ascending MemberID, descending Month
Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:
setkey(X, Year, MemberID)
This is very fast and takes only a few second.
X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]
This step takes so much longer (5 minutes).
Update:
Someone made a comment to do X <- X[sort(Year, MemberID, -Month)] and later deleted. This approach seems to be much faster:
user system elapsed
5.560 11.242 66.236
My approach: setkey() then order(-Month)
user system elapsed
816.144 9.648 848.798
My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?
Update 2: to response to Matthew Dowle:
After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:
monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]
Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:
user system elapsed
7.910 7.750 53.916
Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.
Update June 5 2014:
The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.
In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.
Benchmarks:
Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:
require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year = sample(1950:2000, 3e6, TRUE),
memberID = sample(paste0("V", 1:1e4), 3e6, TRUE),
month = sample(12, 3e6, TRUE))
## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
# user system elapsed
# 76.909 0.262 81.266
## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
# user system elapsed
# 0.985 0.030 1.027
## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
# user system elapsed
# 0.585 0.013 0.600
## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))
## are they equal?
identical(ans2, DT) # [1] TRUE
identical(ans1, ans2) # [1] TRUE
On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.
data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].
All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.
NOTE: Please refer to the history/revisions for earlier answer and updates.
The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster.
X <- X[order(Year, MemberID, -Month)]
Summarizing shouldn't depend on the order of your rows.