I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.
The dataset looks like this:
library(data.table)
# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)
So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.
What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.
As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).
I tried this:
df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]
which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.
What am I doing wrong?
I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
## Location Date Model Temperature
## 1: 1 2000-01-01 A 2.469751
## 2: 2 2000-01-01 A 16.103135
## 3: 3 2000-01-01 A 7.147051
## 4: 4 2000-01-01 A 10.301937
## 5: 5 2000-01-01 A 16.760238
## ---
## 2996: 6 2004-12-01 E 26.293968
## 2997: 7 2004-12-01 E 8.446528
## 2998: 8 2004-12-01 E 29.003001
## 2999: 9 2004-12-01 E 12.076765
## 3000: 10 2004-12-01 E 28.410980
If this is correct, you can generate the means across models with this:
df[,.(Mean=mean(Temperature)),.(Location,Date)];
## Location Date Mean
## 1: 1 2000-01-01 9.498497
## 2: 2 2000-01-01 11.744622
## 3: 3 2000-01-01 15.691228
## 4: 4 2000-01-01 11.457154
## 5: 5 2000-01-01 8.897931
## ---
## 596: 6 2004-12-01 17.587000
## 597: 7 2004-12-01 19.555963
## 598: 8 2004-12-01 15.710465
## 599: 9 2004-12-01 15.322790
## 600: 10 2004-12-01 20.240392
Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.
In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.
Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.
If we are using data.table, the CJ can be used
CJ(Location=loc, date= dates,Model= mods)[,
Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]
Related
I'm dealing with tournament results in R where ties can happen. Say two players tie for 3rd place. They would share (3rd_prize + 4th_prize), and each earn (3rd_prize + 4th_prize)/2. If 10 players tie for third place, they would split the sum of 3rd through 13th place, and each get that sum over 10.
Given this structure, and given a data.table listing all players, their absolute results, and how many people they drew with, how could we generate a column with everyone's winnings? I don't know how to format sample data in the post, so I'm attaching a link to a google sheet with sample data and a desired result if that's okay!
https://docs.google.com/spreadsheets/d/1fLUZ172Sl_yXVQE3VI0Xo4wSr_SRvaL43MCZIMYen2w/edit?usp=sharing
Here are 2 options:
(1)
prizes[results[, rn:=.I], on=.(Position=rn)][,
.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)]
Explanation:
Create a sequence of row index for results using results[, rn:=.I]
Then left join this results and prize table using row index prizes[results[, rn:=.I], on=.(Position=rn)]
Then using the result from step 2, group by Position in results and calculate average prize for each Person (i.e. [,.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)].
Assumption is that results is already sorted by Position.
(2)
Assuming that each row in results receives a prize in the same row in prizes, you can calculated average prizes after extracting using indexing:
results[, Winnings := sum(prizes$Prize[.I], na.rm=TRUE) / .N, Position]
output:
Position Person Winnings
1: 1 A 100.0
2: 2 B 50.0
3: 3 C 17.5
4: 3 D 17.5
5: 4 E 5.0
6: 5 F 4.0
7: 6 G 3.0
8: 7 H 1.0
9: 7 I 1.0
10: 7 J 1.0
data:
library(data.table)
results <- data.table(Person=LETTERS[1:10],
Position=c(1,2,3,3,4,5,6,7,7,7),
tied=c(1,1,2,2,1,1,1,3,3,3))
prizes <- data.table(Position=1:10,
Prize=c(100,50,25,10,5,4,3,2,1,0))
My question is very similar to this one:
How to extract the first n rows per group?
dt
date age name val
1: 2000-01-01 3 Andrew 93.73546
2: 2000-01-01 4 Ben 101.83643
3: 2000-01-01 5 Charlie 91.64371
4: 2000-01-02 6 Adam 115.95281
5: 2000-01-02 7 Bob 103.29508
6: 2000-01-02 8 Campbell 91.79532
We have a dt and I've added an extra column named val. First, we want to extract the first n rows within each group.
The solutions from the link provided are:
dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed
My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:
# uses other columns for results/ is dependent on subsetted rows
# but keep it simple for replication
do_something <- function(dt){
res <- ifelse(cumsum(dt$val) > 200, 1, 0)
return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]
date age name val funcVal
1: 2000-01-01 3 Andrew 93.73546 0
2: 2000-01-01 4 Ben 101.83643 1
3: 2000-01-02 6 Adam 115.95281 0
4: 2000-01-02 7 Bob 103.29508 1
Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?
Any help is appreciated and below is the code to produce the data above:
date <- c("2000-01-01","2000-01-01","2000-01-01",
"2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)
In case there's more than one grouping column, it might be more efficient to collapse to one:
m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]
This is just an extension to #eddi's approach (of keeping row numbers .I, seen in #akrun's answer) to also keep group counter .GRP.
Re OP's comment that they're more concerned about the function, well, borrowing from #akrun, there's ...
ff = function(x) as.integer(cumsum(x[[1]]) > 200)
Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.
My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.
We can use as.integer on the cumsum to coerce the logical to binary. Extract the row index, specify it as i, grouped by 'date', apply the function on the 'val' column
f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]
I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2
If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.
Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.
Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).
I'm new to R and struggling to group multiple levels of a factor prior to calculating means. This problem is complicated by the fact that I am doing this over hundreds of files that have variable levels of factors that need to be grouped. I see from previous posts how to address this grouping issue for single levels using levels (), but my data is too variable for this method.
Basically, I'd like to calculate both individual and then an overall mean for multiple levels of a factor. For example, I would like to calculate the mean for each species for each of the following factors present in the column Status: Crypt1, Crypt2, Crypt3, Native, Intro, and then also the overall mean for Crypt species (includes Crypt1, Crypt2, and Crypt3, but not Native or Intro). However, a species either has multiple levels of Crypt (variable, and up to Crypt8), or has Native and Intro, and means for all species at each of these levels are ultimately averaged into the same summary sheet.
For example:
Species Status Value
A Crypt1 5
A Crypt1 6
A Crypt2 4
A Crypt2 8
A Crypt3 10
A Crypt3 50
B Native 2
B Native 9
B Intro 9
B Intro 10
I was thinking that I could use the first letter of each factor to group the Crypt factors together, but I am struggling to target the first letter because they are factors, not strings, and I am not sure how to convert between them. I'm ultimately calculating the means using aggregate(), and I can get individual means for each factor, but not for the grouped factors.
Any ideas would be much appreciated, thanks!
For the individual means:
# assuming your data is in data.frame = df
require(plyr)
df.1 <- ddply(df, .(Species, Status), summarise, ind.m.Value = mean(Value))
> df.1
# Species Status ind.m.Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
For the overall mean, the idea is to remove the numbers present at the end of every entry in Status using sub/gsub.
df.1$Status2 <- gsub("[0-9]+$", "", df.1$Status)
df.2 <- ddply(df.1, .(Species, Status2), summarise, oall.m.Value = mean(ind.m.Value))
> df.2
# Species Status2 oall.m.Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Is this what you're expecting?
Here's an alternative. Conceptually, it is the same as Arun's answer, but it sticks to functions in base R, and in a way, keeps your workspace and original data somewhat tidy.
I'm assuming we're starting with a data.frame named "temp" and that we want to create two new data.frames, "T1" and "T2" for individual and grouped means.
# Verify that you don't have T1 and T2 in your workspace
ls(pattern = "T[1|2]")
# character(0)
# Use `with` to generate T1 (individual means)
# and to generate T2 (group means)
with(temp, {
T1 <<- aggregate(Value ~ Species + Status, temp, mean)
temp$Status <- gsub("\\d+$", "", Status)
T2 <<- aggregate(Value ~ Species + Status, temp, mean)
})
# Now they're there!
ls(pattern = "T[1|2]")
# [1] "T1" "T2"
Notice that we used <<- to assign the results from within with to the global environment. Not everyone likes using that, but I think it is OK in this particular case. Here is what "T1" and "T2" look like.
T1
# Species Status Value
# 1 A Crypt1 5.5
# 2 A Crypt2 6.0
# 3 A Crypt3 30.0
# 4 B Intro 9.5
# 5 B Native 5.5
T2
# Species Status Value
# 1 A Crypt 13.83333
# 2 B Intro 9.50000
# 3 B Native 5.50000
Looking back at the with command, it might have seemed like we had changed the value of the "Status" column. However, that was only within the environment created by using with. Your original data.frame is the same as it was when you started.
temp
# Species Status Value
# 1 A Crypt1 5
# 2 A Crypt1 6
# 3 A Crypt2 4
# 4 A Crypt2 8
# 5 A Crypt3 10
# 6 A Crypt3 50
# 7 B Native 2
# 8 B Native 9
# 9 B Intro 9
# 10 B Intro 10