I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.
Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.
Example
A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)
dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))
lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
maxA=c(2e4,3e4,7e4,6e4,9e4),
minB=rep(2e3,5),
Val=seq(.1,.5,.1))
# Sample Desired Value
A B ID Val
99: 90000 10000 99 0.5
In SQL, I would then write something along the lines of
SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
and dt1.A<=lookup.maxA
and dt1.B>=lookup.minB
GROUP BY ID, A, B
Which would join all matching records from lookup to dt1 and return the smallest Val.
Update
My solution so far looks like:
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
list(Val=Val[which.min( Val)]),
by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])
This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
My first thought was trying to make an index like Senor O did. However, the min(Val) made the index table tougher for me to think through. The way I thought to do it was to loop through the lookup table.
dt1[,Val:=as.numeric(NA)]
for (row in 1:NROW(lookup)) {
dt1[A>=lookup[order(Val)][row,minA]&A<=lookup[order(Val)][row,maxA]&B>=lookup[order(Val)][row,minB]&is.na(Val),Val:=lookup[order(Val)][row,Val]]
}
I think this should work because it first sets the new column with NA values.
Then it puts the lookup table in order by Val so you're going to get the lowest value of it.
At each loop it will only potentially changes values in dt1 if they were still NA in Val and since we're looping through lookup in order of smallest Val to biggest it will ensure you get the min(Val) that you wanted.
replace the rbind.fill line with
rbindlist(list(dt1.lookup,dt1[!ID %in% dt1.lookup[,ID]][,list(ID, A, B, Val=as.numeric(NA))]))
it will eliminate reliance on the reshape package and I think it'll be faster.
Related
I have a dataframe like this:
mydf <- data.frame(A = c(40,9,55,1,2), B = c(12,1345,112,45,789))
mydf
A B
1 40 12
2 9 1345
3 55 112
4 1 45
5 2 789
I want to retain only 95% of the observations and throw out 5% of the data that have extreme values. First, I calculate how many observations they are:
th <- length(mydf$A) * 0.95
And then I want to remove all the rows above the th (or retain the rows below the th, as you wish). I need to sort mydf in an ascending order, to remove only those extreme values. I tried several approaches:
mydf[order(mydf["A"], mydf["B"]),]
mydf[order(mydf$A,mydf$B),]
mydf[with(mydf, order(A,B)), ]
plyr::arrange(mydf,A,B)
but nothing works, so mydf is not sorted in ascending order by the two columns at the same time. I looked here Sort (order) data frame rows by multiple columns but the most common solutions do not work and I don't get why.
However, if I consider only one column at a time (e.g., A), those ordering methods work, but then I don't get how to throw out the extreme values, because this:
mydf <- mydf[(order(mydf$A) < th),]
removes the second row that has a value of 9, while my intent is to subset mydf retaining only the values below threshold (intended in this case as number of observations, not value).
I can imagine it is something very simple and basic that I am missing... And probably there are nicer tidyverse approaches.
I think you want rank here, but it doesn't work on multiple columns. To work around that, note that rank(.) is equivalent to order(order(.)):
rank(mydf$A)
# [1] 4 3 5 1 2
order(order(mydf$A))
# [1] 4 3 5 1 2
With that, we can order on both (all) columns, then order again, then compare the resulting ranks with your th value.
mydf[order(do.call(order, mydf)) < th,]
# A B
# 1 40 12
# 2 9 1345
# 4 1 45
# 5 2 789
This approach benefits from preserving the natural sort of the rows.
If you would prefer to stick with a single call to order, then you can reorder them and use head:
head(mydf[order(mydf$A, mydf$B),], th)
# A B
# 4 1 45
# 5 2 789
# 2 9 1345
# 1 40 12
though this does not preserve the original order of rows (which may or may not be important to you).
Possible approach
An alternative to your approach would be to use a dplyr ranking function such as cume_dist() or percent_rank(). These can accept a dataframe as input and return ranks / percentiles based on all columns.
set.seed(13)
dat_all <- data.frame(
A = sample(1:60, 100, replace = TRUE),
B = sample(1:1500, 100, replace = TRUE)
)
nrow(dat_all)
# 100
dat_95 <- dat_all[cume_dist(dat_all) <= .95, ]
nrow(dat_95)
# 95
General cautions about quantiles
More generally, keep in mind that defining quantiles is slippery, as there are multiple possible approaches. You'll want to think about what makes the most sense given your goal. As an example, from the dplyr docs:
cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.
percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.
Some implications of this are that the lowest value is always 1 / nrow() for cume_dist() but 0 for percent_rank(), while the highest value is always 1 for both methods. This means different cases might be excluded depending on the method. It also means the code I provided will always remove the highest-ranking row, which may or may not match your expectations. (e.g., in a vector with just 5 elements, is the highest value "above the 95th percentile"? It depends on how you define it.)
I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first
I have the following data with two observations per subject:
SUBJECT <- c(8,8,10,10,11,11,15,15)
POSITION <- c("H","L","H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30,90,90)
RESPONSE <- c(5.6,5.2,0,0,4.8,4.9,1.2,.9)
DATA <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I want the rows of DATA which have SUBJECT numbers that are in a vector, V:
V <- c(8,10,10)
How can I obtain both observations from DATA whose SUBJECT number is in V and have those observations repeated the same number of times as the corresponding SUBJECT number appears in V?
Desired result:
SUBJECT <- c(8,8,10,10,10,10)
POSITION <- c("H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30)
RESPONSE <- c(5.6,5.2,0,0,0,0)
OUT <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I thought some variation of the %in% operator would do the trick but it does not account for repeated subject numbers in V. Even though a subject number is listed twice in V, I only get one copy of the corresponding rows in DATA.
I could also create a loop and append matching observations but this piece is inside a bootstrap sampler and this option would dramatically increase computation time.
merge is your friend:
merge(list(SUBJECT=V), DATA)
# SUBJECT POSITION TIME RESPONSE
#1 8 H 90 5.6
#2 8 L 90 5.2
#3 10 H 30 0.0
#4 10 L 30 0.0
#5 10 H 30 0.0
#6 10 L 30 0.0
As #Frank implies, this logic can be translated to data.table or dplyr or sql of anything else that will handle a left-join.
i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA
I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50