R equivalent of update query with inner join and conditions - r

I infrequently use Access to update one table with another table using an inner join and some selection conditions and am trying to find a method to do this sort of operation in R.
# Example data to be updated
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
dfA <- data.frame(ID,Fr,To)
dfA$Vl <- NA
I wish to update dfA$Vl using the Vl field ina second data frame as below
# Example data to do the updating
ID <- c('A','A','B','B','B','C','C','C')
Fr <- c(0,3,0,1,3,0,4,7)
To <- c(3,6,1,3,9,4,7,9)
Vl <- c(1,2,3,4,5,6,7,8)
dfB <- data.frame(ID,Fr,To,Vl)
The following is the Access SQL syntax I would use for this type of update
UPDATE DfA INNER JOIN DfB ON DfA.ID = DfB.ID SET DfA.Vl = [DfB].[Vl]
WHERE (((DfA.Fr)<=[DfB].[To]) AND ((DfA.To)>[DfB].[Fr]));
This reports that 14 rows are being updated (even through there are only 9 in dfA) as some of the rows will meet the selection conditions more than once and are applied sequentially. I'm not concerned about this inconsistency as the result is sufficient for the intended purpose -- however, it would be best to match the longest overlapping(To-Fr) from DfB to the (To-Fr) of DfA to be more precise - bonus points for that solution)
The result I end up with from Access is as follows
# Result
ID <- c('A','A','A','B','B','B','C','C','C')
Fr <- c(0,1.5,3,0,1.5,4.5,0,3,6)
To <- c(1.5,3,6,1.5,4.5,9,3,6,9)
Vl <- c(1,1,2,4,5,5,6,7,8)
dfC <- data.frame(ID,Fr,To,Vl)
So the question is the best R way to addressing this operation or alternatively (or additionally) how to reproduce the Access SQL in the R sql packages? Also (for extra credit) how to make sure the majority To-Fr overlap is the number updated not necessary the last update operation?

A possible approach using data.table:
library(data.table)
setDT(dfA); setDT(dfB); setDT(dfC);
dfA[, rn:=.I]
#non equi join like your ACCESS sql
dfB[dfA, on=.(ID, To>=Fr, Fr<To), .(rn, i.ID, i.Fr, i.To, x.Vl, x.Fr, x.To)][,
#calculate overlapping range
rng := pmin(x.To, i.To) - pmax(x.Fr, i.Fr)][,
#find the rows with max overlapping range and in case of dupes, choose the first row
first(.SD[rng==max(rng), .(ID=i.ID, Fr=i.Fr, To=i.To, Vl=x.Vl)]), by=.(rn)]
output:
rn ID Fr To Vl
1: 1 A 0.0 1.5 1
2: 2 A 1.5 3.0 1
3: 3 A 3.0 6.0 2
4: 4 B 0.0 1.5 3 #diff from dfC as Vl=3 has a bigger overlap
5: 5 B 1.5 4.5 4 #diff from dfC. both overlaps by 1.5 so either 4/5 works
6: 6 B 4.5 9.0 5
7: 7 C 0.0 3.0 6
8: 8 C 3.0 6.0 7
9: 9 C 6.0 9.0 8

Related

Allocating prizes in tournament with ties using R and data.table

I'm dealing with tournament results in R where ties can happen. Say two players tie for 3rd place. They would share (3rd_prize + 4th_prize), and each earn (3rd_prize + 4th_prize)/2. If 10 players tie for third place, they would split the sum of 3rd through 13th place, and each get that sum over 10.
Given this structure, and given a data.table listing all players, their absolute results, and how many people they drew with, how could we generate a column with everyone's winnings? I don't know how to format sample data in the post, so I'm attaching a link to a google sheet with sample data and a desired result if that's okay!
https://docs.google.com/spreadsheets/d/1fLUZ172Sl_yXVQE3VI0Xo4wSr_SRvaL43MCZIMYen2w/edit?usp=sharing
Here are 2 options:
(1)
prizes[results[, rn:=.I], on=.(Position=rn)][,
.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)]
Explanation:
Create a sequence of row index for results using results[, rn:=.I]
Then left join this results and prize table using row index prizes[results[, rn:=.I], on=.(Position=rn)]
Then using the result from step 2, group by Position in results and calculate average prize for each Person (i.e. [,.(Person, Winnings=sum(Prize) / .N), .(Position=i.Position)].
Assumption is that results is already sorted by Position.
(2)
Assuming that each row in results receives a prize in the same row in prizes, you can calculated average prizes after extracting using indexing:
results[, Winnings := sum(prizes$Prize[.I], na.rm=TRUE) / .N, Position]
output:
Position Person Winnings
1: 1 A 100.0
2: 2 B 50.0
3: 3 C 17.5
4: 3 D 17.5
5: 4 E 5.0
6: 5 F 4.0
7: 6 G 3.0
8: 7 H 1.0
9: 7 I 1.0
10: 7 J 1.0
data:
library(data.table)
results <- data.table(Person=LETTERS[1:10],
Position=c(1,2,3,3,4,5,6,7,7,7),
tied=c(1,1,2,2,1,1,1,3,3,3))
prizes <- data.table(Position=1:10,
Prize=c(100,50,25,10,5,4,3,2,1,0))

Solve a linear equation on every row in datatable

I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?
So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4
in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".
Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4

Unroll R data.frame list column retaining the other values in the row [duplicate]

This question already has answers here:
Unlisting columns by groups
(3 answers)
Closed 7 years ago.
I need to efficiently "unroll" a list column in an R data.frame. For example, if I have a data.frame defined as:
dbt <- data.frame(values=c(1,1,1,1,2,3,4),
parm1=c("A","B","C","A","B","C","B"),
parm2=c("d","d","a","b","c","a","a"))
Then, assume an analysis that generates one column as a list, similar to the following output:
agg <- aggregate(values ~ parm1 + parm2, data=dbt,
FUN=function(x) {return(list(x))})
The aggregated data.frame looks like (where class(agg$values) == "list"):
parm1 parm2 values
1 B a 4
2 C a 1, 3
3 A b 1
4 B c 2
5 A d 1
6 B d 1
I'd like to unroll the "values" column , repeating the parm1 & 2 values (adding more rows) in an efficient manner for each element of the list over all the data.frame rows.
At the top level I wrote a function that does the unroll in a for loop called in an apply. It's really inefficient, (the aggregated data.frame takes about an hour to create and nearly 24 hours to unroll, the fully unrolled data has ~500k records). The top level I'm using is:
unrolled.data <- do.call(rbind, apply(agg, 1, FUN=unroll.data))
The function just calls unlist() on the value column object then builds a data.frame object in a for loop as the returned object.
The environment is somewhat restricted and the tidyr, data.table and splitstackshape libraries are unavailable to me, it needs to not only be functions found in base:: but limited to those available in v3.1.1 and before. Thus the answers in this (not really a duplicate) question do not apply.
Any suggestions on something faster?
Thanks!
With base R, you could try
with(agg, {
data.frame(
lapply(agg[,1:2], rep, times=lengths(values)),
values=unlist(values)
)
})
# parm1 parm2 values
# 1.2 B a 4
# 1.31 C a 1
# 1.32 C a 3
# 2.1 A b 1
# 3.2 B c 2
# 4.1 A d 1
# 4.2 B d 1
Timings for an alternative (thanks #thelatemail)
library(dplyr)
agg %>%
sample_n(1e7, replace=T) -> bigger
system.time(
with(bigger, { data.frame(lapply(bigger[,1:2], rep, times=lengths(values)), values=unlist(values)) })
)
# user system elapsed
# 3.78 0.14 3.93
system.time(
with(bigger, { data.frame(bigger[rep(rownames(bigger), lengths(values)), 1:2], values=unlist(values)) })
)
# user system elapsed
# 11.30 0.34 11.64

In R how to make two columns an ID and get a frequency histogram for each ID

Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed

Multiple plyr functions and operations in one statement?

I have a dataset as follows:
i,o,c
A,4,USA
B,3,CAN
A,5,USA
C,4,MEX
C,1,USA
A,3,CAN
I want to reform this dataset into a form as follows:
i,u,o,c
A,3,4,2
B,1,3,1
C,2,2.5,1
Here, u represents the unique instances of variable i in the dataset, o = (sum of o / u) and c = unique countries.
I can get u with the following statement and by using plyr:
count(df1,vars="i")
I can also get some of the other variables by using the insights learned from my previous question. I can laboriously and by saving to multiple data frames and then finally joining them together achieve my intended results by I wonder if there is a one line optimization or just simply a better way of doing this than my current long winded way.
Thanks !
I don't understand how this is different from your earlier question. The approach is the same:
library(plyr)
ddply(mydf, .(i), summarise,
u = length(i),
o = mean(o),
c = length(unique(c)))
# i u o c
# 1 A 3 4.0 2
# 2 B 1 3.0 1
# 3 C 2 2.5 2
If you prefer a data.table solution:
> library(data.table)
> DT <- data.table(mydf)
> DT[, list(u = .N, o = mean(o), c = length(unique(c))), by = "i"]
i u o c
1: A 3 4.0 2
2: B 1 3.0 1
3: C 2 2.5 2

Resources