Merge two data frames in R by the closest default options, not by excess - r

i have two dataframes:
df_bestquotes
df_transactions
df_transactions:
day time vol price buy ask bid
1 43688,08 100 195,8 1 195,8 195,74
1 56357,34 20 192,87 1 192,87 192,86
1 57576,14 14 192,48 -1 192,48 192,46
2 50468,29 3 193,83 1 193,86 193,77
2 56107,54 11 194,17 -1 194,2 194,16
7 42549,66 100 188,81 -1 188,85 188,78
7 42724,38 200 188,62 -1 188,66 188,61
7 48924,66 5 189,59 -1 189,62 189,59
8 48950,14 52 187,66 -1 187,7 187,66
9 36242,86 89 186,61 1 186,62 186,56
9 53910,46 1 189,81 -1 189,87 189,81
10 47041,94 15 187,87 -1 187,88 187,86
13 34380,73 87 187,29 -1 187,42 187,27
13 36037,18 100 188,94 1 188,95 188,94
14 46644,64 100 189,29 -1 189,34 189,29
14 57571,12 52 190,03 1 190,03 190
15 36418,71 45 192,07 1 192,07 192,04
15 37223,77 100 191,09 -1 191,07 191,06
17 37245,59 100 186,45 -1 186,47 186,45
23 34200,39 50 189,29 -1 189,29 189,27
24 40294,73 60 193,52 -1 193,54 193,5
29 52813,68 5 202,99 -1 203,01 202,99
29 55279,13 93 203,97 -1 203,98 203,9
30 51356,91 68 204,41 -1 204,45 204,4
30 53530,24 40 204,14 -1 204,18 204,14
df_bestquotes:
day time best_ask best_bid
1 51384,613 31,78 31,75
1 56593,74 31,6 31,55
3 40568,217 31,36 31,32
7 39169,237 31,34 31,28
8 44715,713 31,2 31,17
8 53730,707 31,24 31,19
8 55851,75 31,17 31,14
10 49376,267 31,06 30,99
16 48610,483 30,75 30,66
16 57360,917 30,66 30,64
17 53130,717 30,39 30,32
20 46353,133 30,72 30,63
23 46429,67 29,7 29,64
24 37627,727 29,81 29,63
24 46354,647 29,92 29,77
24 53863,69 30,04 29,93
24 53889,923 30,03 29,95
24 59047,223 29,99 29,2
28 39086,407 30,87 30,83
28 41828,703 30,87 30,8
28 50489,367 30,99 30,87
29 54264,467 30,97 30,85
30 34365,95 31,21 30,99
30 39844,357 31,06 31
30 57550,523 31,18 31,15
For each record of the df_transactions, from the day and time, I need to find the best_ask and the best_bid that was just before that moment, and incorporate this information to df_transactions.
df_joined: df_transactions + df_bestquotes
day time vol price buy ask bid best_ask best_bid
1 43688,08 100 195,8 1 195,8 195,74
1 56357,34 20 192,87 1 192,87 192,86
1 57576,14 14 192,48 -1 192,48 192,46
2 50468,29 3 193,83 1 193,86 193,77
2 56107,54 11 194,17 -1 194,2 194,16
7 42549,66 100 188,81 -1 188,85 188,78
7 42724,38 200 188,62 -1 188,66 188,61
7 48924,66 5 189,59 -1 189,62 189,59
8 48950,14 52 187,66 -1 187,7 187,66
9 36242,86 89 186,61 1 186,62 186,56
9 53910,46 1 189,81 -1 189,87 189,81
10 47041,94 15 187,87 -1 187,88 187,86
13 34380,73 87 187,29 -1 187,42 187,27
13 36037,18 100 188,94 1 188,95 188,94
14 46644,64 100 189,29 -1 189,34 189,29
14 57571,12 52 190,03 1 190,03 190
15 36418,71 45 192,07 1 192,07 192,04
15 37223,77 100 191,09 -1 191,07 191,06
17 37245,59 100 186,45 -1 186,47 186,45
23 34200,39 50 189,29 -1 189,29 189,27
24 40294,73 60 193,52 -1 193,54 193,5
29 52813,68 5 202,99 -1 203,01 202,99
29 55279,13 93 203,97 -1 203,98 203,9
30 51356,91 68 204,41 -1 204,45 204,4
30 53530,24 40 204,14 -1 204,18 204,14
I have tried with the next code, but it doesn't work:
library(data.table)
df_joined = df_bestquotes[df_transactions, on="time", roll = "nearest"]
Here are the real files with a lot more records, the ones I put before are an example of only 25 records.
df_transactions_original
df_bestquotes_original
And my code in R:
matching.R
Any suggestions on how to get it? Thanks a lot, guys.

The attempt you made uses data.table but you don't refer to data.table. Have you done library(data.table) before ?
I think it should rather be :
df_joined = df_bestquotes[df_transactions, on=.(day, time), roll = TRUE]
But I cannot test without the objects. Does it work ? roll="nearest" doesn't give you the previous best quotes but the nearest.
EDIT : Thanks for the objects, I checked, that works for me :
library(data.table)
dfb <- fread("df_bestquotes.csv", dec=",")
dft <- fread("df_transactions.csv", dec = ",")
dfb[, c("day2", "time2") := .(day,time)] # duplicated to keep track of the best quotes days
joinedDf <- dfb [dft, on=.(day, time), roll = +Inf]
It puts NA when there is no best quotes for the day. If you want to roll across days, I suggest you create a unique measure of time. I don't know exactly what time is. Considering the units of time is seconds :
dfb[, uniqueTime := day + time/(60*60*24)]
dft[, uniqueTime := day + time/(60*60*24)]
joinedDf <- dfb [dft, on=.(uniqueTime), roll = +Inf]
This works even if time is not seconds, only the ranking is important in this case.

Good morning #samuelallain, yes, I have used library(data.table) before.
I've edited it in the main commentary.
I have tried its solution and RStudio returns the following error:
library(data.table)
df_joined = df_bestquotes[df_transactions, on=.("day", "time"), roll = TRUE]
Error in [.data.frame(df_bestquotes, df_transactions, on = .(day, time), :
unused arguments (on = .("day", "time"), roll = TRUE)
Thank you.

Related

Find maximum between two rows, column wise

I'm a newbie in R, trying to figure out how to find the maximum value between 2 values in a single column.
Example data:
t min most max
---------------
1 10 20 40
2 5 10 30
3 14 28 60
4 40 75 150
Result I'm looking for:
t min most max
---------------
1 10 20 40
2 14 28 60
3 40 75 150
I have tried using rowWise(), but it's not working. I am getting the maximum value row wise using:
df$new <-pmax(df$min, df$most, df$max)
df
which gives me the maximum value for the entire row.
t min most max new
-------------------
1 10 20 40 40
2 5 10 30 30
3 14 28 60 60
4 40 75 150 150
Thanks in advance.
You can do this with pmax applied to the vector against its shifted self. Putting it in a nice little helper function:
adj_max = function(x) {
pmax(x[-1], x[-length(x)])
}
as.data.frame(lapply(your_data, adj_max))
# or with dplyr
your_data %>%
summarize(across(everything(), adj_max))
Reproducible demo:
x = c(10, 5, 14, 40)
adj_max(x)
# [1] 10 14 40

Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213
table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

How to sum a function over a specific range in R?

Here are three columns:
indx vehID LocalY
1 2 35.381
2 2 39.381
3 2 43.381
4 2 47.38
5 2 51.381
6 2 55.381
7 2 59.381
8 2 63.379
9 2 67.383
10 2 71.398
11 2 75.401
12 2 79.349
13 2 83.233
14 2 87.043
15 2 90.829
16 2 94.683
17 2 98.611
18 2 102.56
19 2 106.385
20 2 110.079
21 2 113.628
22 2 117.118
23 2 120.6
24 2 124.096
25 2 127.597
26 2 131.099
27 2 134.595
28 2 138.081
29 2 141.578
30 2 145.131
31 2 148.784
32 2 152.559
33 2 156.449
34 2 160.379
35 2 164.277
36 2 168.15
37 2 172.044
38 2 176
39 2 179.959
40 2 183.862
41 2 187.716
42 2 191.561
43 2 195.455
44 2 199.414
45 2 203.417
46 2 207.43
47 2 211.431
48 2 215.428
49 2 219.427
50 2 223.462
51 2 227.422
52 2 231.231
53 2 235.001
54 2 238.909
55 2 242.958
56 2 247.137
57 2 251.247
58 2 255.292
59 2 259.31
60 2 263.372
61 2 267.54
62 2 271.842
63 2 276.256
64 2 280.724
65 2 285.172
I want to create a new column called 'Smoothed Y' by applying the following formula:
D=15, delta (triangular symbol) = 5, i = indx, x_alpha(tk) = LocalY, x_alpha(ti) = smoothed value
I have tried using following code for first calculating Z: (Kernel below means the exp function)
t <- 0.5
dt <- 0.1
delta <- t/dt
d <- 3*delta
indx <- a$indx
for (i in indx) {
initial <- i-d
end <- i+d
k <- c(initial:end)
for (n in k) {
kernel <- exp(-abs(i-n)/delta)
z <- sum(kernel)
}
}
a$z <- z
print (a)
NOTE: 'a' is the imported data frame containing the three columns above.
Although the values of computed function are fine but it doesn't sum up the values in variable z. How can I do summation over the range i-d to i+d for every indx value i?
You can use the convolve function. One thing you need to decide is what to do for indices closer to either end of the array than width of the convolution kernel. One option is to simply use the partial kernel, rescaled so the weights still sum to 1.
smooth<-function(x,D,delta){
z<-exp(-abs(-D:D)/delta)
r<-convolve(x,z,type="open")/convolve(rep(1,length(x)),z,type="open")
r<-head(tail(r,-D),-D)
r
}
With your array as y, the result is this:
> yy<-smooth(y,15,5)
> yy
[1] 50.70804 52.10837 54.04788 56.33651 58.87682 61.61121 64.50214
[8] 67.52265 70.65186 73.87197 77.16683 80.52193 83.92574 87.36969
[15] 90.84850 94.35809 98.15750 101.93317 105.67833 109.38989 113.06889
[22] 116.72139 120.35510 123.97707 127.59293 131.20786 134.82720 138.45720
[29] 142.10507 145.77820 149.48224 153.21934 156.98794 160.78322 164.60057
[36] 168.43699 172.29076 176.15989 180.04104 183.93127 187.83046 191.74004
[43] 195.66223 199.59781 203.54565 207.50342 211.46888 215.44064 219.41764
[50] 223.39908 227.05822 230.66813 234.22890 237.74176 241.20236 244.60039
[57] 247.91917 251.14346 254.25876 257.24891 260.09121 262.74910 265.16057
[64] 267.21598 268.70276
Of course, the problem with this is that the kernel ends up non-centered at the edges. This is a well-known problem, and there are ways to deal with it but it complicates the problem. Plotting the data will show you the effects of this non-centering:
plot(y)
lines(yy)

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources