Merge two data frames in R by the closest default options, not by excess - r
i have two dataframes:
df_bestquotes
df_transactions
df_transactions:
day time vol price buy ask bid
1 43688,08 100 195,8 1 195,8 195,74
1 56357,34 20 192,87 1 192,87 192,86
1 57576,14 14 192,48 -1 192,48 192,46
2 50468,29 3 193,83 1 193,86 193,77
2 56107,54 11 194,17 -1 194,2 194,16
7 42549,66 100 188,81 -1 188,85 188,78
7 42724,38 200 188,62 -1 188,66 188,61
7 48924,66 5 189,59 -1 189,62 189,59
8 48950,14 52 187,66 -1 187,7 187,66
9 36242,86 89 186,61 1 186,62 186,56
9 53910,46 1 189,81 -1 189,87 189,81
10 47041,94 15 187,87 -1 187,88 187,86
13 34380,73 87 187,29 -1 187,42 187,27
13 36037,18 100 188,94 1 188,95 188,94
14 46644,64 100 189,29 -1 189,34 189,29
14 57571,12 52 190,03 1 190,03 190
15 36418,71 45 192,07 1 192,07 192,04
15 37223,77 100 191,09 -1 191,07 191,06
17 37245,59 100 186,45 -1 186,47 186,45
23 34200,39 50 189,29 -1 189,29 189,27
24 40294,73 60 193,52 -1 193,54 193,5
29 52813,68 5 202,99 -1 203,01 202,99
29 55279,13 93 203,97 -1 203,98 203,9
30 51356,91 68 204,41 -1 204,45 204,4
30 53530,24 40 204,14 -1 204,18 204,14
df_bestquotes:
day time best_ask best_bid
1 51384,613 31,78 31,75
1 56593,74 31,6 31,55
3 40568,217 31,36 31,32
7 39169,237 31,34 31,28
8 44715,713 31,2 31,17
8 53730,707 31,24 31,19
8 55851,75 31,17 31,14
10 49376,267 31,06 30,99
16 48610,483 30,75 30,66
16 57360,917 30,66 30,64
17 53130,717 30,39 30,32
20 46353,133 30,72 30,63
23 46429,67 29,7 29,64
24 37627,727 29,81 29,63
24 46354,647 29,92 29,77
24 53863,69 30,04 29,93
24 53889,923 30,03 29,95
24 59047,223 29,99 29,2
28 39086,407 30,87 30,83
28 41828,703 30,87 30,8
28 50489,367 30,99 30,87
29 54264,467 30,97 30,85
30 34365,95 31,21 30,99
30 39844,357 31,06 31
30 57550,523 31,18 31,15
For each record of the df_transactions, from the day and time, I need to find the best_ask and the best_bid that was just before that moment, and incorporate this information to df_transactions.
df_joined: df_transactions + df_bestquotes
day time vol price buy ask bid best_ask best_bid
1 43688,08 100 195,8 1 195,8 195,74
1 56357,34 20 192,87 1 192,87 192,86
1 57576,14 14 192,48 -1 192,48 192,46
2 50468,29 3 193,83 1 193,86 193,77
2 56107,54 11 194,17 -1 194,2 194,16
7 42549,66 100 188,81 -1 188,85 188,78
7 42724,38 200 188,62 -1 188,66 188,61
7 48924,66 5 189,59 -1 189,62 189,59
8 48950,14 52 187,66 -1 187,7 187,66
9 36242,86 89 186,61 1 186,62 186,56
9 53910,46 1 189,81 -1 189,87 189,81
10 47041,94 15 187,87 -1 187,88 187,86
13 34380,73 87 187,29 -1 187,42 187,27
13 36037,18 100 188,94 1 188,95 188,94
14 46644,64 100 189,29 -1 189,34 189,29
14 57571,12 52 190,03 1 190,03 190
15 36418,71 45 192,07 1 192,07 192,04
15 37223,77 100 191,09 -1 191,07 191,06
17 37245,59 100 186,45 -1 186,47 186,45
23 34200,39 50 189,29 -1 189,29 189,27
24 40294,73 60 193,52 -1 193,54 193,5
29 52813,68 5 202,99 -1 203,01 202,99
29 55279,13 93 203,97 -1 203,98 203,9
30 51356,91 68 204,41 -1 204,45 204,4
30 53530,24 40 204,14 -1 204,18 204,14
I have tried with the next code, but it doesn't work:
library(data.table)
df_joined = df_bestquotes[df_transactions, on="time", roll = "nearest"]
Here are the real files with a lot more records, the ones I put before are an example of only 25 records.
df_transactions_original
df_bestquotes_original
And my code in R:
matching.R
Any suggestions on how to get it? Thanks a lot, guys.
The attempt you made uses data.table but you don't refer to data.table. Have you done library(data.table) before ?
I think it should rather be :
df_joined = df_bestquotes[df_transactions, on=.(day, time), roll = TRUE]
But I cannot test without the objects. Does it work ? roll="nearest" doesn't give you the previous best quotes but the nearest.
EDIT : Thanks for the objects, I checked, that works for me :
library(data.table)
dfb <- fread("df_bestquotes.csv", dec=",")
dft <- fread("df_transactions.csv", dec = ",")
dfb[, c("day2", "time2") := .(day,time)] # duplicated to keep track of the best quotes days
joinedDf <- dfb [dft, on=.(day, time), roll = +Inf]
It puts NA when there is no best quotes for the day. If you want to roll across days, I suggest you create a unique measure of time. I don't know exactly what time is. Considering the units of time is seconds :
dfb[, uniqueTime := day + time/(60*60*24)]
dft[, uniqueTime := day + time/(60*60*24)]
joinedDf <- dfb [dft, on=.(uniqueTime), roll = +Inf]
This works even if time is not seconds, only the ranking is important in this case.
Good morning #samuelallain, yes, I have used library(data.table) before.
I've edited it in the main commentary.
I have tried its solution and RStudio returns the following error:
library(data.table)
df_joined = df_bestquotes[df_transactions, on=.("day", "time"), roll = TRUE]
Error in [.data.frame(df_bestquotes, df_transactions, on = .(day, time), :
unused arguments (on = .("day", "time"), roll = TRUE)
Thank you.
Related
Find maximum between two rows, column wise
I'm a newbie in R, trying to figure out how to find the maximum value between 2 values in a single column. Example data: t min most max --------------- 1 10 20 40 2 5 10 30 3 14 28 60 4 40 75 150 Result I'm looking for: t min most max --------------- 1 10 20 40 2 14 28 60 3 40 75 150 I have tried using rowWise(), but it's not working. I am getting the maximum value row wise using: df$new <-pmax(df$min, df$most, df$max) df which gives me the maximum value for the entire row. t min most max new ------------------- 1 10 20 40 40 2 5 10 30 30 3 14 28 60 60 4 40 75 150 150 Thanks in advance.
You can do this with pmax applied to the vector against its shifted self. Putting it in a nice little helper function: adj_max = function(x) { pmax(x[-1], x[-length(x)]) } as.data.frame(lapply(your_data, adj_max)) # or with dplyr your_data %>% summarize(across(everything(), adj_max)) Reproducible demo: x = c(10, 5, 14, 40) adj_max(x) # [1] 10 14 40
Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns?
when I do table(df$strategy.x) 0 1 2 3 70 514 223 209 table(df$strategy.y) 0 1 2 3 729 24 7 4 I want to create a variable with both of these combined. I tried this df <- df %>% mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0)) table(df$nstrategy1) 0 1 399 519 I am supposed to get 514 + 24 = 538 but I got 519 instead df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0)) table(df$nstrategy2) 0 1 578 228 Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories? 0 1 2 3 799 538 230 213
table(mtcars$am) # 13 1's table(mtcars$vs) # 14 1's mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0) table(mtcars$ones) # 20 1's < 13 + 14 = 27 Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice. table(mtcars$am, mtcars$vs) # 0 1 # 0 12 7 # 1 6 7 The simplest way to get the sum of the two results would be by adding the two table objects: table(mtcars$am) + table(mtcars$vs) # 0 1 # 37 27
writing out .dat file in r
I have a dataset looks like this: ids <- c(111,12,134,14,155,16,17,18,19,20) scores.1 <- c(0,1,0,1,1,2,0,1,1,1) scores.2 <- c(0,0,0,1,1,1,1,1,1,0) data <- data.frame(ids, scores.1, scores.1) > data ids scores.1 scores.1.1 1 111 0 0 2 12 1 1 3 134 0 0 4 14 1 1 5 155 1 1 6 16 2 2 7 17 0 0 8 18 1 1 9 19 1 1 10 20 1 1 ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library. item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)), colnames = FALSE, sep = "") I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids 11100 1211 13400 1411 15511 1622 1700 1811 1911 2011 rather than at the right side of the ids. 111 00 12 11 134 00 14 11 155 11 16 22 17 00 18 11 19 11 20 11 Any recommendations? Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv library(dplyr) library(tidyr) data %>% unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted: library(dplyr) library(tidyr) data %>% unite(scores, starts_with('scores'), sep='') write.fwf(data, file = "data.dat", width = c(5,item.count), colnames = FALSE, sep = " ") in the .dat file, the dataset looks like this below: 111 00 12 11 134 00 14 11 155 11 16 22 17 00 18 11 19 11 20 11
How to sum a function over a specific range in R?
Here are three columns: indx vehID LocalY 1 2 35.381 2 2 39.381 3 2 43.381 4 2 47.38 5 2 51.381 6 2 55.381 7 2 59.381 8 2 63.379 9 2 67.383 10 2 71.398 11 2 75.401 12 2 79.349 13 2 83.233 14 2 87.043 15 2 90.829 16 2 94.683 17 2 98.611 18 2 102.56 19 2 106.385 20 2 110.079 21 2 113.628 22 2 117.118 23 2 120.6 24 2 124.096 25 2 127.597 26 2 131.099 27 2 134.595 28 2 138.081 29 2 141.578 30 2 145.131 31 2 148.784 32 2 152.559 33 2 156.449 34 2 160.379 35 2 164.277 36 2 168.15 37 2 172.044 38 2 176 39 2 179.959 40 2 183.862 41 2 187.716 42 2 191.561 43 2 195.455 44 2 199.414 45 2 203.417 46 2 207.43 47 2 211.431 48 2 215.428 49 2 219.427 50 2 223.462 51 2 227.422 52 2 231.231 53 2 235.001 54 2 238.909 55 2 242.958 56 2 247.137 57 2 251.247 58 2 255.292 59 2 259.31 60 2 263.372 61 2 267.54 62 2 271.842 63 2 276.256 64 2 280.724 65 2 285.172 I want to create a new column called 'Smoothed Y' by applying the following formula: D=15, delta (triangular symbol) = 5, i = indx, x_alpha(tk) = LocalY, x_alpha(ti) = smoothed value I have tried using following code for first calculating Z: (Kernel below means the exp function) t <- 0.5 dt <- 0.1 delta <- t/dt d <- 3*delta indx <- a$indx for (i in indx) { initial <- i-d end <- i+d k <- c(initial:end) for (n in k) { kernel <- exp(-abs(i-n)/delta) z <- sum(kernel) } } a$z <- z print (a) NOTE: 'a' is the imported data frame containing the three columns above. Although the values of computed function are fine but it doesn't sum up the values in variable z. How can I do summation over the range i-d to i+d for every indx value i?
You can use the convolve function. One thing you need to decide is what to do for indices closer to either end of the array than width of the convolution kernel. One option is to simply use the partial kernel, rescaled so the weights still sum to 1. smooth<-function(x,D,delta){ z<-exp(-abs(-D:D)/delta) r<-convolve(x,z,type="open")/convolve(rep(1,length(x)),z,type="open") r<-head(tail(r,-D),-D) r } With your array as y, the result is this: > yy<-smooth(y,15,5) > yy [1] 50.70804 52.10837 54.04788 56.33651 58.87682 61.61121 64.50214 [8] 67.52265 70.65186 73.87197 77.16683 80.52193 83.92574 87.36969 [15] 90.84850 94.35809 98.15750 101.93317 105.67833 109.38989 113.06889 [22] 116.72139 120.35510 123.97707 127.59293 131.20786 134.82720 138.45720 [29] 142.10507 145.77820 149.48224 153.21934 156.98794 160.78322 164.60057 [36] 168.43699 172.29076 176.15989 180.04104 183.93127 187.83046 191.74004 [43] 195.66223 199.59781 203.54565 207.50342 211.46888 215.44064 219.41764 [50] 223.39908 227.05822 230.66813 234.22890 237.74176 241.20236 244.60039 [57] 247.91917 251.14346 254.25876 257.24891 260.09121 262.74910 265.16057 [64] 267.21598 268.70276 Of course, the problem with this is that the kernel ends up non-centered at the edges. This is a well-known problem, and there are ways to deal with it but it complicates the problem. Plotting the data will show you the effects of this non-centering: plot(y) lines(yy)
R efficiently add up tables in different order
At some point in my code, I get a list of tables that looks much like this: [[1]] cluster_size start end number p_value 13 2 12 13 131 4.209645e-233 12 1 12 12 100 6.166824e-185 22 11 12 22 132 6.916323e-143 23 12 12 23 133 1.176194e-139 13 1 13 13 31 3.464284e-38 13 68 13 117 34 3.275941e-37 23 78 23 117 2 4.503111e-32 .... [[2]] cluster_size start end number p_value 13 2 12 13 131 4.209645e-233 12 1 12 12 100 6.166824e-185 22 11 12 22 132 6.916323e-143 23 12 12 23 133 1.176194e-139 13 1 13 13 31 3.464284e-38 .... While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum. The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it? Edit: I was asked for a dput file of the data. It's located here: http://alrig.com/code/ In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps Convert your list to a data.frame Use any of the split-apply-combine approaches to summarize. Assuming your data was named X, here's what you could do: library(plyr) #need to convert to data.frame since all of your list objects are of class matrix XDF <- as.data.frame(do.call("rbind", X)) ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value)) #----- cluster_size start end number sump 1 1 12 12 100 5.550142e-184 2 1 13 13 31 3.117856e-37 3 1 22 22 1 9.000000e+00 ... 29 105 23 117 2 6.271469e-16 30 106 22 146 13 7.266746e-25 31 107 23 146 12 1.382328e-25 Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.