Combing "previous row" of same table and JOIN from different table in Sqlite - sqlite

I have the following table
CREATE TABLE "shots" (
"player" INTEGER,
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"shot" INTEGER,
"text" TEXT,
"distance" REAL,
"x" TEXT,
"y" TEXT,
"z" TEXT
);
With a sample of the data:
28237 470 2015 717 1 1 1 Shot 1 302 yds to left fairway, 257 yds to hole 10874 11451.596 10623.774 78.251
28237 470 2015 717 1 1 2 Shot 2 234 yds to right fairway, 71 ft to hole 8437 12150.454 10700.381 86.035
28237 470 2015 717 1 1 3 Shot 3 70 ft to green, 4 ft to hole 838 12215.728 10725.134 88.408
28237 470 2015 717 1 1 4 Shot 4 in the hole 46 12215.1 10729.1 88.371
28237 470 2015 717 1 2 1 Shot 1 199 yds to green, 29 ft to hole 7162 12776.03 10398.086 91.017
28237 470 2015 717 1 2 2 Shot 2 putt 26 ft 7 in., 2 ft 4 in. to hole 319 12749.444 10398.854 90.998
28237 470 2015 717 1 2 3 Shot 3 in the hole 28 12747.3 10397.6 91.027
28237 470 2015 717 1 3 1 Shot 1 296 yds to left intermediate, 204 yds to hole 10651 12596.857 9448.27 94.296
28237 470 2015 717 1 3 2 Shot 2 208 yds to green, 15 ft to hole 7478 12571.0 8825.648 94.673
28237 470 2015 717 1 3 3 Shot 3 putt 17 ft 6 in., 2 ft 5 in. to hole 210 12561.831 8840.539 94.362
I want to get for each shot the previous location (x, y, z). I wrote the below query.
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z, prev.x, prev.y, prev.z
FROM shots cur
INNER JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot - 1)
This query takes forever basically. How can I rewrite it to make it faster?
In addition, I need to make an adjustment for the first shot on a hole (shot = 1). This shot is made from tee_x, tee_y and tee_z. These values are available in table holes
CREATE TABLE "holes" (
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"tee_x" TEXT,
"tee_y" TEXT,
"tee_z" TEXT
);
With data:
470 2015 717 1 1 11450 10625 78.25
470 2015 717 1 2 12750 10400 91
470 2015 717 1 3 2565 8840.5 95
Thanks

First, you need a composite index to speed up the operation:
CREATE INDEX idx_shots ON shots (player, tournament, year, course, round, hole, shot);
With that index, your query should run faster:
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z,
prev.x AS prev_x, prev.y AS prev_y, prev.z AS prev_z
FROM shots cur LEFT JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot + 1);
The changes I made:
the join should be a LEFT join so that all rows are included and
not only the ones that have a previous row
-1 should be +1 because the previous row's shot is 1 less than the current row's shot
added aliases for the previous row's x, y and z
But, if your version of SQLite is 3.25.0+ it would be better to use window function LAG() instead of a self join:
SELECT *,
LAG(x) OVER w AS prev_x,
LAG(y) OVER w AS prev_y,
LAG(z) OVER w AS prev_z
FROM shots
WINDOW w AS (PARTITION BY player, tournament, year, course, round, hole ORDER BY shot);
See the demo (I include the query plan for both queries where you can see the use of the composite index).

Related

Calculate mean by decile in Svydesign object

So, I´m working with ENIGH - Database, which stands for ¨National Survey of Household Income and Expenses¨ in Spanish, this is an exercise conducted by the Mexican government and like most surveys of its kind, it works with Weights.
What I´m trying to do is to calculate the mean, maximum and minimum household income by Decile. In other words What´s the income of each 10%, grouping household base on their income.
To be honest, I haven’t gone that far but this is what I got until now:
I need my svydesign object
Convert that into a table using svytable
Arrange using desc() on my income variable
ENIGH_design <-svydesign(id=~upm, strata=~est_dis, weights=~factor_hog, data = ENIGH)
ENIGH_table <- svytable(ing_cor, ENIGH_design)
Here is where it gets tricky, supposing I have 100 rows, I can’t take the first 10 of them because in reality, when taking weights in mind, the might be 9% or 20% (I´m just throwing numbers) of the actual population.
I could use cut() on my income variable but I would be forgetting about weights and results will only be representative of the sample, not total population.
I think that the best approach would be to use a combination of:
mutate() to create a new variable base
if() in conjugation with mutate to define on which decile each row falls to
group_by() and mean() to calculate what I´m aiming for
This way I will have an extra variable which I could use to calculate whatever I want with whatever other variable I wish to. But again, I haven´t define my groups so it´s pretty much useless.
Thank you for reading. Thank you for your help.
Database available: https://www.inegi.org.mx/programas/enigh/nc/2016/default.html#Datos_abiertos
Here is a glimpse of how my DB looks:
folioviv foliohog ubica_geo est_dis upm factor ing_cor
100587003 1 10010000 2 610 180 22,723
100587004 1 10010000 2 610 180 17,920
100587005 1 10010000 2 610 180 27,506
100587006 1 10010000 2 610 180 56,236
100605201 1 10010000 2 620 178 41,587
100605202 1 10010000 2 620 178 135,437
100605203 1 10010000 2 620 178 62,386
100605205 1 10010000 2 620 178 103,502
100605206 1 10010000 2 620 178 27,323
100606301 1 10010000 3 630 223 68,042
100606302 1 10010000 3 630 223 98,537
100606305 1 10010000 3 630 223 53,237
100606306 1 10010000 3 630 223 132,861
100609801 1 10010000 3 640 232 190,033
100609802 1 10010000 3 640 232 28,654
100609805 1 10010000 3 640 232 74,408
100631401 1 10010000 1 650 171 80,761
100711503 1 10010000 1 770 184 38,640
100711504 1 10010000 1 770 184 81,672
There are many more columns but they aren´t necessary for this exercise.
Make a table (dataframe or data.table or tibble) that looks like this:
> dt
folioviv factor ing_tri
1 247 30000
2 200 15000
3 150 50000
incomes <- rep(dt$ing_tri, times = dt$factor)
deciles <- quantile(incomes, probs = seq(0.1, 1, by = 0.1), names = TRUE)
If I were you, I would try with names = FALSE to make it manipulable. Otherwise, it will be a named list and that's a bit annoying.
Oh, and in case you want to compute the mean, just do mean(incomes).
PS: The column folioviv is not actually necessary, but you may want to put it there just in case.

DT::datatable in R, flexdashboard

Household Size 0 1 2 3 4 5+
Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have the above dataframe where 'Bedrooms' is a label on the columns.
I'm trying to change this into a data table I can then use within rmarkdown to add into a flexdashboard. When I use the below code:
DT::datatable(df, rownames = FALSE, extensions = 'FixedColumns', escape=TRUE,options= list(bPaginate = FALSE))
I get the output:
Household Size 0 1 2 3 4 5+
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have a few problems with this:
The lables that say 'Bedrooms' don't show, so there's no way of knowing what these numbers in the columns actually mean. I'd like to include the labels or have a Row on top of the column names that says "Number of Bedrooms" that covers all of the rows?
The column Household Size and 5+ have a wider width than the rest of the columns, I want these to either be the same or Household Size to be slightly bigger than the rest
I think it's worth noting that the row 5+ and the column 5+ are both a new row/column that count any value above 5.
Also, this is just an extra but I'd like to colour the bottom left cells red and the top right cells green, is this possible?
I've figured out how to keep 'Bedrooms' in the column titles. It's possible to set the column names within DT::datatable using the code below;
DT::datatable(HS_BED_ALL, rownames = FALSE, colnames=c('Household Size','0 Bedrooms','1 Bedroom','2 Bedrooms','3 Bedrooms','4 Bedrooms','5+ Bedrooms'), extensions = 'FixedColumns', escape=TRUE, options= list(bPaginate = FALSE, dom = 't',buttons = c('excel')))%>%formatStyle(1:7,fontSize = '14px')
Which gives the desired output.

adding and subtracting values in multiple data frames of different lengths - flow analysis

Thank you jakub and Hack-R!
Yes, these are my actual data. The data I am starting from are the following:
[A] #first, longer dataset
CODE_t2 VALUE_t2
111 3641
112 1691
121 1271
122 185
123 522
124 0
131 0
132 0
133 0
141 626
142 170
211 0
212 0
213 0
221 0
222 0
223 0
231 95
241 0
242 0
243 0
244 0
311 129
312 1214
313 0
321 0
322 0
323 565
324 0
331 0
332 0
333 0
334 0
335 0
411 0
412 0
421 0
422 0
423 0
511 6
512 0
521 0
522 0
523 87
In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city.
Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type.
To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following:
[B] #second, shorter dataset
CODE_t2 CODE_t1 VALUE_CHANGE1
121 112 2
121 133 12
121 323 0
121 511 3
121 523 2
123 523 4
133 123 3
133 523 4
141 231 12
141 511 37
So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0.
What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0.
Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real):
CODE VALUE_t0 VALUE_t2 % VAR t2-t0
code1 50 100 ((100-50)/50)*100
code2 70 80 ((80-70)/70)*100
code3 45 34 ((34-45)/45)*100
What I could do so far is:
land_code <- names(A)[-1]
land_code
A$VALUE_t1 <- for(code in land_code{
cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
}
If I use the loop I get an error, while if I take it away:
A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula.
Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here).
Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me.
thanks a lot

Combining Two Rows with Different Levels according to Some Conditions into One in R

This is a part of my data: (The actual data contains about 10,000 observations with about 500 levels of SalesItem)
s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
s2<-c(155,153,154,150,176,165,159,143,179,150)
S<-data.frame(SalesItem=factor(s1), Sales=s2)
> str(S)
'data.frame': 10 obs. of 2 variables:
$ SalesItem: Factor w/ 10 levels "1008","1009",..: 1 2 3 4 5 6 7 8 9 10
$ Sales : num 155 153 154 150 176 165 159 143 179 150`
What I want to do is, if diff(SalesItem)=1, I want to combine the level of SalesItem into 1, for example: diff between SalesItem 1008 and 1009 equal to one, so, I want to rename SalesItem 1009 to 1008. So, later I can compute the sum of Sales for this SalesItem as one, because of my actual data=10,000, so, it is quite hard for me to do this one by one.
Is there any simplest way for me to do that?
Clearly the fact that you have converted the first column to a factor indicates that you might need those factors in some place. so i would suggest that instead of changing any of the columns, add a third column to your data frame which will help you maintain the SalesItem relevant to that value. here are the steps for it :
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> s3 = ifelse((s1-1) %in% s1, s1-1, s1)
> S <- data.frame(SalesItem=s1, Sales=s2, ItemId=s3)
then you can just count on the basis of the ItemId column.
This is not a terribly efficient solution, but since your data only contains 10000 records, it is not going to be a big problem.
Set up provided example data, but convert the SalesItem field to an integer so that the diff() operation makes sense.
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> S<-data.frame(SalesItem=s1, Sales=s2)
Reorder data frame so that the SalesItem field is in ascending order (not necessary for current data set, but required for solution) then find the differences.
> S = S[order(S$SalesItem),]
> d = c(0, diff(S$SalesItem))
Duplicate the SalesItem data and then filter based on the values of the differences.
> labels = s1
> #
> for (n in 1:nrow(S)) {if (d[n] == 1) labels[n] = labels[n-1]}
> S$labels = labels
The (temporary) labels field now has the required new values for the SalesItem field. Once you are happy that this is doing the right thing, you can modify last line in above code to simply over-write the existing SalesItem field.
> S
SalesItem Sales labels
1 1008 155 1008
2 1009 153 1008
3 1012 154 1012
4 1013 150 1012
5 1016 176 1016
6 1017 165 1016
7 1018 159 1016
8 1019 143 1016
9 1054 179 1054
10 1055 150 1054

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Resources