For my thesis, I am trying to use several variables from two types of surveys (the British Election Studies (BES) and the British Social Attitudes Survey (BSA)) and combine them into one dataset.
Currently, I have two datasets, one with BES data, which looks like this (in simplified version):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
The BSA data looks like this (again, simplified):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
Basically, what I am trying to do is combine the two into one dataframe that looks like this:
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
I have googled a lot about joins and merging, but I can't figure it out in a way that works correctly. From what I understand, I believe I should join "by" the year variable, but is that correct? And how can I prevent it taking up a lot of memory to perform the computation (the actual datasets are about 30k for the BES and 130k for the BSA)? Is there a solution using either dplyr or data.tables in R?
Any help is much appreciated!!!
This is not a "merge" (or join) operation, it's just row-concatenation. In R, that's done with rbind (which works for matrix and data.frame using different methods). (For perspective, there's also cbind, which concatenates by columns. Not applicable here.)
base R
rbind(BES, BSA)
# year class education gender age
# 1 1992 working A-levels female 32
# 2 1992 middle GCSE male 49
# 3 1997 lower Undergrad female 24
# 4 1997 middle GCSE male 29
# 5 1992 middle A-levels male 22
# 6 1993 working GCSE female 45
# 7 1994 upper Postgrad female 38
# 8 1994 middle GCSE male 59
other dialects
dplyr::bind_rows(BES, BSA)
data.table::rbindlist(list(BES, BSA))
Related
I have a table in a Mariadb version 10.3.27 database that looks like this:
+----+------------+---------------+-----------------+
| id | channel_id | timestamp | value |
+----+------------+---------------+-----------------+
| 1 | 2 | 1623669600000 | 2882.4449252449 |
| 2 | 1 | 1623669600000 | 295.46914369742 |
| 3 | 2 | 1623669630000 | 2874.46365243 |
| 4 | 1 | 1623669630000 | 295.68124546516 |
| 5 | 2 | 1623669660000 | 2874.9638893452 |
| 6 | 1 | 1623669660000 | 295.69561247521 |
| 7 | 2 | 1623669690000 | 2878.7120274678 |
and I want to have a result like this:
+------+-------+-------+
| hour | valhh | valwp |
+------+-------+-------+
| 0 | 419 | 115 |
| 1 | 419 | 115 |
| 2 | 419 | 115 |
| 3 | 419 | 115 |
| 4 | 419 | 115 |
| 5 | 419 | 115 |
| 6 | 419 | 115 |
| 7 | 419 | 115 |
| 8 | 419 | 115 |
| 9 | 419 | 115 |
| 10 | 419 | 115 |
| 11 | 419 | 115 |
| 12 | 419 | 115 |
| 13 | 419 | 115 |
| 14 | 419 | 115 |
| 15 | 419 | 115 |
| 16 | 419 | 115 |
| 17 | 419 | 115 |
| 18 | 419 | 115 |
| 19 | 419 | 115 |
| 20 | 419 | 115 |
| 21 | 419 | 115 |
| 22 | 419 | 115 |
| 23 | 419 | 115 |
+------+-------+-------+
but with valhh (valwp) being the average of the values for the hour of the day for all days where the channel_id is 1 (2) and not the overall average. So far, I've tried:
select h.hour, hh.valhh, wp.valwp from
(select hour(from_unixtime(timestamp/1000)) as hour from data) h,
(select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as valhh from data where channel_id = 1) hh,
(select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as valwp from data where channel_id = 2) wp group by h.hour;
which gives the result above (average of all values).
I can get what I want by querying the channels separately, i.e.:
select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as value from data where channel_id = 1 group by hour;
gives
+------+-------+
| hour | value |
+------+-------+
| 0 | 326 |
| 1 | 145 |
| 2 | 411 |
| 3 | 142 |
| 4 | 143 |
| 5 | 171 |
| 6 | 160 |
| 7 | 487 |
| 8 | 408 |
| 9 | 186 |
| 10 | 214 |
| 11 | 199 |
| 12 | 942 |
| 13 | 521 |
| 14 | 196 |
| 15 | 247 |
| 16 | 364 |
| 17 | 252 |
| 18 | 392 |
| 19 | 916 |
| 20 | 1024 |
| 21 | 1524 |
| 22 | 561 |
| 23 | 249 |
+------+-------+
but I want to have both channels in one result set as separate columns.
How would I do that?
Thanks!
After a steep learning curve I think I figured it out:
select
hh.hour, hh.valuehh, wp.valuewp
from
(select
hour(from_unixtime(timestamp/1000)) as hour,
cast(avg(value) as integer) as valuehh
from data
where channel_id=1
group by hour) hh
inner join
(select
hour(from_unixtime(timestamp/1000)) as hour,
cast(avg(value) as integer) as valuewp
from data
where channel_id=2
group by hour) wp
on hh.hour = wp.hour;
gives
+------+---------+---------+
| hour | valuehh | valuewp |
+------+---------+---------+
| 0 | 300 | 38 |
| 1 | 162 | 275 |
| 2 | 338 | 668 |
| 3 | 166 | 38 |
| 4 | 152 | 38 |
| 5 | 176 | 37 |
| 6 | 174 | 38 |
| 7 | 488 | 36 |
| 8 | 553 | 37 |
| 9 | 198 | 36 |
| 10 | 214 | 38 |
| 11 | 199 | 612 |
| 12 | 942 | 40 |
| 13 | 521 | 99 |
| 14 | 187 | 38 |
| 15 | 209 | 38 |
| 16 | 287 | 39 |
| 17 | 667 | 37 |
| 18 | 615 | 39 |
| 19 | 854 | 199 |
| 20 | 1074 | 44 |
| 21 | 1470 | 178 |
| 22 | 665 | 37 |
| 23 | 235 | 38 |
+------+---------+---------+
I need to run the Mann-Kendall test (package trend in R, https://cran.r-project.org/web/packages/trend/index.html) on varying length time series data. Currently the time series analysis is run with the start year that I manually specify, but that may not be the actual start date. A lot of my sites contain differing start years and some may have different ending years. I condensed my data into the following. This is water quality data, so has issues with missing data and varying start/end dates.
I also deal with NAs in the middle of the time series and at the beginning. I would like to smooth out the missing NAs when in the middle of a time series. If the NAs are at the beginning, I would like to start the time series with the first actual value.
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
| SITE_ID | PROGRAM_ID | YEAR | ANC_UEQ_L | NO3_UEQ_L | SO4_UEQ_L | SBC_ALL_UEQ_L | SBC_NA_UEQ_L |
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
| 1234 | Alpha | 1992 | 36.12 | 0.8786 | 91.90628571 | 185.5595714 | 156.2281429 |
| 1234 | Alpha | 1993 | 22.30416667 | 2.671258333 | 86.85733333 | 180.5109167 | 154.1934167 |
| 1234 | Alpha | 1994 | 25.25166667 | 3.296475 | 92.00533333 | 184.3589167 | 157.3889167 |
| 1234 | Alpha | 1995 | 23.39166667 | 1.753436364 | 97.58981818 | 184.5251818 | 160.2047273 |
| 5678 | Beta | 1983 | 4.133333333 | 20 | 134.4333333 | 182.1 | 157.4 |
| 5678 | Beta | 1984 | 2.6 | 21.85 | 137.78 | 170.67 | 150.64 |
| 5678 | Beta | 1985 | 3.58 | 20.85555556 | 133.7444444 | 168.82 | 150.09 |
| 5678 | Beta | 1986 | -5.428571429 | 40.27142857 | 124.9 | 152.4 | 136.2142857 |
| 5678 | Beta | 1987 | NA | 13.75 | 122.75 | 137.4 | 126.3 |
| 5678 | Beta | 1988 | 4.666666667 | 26.13333333 | 123.7666667 | 174.9166667 | 155.4166667 |
| 5678 | Beta | 1989 | 6.58 | 31.91 | 124.63 | 167.39 | 148.68 |
| 5678 | Beta | 1990 | 2.354545455 | 39.49090909 | 121.6363636 | 161.6454545 | 144.5545455 |
| 5678 | Beta | 1991 | 5.973846154 | 30.54307692 | 119.8138462 | 165.4661185 | 147.0807338 |
| 5678 | Beta | 1992 | 4.174359 | 16.99051285 | 124.1753846 | 148.5505115 | 131.8894862 |
| 5678 | Beta | 1993 | 6.05 | 19.76125 | 117.3525 | 148.3025 | 131.3275 |
| 5678 | Beta | 1994 | -2.51666 | 17.47167 | 117.93266 | 129.64167 | 114.64501 |
| 5678 | Beta | 1995 | 8.00936875 | 22.66188125 | 112.3575 | 166.1220813 | 148.7095813 |
| 9101 | Victor | 1980 | NA | NA | 94.075 | NA | NA |
| 9101 | Victor | 1981 | NA | NA | 124.7 | NA | NA |
| 9101 | Victor | 1982 | 33.26666667 | NA | 73.53333333 | 142.75 | 117.15 |
| 9101 | Victor | 1983 | 26.02 | NA | 94.9 | 147.96 | 120.44 |
| 9101 | Victor | 1984 | 20.96 | NA | 82.98 | 137.4 | 110.46 |
| 9101 | Victor | 1985 | 29.325 | 0.157843137 | 84.975 | 144.45 | 118.45 |
| 9101 | Victor | 1986 | 28.6 | 0.88504902 | 81.675 | 139.7 | 114.45 |
| 9101 | Victor | 1987 | 25.925 | 1.065441176 | 74.15 | 131.875 | 108.7 |
| 9101 | Victor | 1988 | 29.4 | 1.048529412 | 80.625 | 148.15 | 122.5 |
| 9101 | Victor | 1989 | 27.7 | 0.907598039 | 81.025 | 143.1 | 119.275 |
| 9101 | Victor | 1990 | 27.4 | 0.642647059 | 77.65 | 126.825 | 104.775 |
| 9101 | Victor | 1991 | 24.95 | 1.228921569 | 74.1 | 138.55 | 115.7 |
| 9101 | Victor | 1992 | 29.425 | 0.591911765 | 73.85 | 130.675 | 106.65 |
| 9101 | Victor | 1993 | 22.53333333 | 0.308169935 | 64.93333333 | 117.3666667 | 96.2 |
| 9101 | Victor | 1994 | 29.93333333 | 0.428431373 | 67.23333333 | 124.0666667 | 101.2333333 |
| 9101 | Victor | 1995 | 39.33333333 | 0.57875817 | 65.36666667 | 128.8333333 | 105.0666667 |
| 1121 | Charlie | 1987 | 12.39 | 0.65 | 99.48 | 136.37 | 107.75 |
| 1121 | Charlie | 1988 | 10.87333333 | 0.69 | 104.6133333 | 131.9 | 105.2 |
| 1121 | Charlie | 1989 | 5.57 | 1.09 | 105.46 | 136.125 | 109.5225 |
| 1121 | Charlie | 1990 | 13.4725 | 0.8975 | 99.905 | 134.45 | 108.9875 |
| 1121 | Charlie | 1991 | 11.3 | 0.805 | 100.605 | 134.3775 | 108.9725 |
| 1121 | Charlie | 1992 | 9.0025 | 7.145 | 99.915 | 136.8625 | 111.945 |
| 1121 | Charlie | 1993 | 7.7925 | 6.6 | 95.865 | 133.0975 | 107.4625 |
| 1121 | Charlie | 1994 | 7.59 | 3.7625 | 97.3575 | 129.635 | 104.465 |
| 1121 | Charlie | 1995 | 7.7925 | 1.21 | 100.93 | 133.9875 | 109.5025 |
| 3812 | Charlie | 1988 | 18.84390244 | 17.21142857 | 228.8684211 | 282.6540541 | 260.5648649 |
| 3812 | Charlie | 1989 | 11.7248 | 21.21363636 | 216.5973451 | 261.3711712 | 237.4929204 |
| 3812 | Charlie | 1990 | 2.368571429 | 35.23448276 | 216.7827586 | 286.0034483 | 264.3137931 |
| 3812 | Charlie | 1991 | 33.695 | 40.733 | 231.92 | 350.91075 | 328.443 |
| 3812 | Charlie | 1992 | 18.49111111 | 26.14818889 | 219.1488 | 301.3785889 | 281.8809222 |
| 3812 | Charlie | 1993 | 17.28181818 | 27.65394545 | 210.6605091 | 290.064 | 271.9205455 |
+---------+------------+------+--------------+-------------+-------------+---------------+--------------+
Here is the code currently that will run time series for my actual data if I change the start year to miss the NAs in the earlier data. It works great for sites that have values for that entire time, but gives me odd results when different start/end years are taken into account.
Mann_Kendall_Values_Trimmed <- filter(LTM_Data_StackOverflow_9_22_2020, YEAR >1984) %>% #I manually trimmed the data here to prevent some errors
group_by(SITE_ID) %>%
filter(n() > 2) %>% #filter sites with more than 10 years of data
gather(parameter, value, SO4_UEQ_L, ANC_UEQ_L, NO3_UEQ_L, SBC_ALL_UEQ_L, SBC_NA_UEQ_L ) %>%
#, DOC_MG_L)
group_by(parameter, SITE_ID, PROGRAM_ID) %>% nest() %>%
mutate(ts_out = map(data, ~ts(.x$value, start=c(1985, 1), end=c(1995, 1), frequency=1))) %>%
#this is where I would like to specify the first year in the actual time series with data. End year would also be tied to the last year of data.
mutate(mk_res = map(ts_out, ~mk.test(.x, alternative = c("two.sided", "greater", "less"),continuity = TRUE)),
sens = map(ts_out, ~sens.slope(.x, conf.level = 0.95))) %>%
#run the Mann-Kendall Test
mutate(mk_stat = map_dbl(mk_res, ~.x$statistic),
p_val = map_dbl(mk_res, ~.x$p.value)
, sens_slope = map_dbl(sens, ~.x$estimates)
) %>%
#Pull the parameters we need
select(SITE_ID, PROGRAM_ID, parameter, sens_slope, p_val, mk_stat) %>%
mutate(output = case_when(
sens_slope == 0 ~ "NC",
sens_slope > 0 & p_val < 0.05 ~ "INS",
sens_slope > 0 & p_val > 0.05 ~ "INNS",
sens_slope < 0 & p_val < 0.05 ~ "DES",
sens_slope < 0 & p_val > 0.05 ~ "DENS"))
How do I handle the NAs in the middle of the data?
How do I get the time series to automatically start and end on the dates with actual data ? For reference each of the site_id's has the following date ranges (not including NAs):
+-----------+-----------+-------------------+-----------+-----------+
| 1234 | 5678 | 9101 | 1121 | 3812 |
+-----------+-----------+-------------------+-----------+-----------+
| 1992-1995 | 1983-1995 | 1982 OR 1985-1995 | 1987-1995 | 1988-1993 |
+-----------+-----------+-------------------+-----------+-----------+
To make the data more consistent, I decided to organize the data as individual time-series (grouping by parameter, year, site_id, program) in Oracle before importing into R.
+---------+------------+------+--------------+-----------+
| SITE_ID | PROGRAM_ID | YEAR | Value | Parameter |
+---------+------------+------+--------------+-----------+
| 1234 | Alpha | 1992 | 36.12 | ANC |
| 1234 | Alpha | 1993 | 22.30416667 | ANC |
| 1234 | Alpha | 1994 | 25.25166667 | ANC |
| 1234 | Alpha | 1995 | 23.39166667 | ANC |
| 5678 | Beta | 1990 | 2.354545455 | ANC |
| 5678 | Beta | 1991 | 5.973846154 | ANC |
| 5678 | Beta | 1992 | 4.174359 | ANC |
| 5678 | Beta | 1993 | 6.05 | ANC |
| 5678 | Beta | 1994 | -2.51666 | ANC |
| 5678 | Beta | 1995 | 8.00936875 | ANC |
| 9101 | Victor | 1990 | 27.4 | ANC |
| 9101 | Victor | 1991 | 24.95 | ANC |
| 9101 | Victor | 1992 | 29.425 | ANC |
| 9101 | Victor | 1993 | 22.53333333 | ANC |
| 9101 | Victor | 1994 | 29.93333333 | ANC |
| 9101 | Victor | 1995 | 39.33333333 | ANC |
| 1121 | Charlie | 1990 | 13.4725 | ANC |
| 1121 | Charlie | 1991 | 11.3 | ANC |
| 1121 | Charlie | 1992 | 9.0025 | ANC |
| 1121 | Charlie | 1993 | 7.7925 | ANC |
| 1121 | Charlie | 1994 | 7.59 | ANC |
| 1121 | Charlie | 1995 | 7.7925 | ANC |
| 3812 | Charlie | 1990 | 2.368571429 | ANC |
| 3812 | Charlie | 1991 | 33.695 | ANC |
| 3812 | Charlie | 1992 | 18.49111111 | ANC |
| 3812 | Charlie | 1993 | 17.28181818 | ANC |
| 1234 | Alpha | 1992 | 0.8786 | NO3 |
| 1234 | Alpha | 1993 | 2.671258333 | NO3 |
| 1234 | Alpha | 1994 | 3.296475 | NO3 |
| 1234 | Alpha | 1995 | 1.753436364 | NO3 |
| 5678 | Beta | 1990 | 39.49090909 | NO3 |
| 5678 | Beta | 1991 | 30.54307692 | NO3 |
| 5678 | Beta | 1992 | 16.99051285 | NO3 |
| 5678 | Beta | 1993 | 19.76125 | NO3 |
| 5678 | Beta | 1994 | 17.47167 | NO3 |
| 5678 | Beta | 1995 | 22.66188125 | NO3 |
| 9101 | Victor | 1990 | 0.642647059 | NO3 |
| 9101 | Victor | 1991 | 1.228921569 | NO3 |
| 9101 | Victor | 1992 | 0.591911765 | NO3 |
| 9101 | Victor | 1993 | 0.308169935 | NO3 |
| 9101 | Victor | 1994 | 0.428431373 | NO3 |
| 9101 | Victor | 1995 | 0.57875817 | NO3 |
| 1121 | Charlie | 1990 | 0.8975 | NO3 |
| 1121 | Charlie | 1991 | 0.805 | NO3 |
| 1121 | Charlie | 1992 | 7.145 | NO3 |
| 1121 | Charlie | 1993 | 6.6 | NO3 |
| 1121 | Charlie | 1994 | 3.7625 | NO3 |
| 1121 | Charlie | 1995 | 1.21 | NO3 |
| 3812 | Charlie | 1990 | 35.23448276 | NO3 |
| 3812 | Charlie | 1991 | 40.733 | NO3 |
| 3812 | Charlie | 1992 | 26.14818889 | NO3 |
| 3812 | Charlie | 1993 | 27.65394545 | NO3 |
+---------+------------+------+--------------+-----------+
Once in R, I was able to edit the code to the following with the beginning of the code. Remaining code was the same.
Mann_Kendall_Values_Trimmed <- filter(LTM_Data_StackOverflow_9_22_2020, YEAR >1989, PARAMETER != 'doc') %>%
#filter data to start in 1990 as this removes nulls from pre-1990 sampling
group_by(SITE_ID) %>%
filter(n() > 10) %>% #filter sites with more than 10 years of data
#gather(SITE_ID, PARAMETER, VALUE) #I believe this is now redundant %>%
group_by(PARAMETER, SITE_ID, PROGRAM_ID) %>% nest() %>%
mutate(ts_out = map(data, ~ts(.x$VALUE, start=c(min(.x$YEAR), 1), c(max(.x$YEAR), 1), frequency=1)))
This achieved the result I needed for all time series that have sufficient length (greater than 2 I believe) to run the mann-kendall test. The parameter that had those issues will be dealt with in separate R code.
I have a sample table which looks somewhat like this:
| Date | Vendor_Id | Requisitioner | Amount |
|------------|:---------:|--------------:|--------|
| 1/17/2019 | 98 | John | 2405 |
| 4/30/2019 | 1320 | Dave | 1420 |
| 11/29/2018 | 3887 | Michele | 596 |
| 11/29/2018 | 3887 | Michele | 960 |
| 11/29/2018 | 3887 | Michele | 1158 |
| 9/21/2018 | 4919 | James | 857 |
| 10/25/2018 | 4919 | Paul | 1162 |
| 10/26/2018 | 4919 | Echo | 726 |
| 10/26/2018 | 4919 | Echo | 726 |
| 10/29/2018 | 4919 | Andrew | 532 |
| 10/29/2018 | 4919 | Andrew | 532 |
| 11/12/2018 | 4919 | Carlos | 954 |
| 5/21/2018 | 2111 | June | 3580 |
| 5/23/2018 | 7420 | Justin | 224 |
| 5/24/2018 | 1187 | Sylvia | 3442 |
| 5/25/2018 | 1187 | Sylvia | 4167 |
| 5/30/2018 | 3456 | Ama | 4580 |
Based on each requisitioner and vendor id, I need to find the difference in the date such that it should be something like this:
| Date | Vendor_Id | Requisitioner | Amount | Date_Diff |
|------------|:---------:|--------------:|--------|-----------|
| 1/17/2019 | 98 | John | 2405 | NA |
| 4/30/2019 | 1320 | Dave | 1420 | 103 |
| 11/29/2018 | 3887 | Michele | 596 | NA |
| 11/29/2018 | 3887 | Michele | 960 | 0 |
| 11/29/2018 | 3887 | Michele | 1158 | 0 |
| 9/21/2018 | 4919 | James | 857 | NA |
| 10/25/2018 | 4919 | Paul | 1162 | NA |
| 10/26/2018 | 4919 | Paul | 726 | 1 |
| 10/26/2018 | 4919 | Paul | 726 | 0 |
| 10/29/2018 | 4919 | Paul | 532 | 3 |
| 10/29/2018 | 4919 | Paul | 532 | 0 |
| 11/12/2018 | 4917 | Carlos | 954 | NA |
| 5/21/2018 | 2111 | Justin | 3580 | NA |
| 5/23/2018 | 7420 | Justin | 224 | 2 |
| 5/24/2018 | 1187 | Sylvia | 3442 | NA |
| 5/25/2018 | 1187 | Sylvia | 4167 | 1 |
| 5/30/2018 | 3456 | Ama | 4580 | NA |
Now, if the difference in the date is <=3 days within each requisitioner and vendor id, and sum of the amount is >5000, I need to create a subset of that. The final output should be something like this:
| Date | Vendor_Id | Requisitioner | Amount | Date_Diff |
|-----------|:---------:|--------------:|--------|-----------|
| 5/24/2018 | 1187 | Sylvia | 3442 | NA |
| 5/25/2018 | 1187 | Sylvia | 4167 | 1 |
Initially, when I tried working with date difference, I used the following code:
df=df %>% mutate(diffdate= difftime(Date,lag(Date,1)))
However, the difference doesn't make sense as they are huge numbers such as 86400 and some huge random numbers. I tried the above code when data type for 'Date' field was initially Posixct. Later when I changed it to 'Date' data type, the date differences were still the same huge random numbers.
Also, is it possible to group the date differences based on requisitioners and vendor id's as mentioned in the 2nd table above?
EDIT:
I'm coming across a new challenge now. In the problem set, I need to filter out the values whose date differences are less than 3 days. Let us assume that the table with date difference appears something like this:
| MasterCalendarDate | Vendor_Id | Requisitioner | Amount | diffdate |
|--------------------|:---------:|--------------:|--------|----------|
| 1/17/2019 | 98 | John | 2405 | #N/A |
| 4/30/2019 | 1320 | Dave | 1420 | 103 |
| 11/29/2018 | 3887 | Michele | 596 | #N/A |
| 11/29/2018 | 3887 | Michele | 960 | 0 |
| 11/29/2018 | 3887 | Michele | 1158 | 0 |
| 9/21/2018 | 4919 | Paul | 857 | #N/A |
| 10/25/2018 | 4919 | Paul | 1162 | 34 |
| 10/26/2018 | 4919 | Paul | 726 | 1 |
| 10/26/2018 | 4919 | Paul | 726 | 0 |
When we look at the requisitioner 'Paul', the date diff between 9/21/2018 and 10/25/2018 is 34 and between that of 10/25/2018 and 10/26/2018 is 1 day. However, when I filter the data for date difference <=3 days, I miss out on 10/25/2018 because of 34 days difference. I have multiple such occurrences. How can I fix it?
I think you need to convert your date variable using as.Date(), then you can compute the lagged time difference using difftime().
# create toy data frame
df <- data.frame(date=as.Date(paste(sample(2018:2019,100,T),
sample(1:12,100,T),
sample(1:28,100,T),sep = '-')),
req=sample(letters[1:10],100,T),
amount=sample(100:10000,100,T))
# compute lagged time difference in days -- diff output is numeric
df %>% arrange(req,date) %>% group_by(req) %>%
mutate(diff=as.numeric(difftime(date,lag(date),units='days')))
# as above plus filtering based on time difference and amount
df %>% arrange(req,date) %>% group_by(req) %>%
mutate(diff=as.numeric(difftime(date,lag(date),units='days'))) %>%
filter(diff<10 | is.na(diff), amount>5000)
# A tibble: 8 x 4
# Groups: req [7]
date req amount diff
<date> <fct> <int> <dbl>
1 2018-05-13 a 9062 NA
2 2019-05-07 b 9946 2
3 2018-02-03 e 5697 NA
4 2018-03-12 g 7093 NA
5 2019-05-16 g 5631 3
6 2018-03-06 h 7114 6
7 2018-08-12 i 5151 6
8 2018-04-03 j 7738 8
I am trying to find the symbol of the smallest difference. But I don't know what to do answer finding the difference to compare the two.
I have this set:
+------+------+-------------+-------------+--------------------+------+--------+
| clid | cust | Min | Max | Difference | Qty | symbol |
+------+------+-------------+-------------+--------------------+------+--------+
| 102 | C6 | 11.8 | 12.72 | 0.9199999999999999 | 1500 | GE |
| 110 | C3 | 44 | 48.099998 | 4.099997999999999 | 2000 | INTC |
| 115 | C4 | 1755.25 | 1889.650024 | 134.40002400000003 | 2000 | AMZN |
| 121 | C9 | 28.25 | 30.27 | 2.0199999999999996 | 1500 | BAC |
| 130 | C7 | 8.48753 | 9.096588 | 0.609058000000001 | 5000 | F |
| 175 | C3 | 6.41 | 7.71 | 1.2999999999999998 | 1500 | SBS |
| 204 | C5 | 6.41 | 7.56 | 1.1499999999999995 | 5000 | SBS |
| 208 | C2 | 1782.170044 | 2004.359985 | 222.1899410000001 | 5000 | AMZN |
| 224 | C10 | 153.350006 | 162.429993 | 9.079986999999988 | 1500 | FB |
| 269 | C6 | 355.980011 | 392.299988 | 36.319976999999994 | 2000 | BA |
+------+------+-------------+-------------+--------------------+------+--------+
so far I have this Query
select d.clid,
d.cust,
MIN(f.fillPx) as Min,
MAX(f.fillPx) as Max,
MAX(f.fillPx)-MIN(f.fillPx) as Difference,
d.Qty,
d.symbol
from orders d
inner join mp f on d.clid=f.clid
group by f.clid
having SUM(f.fillQty) < d.Qty
order by d.clid;
What am I missing so that I can compare the min and max and get the smallest different symbol?
mp table:
+------+------+--------+------+------+---------+-------------+--------+
| clid | cust | symbol | side | oQty | fillQty | fillPx | execid |
+------+------+--------+------+------+---------+-------------+--------+
| 123 | C2 | SBS | SELL | 5000 | 273 | 7.37 | 1 |
| 157 | C9 | C | SELL | 1500 | 167 | 69.709999 | 2 |
| 254 | C9 | GE | SELL | 5000 | 440 | 13.28 | 3 |
| 208 | C2 | AMZN | SELL | 5000 | 714 | 1864.420044 | 4 |
| 102 | C6 | GE | SELL | 1500 | 136 | 12.32 | 5 |
| 160 | C7 | INTC | SELL | 1500 | 267 | 44.5 | 6 |
| 145 | C10 | GE | SELL | 5000 | 330 | 13.28 | 7 |
| 208 | C2 | AMZN | SELL | 5000 | 1190 | 1788.609985 | 8 |
| 161 | C1 | C | SELL | 1500 | 135 | 72.620003 | 9 |
| 181 | C5 | FCX | BUY | 1500 | 84 | 12.721739 | 10 |
orders table:
+------+------+--------+------+------+
| cust | side | symbol | qty | clid |
+------+------+--------+------+------+
| C1 | SELL | C | 1500 | 161 |
| C9 | SELL | INTC | 2000 | 231 |
| C10 | SELL | BMY | 1500 | 215 |
| C1 | BUY | SBS | 2000 | 243 |
| C4 | BUY | AMZN | 2000 | 226 |
| C10 | BUY | C | 1500 | 211 |
If you want one symbol, you can use order by and limit:
select d.clid,
d.cust,
MIN(f.fillPx) as Min,
MAX(f.fillPx) as Max,
MAX(f.fillPx)-MIN(f.fillPx) as Difference,
d.Qty,
d.symbol
from orders d join
mp f
on d.clid = f.clid
group by d.clid, d.cust, d.Qty, d.symbol
having SUM(f.fillQty) < d.Qty
order by difference
limit 1;
Notice that I added the rest of the unaggregated columns to the group by.
Say I have a raw dataset (already in data frame and I can convert that easily to xts.data.table with as.xts.data.table), the DF is like the following:
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30
and so on (many more cities and many more days).
And I would like to make this to show both the current day temperature and the day over day change from the previous day, together with the other info on the city (state, country). i.e., the new data frame should be something like (from the example above):
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature| ChangeInDailyMin | ChangeInDailyMax | ChangeInDailyMedian
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19 | 6 | -8 | 1
2018-02-03 | London | LDN |UK | 10 | 25 | 15 | -2 | -10 | 1
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29 | 1 | 1 | -1
2018-02-03 | New York City | NY | US | ...
and so on. i.e., add 3 more columns to show the day over day change.
Note that in the dataframe I may not have data everyday, however my change is defined as the differences between temperature on day t - temperature on the most recent date where I have data on the temperature.
I tried to use the shift function but R was complaining about the := sign.
Is there any way in R I could get this to work?
Thanks!
You can use dplyr::mutate_at and lubridate package to transform data in desired format. The data needs to be arranged in Date format and difference of current record with previous record can be taken with help of dplyr::lag function.
library(dplyr)
library(lubridate)
df %>% mutate_if(is.character, funs(trimws)) %>% #Trim any blank spaces
mutate(Date = ymd(Date)) %>% #Convert to Date/Time
group_by(City, State, Country) %>%
arrange(City, State, Country, Date) %>% #Order data date
mutate_at(vars(starts_with("Daily")), funs(Change = . - lag(.))) %>%
filter(!is.na(DailyMinTemperature_Change))
Result:
# # A tibble: 3 x 10
# # Groups: City, State, Country [3]
# Date City State Country DailyMinTemperature DailyMaxTemperature DailyMedianTemperature DailyMinTemperature_Change DailyMaxT~ DailyMed~
# <date> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 2018-02-03 London LDN UK 10.0 25.0 15 -2.00 10.0 1
# 2 2018-02-03 New York City NY US 18.0 22.0 19 6.00 - 8.00 1
# 3 2018-02-03 Singapore SG SG 28.0 32.0 29 1.00 1.00 -1
#
Data:
df <- read.table(text =
"Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30",
header = TRUE, stringsAsFactors = FALSE, sep = "|")