I have a data frame that looks like this:
user_id date price
2375 2012/12/12 00:00:00.000 47.900000
2375 2013/01/16 00:00:00.000 47.900000
2375 2013/01/16 00:00:00.000 47.900000
2375 2013/05/08 00:00:00.000 47.900000
2375 2013/06/01 00:00:00.000 47.900000
2375 2013/10/02 00:00:00.000 26.500000
2375 2014/01/22 00:00:00.000 47.900000
2375 2014/03/21 00:00:00.000 47.900000
2375 2014/05/24 00:00:00.000 47.900000
2375 2015/04/11 00:00:00.000 47.900000
7419 2012/12/12 00:00:00.000 7.174977
7419 2013/01/02 00:00:00.000 27.500000
7419 2013/01/18 00:00:00.000 22.901482
7419 2013/02/08 00:00:00.000 27.500000
7419 2013/03/06 00:00:00.000 8.200000
7419 2013/04/03 00:00:00.000 22.901482
7419 2013/04/03 00:00:00.000 8.200000
7419 2013/04/03 00:00:00.000 6.900000
7419 2013/04/17 00:00:00.000 7.500000
7419 2013/04/17 00:00:00.000 7.500000
7419 2013/05/23 00:00:00.000 7.500000
7419 2013/06/07 00:00:00.000 27.500000
7419 2013/06/07 00:00:00.000 7.500000
7419 2013/06/07 00:00:00.000 7.500000
7419 2013/06/07 00:00:00.000 5.829188
7419 2013/07/10 00:00:00.000 27.500000
7419 2013/08/21 00:00:00.000 7.500000
7419 2013/08/21 00:00:00.000 27.500000
7419 2013/09/06 00:00:00.000 27.500000
7419 2013/12/27 00:00:00.000 7.500000
7419 2014/01/10 00:00:00.000 27.500000
7419 2014/02/16 00:00:00.000 27.500000
7419 2014/05/14 00:00:00.000 41.900000
7419 2014/07/03 00:00:00.000 26.500000
7419 2014/09/26 00:00:00.000 26.500000
7419 2014/09/26 00:00:00.000 7.500000
7419 2014/10/22 00:00:00.000 27.500000
7419 2014/11/15 00:00:00.000 6.900000
7419 2014/11/27 00:00:00.000 26.500000
7419 2014/12/12 00:00:00.000 40.900000
7419 2015/01/14 00:00:00.000 27.200000
7419 2015/02/24 00:00:00.000 26.500000
7419 2015/03/17 00:00:00.000 40.900000
7419 2015/05/02 00:00:00.000 27.200000
7419 2015/05/02 00:00:00.000 26.500000
7419 2015/05/15 00:00:00.000 7.900000
7419 2015/05/20 00:00:00.000 27.500000
7419 2015/06/20 00:00:00.000 7.500000
7419 2015/06/26 00:00:00.000 7.500000
7419 2015/06/30 00:00:00.000 41.900000
7419 2015/07/16 00:00:00.000 78.500000
11860 2012/12/12 00:00:00.000 7.174977
11860 2012/12/12 00:00:00.000 21.500000
11860 2013/03/02 00:00:00.000 22.901482
11860 2013/03/02 00:00:00.000 8.200000
11860 2013/05/25 00:00:00.000 29.500000
11860 2013/05/25 00:00:00.000 7.500000
In reality, I have more than 40000 user_id. I want to calculate the sum of the previous 4 weeks (not counting the present week) of the price for each user. However, the date period is fixed, from 12/12/2012 to 22/09/2015. In order to avoid a loop for each user, I thought of something like
df <- df %>% group_by(user_id) %>%
mutate(price.lag1 = lag(prod_price, n = 1)) %>%
mutate(amount4weeks = rollsum(x=price, 4, align = "right", fill = NA))
However, it gives me an error, and it will only take as "date" the rows present in the data.
How can I give rollsum specific dates and/or how can I do what I want in a one-liner? My result should look like:
df$price4weeks = c(NA, 0.000000, 0.000000, 0.000000, 47.900000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, NA, 7.174977, 27.500000, 22.901482, 27.500000, 8.200000, 8.200000, 8.200000, 6.900000, 6.900000, 0.000000, 7.500000, 7.500000, 7.500000, 7.500000, 0.000000, 0.000000, 0.000000, 27.500000, 0.000000, 7.500000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 7.500000, 27.500000, 6.900000, 33.400000, 0.000000, 0.000000, 26.500000, 0.000000, 0.000000, 26.500000, 34.400000, 27.500000, 7.500000,15.000000, 56.900000, NA, NA, 0.000000, 0.000000, 0.000000, 0.000000)
Let me know if I am missing something in my explanation.
Thank you!
rollsum calculates the sum over a rolling k number of data points. To use dplyr with weeks, you could add a week_number column to your data and then calculate the rolling sum using sapply over week_number . The code could look like:
df <- mutate(df, week_number=cut.POSIXt(df$date, breaks="week", labels=FALSE))
df_new <- df %>% group_by(user_id) %>%
do(mutate(.,total_4wk=sapply(week_number, function(n) sum(.$price[between(.$week_number, n -4, n-1)],na.rm=TRUE))))
Related
I want to add a new column to my dataframe based on a time interval.
For the time 10:00 - 15:00 I want to add "day" in the new column, for 22:00-03:00 I want to add "night". Additionally I want to exclude all rows which aren't in one of the intervals
I've used as.POSIXct already.
This is what I want:
hour D_N
10:31 day
01:10 night
12:03 day
14:51 day
We can use lubridate and dplyr package. Convert the column to Period class and extract the hour from it. Assign "day" and "night" according to hour of the day and remove rows which do not lie in one of those intervals.
library(lubridate)
library(dplyr)
df %>%
mutate(hour = hour(hm(hour1)),
D_N = case_when(hour %in% 10:15 ~ "day",
hour %in% 22:23 | hour %in% 1:3 ~ "night",
TRUE ~ NA_character_)) %>%
filter(!is.na(D_N))
# hour1 hour D_N
#1 10:31 10 day
#2 01:10 1 night
#3 12:03 12 day
#4 14:51 14 day
data
df <- structure(list(hour1 = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("01:10",
"10:31", "12:03", "14:51", "16:03"), class = "factor")),
class = "data.frame", row.names = c(NA, -5L))
So this is my data frame.
I tried your code, but it doesn't exclude the times I dont want to have: 15:00-22:00 and 03:00-10:00
date time date_time
1 2017-05-25 10:16 2017-05-25 10:16:00
2 2017-05-27 13:16 2017-05-27 13:16:00
3 2017-05-28 05:31 2017-05-28 05:31:00
4 2017-05-28 08:01 2017-05-28 08:01:00
5 2017-05-29 14:31 2017-05-29 14:31:00
6 2017-05-30 09:01 2017-05-30 09:01:00
7 2017-05-31 03:31 2017-05-31 03:31:00
8 2017-05-31 07:16 2017-05-31 07:16:00
9 2017-06-03 06:01 2017-06-03 06:01:00
10 2017-06-03 10:16 2017-06-03 10:16:00
11 2017-06-03 14:01 2017-06-03 14:01:00
12 2017-06-04 05:31 2017-06-04 05:31:00
13 2017-06-04 12:16 2017-06-04 12:16:00
14 2017-06-04 15:16 2017-06-04 15:16:00
15 2017-06-05 03:31 2017-06-05 03:31:00
so what I want is:
date time date_time D_N
1 2017-05-25 10:16 2017-05-25 10:16:00 day
2 2017-05-27 13:16 2017-05-27 13:16:00 day
3 2017-05-28 05:31 2017-05-28 05:31:00 #should be excluded
4 2017-05-28 08:01 2017-05-28 08:01:00 #should be excluded
5 2017-05-29 14:31 2017-05-29 14:31:00 day
6 2017-05-30 09:01 2017-05-30 09:01:00 #should be excluded
7 2017-05-31 03:31 2017-05-31 03:31:00 night
8 2017-05-31 07:16 2017-05-31 07:16:00 #should be excluded
9 2017-06-03 06:01 2017-06-03 06:01:00 #should be excluded
10 2017-06-03 10:16 2017-06-03 10:16:00 day
11 2017-06-03 14:01 2017-06-03 14:01:00 day
12 2017-06-04 05:31 2017-06-04 05:31:00 #should be excluded
13 2017-06-04 12:16 2017-06-04 12:16:00 day
14 2017-06-04 15:16 2017-06-04 15:16:00 #should be excluded
15 2017-06-05 03:31 2017-06-05 03:31:00 night
as a result I want:
to get this:
date time date_time D_N
1 2017-05-25 10:16 2017-05-25 10:16:00 day
2 2017-05-27 13:16 2017-05-27 13:16:00 day
5 2017-05-29 14:31 2017-05-29 14:31:00 day
7 2017-05-31 03:31 2017-05-31 03:31:00 night
10 2017-06-03 10:16 2017-06-03 10:16:00 day
11 2017-06-03 14:01 2017-06-03 14:01:00 day
13 2017-06-04 12:16 2017-06-04 12:16:00 day
15 2017-06-05 03:31 2017-06-05 03:31:00 night
Subset dataframe w/ sequence of observations
I am experimenting with a large dataset. I would like to subset this data frame, in intervals of Monday through Friday. However, I see that some weeks have missing days (Thursday is missing one week).
If one sequence of days, i.e. Monday to Friday, I would like to not include this sequence of days in my sample.
Would this be possible?
week.nr <- data$week.nr[1:20]
week.day<- data$week.day[1:20]
date <- data$specific.date[1:20]
price <- data$price[1:20]
data.frame(date,week.nr,week.day,price)
data.frame(date,week.nr,week.day,price)
date week.nr week.day price
1 2019-01-28 05 Monday 62.6
2 2019-01-25 04 Friday 63.8
3 2019-01-24 04 Thursday 64.2
4 2019-01-23 04 Wednesday 64.0
5 2019-01-22 04 Tuesday 64.0
6 2019-01-21 04 Monday 63.4
7 2019-01-18 03 Friday 62.6
8 2019-01-17 03 Thursday 62.6
9 2019-01-16 03 Wednesday 64.0
10 2019-01-15 03 Tuesday 64.4
11 2019-01-14 03 Monday 65.2
12 2019-01-11 02 Friday 66.4
13 2019-01-10 02 Thursday 66.2
14 2019-01-09 02 Wednesday 68.2
15 2019-01-08 02 Tuesday 68.8
16 2019-01-07 02 Monday 67.8
17 2019-01-04 01 Friday 67.4
18 2019-01-03 01 Thursday 68.0
19 2019-01-02 01 Wednesday 69.6
20 2018-12-28 52 Friday 71.0
I use RODBC to get data from sql
sql <- paste0("
with cte as (
Select *,datePart(WEEKDAY,Dt) as WeekDay,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY SaleCount) Over (partition by ItemRelation,
DocumentNum, DocumentYear) as PERCENTILE,
avg(SaleCount) over (Partition by ItemRelation,
DocumentNum, DocumentYear,datePart(WEEKDAY,Dt), IsPromo) as AVG_WeekDay
From [Action].[dbo].[promo_data_copy])
Update a
Set SaleCount = cte.AVG_WeekDay
From CTE
join [Action].[dbo].[promo_data_copy] a
on a.Dt = cte.dt
and a.ItemRelation=cte.ItemRelation
and a.DocumentNum = cte.DocumentNum
and a.DocumentYear = cte.DocumentYear
and a.ispromo = cte.ispromo
Where CTE.PERCENTILE < CTE.SaleCount
and datePart(WEEKDAY,CTE.Dt) < 5
and CTE.ispromo = 0 ;")
df <- sqlQuery(dbHandle, sql)
View(df)
and df is empty dataset.
No data available in table
Can anobody help me understand, why the data wasn't return?
Edit
Dt ItemRelation SaleCount DocumentNum DocumentYear IsPromo
2017-10-12 00:00:00.000 13322 7 36 2017 0
2017-10-12 00:00:00.000 13322 35 4 2017 0
2017-10-12 00:00:00.000 158121 340 41 2017 0
2017-10-12 00:00:00.000 158122 260 41 2017 0
2017-10-13 00:00:00.000 13322 3 36 2017 0
2017-10-13 00:00:00.000 13322 31 4 2017 0
2017-10-13 00:00:00.000 158121 420 41 2017 0
2017-10-13 00:00:00.000 158122 380 41 2017 0
2017-10-14 00:00:00.000 11592 45 33 2017 0
2017-10-14 00:00:00.000 13189 135 33 2017 0
2017-10-14 00:00:00.000 13191 852 33 2017 0
2017-10-14 00:00:00.000 13322 1 36 2017 0
2017-10-14 00:00:00.000 13322 34 4 2017 0
2017-10-14 00:00:00.000 158121 360 41 2017 0
2017-10-14 00:00:00.000 158122 140 41 2017 0
here top 15 observations of table.So i expect that my query will return me this data.frame
I'm not sure about the percentile stuff; I'll leave it to you to get that part straightened out. Anyway, here is how I use R to query a database.
library(RODBC)
dbconnection <- odbcDriverConnect("Driver=ODBC Driver 11 for SQL Server;Server=Server_Name; Database=DB_Name;Uid=; Pwd=; trusted_connection=yes")
initdata <- sqlQuery(dbconnection,paste("select * from MyTable;"))
odbcClose(channel)
Here are a couple very useful resources for cross-reference.
http://stackoverflow.com/questions/15420999/rodbc-odbcdriverconnect-connection-error
https://andersspur.wordpress.com/2013/11/26/connect-r-to-sql-server-2012-and-14/
I have a data frame ordered desc on the calving column as below.
Served Calved ProfileID
1 2015-07-29 2017-05-07 1346
2 2015-07-29 2017-05-06 2645
3 2016-06-12 2017-05-05 3687
4 2016-05-19 2017-05-05 3687
5 2015-05-21 2017-05-05 3687
6 2013-05-08 2017-05-05 3687
7 2015-08-08 2016-05-04 4235
8 2015-06-14 2016-05-04 4235
9 2015-05-31 2016-05-04 4235
10 2013-08-13 2014-05-02 5425
11 2013-07-23 2014-05-02 5425
12 2012-03-01 2014-05-02 5425
13 2017-07-11 2013-04-22 5425
14 2012-11-01 2013-04-22 5425
15 2015-12-23 2013-04-22 5425
16 2014-05-10 2013-04-22 5425
I would like to remove the duplicates from the Calved column keeping one observance from the ProfileID column depending on the date in the calved column like so
Served Calved ProfileID
1 2015-07-29 2017-05-07 1346
2 2015-07-29 2017-05-06 2645
3 2016-06-12 2017-05-05 3687
7 2015-08-08 2016-05-04 4235
10 2013-08-13 2014-05-02 5425
13 2017-07-11 2013-04-22 5425
I achieved this using
on_served_profileID<-master_arranged[!duplicated(master_arranged[c("Calved","ProfileID")]),]
I would like to add and and condition so that the row which is selected from Calved column is less than the served column and not just the first occurrence in each date.
For line 13 of the output I would rather this be line 14 because the served column is less than the calved column like so, rather than giving me the first observance for each date in the Calved column.
Served Calved ProfileID
1 2015-07-29 2017-05-07 1346
2 2015-07-29 2017-05-06 2645
3 2016-06-12 2017-05-05 3687
7 2015-08-08 2016-05-04 4235
10 2015-08-13 2014-05-02 5425
14 2012-11-01 2013-04-22 5425
I have tried, and variations of:
on_served_profileID<-master_arranged[!duplicated(master_arranged[c("Calved","ProfileID")])& master_arranged$Served < master_arranged$Calved,]
This is to try and select the calved observance which is less than the served observance hence the & condition "$served < $Calved"
library(dplyr)
df$Served <- as.Date(df$Served)
df$Calved <- as.Date(df$Calved)
df %>%
group_by(Calved, ProfileID) %>%
summarise(Served = Served[first(which(Served < Calved))]) %>%
arrange(desc(Calved))
Output is:
Calved ProfileID Served
1 2017-05-07 1346 2015-07-29
2 2017-05-06 2645 2015-07-29
3 2017-05-05 3687 2016-06-12
4 2016-05-04 4235 2015-08-08
5 2014-05-02 5425 2013-08-13
6 2013-04-22 5425 2012-11-01
Sample data:
df <- structure(list(Served = c("2015-07-29", "2015-07-29", "2016-06-12",
"2016-05-19", "2015-05-21", "2013-05-08", "2015-08-08", "2015-06-14",
"2015-05-31", "2013-08-13", "2013-07-23", "2012-03-01", "2017-07-11",
"2012-11-01", "2015-12-23", "2014-05-10"), Calved = c("2017-05-07",
"2017-05-06", "2017-05-05", "2017-05-05", "2017-05-05", "2017-05-05",
"2016-05-04", "2016-05-04", "2016-05-04", "2014-05-02", "2014-05-02",
"2014-05-02", "2013-04-22", "2013-04-22", "2013-04-22", "2013-04-22"
), ProfileID = c(1346L, 2645L, 3687L, 3687L, 3687L, 3687L, 4235L,
4235L, 4235L, 5425L, 5425L, 5425L, 5425L, 5425L, 5425L, 5425L
)), .Names = c("Served", "Calved", "ProfileID"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16"))
I'm struggling with a query in SQL Server. I have the following columns:
Member Number; Dependant Number; Provider Number; Service Date
I concatenate the above to create a unique ID, see extract below:
MbrNo DepNo PracticeNo ServiceDt UniqueProviderConsults
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 1 243264 2014-07-02 00:00:00.000 243264100001077141820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-02 00:00:00.000 243264100001077041820
100001077 0 243264 2014-07-07 00:00:00.000 243264100001077041825
100000838 1 243264 2014-07-09 00:00:00.000 243264100000838141827
100000838 5 243264 2014-07-14 00:00:00.000 243264100000838541832
100000838 3 243264 2014-07-17 00:00:00.000 243264100000838341835
100000838 0 243264 2014-07-17 00:00:00.000 243264100000838041835
100000838 5 243264 2014-07-18 00:00:00.000 243264100000838541836
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001077 0 243264 2014-07-14 00:00:00.000 243264100001077041832
100001480 1 243264 2014-07-17 00:00:00.000 243264100001480141835
My Unique ID is a numeric(30) data type. I then want to count the how many times the Unique ID appears. Using count_big, I do not get any overflow warnings but it still does not give me the right count. I think it is because a precision of 30 is too high so it cuts it off when counting. Is there another alternative?Unfortunately the components above are the minimum number of components to define the Unique ID. I have tried taking the log of my unique ID but the count is also incorrect.
Can someone please help:)
Code:
ALTER TABLE [Claims Edited] ADD [UniqueProviderConsults] NUMERIC(30)
GO
UPDATE [Claims Edited] SET [UniqueProviderConsults] = CONCAT(CONVERT(DECIMAL(38,0),ProviderNo),CONVERT(DECIMAL(38,0),MbrNo),CONVERT(VARCHAR(MAX),DepNo),CONVERT(DECIMAL(38,0),ServiceDt))
GO
Select PracticeNo,
count_big(DISTINCT case when [ServiceMth]='2014-06-30' THEN [UniqueProviderConsults] else 0 end) as [Jun-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-07-31' THEN [UniqueProviderConsults] else 0 end) as [Jul-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-08-31' THEN [UniqueProviderConsults] else 0 end) as [Aug-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-09-30' THEN [UniqueProviderConsults] else 0 end) as [Sep-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-10-31' THEN [UniqueProviderConsults] else 0 end) as [Oct-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-11-30' THEN [UniqueProviderConsults] else 0 end) as [Nov-14 Consults],
count_big(DISTINCT case when [ServiceMth]='2014-12-31' THEN [UniqueProviderConsults] else 0 end) as [Dec-14 Consults],
Into [EM Consultation Count temp]
from [EM Claims Edited]
Group by PracticeNo
Using the data from the extract above
Provider Number 243264
For the month of June 2014, there are no lines yet my code counts 1.
For the month of July 2014, there are 10 unique IDs, yet my code counts 11.