Related
Below is the sample data. The goal is to first create a column that contains the total employment for that quarter. Second is to create a new column that shows the relative share for the area. Finally, the last item (and one which is vexing me) is to calculate whether the total with suppress = 0 represents over 50% of the total. I can do this in excel easily but trying to this in R and so have it be something that I can replicate year after year.
desired result is below
area <- c("001","005","007","009","011","013","015","017","019","021","023","027","033","001","005","007","009","011","013","015","017","019","021","023","027","033")
year <- c("2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021","2021")
qtr <- c("01","01","01","01","01","01","01","01","01","01","01","01","01","02","02","02","02","02","02","02","02","02","02","02","02","02")
employment <- c(2,4,6,8,11,10,12,14,16,18,20,22,30,3,5,8,9,12,9,24,44,33,298,21,26,45)
suppress <- c(0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0)
testitem <- data.frame(year,qtr, area, employment, suppress)
For the first quarter of 2021, the total is 173. If you only take suppress = 1 into account, that is only 24 of 173 hence the TRUE in the 50 percent column. If these two values summed up to 173/2 or greater than you would have it say FALSE. For the second quarter, the suppress = 1 accounts for 310 of the total of 537 and so is over 50% of the total.
For the total column, I am showing the computation or ingredients. Ideally, it would show a value such as .0115 in place of 2/173.
year qtr area employment suppress total 50percent
2021 01 001 2 0 =2/173 TRUE
2021 01 005 4 0 =4/173 TRUE
.....
2021 02 001 3 0 =3/537 FALSE
2021 02 005 5 0 =5/537 FALSE
For example:
library(dplyr)
testitem %>%
group_by(year, qtr) %>%
mutate(
total = employment / sum(employment),
over_half = sum(employment[suppress == 0]) > (0.5 * sum(employment))
)
Gives:
# A tibble: 26 × 7
# Groups: year, qtr [2]
year qtr area employment suppress total over_half
<chr> <chr> <chr> <dbl> <dbl> <dbl> <lgl>
1 2021 01 001 2 0 0.0116 TRUE
2 2021 01 005 4 0 0.0231 TRUE
3 2021 01 007 6 0 0.0347 TRUE
4 2021 01 009 8 1 0.0462 TRUE
5 2021 01 011 11 0 0.0636 TRUE
6 2021 01 013 10 0 0.0578 TRUE
7 2021 01 015 12 0 0.0694 TRUE
8 2021 01 017 14 0 0.0809 TRUE
9 2021 01 019 16 1 0.0925 TRUE
10 2021 01 021 18 0 0.104 TRUE
# … with 16 more rows
# ℹ Use `print(n = ...)` to see more rows
I think you'll want to use group_by() and mutate() here.
library(dplyr)
testitem |>
## grouping by year and quarter
## sums will be calculated over areas
group_by(year, qtr) |>
## this could be more terse, but gets the job done.
mutate(total_sum = sum(employment),
## This uses the total_sum column that was just created
total_prop = employment/total_sum,
## leveraging the 0,1 coding of suppress
suppress_sum = sum(suppress * employment),
suppress_prop = suppress_sum/total,
fifty = (1-suppress_prop) > 0.5)
I've got a dataset of many individuals("ID") with body weight measurement ("BW")at random time points("time") spanning over 15 years.
Example:
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
ID Time BW
1 1 2015/1/1 75.01736
2 1 2015/3/1 75.44717
3 1 2016/1/1 73.09934
4 1 2016/3/1 74.79920
5 1 2017/1/1 74.70097
6 1 2018/5/1 74.23496
7 2 2012/1/1 73.57179
8 2 2017/5/1 74.50970
9 2 2019/4/1 74.43412
10 2 2020/4/1 75.02952
11 3 2019/10/1 76.41390
12 3 2020/1/1 75.79827
13 3 2020/4/1 74.46035
What I'm trying to filter are IDs with measurements that has one within 12+/- 3 months prior to this measurement and one after. ie. one bodyweight at 0yr+/-3months one at 1yr one at 2yr+/-3months. In this case, only rows 3 to 5 fulfill the criteria.
And in all "individuals" that fulfills such criteria, I would like choose the measurement that has the most data points within this +/- 15 months range. The example desired output may look like:
ID Time BW Fulfill Counts
1 1 2015/1/1 75.01736 0 4
2 1 2015/3/1 75.44717 0 4
3 1 2016/1/1 73.09934 1 5
4 1 2016/3/1 74.79920 1 5
5 1 2017/1/1 74.70097 1 3
6 1 2018/5/1 74.23496 0 2
7 2 2012/1/1 73.57179 0 1
8 2 2017/5/1 74.50970 0 1
9 2 2019/4/1 74.43412 0 2
10 2 2020/4/1 75.02952 0 2
11 3 2019/10/1 76.41390 0 3
12 3 2020/1/1 75.79827 0 3
13 3 2020/4/1 74.46035 0 3
I've tried my best searching for similar answers on internet but I couldn't come up with anything remotely close to what I want to do. I could only make it to the grouping part with
group_by(ID)%>%
mutate(Fulfill==if time-...)
and then stuck the "calculating difference with every other row" thing. I'm imagining something like a loop for each row within a group(ID) to calculate the difference in time and then a logical statement for determining whether it's true or not. I've used R for a while but only with descriptive statistics previously, so I'm sorry if it's actually quite simple. Thanks.
Here is a tidyverse approach (not completely optimised, you could probably even simplify it to only one function call with map_dfr or so).
I've chosen to use purrr::map_ functions. This allows me to apply the function to every entry of the column/vector separately (this results from the first time Time is passed to map_) and at the same time also pass the complete Time column (the second argument), to calculate the filter operations to see if you have entries in the +-15 months.
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
library(dplyr)
library(purrr)
library(lubridate)
check_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound_1 <- curr_entry %m-% months(15)
lower_bound_2 <- curr_entry %m-% months(9)
upper_bound_1 <- curr_entry %m+% months(9)
upper_bound_2 <- curr_entry %m+% months(15)
# filter the entries that match the time period constraints
entries <- data.frame(entries = entries)
filtered_lower <- entries %>%
filter(entries >= lower_bound_1 & entries <= lower_bound_2)
filtered_upper <- entries %>%
filter(entries >= upper_bound_1 & entries <= upper_bound_2)
# check if there is a matching earlier and later entry
if (nrow(filtered_lower) > 0 && nrow(filtered_upper) > 0) {
TRUE
} else {
FALSE
}
}
calculate_number_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound <- curr_entry %m-% months(15)
upper_bound <- curr_entry %m+% months(15)
# filter the matching entries and calculate the number of observations
entries <- data.frame(entries = entries)
entries %>%
filter(entries >= lower_bound & entries <= upper_bound) %>%
nrow()
}
df %>%
group_by(ID) %>%
mutate(Time = as.Date(Time, format = "%Y/%m/%d"),
Fulfill = map_lgl(Time, check_entries, Time),
Fulfill_ID = sum(Fulfill) > 0,
Counts = map_int(Time, calculate_number_entries, Time))
#> # A tibble: 13 x 6
#> # Groups: ID [3]
#> ID Time BW Fulfill Fulfill_ID Counts
#> <chr> <date> <dbl> <lgl> <lgl> <int>
#> 1 1 2015-01-01 75.4 FALSE TRUE 4
#> 2 1 2015-03-01 74.0 FALSE TRUE 4
#> 3 1 2016-01-01 74.2 TRUE TRUE 5
#> 4 1 2016-03-01 74.9 TRUE TRUE 5
#> 5 1 2017-01-01 75.6 FALSE TRUE 3
#> 6 1 2018-05-01 73.8 FALSE TRUE 1
#> 7 2 2012-01-01 75.6 FALSE FALSE 1
#> 8 2 2017-05-01 75.0 FALSE FALSE 1
#> 9 2 2019-04-01 74.3 FALSE FALSE 2
#> 10 2 2020-04-01 74.9 FALSE FALSE 2
#> 11 3 2019-10-01 75.5 FALSE FALSE 3
#> 12 3 2020-01-01 75.3 FALSE FALSE 3
#> 13 3 2020-04-01 76.0 FALSE FALSE 3
Created on 2020-12-06 by the reprex package (v0.3.0)
Note that I find a different result for the 5th entry, you may check if the month addition/subtraction is as you need it, check out lubridate for more info.
I have a group of accounts with balances over 4 months. I want to a sum the balances that have just appeared that particular month. This is what I have gotten so far.
One account originated (new) each month.
Accounts <- c('A','B','C','A','B','C','A','B','C')
Dates <- as.Date(c('2016-01-31', '2016-01-31','2016-01-31','2016-02-28','2016-02-28','2016-02-28','2016-03-31','2016-03-31','2016-03-31'))
Balances <- c(100,NA,NA,90,50,NA,80,40,120)
Origination <- data.frame(Dates,Accounts,Balances)
library(reshape2)
Origination <- dcast(Origination,Dates ~ Accounts, value.var = "Balances")
Origination$Originated <- apply(Origination[2:4],1,function(x) ifelse(sum(is.na(x))==nrow(Origination),NA,tail(na.omit(x),1)))
Origination <- melt(Origination, id = c("Dates"))
Origination <-dcast(Origination, variable ~ Dates, value.var = "value")
variable 2016-01-31 2016-02-29 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA NA 120
4 Originated 100 50 120
This creates an origination table with a row called Originated. First month we only have the 100, second month we have the amortized A to 90 but also a new account 50 and last month we have both the amortized A and B with new C at 120. The Originated Column captures it exactly as I want.
But if I introduce another account D with in month 2 it picks just that amount (10) and not the sum of the two that is being originated. ie 50 (B) plus the 10(C).
Accounts <- c('A','B','C','D','A','B','C','D','A','B','C','D')
Dates <- as.Date(c('2016-01-31', '2016-01-31','2016-01-31','2016-01-31','2016-02-28','2016-02-28','2016-02-28','2016-02-28','2016-03-31','2016-03-31','2016-03-31','2016-03-31'))
Balances <- c(100,NA,NA,NA,90,50,10,NA,80,40,5,120)
Origination <- data.frame(Dates,Accounts,Balances)
library(reshape2)
Origination <- dcast(Origination,Dates ~ Accounts, value.var = "Balances")
Origination$Originated <- apply(Origination[2:4],1,function(x) ifelse(sum(is.na(x))==nrow(Origination),NA,tail(na.omit(x),1)))
Origination <- melt(Origination, id = c("Dates"))
Origination <-dcast(Origination, variable ~ Dates, value.var = "value")
variable 2016-01-31 2016-02-28 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA 10 5
4 D NA NA 120
5 Originated 100 10 5
So the ask is, how do I sum the newly added accounts from A through D across dates. Perhaps I am over thinking it. The result I would like is this:
variable 2016-01-31 2016-02-28 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA 10 5
4 D NA NA 120
5 Originated 100 60 120
Help is much appreciated.
Aksel
I have finally found a way to get the output I want. Here is the answer for those whom are interested.
sel <- rbind(FALSE, !is.na(head(Origination[-1], -1)))
#sel
# A B C D
#[1,] FALSE FALSE FALSE FALSE
#[2,] TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE
rowSums(replace(Origination[-1], sel, 0), na.rm=TRUE)
#[1] 100 60 120
I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.
Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")