R Create dummy datasets based on reference dataset

R Create dummy datasets based on reference dataset - r

Context
I'd like to build a two dummy survey dataframes for a project. One dataframe has responses to a Relationship survey, and another to aPulse survey.
Here are what each look like -
Relationship Dataframe
#Relationship Data
rel_data= data.frame(
TYPE=rep('Relationship',446),
SURVEY_ID = rep('SURVEY 2018 Z662700',446),
SITE_ID=rep('Z662700',446),
START_DATE= rep(as.Date('2018-07-01'),446),
END_DATE= rep(as.Date('2018-07-04'),446)
)
Pulse Dataframe
#Pulse Data
pulse_data= data.frame(
TYPE=rep('Pulse',525),
SURVEY_ID = rep('SURVEY 2018 W554800',525),
SITE_ID=rep('W554800',525),
START_DATE= rep(as.Date('2018-04-01'),525),
END_DATE= rep(as.Date('2018-04-04'),525)
)
My Objective
I'd like to add columns to each of these two dataframes, based on conditions from a reference table.
The reference table consists of the questions to be added to each of the two survey dataframes, along with further details on each question asked. This is what it looks like
Reference Table
#Reference Table - Question Bank
qbank= data.frame(QUEST_ID=c('QR1','QR2','QR3','QR4','QR5','QP1','QP2','QP3','QP4','QP5','QP6'),
QUEST_TYPE=c('Relationship','Relationship','Relationship','Relationship','Relationship',
'Pulse','Pulse','Pulse','Pulse','Pulse','Pulse'),
SCALE=c('Preference','Satisfaction','Satisfaction','Satisfaction','Preference','NPS',
'Satisfaction','Satisfaction','Satisfaction','Preference','Open-Ended'),
FOLLOWUP=c('No','No','No','No','No','No','Yes','No','Yes','No','No'))
The Steps
For each survey dataframe( Relationship & Pulse), I'd like to do the following -
1) Lookup their respective question codes in the reference table, and add only those questions to the dataframe. For example, the Relationship dataframe would have only question codes pertaining to TYPE = 'Relationship' from the reference table. And the same for the Pulse dataframe.
2) The responses to each question would be conditionally added to each dataframe. Here are the conditions -
If SCALE = 'Preference' in the Reference table, then responses would be either 150,100,50,0 or -50. Also, these numbers would be generated in any random order.
If SCALE = 'NPS' in the Reference table, then responses would range from 0 to 10. Numbers would be generated such that the Net Promoter Score (NPS) equals 50%. Reminder: NPS = Percentage of 9s & 10s minus Percentage of 0s to 6s.
If SCALE = 'Satisfaction' in the Reference table, then responses would range from 1 (Extremely Dissatisfied) to 5 (Extremely Satisfied). Numbers would be generated such that the percentage of 1s & 2s equal 90%.
If SCALE = 'Open-Ended' in the Reference table, then ensure the column is empty (i.e. contains no responses).
My Attempt
Using this previously asked question for the conditional response creation and this one to add columns from the reference table, I attempted to solve the problem. But I haven't got what I was looking for yet.
Any inputs on this would be greatly appreciated
Desired Output
My desired output tables would look like this -
Relationship Dataframe Output
TYPE SURVEY_ID SITE_ID START_DATE END_DATE QR1 QR2 QR3 QR4 QR5
1 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 150 5 1 2 2
2 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 100 1 2 2 2
3 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 100 4 5 2 2
4 Relationship SURVEY 2018 Z662700 Z662700 2018-07-01 2018-07-04 150 1 1 2 2
and so on
And the Pulse Dataframe Output
TYPE SURVEY_ID SITE_ID START_DATE END_DATE QP1 QP2 QP3 QP4 QP5 QP6
1 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 7 1 3 3 100
2 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 8 5 3 1 100
3 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 3 1 4 3 100
4 Pulse SURVEY 2018 W554800 W554800 2018-04-01 2018-04-04 1 2 4 3 100
and so on

Will something like
rel_data %>%
left_join(qbank, by = c("TYPE" = "QUEST_TYPE")) %>%
select(-FOLLOWUP) %>%
unique() %>%
mutate(val = case_when(SCALE == "Preference" ~ "A",
SCALE == "Satisfaction" ~ "B",
SCALE == "NPS" ~ "C",
TRUE ~ NA_character_ )) %>%
select(-SCALE) %>%
spread(key = QUEST_ID, value = val)
work for you?
you can modify the case_when conditions to fit your need.

Related

Aggregating Rows based on multiple conditions (matching in one column, not matching in another)

I have a dataset I'm working with in R that has an ID number (ID), year they submitted data (Year), some other data (which isn't relevant to my question, but just consider them as "columns"), and a date of registration on our systems (DateR).
This dateR is autogenerated from the dataset I am using, and is supposed to represent the "earliest" date the ID number appears on our systems.
However, due to some kind problem with how the data is being pulled that I can't get fixed, the date is being recorded as a new date that updates every year, instead of simply the earliest date.
Thus, my goal would be to create a script that reworks the data and does the following two checks:
Firstly, it checks the row and identifies which rows have the matching ID number
Secondly, it then applies the "earliest" date of all of the matching ID numbers in the date column.
So below is the example of a Dataset like what I am using
#
ID1
YearSubmitted
Data
DateR
1
12345
2017
100
22-03-2017
2
12345
2018
100
22-03-2018
3
12345
2019
100
22-03-2019
4
22221
2018
100
22-03-2018
5
22221
2019
100
22-03-2019
This is what I would like it to look like (I have bolded the changed numbers for clarity)
#
ID1
YearSubmitted
Data
DateR
1
12345
2017
100
22-03-2017
2
12345
2018
100
22-03-2017
3
12345
2019
100
22-03-2017
4
22221
2018
100
22-03-2018
5
22221
2019
100
22-03-2018
Most of the reference questions I have searched for this reference either replacing data with values fromanother column like If data present, replace with data from another column based on row ID, or use the replacement value as pulled from another dataframe like Replace a value in a dataframe by using other matching IDs of another dataframe in R.
I would prefer to acheive in dplyr if possible.
Preferably I'd like to start this with
data %>%
group_by(ID1, Yearsubmitted) %>%
mutate(across(c(DateR),
And I understand I could use the match function .. but I just draw a blank from this point on.
Thus, I would appreciate advice on how to:
Conditionally change the date if it's matching ID1 values, and secondly, to change all dates to the earliest value in the date column (DateR).
Thanks for your time.

Try this:
quux %>%
mutate(DateR = as.Date(DateR, format = "%d-%m-%Y")) %>%
group_by(ID1) %>%
mutate(DateR = min(DateR)) %>%
ungroup()
# # A tibble: 5 × 5
# `#` ID1 YearSubmitted Data DateR
# <int> <int> <int> <int> <date>
# 1 1 12345 2017 100 2017-03-22
# 2 2 12345 2018 100 2017-03-22
# 3 3 12345 2019 100 2017-03-22
# 4 4 22221 2018 100 2018-03-22
# 5 5 22221 2019 100 2018-03-22
This involves converting DateR to a "real" Date-class object, where numeric comparisons (such as min) are unambiguous and correct.
Data
quux <- structure(list("#" = 1:5, ID1 = c(12345L, 12345L, 12345L, 22221L, 22221L), YearSubmitted = c(2017L, 2018L, 2019L, 2018L, 2019L), Data = c(100L, 100L, 100L, 100L, 100L), DateR = c("22-03-2017", "22-03-2018", "22-03-2019", "22-03-2018", "22-03-2019")), class = "data.frame", row.names = c(NA, -5L))

Here is a similar approach using dplyrs first function after using arrange to sort the years:
df %>%
group_by(ID1) %>%
arrange(YearSubmitted,.by_group = TRUE) %>%
mutate(DateR = first(DateR))
ID1 YearSubmitted Data DateR
<int> <int> <int> <chr>
1 12345 2017 100 22-03-2017
2 12345 2018 100 22-03-2017
3 12345 2019 100 22-03-2017
4 22221 2018 100 22-03-2018
5 22221 2019 100 22-03-2018

Cumsum w/ panel data: different start dates

Trying to find the cumsum across different types of contracts. Each has a unique stop (i.e. delivery) date with several months of expected delivery leading up to that date. Needing to calculate the cumsum of all expected deliveries before the actual delivery date.
For some reason the cumsum/rollsum function is not working. I have tried both DT and dplyr versions but both have failed.
Here is a simplified data for the problem I am working on.
df <- data.frame(report_year = c(rep(2017,10), rep(2018,10)),
report_month = c(seq(1,5,1), seq(2,6,1), seq(3,7,1), seq(2,6,1)),
delivery_year = c(rep(2017,10), rep(2018,10)),
delivery_month = c(rep(5,5),rep(6,5), rep(7,5), rep(6,5)),
sum = c(rep(seq(100,500,100), 4)),
cumsum = c(rep(c(100,300,600,1000,1500),4)))
The first 5 columns is what I currently have.
I am trying to get the last column (i.e. cumsum)
I am probably doing something wrong. Any help is appreciated.

The question did not specifically define which grouping columns to use so this may have to be modified slightly depending on what you want but this does it without any packages:
df$cumsum <- NULL # remove the result from df shown in question
transform(df, cumsum = ave(sum, delivery_year, delivery_month, FUN = cumsum))
Note that although the above works you may run into some problems using sum and cumsum as the column names due to confusion with the functions of the same name so you might want to use Sum and Cumsum, say. For example if you don't null out cumsum as we did above then FUN = cumsum will think that you want to apply the cumsum column which is not a function.

Use arrange and mutate
# Import library
library(dplyr)
# Calculating cumsum
df %>%
group_by(delivery_year, delivery_month) %>%
arrange(sum) %>%
mutate(cs = cumsum(sum))
Output
report_year report_month delivery_year delivery_month sum cumsum cs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 1 2017 5 100 100 100
2 2017 2 2017 6 100 100 100
3 2018 3 2018 7 100 100 100
4 2018 2 2018 6 100 100 100
5 2017 2 2017 5 200 300 300
6 2017 3 2017 6 200 300 300
7 2018 4 2018 7 200 300 300

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!

Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

How to check if any row has negative values by leaving out selected rows?

Below is the dataframe I get by running a query. Please note that df1 is a dynamic dataframe and it might return either an empty df or partial df with not all quarters as seen below:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 299111.86
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
I would want to check the values of all the rows in Revenue column and see if any value is 0 or negative excluding 2014-Q1 row
Also, the df1 is dynamic and will contain only 12 quarters of data i.e. when I reach next qtr i.e. 2017-Q2, the Revenue associated with 2014-Q2 becomes 0 and it will look like this:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 0.00
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
14 2017-Q2 146253.64
In the above case, I would need to check all rows for the Revenue column by excluding 2014-Q1 and 2014-Q2
And this goes on as quarter progresses
Need your help to generate the code which would dynamically do all the above steps of excluding the row(s) and check only the rows that matter for a particular quarter
Currently, I am using the below code:
#Taking the first df1 into consideration which has 2017-Q1 as the last quarter
startQtr <- "2014-Q2" #This value is dynamically achieved and will change as we move ahead. Next quarter, the value changes to 2014-Q3 and so on
if(length(df1[["FISC_QTR_VAL"]][nrow(df1)-11] == startQtr) == 1){
if(nrow(df1[df1$Revenue < 0,]) == 0 & nrow(df1[df1$Revenue == 0,]) == 0){
df1 <- df1 %>% slice((nrow(df1)-11):(nrow(df1)))
}
}
The first IF loop checks if there is data in df1
If the df is empty, df1[["FISC_QTR_VAL"]][nrow(df1)-10] == startQtr condition would return numeric(0) whose length would be 0 and hence the condition fails
If not, then it goes to the next IF loop and checks for -ve and 0 values in Revenue column. But it does for all the rows. I want 2014-Q1 excluded in this case, and going forward to the future quarters, would want the condition to be dynamic as explained above.
Also, I do not want to slice the dataset before the if condition as the code would throw an error if the initial dataframe df1 returns 1 row or 2 rows and we try to slice those further
Thanks

Here's a solution using a few functions from the dplyr and tidyr packages.
Here's a toy data set to work with:
d <- data.frame(
FISC_QTR_VAL = c("2015-Q1", "2014-Q2", "2014-Q1", "2015-Q2"),
Revenue = c(100, 200, 0, 0)
)
d
#> FISC_QTR_VAL Revenue
#> 1 2015-Q1 100
#> 2 2014-Q2 200
#> 3 2014-Q1 0
#> 4 2015-Q2 0
Notice that FISC_QTR_VAL is intentionally out of order (as a precaution).
Next, set variables for the current year and quarter (you'll see why separate in a moment):
current_year <- 2014
current_quarter <- 2
Then run the following:
d %>%
separate(FISC_QTR_VAL, c("year", "quarter"), sep = "-Q") %>%
arrange(year, quarter) %>%
slice(which(year == current_year & quarter == current_quarter):n()) %>%
filter(Revenue <= 0)
#> year quarter Revenue
#> 1 2015 2 0
First, we separate() the FISC_QTR_VAL into separate year and quarter variables for (a) a tidy data set and (b) a way to order the data in case it's out of order (as in the toy used here). We then arrange() the data so that it's ordered by year and quarter. Then, we slice() away any quarters prior to the current one, and then filter() to return all rows where Revenue <= 0.
To alternatively get, for example, a count of the number of rows that are returned, you can pipe on something like nrow().

Is the subset function an option for you?
exclude.qr <- c("2014-Q1", "2014-Q2")
df <- data.frame(
FISC_QTR_VAL = c("2014-Q1", "2014-Q2", "2014-Q3", "2014-Q4"),
Revenue = c(0.00, 299111.86, 174071.98, 257655.30))
subset(
df,
FISC_QTR_VAL != exclude.qr, Revenue > 0)
You can easily create exclue.qr dynamically, e.g. via paste an years <- 2010:END.
I hope this is helpfull!

Count the number of previous occurrences using a time window, not a fixed window size

I have a dataset like the following, the last column is desired output.
DX_CD AID date2 <count.occurences.1000.days>
1 272.4 1649 2007-02-10 0 or N/A
2 V58.67 1649 2007-02-10 0<- (excluding the same day). OR 1
3 787.91 1649 2010-04-14 0
4 788.63 1649 2011-03-10 1
5 493.90 4193 2007-09-13 0 or N/A #new AID
6 787.20 6954 2010-02-25 0 or N/A #new AID
.....
I want to compute the column (count.occurences.1000.days) that counts the number of previous occurrences within X days (e.g. X=1000) by AID.
The first value is 0 or N/A because there is no previous record before record #1 for AID=1649. The second value is 0 because this event occurs on the same day as record #1. Third value is 0 because there are records older than 2010-04-14, but they are beyond 1000days. Fourth value is 1 because the record #3 happened within 1000 days. Same logic goes for AID=4193 and AID=6954
Can someone provide an idea, preferably vectorized?

If I understood correctly the question, this should do
First, a sample of the data
df <- data.frame(date2=days <-
seq(as.Date("2008-12-30"), as.Date("2015-01-03"), by="days"),
AID=sample(c(1649, 4193, 6954, 3466), 2196, replace=T),
count=(rep.int(1,2196)))
Now we group by the 1000 days from max to min
df$date.bin <- Hmisc::cut2(df$date2,
cuts=sort(seq(max(df$date2), length=10,by="-1000 days")))
Now we use cumsum on the grouped variables
res <-df %>% dplyr::arrange(date.bin, AID) %>% group_by(date.bin, AID) %>%
mutate(cumsum=cumsum(count))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Create dummy datasets based on reference dataset - r

Related

Aggregating Rows based on multiple conditions (matching in one column, not matching in another)

Cumsum w/ panel data: different start dates

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

How to check if any row has negative values by leaving out selected rows?

Count the number of previous occurrences using a time window, not a fixed window size

Categories

Resources