I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.
The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):
# Load dplyr
library(dplyr)
# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))
# create data frame
data = data.frame(days, cats, values1, values2)
# mutate new lag variables
LagVal = data %>% arrange(days) %>% group_by(cats) %>%
mutate(LagVal1.1 = lag(values1, 1)) %>%
mutate(LagVal1.2 = lag(values1, 2)) %>%
mutate(LagVal2.1 = lag(values2, 1)) %>%
mutate(LagVal2.2 = lag(values2, 2))
LagVal
days cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
<int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 16 -10 NA NA NA NA
2 2 B 14 24 NA NA NA NA
3 3 C 16 -6 NA NA NA NA
4 4 A 12 25 16 NA -10 NA
5 5 B 20 14 14 NA 24 NA
6 6 C 18 -5 16 NA -6 NA
7 7 A 21 2 12 16 25 -10
8 8 B 19 5 20 14 14 24
9 9 C 18 -3 18 16 -5 -6
My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.
Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):
MultiMutate = function(data, variables, lags){
# select the data to be working with
FuncData = data
# Loop through desired variables to mutate
for (i in variables){
# Loop through number of desired lags
for (u in 1:lags){
FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
# Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
mutate(paste(i, u) = lag(i, u))
}
}
FuncData
}
To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?
In particular, I'm looking for:
A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.
Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.
Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.
You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats, you can iterate over the nested data frames, iterating the lags over the values* columns in each.
library(tidyverse)
set.seed(47)
df <- data_frame(days = 1:9,
cats = rep(c('A','B','C'),3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16)))
df %>% nest(-cats) %>%
mutate(lags = map(data, function(dat) {
imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x),
paste0(.y, '_lag', 1:2)))
})) %>%
unnest() %>%
arrange(days)
#> # A tibble: 9 x 8
#> cats days values1 values2 values1_lag1 values1_lag2 values2_lag1
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 24. -7. NA NA NA
#> 2 B 2 19. 1. NA NA NA
#> 3 C 3 17. 17. NA NA NA
#> 4 A 4 15. 24. 24. NA -7.
#> 5 B 5 16. -13. 19. NA 1.
#> 6 C 6 12. 17. 17. NA 17.
#> 7 A 7 12. 27. 15. 24. 24.
#> 8 B 8 16. 15. 16. 19. -13.
#> 9 C 9 15. 36. 12. 17. 17.
#> # ... with 1 more variable: values2_lag2 <dbl>
data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:
library(data.table)
setDT(df)
df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2),
by = cats, .SDcols = values1:values2][]
#> days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1: 1 A 24 -7 NA NA NA
#> 2: 2 B 19 1 NA NA NA
#> 3: 3 C 17 17 NA NA NA
#> 4: 4 A 15 24 24 NA -7
#> 5: 5 B 16 -13 19 NA 1
#> 6: 6 C 12 17 17 NA 17
#> 7: 7 A 12 27 15 24 24
#> 8: 8 B 16 15 16 19 -13
#> 9: 9 C 15 36 12 17 17
#> values2_lag2
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: -7
#> 8: 1
#> 9: 17
In these cases, I rely on the magic of dplyr and tidyr:
library(dplyr)
library(tidyr)
set.seed(47)
# create data
s_data = data_frame(
days = 1:9,
cats = rep(c('A', 'B', 'C'), 3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16))
)
max_lag = 2 # define max number of lags
# create lags
s_data %>%
gather(select = -c("days", "cats")) %>% # gather all variables that will be lagged
mutate(n_lag = list(0:max_lag)) %>% # add list-column with lag numbers
unnest() %>% # unnest the list column
arrange(cats, key, n_lag, days) %>% # order the data.frame
group_by(cats, key, n_lag) %>% # group by relevant variables
# create lag. when grouped by vars above, n_lag is a constant vector, take 1st value
mutate(lag_val = lag(value, n_lag[1])) %>%
ungroup() %>%
# create some fancy labels
mutate(var_name = ifelse(n_lag == 0, key, paste0("Lag", key, ".", n_lag))) %>%
select(-c(key, value, n_lag)) %>% # drop unnecesary data
spread(var_name, lag_val) %>% # spread your newly created variables
select(days, cats, starts_with("val"), starts_with("Lag")) # reorder
## # A tibble: 9 x 8
## days cats values1 values2 Lagvalues1.1 Lagvalues1.2 Lagvalues2.1 Lagvalues2.2
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 24. -7. NA NA NA NA
## 2 2 B 19. 1. NA NA NA NA
## 3 3 C 17. 17. NA NA NA NA
## 4 4 A 15. 24. 24. NA -7. NA
## 5 5 B 16. -13. 19. NA 1. NA
## 6 6 C 12. 17. 17. NA 17. NA
## 7 7 A 12. 27. 15. 24. 24. -7.
## 8 8 B 16. 15. 16. 19. -13. 1.
## 9 9 C 15. 36. 12. 17. 17. 17.
Related
Below is my attempt at a minimal reproducible example. Briefly explained, I am using rollApply from the rowr package to calculate a function over a rolling window, and using data from two columns simultaneously. If possible, I would like to skip n steps between each time the function is calculated on a new window. I will try to make it clear what I mean in the example below.
Here is the example data:
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03"))
)
Here are the example functions:
calc_ex <- function(y){
sum(y[,1] + y[,2])
}
roll_calc_ex <- function(y){
vec <- c(rep(NA, 2), rowr::rollApply(y, calc_ex, window = 3, minimum = 3))
y <- y %>%
mutate(estimate = vec)
return(y)
}
Applying the function roll_calc_ex() to df1, I get the following output:
> roll_calc_ex(df1)
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 18
5 5 5 2015-09-05 24
6 6 6 2015-09-12 30
7 7 7 2015-09-19 36
8 8 8 2015-09-26 42
9 9 9 2015-10-03 48
Ideally, I would to have a rolling window that skips n steps, say n=2, to produce the following output:
# A tibble: 9 x 4
x y Date estimate
<int> <int> <date> <int>
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 NA
6 6 6 2015-09-12 30
7 7 7 2015-09-19 NA
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
Alternatively, instead of returning NA for every row skipped, the number from the previous calculation could be filled in (something I am planning to do later aynway using fill() from tidyverse).
If this is possible to solve using for example rollapply() from the zoo package, that would also be interesting to hear. I am only using rowr::rollApply() because I need to apply the function to two columns simultaneously. I know it is possible to use runner() from the package "runner", but in my more complicated problem I need to run parallel computations. I am using the furrr package for parallelization, and my code works well with rollApply, but not with runner(). The problem I have with runner is explained here: Problem with parallelization using furrr [and runner::runner() ] in R .
Thanks to anyone that took the time to read this post. Any help will be much appreciated.
1) The rowr package was removed from CRAN but we can use rollapplyr (like rollapply but the r on the end means to default to right alignment) from zoo which has a by.column= argument to specify whether processing is performed column by column (TRUE) or all columns are passed at once (FALSE) and a by= argument which causes skipping.
library(dplyr)
library(zoo)
mutate(df1, roll =
rollapplyr(cbind(x, y), 3, calc_ex, fill = NA, by.column = FALSE, by = 2)
)
giving:
x y Date roll
1 1 1 2015-08-08 NA
2 2 2 2015-08-15 NA
3 3 3 2015-08-22 12
4 4 4 2015-08-29 NA
5 5 5 2015-09-05 24
6 6 6 2015-09-12 NA
7 7 7 2015-09-19 36
8 8 8 2015-09-26 NA
9 9 9 2015-10-03 48
2) Using complex arithmetic would also work:
f <- function(v) calc_ex(cbind(Re(v), Im(v)))
mutate(df1, roll = rollapplyr(x + y * 1i, 3, f, fill = NA, by = 2))
3) and if we look into call_ex then it could be written (although this does not generalize):
mutate(df1, roll = rollapplyr(x + y, 3, sum, fill = NA, by = 2))
4) We could also consider using zoo objects rather than data frames:
z <- read.zoo(df1, index = "Date")
merge(z, roll = rollapplyr(z, 3, calc_ex, by.column = FALSE, by = 2))
If we were to use the slider package
library(tidyverse)
library(slider)
df1 <- tibble(
x = c(1:9),
y = c(1:9),
Date = as.Date(c("2015-08-08", "2015-08-15", "2015-08-22",
"2015-08-29","2015-09-05", "2015-09-12", "2015-09-19",
"2015-09-26", "2015-10-03")))
df1 |>
mutate(rolling_sum = slide2_dbl(.x = x,.y = y,.f = sum,
.step = 3,.before = 2,.complete = T
))
#> # A tibble: 9 x 4
#> x y Date rolling_sum
#> <int> <int> <date> <dbl>
#> 1 1 1 2015-08-08 NA
#> 2 2 2 2015-08-15 NA
#> 3 3 3 2015-08-22 12
#> 4 4 4 2015-08-29 NA
#> 5 5 5 2015-09-05 NA
#> 6 6 6 2015-09-12 30
#> 7 7 7 2015-09-19 NA
#> 8 8 8 2015-09-26 NA
#> 9 9 9 2015-10-03 48
Created on 2021-10-21 by the reprex package (v2.0.1)
Suppose I have the following data and data frame:
sample_data <- c(1:14)
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- as.data.frame(sample_data)
sample_df$sample_data2 <- sample_data2
When I print this data frame, the results are as follows:
sample_data sample_data2
1 1 <NA>
2 2 <NA>
3 3 <NA>
4 4 break
5 5 <NA>
6 6 <NA>
7 7 break
8 8 <NA>
9 9 <NA>
10 10 <NA>
11 11 <NA>
12 12 <NA>
13 13 <NA>
14 14 break
How would I program it so that at every "break", it outputs the max from that row up? For instance, I would want the code to output the set of (4,7,14). Additionally, I would want it so that it only finds the max value between up to the next "break" interval.
I apologize in advance if I used any incorrect nomenclature.
I construct the groups looking for the word "break" and then move the results one row up. Then some dplyr commands to get max of every group.
library(dplyr)
sample_df_new <- sample_df %>%
mutate(group = c(1, cumsum(grepl("break", sample_data2)) + 1)[1:length(sample_data2)]) %>%
group_by(group) %>%
summarise(group_max = max(sample_data))
> sample_df_new
# A tibble: 3 x 2
group group_max
<dbl> <dbl>
1 1 4
2 2 7
3 3 14
I have an answer using data.table:
library(data.table)
sample_df <- setDT(sample_df)
sample_df[,group := (rleid(sample_data2)-0.5)%/%2]
sample_df[,.(maxvalues = max(sample_data)),by = group]
group maxvalues
1: 0 4
2: 1 7
3: 2 14
The tricky part is (rleid(sample_data2)-0.5)%/%2: rleid create an increasing index to each change :
sample_data sample_data2 rleid
1: 1 NA 1
2: 2 NA 1
3: 3 NA 1
4: 4 break 2
5: 5 NA 3
6: 6 NA 3
7: 7 break 4
8: 8 NA 5
9: 9 NA 5
10: 10 NA 5
11: 11 NA 5
12: 12 NA 5
13: 13 NA 5
14: 14 break 6
If you keep the entire part of that index - 0.5, you have a constant index for the rows you want, that you can use for grouping operation:
sample_data sample_data2 group
1: 1 NA 0
2: 2 NA 0
3: 3 NA 0
4: 4 break 0
5: 5 NA 1
6: 6 NA 1
7: 7 break 1
8: 8 NA 2
9: 9 NA 2
10: 10 NA 2
11: 11 NA 2
12: 12 NA 2
13: 13 NA 2
14: 14 break 2
Then it is just taking the maximum for each group. You can easily translate it into dplyr if it is easier for you
Here are 2 ways with base R. The trick is to define a grouping variable, grp.
grp <- !is.na(sample_df$sample_data2) & sample_df$sample_data2 == "break"
grp <- rev(cumsum(rev(grp)))
grp <- -1*grp + max(grp)
tapply(sample_df$sample_data, grp, max, na.rm = TRUE)
aggregate(sample_data ~ grp, sample_df, max, na.rm = TRUE)
Data.
This is simplified data creation code.
sample_data <- 1:14
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- data.frame(sample_data, sample_data2)
Looks like there are lots of different ways of doing this. This is how I went about it:
rows <- which(sample_data2 == "break") #Get the row indices for where "break" appears
findmax <- function(maxrow) {
max(sample_data[1:maxrow])
} #Create a function that returns the max "up to" a given row
sapply(rows, findmax) #apply it for each of your rows
### [1] 4 7 14
Note that this works "up to" the given row. To get the maximum value between the two breaks would probably be easier with one of the other solutions, but you could also do it by looking at the j-1 row to jth row from the rows object.
Depending whether you want to assess the maximum "sample_data" number between all "sample_data2" == break including (e.g. row 1 to row 4) or excluding (e.g. row 1 to row 3) the given "sample_data2" == break row, you can do something like this with tidyverse:
Excluding the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
filter(is.na(sample_data2)) %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 3.
2 2 6.
3 3 13.
Including the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 4.
2 2 7.
3 3 14.
Both of the codes create an ID variable called "temp" using gl() for "sample_data2" == break and then fill up the NA rows with that ID. Then, the first code filters out the "sample_data2" == break rows and assess the maximum "sample_data" values per group, while the second assess the maximum "sample_data" values per group including the "sample_data2" == break rows.
I have a table that has dates as a number and a value with each date. Now I'd like to add another column, weekSum, which contains the sum of value over the last week. However some dates are missing (so I can't always use the current and last 6 rows). My table looks like this:
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
date value
20160309 1
20160310 2
20160311 3
20160312 4
20160313 5
20160314 6
20160315 7
20160316 8
20160318 9 #17th skipped
20160319 10
20160321 11 #20th skipped
I'd like to get the following as output:
date value weekSum
20160309 1 NA
20160310 2 NA
20160311 3 NA
20160312 4 NA
20160313 5 NA
20160314 6 NA
20160315 7 28 # 1+2+3+4+5+6+7
20160316 8 35 # 2+3+4+5+6+7+8
20160318 9 39 # 4+5+6+7+8+9
20160319 10 45 # 5+6+7+8+9+10
20160321 11 45 # 7+8+9+10+11
How can this be done?
1) Convert the data frame to zoo and define a weekSum function which subsets its input to the last week and sums that. Then use rollapplyr with coredata = FALSE so that it passes a zoo object with times, not just the core data, to the weekSum function.
library(zoo)
z <- read.zoo(df, format = "%Y%m%d")
weekSum <- function(z) sum(z[time(z) > tail(time(z), 1) - 7])
transform(df, weekSum = rollapplyr(z, 7, weekSum, fill = NA, coredata = FALSE))
giving:
date value weekSum
2016-03-09 20160309 1 NA
2016-03-10 20160310 2 NA
2016-03-11 20160311 3 NA
2016-03-12 20160312 4 NA
2016-03-13 20160313 5 NA
2016-03-14 20160314 6 NA
2016-03-15 20160315 7 28
2016-03-16 20160316 8 35
2016-03-18 20160318 9 39
2016-03-19 20160319 10 45
2016-03-21 20160321 11 45
2) An alternative is to fill in the value at the missing dates with zero and then just use rollsumr with width of 7. z is from (1).
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0)
transform(df, weekSum = rollsumr(z0, 7, fill = NA)[z0 != 0])
With base R it can be done like this:
res <- merge(df, data.frame(date = seq(df$date[1], to = df$date[length(d)], by = "days")), all.y = TRUE)
res$weekSum <- NA
for(i in seq_along(res$sum)[-seq_len(6)]){
res$weekSum[i] <- sum(res$value[(i - 6):i], na.rm = TRUE)
}
res <- res[!is.na(res$value), ]
res
# date value sum weekSum
#1 2016-03-09 1 NA NA
#2 2016-03-10 2 NA NA
#3 2016-03-11 3 NA NA
#4 2016-03-12 4 NA NA
#5 2016-03-13 5 NA NA
#6 2016-03-14 6 NA NA
#7 2016-03-15 7 28 28
#9 2016-03-17 8 33 35
#10 2016-03-18 9 39 42
#11 2016-03-19 10 45 49
#13 2016-03-21 11 45 56
Here is an approach using tidyverse tools. This method uses tidyr::complete to construct the full date sequence, making it easy to take the current row and the previous 6 as suggested. Be careful here if there are
NA values in value to begin with, as currently those rows will be filtered out at the end. Tweaks possible to avoid this case if necessary.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame('date' = c(20160309, 20160310, 20160311, 20160312, 20160313, 20160314, 20160315, 20160317, 20160318, 20160319, 20160321), 'value' = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11))
df %>%
mutate(date = ymd(date)) %>%
complete(date = seq.Date(min(date), max(date), by = 1)) %>%
arrange(date) %>%
mutate(
newval = replace_na(value, 0),
weekSum = newval + lag(newval) + lag(newval, 2) + lag(newval, 3) +
lag(newval, 4) + lag(newval, 5) + lag(newval, 6)
) %>%
select(-newval) %>%
filter(!is.na(value))
#> # A tibble: 11 x 3
#> date value weekSum
#> <date> <dbl> <dbl>
#> 1 2016-03-09 1. NA
#> 2 2016-03-10 2. NA
#> 3 2016-03-11 3. NA
#> 4 2016-03-12 4. NA
#> 5 2016-03-13 5. NA
#> 6 2016-03-14 6. NA
#> 7 2016-03-15 7. 28.
#> 8 2016-03-17 8. 33.
#> 9 2016-03-18 9. 39.
#> 10 2016-03-19 10. 45.
#> 11 2016-03-21 11. 45.
Created on 2018-05-07 by the reprex package (v0.2.0).
I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.
I'm a newcommer to dplyr and have following question. My has data.frame one column serving as a grouping variable. Some rows don't belong to a group, the grouping column being NA.
I need to add some columns to the data.frame using the dplyr function mutate. I'd prefer that dplyr ignores all rows where the grouping column equals to NA. I'll illustrate with an example:
library(dplyr)
set.seed(2)
# Setting up some dummy data
df <- data.frame(
Group = factor(c(rep("A",3),rep(NA,3),rep("B",5),rep(NA,2))),
Value = abs(as.integer(rnorm(13)*10))
)
# Using mutate to calculate differences between values within the rows of a group
df <- df %>%
group_by(Group) %>%
mutate(Diff = Value-lead(Value))
df
# Source: local data frame [13 x 3]
# Groups: Group [3]
#
# Group Value Diff
# (fctr) (int) (int)
# 1 A 8 7
# 2 A 1 -14
# 3 A 15 NA
# 4 NA 11 11
# 5 NA 0 -1
# 6 NA 1 -8
# 7 B 7 5
# 8 B 2 -17
# 9 B 19 18
# 10 B 1 -3
# 11 B 4 NA
# 12 NA 9 6
# 13 NA 3 NA
Calculating the differences between rows without a group makes no sense and is corrupting the data. I need to remove these rows and have done so like this:
df$Diff[is.na(df$Group)] <- NA
Is there a way to include the above command into the dplyr-chain using %>% ? Somewhere in the lines of:
df <- df %>%
group_by(Group) %>%
mutate(Diff = Value-lead(Value)) %>%
filter(!is.na(Group))
But where the rows without a group are not removed all together? Or even better, is there a way to make dplyr ignore rows without a group?
There desired outcome would be:
# Source: local data frame [13 x 3]
# Groups: Group [3]
#
# Group Value Diff
# (fctr) (int) (int)
# 1 A 8 7
# 2 A 1 -14
# 3 A 15 NA
# 4 NA 11 NA
# 5 NA 0 NA
# 6 NA 1 NA
# 7 B 7 5
# 8 B 2 -17
# 9 B 19 18
# 10 B 1 -3
# 11 B 4 NA
# 12 NA 9 NA
# 13 NA 3 NA
Simply use an iflelse condition for the variable that you are trying to create:
library(dplyr)
set.seed(2)
df = data.frame(
Group = factor(c(rep("A",3), rep(NA,3), rep("B",5), rep(NA,2))),
Value = abs(as.integer(rnorm(13)*10))
) %>%
group_by(Group) %>%
mutate(Diff = ifelse(is.na(Group), as.integer(NA), Value-lead(Value)))