Using grepl to match character in a string of characters with delimiters - r

There are a number of solutions for using grepl(), but none which solves my problem (that I have come across so far). I have two data frames. The first labelled x containing a set of combinations associated with a letter:
structure(list(variable = c("A", "B", "C", "D"), combinations = c("16, 17, 18",
"17,18", "16,18", "12,3")), class = "data.frame", row.names = c(NA,
-4L))
> x
variable combinations
1 A 16, 17, 18
2 B 17,18
3 C 16,18
4 D 12,3
The second data frame is the results. It is a set of observations showing the letters that a species interacted with. Below is just one set of observations:
structure(list(variable = c("A, C", NA, NA), species = c("16",
"17", "18"), active = c("16", NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
> y
variable species active
1 A, C 16 16
2 <NA> 17 <NA>
3 <NA> 18 <NA>
This was the original structure of y:
> y
variable species.active species.present
1 A, C 16 17,18
The structure was changed to add more columns associated to each species (so each species had a row), thus the structure serves a specific purpose.
What I want is to have a binary column (T/F or 0/1) to show whether or not each species are in the combinations associated with the variable.
This is what I have managed so far:
library(zoo)
library(dplyr)
#carry locf so that each species are assigned the same variables
y <- y %>%
mutate(variable = zoo::na.locf(variable))
#separate each row to separate combinations
library(tidyr)
y <- separate_rows(y, variable)
#match x$variable by y$variable to add associated combinations in a new column in y
y$combinations <- ifelse(y$variable %in% x$variable, x$combinations)
#return true or false if each species is in the combination
y$type <- grepl(y$species, y$combinations);y
> y
variable species active combinations type
<chr> <chr> <chr> <chr> <lgl>
1 A 16 16 16, 17, 18 TRUE
2 C 16 16 17,18 FALSE
3 A 17 NA 16,18 TRUE
4 C 17 NA 12,3 FALSE
5 A 18 NA 16, 17, 18 TRUE
6 C 18 NA 17,18 FALSE
As you can see, the combinations are wrong and the gprel() returns incorrect T/F (refer to row 3 where it says it is true but '17' is not in the combination anyway.
If anyone could help, that would be greatly appreciated.

Try this, choosing one of type1 or type2 (same result), whichever you prefer.
library(dplyr)
left_join(y, x, by = "variable") %>%
mutate(
type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")),
type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations)
)
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE
Or starting with the original y:
y
# variable species active
# 1 A, C 16 16
# 2 <NA> 17 <NA>
# 3 <NA> 18 <NA>
y %>%
mutate(variable = zoo::na.locf(variable)) %>%
tidyr::separate_rows(variable) %>%
left_join(., x, by = "variable") %>%
mutate(type1 = mapply(`%in%`, species, strsplit(combinations, "\\D+")), type2 = mapply(grepl, paste0("\\b", species, "\\b"), combinations))
# # A tibble: 6 x 6
# variable species active combinations type1 type2
# <chr> <chr> <chr> <chr> <lgl> <lgl>
# 1 A 16 16 16, 17, 18 TRUE TRUE
# 2 C 16 16 16,18 TRUE TRUE
# 3 A 17 <NA> 16, 17, 18 TRUE TRUE
# 4 C 17 <NA> 16,18 FALSE FALSE
# 5 A 18 <NA> 16, 17, 18 TRUE TRUE
# 6 C 18 <NA> 16,18 TRUE TRUE
FYI, some things wrong with your question:
When asking questions that include warnings or errors, you need to include them; in this case, grepl's first argument must be length 1, and it appears you are ignoring it:
grepl(y$species, y$combinations)
# Warning in grepl(y$species, y$combinations) :
# argument 'pattern' has length > 1 and only the first element will be used
ifelse in your code seems to work, but you are using it incorrectly: it requires a no= argument as well, so there needs to be something as its third argument. It does not error here because everything resolves to be true (which is another problem) so it never attempts to evaluate no=.
ifelse(c(T,T), 1:2)
# [1] 1 2
ifelse(c(T,F), 1:2)
# Error in ifelse(c(T, F), 1:2) : argument "no" is missing, with no default
ifelse(c(T,F), 1:2, 11:12)
# [1] 1 12
What you're attempting to do is merge/join x and y, so the tools you want are among base::merge and dplyr::*_join (for starters, others exist). To better understand what's going on in a join, I suggest you see How to join (merge) data frames (inner, outer, left, right), https://stackoverflow.com/a/6188334/3358272.

Related

R output multiple row values that belong to minimum of group and compare the values in case_when

Dataset
Let's say I have the following dataframe:
df <- tibble(ID = c("A", "A", "A", "B", "C", "C", "D", "D", "D", "D", "E", "E", "E"),
Encounter = c(10, 11, 12, 3, 5, 50, 8, 8, 15, 20, 2, 8, 10),
Item = c("apple", "toy", "bowl", "apple", "mango", "mango", "toy", "brush", "toy", "brush", "brush", "key", "key"))
# A tibble: 13 x 3
ID Encounter Item
<chr> <dbl> <chr>
1 A 10 apple
2 A 11 toy
3 A 12 bowl
4 B 3 apple
5 C 5 mango
6 C 50 mango
7 D 8 toy
8 D 8 brush
9 D 15 toy
10 D 20 brush
11 E 2 brush
12 E 8 key
13 E 10 key
Criteria
I wish to find out if the Item in the first Encounter appears in the subsequent Encounter.
For example, in A, the Item in the first Encounter is apple, which does not appear in subsequent Encounter therefore the output should be FALSE.
For example, in C, mango does appear in subsequent Encounter, therefore the output should be TRUE
For example, in D, both toy and brush are in the first Encounter, and they both appears in the subsequent Encounter, therefore the output should be TRUE
The Item in the first Encounter should always be FALSE.
Desired output
Here is my desired output for your better understanding:
# A tibble: 13 x 4
ID Encounter Item Output
<chr> <dbl> <chr> <lgl>
1 A 10 apple FALSE
2 A 11 toy FALSE
3 A 12 bowl FALSE
4 B 3 apple FALSE
5 C 5 mango FALSE
6 C 50 mango TRUE
7 D 8 toy FALSE
8 D 8 brush FALSE
9 D 15 toy TRUE
10 D 20 brush TRUE
11 E 2 brush FALSE
12 E 8 key FALSE
13 E 10 key FALSE
My attempt
I have used dplyr::case_when()
to set the row of min Encounter to FALSE (successful)
to set Item that is NOT in the first Encounter (successful)
to set Item that IS in the first Encounter (FAILED if there are multiple Items in first Encounter)
df %>% group_by(ID) %>%
arrange(ID, Encounter) %>%
mutate(Output = case_when(Encounter == min(Encounter) ~ F,
Item %in% first(Item) ~ T,
!(Item %in% first(Item)) ~ F))
# A tibble: 13 x 4
# Groups: ID [5]
ID Encounter Item Output
<chr> <dbl> <chr> <lgl>
1 A 10 apple FALSE
2 A 11 toy FALSE
3 A 12 bowl FALSE
4 B 3 apple FALSE
5 C 5 mango FALSE
6 C 50 mango TRUE
7 D 8 toy FALSE
8 D 8 brush FALSE
9 D 15 toy TRUE
10 D 20 brush FALSE
11 E 2 brush FALSE
12 E 8 key FALSE
13 E 10 key FALSE
Ultimate question
Is there any function that acts like dplyr::first(), but able to return multiple values that can be used in the case_when() function or ifelse()?
For example in D, I don't know how to output both toy and brush so that it can be compared using %in%.
Sorry for the long question, hope someone can help!
Also, feels like my case_when() expression is not written in a smart way, please feel free to leave a comment if you have suggestions! Thanks in advance!
We may use duplicated - the values in 'Encounter' are already arranged, if not, do an arrange(ID, Encounter) before the group_by
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Output = first(Item) %in% Item[-1] & duplicated(Item)) %>%
ungroup
-output
# A tibble: 13 × 4
ID Encounter Item Output
<chr> <dbl> <chr> <lgl>
1 A 10 apple FALSE
2 A 11 toy FALSE
3 A 12 bowl FALSE
4 B 3 apple FALSE
5 C 5 mango FALSE
6 C 50 mango TRUE
7 D 8 toy FALSE
8 D 8 brush FALSE
9 D 15 toy TRUE
10 D 20 brush TRUE
11 E 2 brush FALSE
12 E 8 key FALSE
13 E 10 key FALSE

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

Pivot/gather multiple "crossed" values belonging to a common key

I have some strangely stored time series data. Two kinds of values, event and foo, can be observed together for different phenomena a and b. The observations are in time t and belong to different category (those are basically different recordings).
Everything is stored as follows, in a kind of mixed wide format:
> tibble(category = c("x", "x", "y", "y"), t = c(1:2, 1:2),
event_a = c(T, T, F, F), event_b = c(T, F, T, F),
foo_a = c(1, 2, 3, 4), foo_b = c(10, 20, 30, 40))
# A tibble: 4 x 6
category t event_a event_b foo_a foo_b
<chr> <int> <lgl> <lgl> <dbl> <dbl>
1 x 1 TRUE TRUE 1 10
2 x 2 TRUE FALSE 2 20
3 y 1 FALSE TRUE 3 30
4 y 2 FALSE FALSE 4 40
Now I want convert it to long format, with the phenomena being used to index the kind of event with a value, and the foo value being matched to them via a/b:
# A tibble: 8 x 5
category t event value foo
<chr> <dbl> <chr> <lgl> <dbl>
1 x 1 a TRUE 1
2 x 1 b TRUE 10
3 x 2 a TRUE 2
4 x 2 b FALSE 20
5 y 1 a FALSE 3
6 y 1 b TRUE 30
7 y 2 a FALSE 4
8 y 2 b FALSE 40
I'm looking for some sort of tidyr (or at least tidyverse) solution using gather/pivot_long and friends, but couldn't come up with anything useful, since there are multiple value columns in the result. I was thinking about a join with the foo columns split of, but didn't really succeed, and I'm not really enought in to SQL to know what goes wrong there...
This is a complicated way of solving the problem but it works.
The idea is to solve the multiple columns issue with in two steps, a pivot_longer for each of event_* and foo_*. And bind_cols the results. Finally, remove the pattern 'event' from the new column event.
library(tidyverse)
df1 %>%
dplyr::select(-starts_with('foo')) %>%
pivot_longer(
cols = starts_with('event'),
names_to = 'event',
values_to = 'value'
) %>%
bind_cols(
df1 %>%
dplyr::select(-starts_with('event')) %>%
pivot_longer(
cols = starts_with('foo'),
values_to = 'foo'
) %>%
dplyr::select(-category, -t, -name)
) %>%
mutate(event = sub('event_', '', event))
## A tibble: 8 x 5
# category t event value foo
# <chr> <int> <chr> <lgl> <dbl>
#1 x 1 a TRUE 1
#2 x 1 b TRUE 10
#3 x 2 a TRUE 2
#4 x 2 b FALSE 20
#5 y 1 a FALSE 3
#6 y 1 b TRUE 30
#7 y 2 a FALSE 4
#8 y 2 b FALSE 40

Dynamic Columns in Dplyr using NSE on the RHS

I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.
Here is a sample dataframe:
dt = data.frame(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
Here is a table of sample operations to be performed:
evaluation = data.frame(
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
What I am trying to do is the following:
for (i in 1:nrow(evaluation)) {
var = evaluation$New_Var[i]
dt = dt %>%
rowwise() %>%
mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])),
evaluation$Result[i],
!!var))
}
my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.
UPDATED:
I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?
I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.
Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.
dt = tibble(
A = c(1:20),
B = c(10:29), ## Note the change
C = c(21:40),
AA = rep("a", 20), ## Note initialization with strings
BB = rep("b", 20) ## Ditto
)
To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.
for( i in 1:nrow(evaluation) )
{
var <- rlang::sym(evaluation$New_Var[i])
op <- rlang::parse_expr(evaluation$Operation[i])
dt = dt %>% rowwise() %>%
mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
# A B C AA BB
# <int> <int> <int> <chr> <chr>
# 1 1 10 21 a False
# 2 2 11 22 a False
# 3 3 12 23 a b
# 4 4 13 24 a b
# 5 5 14 25 a b
# 6 6 15 26 a b
# 7 7 16 27 a b
# 8 8 17 28 a b
# 9 9 18 29 a b
# 10 10 19 30 True b
# 11 11 20 31 True b
# 12 12 21 32 True b
# 13 13 22 33 True b
# 14 14 23 34 True b
# 15 15 24 35 True b
# 16 16 25 36 True b
# 17 17 26 37 True b
# 18 18 27 38 True b
# 19 19 28 39 True b
# 20 20 29 40 True b
Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.
Data
library(rlang)
library(dplyr)
library(purrr)
operations <- tibble(
old_var = exprs(A, B),
new_var = exprs(AA, BB),
test = exprs(2*A > B, 2*B <= C),
result = exprs("True", "False")
)
original <- tibble(
A = sample.int(30, 10),
B = sample.int(30, 10),
C = sample.int(30, 10)
)
original
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 4 20 5
2 30 29 11
3 1 27 14
4 2 21 4
5 17 19 24
6 14 25 9
7 5 22 22
8 6 13 7
9 25 4 21
10 12 11 12
Functions
# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}
generic_ops <- function(dat, ops) {
pmap(ops, generic_mutate, dat = dat) %>%
reduce(full_join)
}
generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.
generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.
Results
original %>%
generic_ops(operations)
Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
A B C AA BB
<int> <int> <int> <chr> <chr>
1 4 20 5 4 20
2 30 29 11 True 29
3 1 27 14 1 27
4 2 21 4 2 21
5 17 19 24 True 19
6 14 25 9 True 25
7 5 22 22 5 22
8 6 13 7 6 13
9 25 4 21 True False
10 12 11 12 True 11
The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.
How's this:
evaluation = data.frame(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
for (i in 1:nrow(evaluation)) {
old <- sym(evaluation$Old_Var[i])
new <- sym(evaluation$New_Var[i])
op <- sym(evaluation$Operation[i])
res <- sym(evaluation$Result[i])
dt <- dt %>%
mutate(!!new := ifelse(!!op, !!res, !!old))
}
EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow #Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...
dt <- tibble(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
evaluation = tibble(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c('(A*2) > B', '(B*2) <= C'),
Result = c("True", "False")
)
for (i in 1:nrow(evaluation)) {
old <- evaluation$Old_Var[i]
new <- evaluation$New_Var[i]
op <- evaluation$Operation[i]
res <- evaluation$Result[i]
dt <- dt %>%
mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))
}
One way is to rework the conditions first, then pass them to mutate :
conds <- parse(text=evaluation$Operation) %>%
as.list() %>%
setNames(evaluation$New_Var) %>%
imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#>
#> $BB
#> ifelse((B * 2) <= C, "True", BB)
dt %>% mutate(!!!conds)
#> A B C AA BB
#> 1 1 11 21 1 2
#> 2 2 12 22 1 2
#> 3 3 13 23 1 2
#> 4 4 14 24 1 2
#> 5 5 15 25 1 2
#> 6 6 16 26 1 2
#> 7 7 17 27 1 2
#> 8 8 18 28 1 2
#> 9 9 19 29 1 2
#> 10 10 20 30 1 2
#> 11 11 21 31 True 2
#> 12 12 22 32 True 2
#> 13 13 23 33 True 2
#> 14 14 24 34 True 2
#> 15 15 25 35 True 2
#> 16 16 26 36 True 2
#> 17 17 27 37 True 2
#> 18 18 28 38 True 2
#> 19 19 29 39 True 2
#> 20 20 30 40 True 2

Manually calculate variance from count data for categorical ratings

I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.

Resources