Calculating percentages by group in R

Calculating percentages by group in R - r

My data set looks like this:
The data is from course student evaluations. The columns include categorical data for courses, and numerical data for scores in various criteria from a rubric. I am trying to use R to calculate the percentage for values equal or greater than 3 for all the columns by course. I can't figure out a straight forward way that is faster than doing it manually.
Thank you
Fernando

The tidyverse packages are well suited for this kind of tasks.
library(tidyverse)
First let's create some dummy data.
df <- tibble(`1..Course` = rep(LETTERS[1:3], each=5),
col1 = sample(c(NA,1:5), 15, replace=TRUE),
col2 = sample(c(NA,1:5), 15, replace=TRUE),
col3 = sample(c(NA,1:5), 15, replace=TRUE))
Now, for each column we want to look which values are >3:
df$col1 > 3
[1] FALSE NA TRUE NA FALSE NA NA FALSE NA FALSE FALSE TRUE FALSE NA FALSE
So we get a boolean, which will be automatically converted to numbers (0 and 1) if we try to take the sum. So computing a proportion is just taking the mean!
But there are missing values, so we will explicitly ignore them:
mean(df$col1 > 3, na.rm = TRUE)
[1] 0.2222222
So we know how to do it for a whole column, now we can use the functions from the tidyverse to do it by course:
df %>%
group_by(`1..Course`) %>%
summarize(prop_col1 = mean(col1 > 3, na.rm = TRUE),
prop_col2 = mean(col2 > 3, na.rm = TRUE),
prop_col3 = mean(col3 > 3, na.rm = TRUE))
# A tibble: 3 x 4
# `1..Course` prop_col1 prop_col2 prop_col3
# <chr> <dbl> <dbl> <dbl>
#1 A 0.333 0.2 0.5
#2 B 0 0.75 0.2
#3 C 0.25 0 0.25
And it's done.
Possibly, you may want to do this for every criteria without typing them. So you need to see the criteria type as a variable, and convert your data.frame to long format. Then the same code applies.
df %>%
pivot_longer(-`1..Course`, names_to="criterium") %>%
group_by(`1..Course`, criterium) %>%
summarize(prop_value = mean(value > 3, na.rm = TRUE))

Related

R creating a function using plyr revalue with multiple inputs

I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.

As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3

You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.

Count unique rows irrespective of column order

EDIT
My question was badly asked. Therefore I re-edited it in order to make it hopefully more useful for others. It has an answer already.
sample data.frame:
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T), b = sample(1:3, 30, rep = T), c = sample(1:3, 30, rep = T))
My question:
I have several columns (in my example a,b,c). Now, slightly similar, but different to this question asked by R-user, I would like to count the possible 'value sets' of in this case three columns (but in general: n columns), irrespective of their order.
count(df,a,b,c) from dplyr does not help:
require (dplyr)
count(df,a,b,c)
# A tibble: 17 x 4
a b c n
<int> <int> <int> <int>
1 1 1 1 1
2 1 1 2 2
...
7 2 1 1 4
...
In this example, row 2 and 7 contain the same set of values (1,1,2), and that's not what I want, because I do not care about the order of the values within the set, so '1,1,2' and '2,1,1' should be considered the same. How to count those value sets?
EDIT 2
The neat trick of #Mouad_S 's answer is that you first order the rows with apply() and then transpose the result (t()) and then you can use count on the columns.)

require(dplyr)
set.seed(10)
df <- data.frame(a = sample(1:3, 30, rep=T),
b = sample(1:3, 30, rep = T),
c = sample(1:3, 30, rep = T))
## the old answer
require(dplyr)
count(data.frame(t(apply(df, 1, function(x) sort(x)))), X1, X2, X3)
## the new answer
t(apply(df,1, function(x) sort(x))) %>% # sorting the values of each row
as.data.frame() %>% # turning the resulting matrix into a data frame
distinct() %>% # taking the unique values
nrow() # counting them
[1] 9

Select column name based on data frame content R

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!

library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c

Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.

EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.

Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20

The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.

Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20

You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Boxplot of pre-aggregated/grouped data in R

In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)

I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.

I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.

A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)

Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculating percentages by group in R - r

Related

R creating a function using plyr revalue with multiple inputs

Count unique rows irrespective of column order

Select column name based on data frame content R

Gather duplicate column sets into single columns

Boxplot of pre-aggregated/grouped data in R

Categories

Resources