R ave with multiple arguments / rank by group with weighting - r

I am using ave for ranking values within groups in a dataset in R. In the example 'data' is a data.frame with the cols raw, group and others, for example
data <- data.frame(raw = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), weight = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2)))
The ranking works fine with
data$rank <- ave(data$raw, data$group, FUN = function(x) {rank(x)})
I would like to generalize this approach by applying weights. The weights are available as another col in the data.frame. The weighted ranking is a self defined function that needs both the raw scores and the weights vector. It is available via the cNORM package, code: https://github.com/WLenhard/cNORM/blob/master/R/utilities.R
Is it possible to use ave with multiple input variables, e. g.
data$rank <- ave(x = data$raw, data$group, y = data$weights, FUN = function(x, y) {weighted.rank(x, weights = y)})
so that both x and y are both the according subsets based on the grouping variable? I guess packages like dplyr have functions for that. Is there a way to do that with base R as well and without changing the order of the rows in the original data frame?
Many thanks!
Edit: The solution from Ronak Shah perfectly solves the problem. Thanks!

You can use by for base R option.
library(cNORM)
data$rank <- unlist(by(data, data$group, function(x) weighted.rank(x$raw, x$weight)))
In dplyr you could do :
library(dplyr)
data %>% group_by(group) %>% mutate(rank = weighted.rank(raw, weight))

Related

R: summerise verb with varying length

[this question is now thoroughly rewritten. I hope this would clarify things]
I have a dataset describing several tests with multiple-answer questions. Each line contains the raw answers of one participant, and the score that participant was awarded for each question. Each test has a different answer key:
df <- data.frame(id =c(1, 2, 3, 4, 5, 6, 7, 8, 9,10), # participant's id
test =c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), # id of question set
ans01=c(1, 2, 3, 4, 1, 1, 2, 3, 3, 4), # raw answers
ans02=c(2, 2,NA, 3, 4, 4, 3, 1, 1, 2),
bin01=c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0), # item scores
bin02=c(1, 1, 0, 0, 0, 1, 0, 0, 0, 0))
My problem is that the answer key is missing, and I need to recreate it from the dataset.
Currently, my solution is simple and creates a separate answer key for each test:
library(dplyr)
key_data <- df %>%
group_by(test) %>%
summarise(key01 = mean(ans01[bin01 == 1], na.rm=TRUE),
key02 = mean(ans02[bin02 == 1], na.rm=TRUE))
However, while this is ok for a short tests, it is not so ok for longer tests, containing dozens of questions.
Also, I want to be able to do so for future sets of tests, so flexibility is needed for the number of items.
Therefore, the question is whether there is a way to do so without writing a line for each item key.
Maybe loop through all variables, or passing string vectors as variable names?
[I answered my own question with a not-very-elegant solution. I'm sure this can be achieved in a much better way]
Hi you could use summarise_at and define the scope at the first argument.
See this exmaple using mtcars dataset.:
library(dplyr)
selection_a<-c("mpg","cyl","vs")
selection_b<-c("mpg","cyl","vs","qsec","carb")
# Use first selection (A)
mtcars %>%
summarise_at(selection_a , ~ mean(.x, na.rm = TRUE))
# Use second Selection (B)
mtcars %>%
summarise_at(selection_b , ~ mean(.x, na.rm = TRUE))
# combine selections (A+C)
selection_c<-c("gear","carb")
mtcars %>%
summarise_at( c(selection_a,selection_c), ~ mean(.x, na.rm = TRUE))
As a side note, I'm answering my own question with a-not-very-elegant-solution that I'm currently using. I'm sure there are shorter and much better solutions..
This required key table can be acquired with a loop, as follows:
MaxItems <- 2
pad0 <- function(x, n = 2) {
n0_pad <- n - nchar(x)
return(paste0(strrep("0",n0_pad), x))
}
library(dplyr)
## create the structure for the key table
keys <- df %>%
group_by(test) %>%
summarise(a01 = mean(ans01[bin01 == 1], na.rm=TRUE))
## add items to key table
for (i in 2:MaxItems) {
keysTemp <- df %>%
rename("tmpAns" = paste0("ans",pad0(i)),
"tmpBin" = paste0("bin",pad0(i))) %>%
group_by(test) %>%
summarise(tmpKey = mean(tmpAns[tmpBin == 1], na.rm=TRUE))
colnames(keysTemp)[2] <- paste0("a",pad0(i))
keys <- keys %>%
left_join(keysTemp, by = c("test"))
}

how to pass options to a function using dplyr mutate_at

I want to center, but not standardize, a set of variables in a data frame. I tried the code for doing that using mutate_at, but the scale function uses scale = TRUE as default, and I can't figure out how to set it to scale = FALSE. Tis scales the desired variables, but standardizes in addition to centering:
centdata <- mydat %>%
mutate_at(.vars = c(1, 2, 3, 4, 5, 6, 7, 8, 14),
.funs = list("scaled" = scale))
You can use purrr style formula or an anonymous function here.
library(dplyr)
cols <- c(1, 2, 3, 4, 5, 6, 7, 8, 14)
centdata <- mydat %>%
mutate_at(.vars = cols,
.funs = list("scaled" = ~scale(., scale = FALSE)))
Since mutate_at has been deprecated, you can use across.
centdata <- mydat %>%
mutate(across(cols, list("scaled" = ~scale(., scale = FALSE))))
In base R -
mydat[paste0(names(mydat)[cols], '_scaled')] <- lapply(mydat[cols], scale, scale = FALSE)
scale also work on dataframe directly.
mydat[paste0(names(mydat)[cols], '_scaled')] <- scale(mydat[cols])

Remove first columns of every tibble in list in R [duplicate]

This question already has answers here:
Delete a column in a data frame within a list
(4 answers)
Closed 3 years ago.
I have a list with several tibbles comprising several columns; and I want to remove the first 2 columns in each tibble (not using for loop).
#Example data with tibble and list
w <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
y <- c(1, 2, 3, 4, 5)
z <- c(1, 2, 3, 4, 5)
tibble <- tibble(w, x, y, z)
list <- list(tibble, tibble, tibble, tibble)
#Remove the first 2 columns in the tibble
tibble1 <- subset(tibble, select=-c(1:2))
tibble1
#Tried this to remove the first two columns of each tibble in list
list1 <- sapply(list, FUN = function(x) subset(x, select=-c(1:2)) )
list1
You could replace sapply with lapply to get expected output as list
lapply(list, FUN = function(x) subset(x, select=-c(1:2)))
With sapply you need to add simplify = FALSE
sapply(list, FUN = function(x) subset(x, select=-c(1:2)), simplify = FALSE)
Another alternative is to use [ to subset which is shorter and concise.
lapply(list, `[`, -c(1, 2))
you can try:
library(purrr)
list %>% map(subset,select=-c(1:2))
Another base R method using Map:
Map(function(x){x[,-c(1:2)]}, df_list)
Data:
require("tidyverse")
w <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
y <- c(1, 2, 3, 4, 5)
z <- c(1, 2, 3, 4, 5)
tibble <- tibble(w, x, y, z)
df_list <- list(tibble, tibble, tibble, tibble)

How to loop through vector of column names, mutate each column with assignment back to column and function referencing loop index

If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))

Loop through each variable and collect output R

I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?
If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})

Resources