I am using ave for ranking values within groups in a dataset in R. In the example 'data' is a data.frame with the cols raw, group and others, for example
data <- data.frame(raw = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), weight = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2)))
The ranking works fine with
data$rank <- ave(data$raw, data$group, FUN = function(x) {rank(x)})
I would like to generalize this approach by applying weights. The weights are available as another col in the data.frame. The weighted ranking is a self defined function that needs both the raw scores and the weights vector. It is available via the cNORM package, code: https://github.com/WLenhard/cNORM/blob/master/R/utilities.R
Is it possible to use ave with multiple input variables, e. g.
data$rank <- ave(x = data$raw, data$group, y = data$weights, FUN = function(x, y) {weighted.rank(x, weights = y)})
so that both x and y are both the according subsets based on the grouping variable? I guess packages like dplyr have functions for that. Is there a way to do that with base R as well and without changing the order of the rows in the original data frame?
Many thanks!
Edit: The solution from Ronak Shah perfectly solves the problem. Thanks!
You can use by for base R option.
library(cNORM)
data$rank <- unlist(by(data, data$group, function(x) weighted.rank(x$raw, x$weight)))
In dplyr you could do :
library(dplyr)
data %>% group_by(group) %>% mutate(rank = weighted.rank(raw, weight))
Related
[this question is now thoroughly rewritten. I hope this would clarify things]
I have a dataset describing several tests with multiple-answer questions. Each line contains the raw answers of one participant, and the score that participant was awarded for each question. Each test has a different answer key:
df <- data.frame(id =c(1, 2, 3, 4, 5, 6, 7, 8, 9,10), # participant's id
test =c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), # id of question set
ans01=c(1, 2, 3, 4, 1, 1, 2, 3, 3, 4), # raw answers
ans02=c(2, 2,NA, 3, 4, 4, 3, 1, 1, 2),
bin01=c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0), # item scores
bin02=c(1, 1, 0, 0, 0, 1, 0, 0, 0, 0))
My problem is that the answer key is missing, and I need to recreate it from the dataset.
Currently, my solution is simple and creates a separate answer key for each test:
library(dplyr)
key_data <- df %>%
group_by(test) %>%
summarise(key01 = mean(ans01[bin01 == 1], na.rm=TRUE),
key02 = mean(ans02[bin02 == 1], na.rm=TRUE))
However, while this is ok for a short tests, it is not so ok for longer tests, containing dozens of questions.
Also, I want to be able to do so for future sets of tests, so flexibility is needed for the number of items.
Therefore, the question is whether there is a way to do so without writing a line for each item key.
Maybe loop through all variables, or passing string vectors as variable names?
[I answered my own question with a not-very-elegant solution. I'm sure this can be achieved in a much better way]
Hi you could use summarise_at and define the scope at the first argument.
See this exmaple using mtcars dataset.:
library(dplyr)
selection_a<-c("mpg","cyl","vs")
selection_b<-c("mpg","cyl","vs","qsec","carb")
# Use first selection (A)
mtcars %>%
summarise_at(selection_a , ~ mean(.x, na.rm = TRUE))
# Use second Selection (B)
mtcars %>%
summarise_at(selection_b , ~ mean(.x, na.rm = TRUE))
# combine selections (A+C)
selection_c<-c("gear","carb")
mtcars %>%
summarise_at( c(selection_a,selection_c), ~ mean(.x, na.rm = TRUE))
As a side note, I'm answering my own question with a-not-very-elegant-solution that I'm currently using. I'm sure there are shorter and much better solutions..
This required key table can be acquired with a loop, as follows:
MaxItems <- 2
pad0 <- function(x, n = 2) {
n0_pad <- n - nchar(x)
return(paste0(strrep("0",n0_pad), x))
}
library(dplyr)
## create the structure for the key table
keys <- df %>%
group_by(test) %>%
summarise(a01 = mean(ans01[bin01 == 1], na.rm=TRUE))
## add items to key table
for (i in 2:MaxItems) {
keysTemp <- df %>%
rename("tmpAns" = paste0("ans",pad0(i)),
"tmpBin" = paste0("bin",pad0(i))) %>%
group_by(test) %>%
summarise(tmpKey = mean(tmpAns[tmpBin == 1], na.rm=TRUE))
colnames(keysTemp)[2] <- paste0("a",pad0(i))
keys <- keys %>%
left_join(keysTemp, by = c("test"))
}
I want to center, but not standardize, a set of variables in a data frame. I tried the code for doing that using mutate_at, but the scale function uses scale = TRUE as default, and I can't figure out how to set it to scale = FALSE. Tis scales the desired variables, but standardizes in addition to centering:
centdata <- mydat %>%
mutate_at(.vars = c(1, 2, 3, 4, 5, 6, 7, 8, 14),
.funs = list("scaled" = scale))
You can use purrr style formula or an anonymous function here.
library(dplyr)
cols <- c(1, 2, 3, 4, 5, 6, 7, 8, 14)
centdata <- mydat %>%
mutate_at(.vars = cols,
.funs = list("scaled" = ~scale(., scale = FALSE)))
Since mutate_at has been deprecated, you can use across.
centdata <- mydat %>%
mutate(across(cols, list("scaled" = ~scale(., scale = FALSE))))
In base R -
mydat[paste0(names(mydat)[cols], '_scaled')] <- lapply(mydat[cols], scale, scale = FALSE)
scale also work on dataframe directly.
mydat[paste0(names(mydat)[cols], '_scaled')] <- scale(mydat[cols])
This question already has answers here:
Delete a column in a data frame within a list
(4 answers)
Closed 3 years ago.
I have a list with several tibbles comprising several columns; and I want to remove the first 2 columns in each tibble (not using for loop).
#Example data with tibble and list
w <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
y <- c(1, 2, 3, 4, 5)
z <- c(1, 2, 3, 4, 5)
tibble <- tibble(w, x, y, z)
list <- list(tibble, tibble, tibble, tibble)
#Remove the first 2 columns in the tibble
tibble1 <- subset(tibble, select=-c(1:2))
tibble1
#Tried this to remove the first two columns of each tibble in list
list1 <- sapply(list, FUN = function(x) subset(x, select=-c(1:2)) )
list1
You could replace sapply with lapply to get expected output as list
lapply(list, FUN = function(x) subset(x, select=-c(1:2)))
With sapply you need to add simplify = FALSE
sapply(list, FUN = function(x) subset(x, select=-c(1:2)), simplify = FALSE)
Another alternative is to use [ to subset which is shorter and concise.
lapply(list, `[`, -c(1, 2))
you can try:
library(purrr)
list %>% map(subset,select=-c(1:2))
Another base R method using Map:
Map(function(x){x[,-c(1:2)]}, df_list)
Data:
require("tidyverse")
w <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
y <- c(1, 2, 3, 4, 5)
z <- c(1, 2, 3, 4, 5)
tibble <- tibble(w, x, y, z)
df_list <- list(tibble, tibble, tibble, tibble)
If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))
I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?
If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})