Using a custom summary function for factors within multiple columns - r

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)

Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

Related

Pass a vector of distribution functions to calculate a mean per case in R

I have several probability distribution functions defined using the pdqr package. Let say, they are A, B and C:
A <- as_d(function(x)dnorm(x, mean = 3, sd = 1))
B <- as_d(function(x)dnorm(x, mean = 6, sd = 1))
C <- as_d(function(x)dnorm(x, mean = 2, sd = 2))
I have a large data.frame with a vector which has a character describing the appropiate PDF per case in a vector distr, let say:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
I would like to generate the mean of each PDF per case. Individually this works like this for PDF A:
> pdqr::summ_mean(A)
[1] 3
Now I would like to generate the mean for each case based on the PDF set in distr. This means passing the PDF into pdqr::sum_mean(). I have tried the following with the resulting errors:
> df$distr_mean <- summ_mean(df$distr)
Error: `f` is not pdqr-function. It should be function.
>
> df$distr_mean <- summ_mean(invoke_map(df$distr))
Error in A() : argument "x" is missing, with no default
>
> df$distr_mean <- df %>%
+ pull(distr) %>%
+ summ_mean()
Error: `f` is not pdqr-function. It should be function.
So, either it doesn't understand that a pdqr-function is being passed, or it needs a x-value, which doesn't make sense, since I want the mean over the entire distribution, not a single x (passing a range like c(1:10) also doesn't work). Furthermore, I understand that any apply or do.call function only passes one single function, while I want to pass several different functions, given in a vector.
How to proceed?
One way to do this is to use the distr column as an argument to mget, which will return all the appropriate functions in a list. Just feed that list to summ_mean using sapply:
sapply(mget(df$distr), pdqr::summ_mean)
#> A C A B B A C
#> 3 2 3 6 6 3 2
Though inside mutate you'll need to tell mget which environment the functions will be found:
df %>%
mutate(distr_mean = sapply(mget(distr, envir = .GlobalEnv), pdqr::summ_mean))
#> distr distr_mean
#> 1 A 3
#> 2 C 2
#> 3 A 3
#> 4 B 6
#> 5 B 6
#> 6 A 3
#> 7 C 2
This may be easier to manage if you store your functions in a named list, rather than in the top level environment. From there, it's relatively easy to use sapply or lapply to calculate the mean for each function and then extract the results into df:
df <- data.frame(distr = c("A", "C", "A", "B", "B", "A", "C"))
pdfs <- list(
A = as_d(function(x)dnorm(x, mean = 3, sd = 1)),
B = as_d(function(x)dnorm(x, mean = 6, sd = 1)),
C = as_d(function(x)dnorm(x, mean = 2, sd = 2))
)
means <- sapply(pdfs, summ_mean)
df$distr_mean <- means[df$distr]
distr distr_mean
1 A 3
2 C 2
3 A 3
4 B 6
5 B 6
6 A 3
7 C 2
Or in one line:
df$distr_mean <- lapply(df$distr, \(x) pdqr::summ_mean(pdfs[[x]]))

Generate individual data distributions using mean and standard deviation data from a data frame in R

I have a data.frame in R, containing several categorical variables, each with its own mean and standard deviation. I want to generate values from a normal data distribution for each categorical variable defined by these values and generate individual data.frames for each discrete categorical variable.
Here's some dummy data
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
dummy_data
VARIABLE MEAN SD
1 A 6.278751 1.937093
2 B 6.384247 2.487678
3 C 9.017496 2.003202
4 D 5.125994 1.829517
5 E 9.525213 1.914513
6 F 9.004893 2.734934
7 G 9.780757 2.511341
8 H 5.372160 1.510281
9 I 6.240331 2.796826
10 J 8.478280 2.325139
What I'd like to do from here, is to generate individual data.frames for each row, with each data.frame containing a normal distribution based on the MEAN and SD columns.
So, for example, I'd have a separate data.frame that contained....
A <- subset(dummy_data, VARIABLE == 'A')
A <- data.frame(rnorm(20, A$MEAN, A$SD))
A
rnorm.20..A.MEAN..A.SD.
1 5.131331
2 9.388104
3 8.909453
4 5.813257
5 5.353137
6 7.598521
7 2.693924
8 5.425703
9 8.939687
10 9.148066
11 4.528936
12 7.576479
13 8.207456
14 6.838258
15 6.972061
16 7.824283
17 6.283434
18 4.503815
19 2.133388
20 7.472886
The real data I'm working with is much larger than ten rows, and so I don't want to subset the whole thing to generate the individual data.frames if I can help it.
Thanks in advance
What about a solution using dplyr?:
library(dplyr)
#A dataframe containing all the information
Huge_df <- dummy_data %>% group_by(VARIABLE) %>% summarise(SIMULATED = rnorm(20, MEAN, SD))
#You can then split the dataframe if needed:
Splitted <- split.data.frame(Huge_df, "VARIABLE")
If you then need to save every individual dataframe, or do something else with them, you can always unlist the Splitted object
Using data.table:
library(data.table)
result <- setDT(dummy_data)[, .(sample=rnorm(20, mean=MEAN, sd=SD)), by=.(VARIABLE)]
list.of.df <- split(result, result$VARIABLE)
You can put everything into a list, then return all the elements in the list to the global environment (if desired, or keep in the list):
set.seed(123)
dummy_data <- data.frame(VARIABLE = LETTERS[seq( from = 1, to = 10 )],
MEAN = runif(10, 5, 10), SD = runif(10, 1, 3))
# put all the values into a list
list_dist <- vector(mode = "list", length = nrow(dummy_data))
for(i in 1:nrow(dummy_data)){
list_dist[[i]] <- data.frame(values = rnorm(20, dummy_data[i,2], dummy_data[i,3]))
}
# name the list elements
names(list_dist) <- dummy_data$VARIABLE
# or more detailed names, for instance,
# names(list_dist) <- paste0(dummy_data$VARIABLE, "_Distribution")
#return all list values to the global environment
list2env(list_dist,globalenv())

permute dataframe but must have unique rows

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

selection of observations by combining criteria in R

This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now.
Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)
> data <- data.frame(id = rep(c(1:10), 3),
T = gl(3, 10),
X = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE),
Z = sample(1:40, 30),
Z2 = rnorm(30, mean = 5, sd = 0.5))
> head(data)
id T X Y Z Z2
1 1 1 10 yes 15 5.993605
2 2 1 18 no 22 6.096566
3 3 1 5 no 24 5.101393
4 4 1 15 yes 18 4.944108
5 5 1 23 no 34 4.634176
6 6 1 13 no 27 5.576015
I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.
I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?
Thank you very much in advance to all.
Have a great day,
Denis
You can do:
keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])
Or use the plyr package:
library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)

How to ddply() without sorting?

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.

Resources