I would like to select a subset of elements from a whole that satisfy certain conditions. There are about 20 elements, each having multiple attributes. I would like to select five elements that offer the least amount of discrepancy from a fixed criterion on one attribute, and offers the highest average value on another attribute.
Lastly, I would like to apply the function over multiple sets of 20 elements.
Thus far, I have been able to identify the subsets "by hand," but I'd like to be able to return the index of the values in addition to returning the values themselves.
Objectives:
I would like to find the set of five values for X1 that are the least discrepant from a fixed value (55), and provide the largest value for the average of X2.
I would like to do this for multiple sets.
##### generating example data
##### this has five groups, each with two variables x1 and x2
set.seed(271828)
grp <- gl(5,20)
x1 <- round(rnorm(100,45, 12), digits=0)
x2 <- round(rbeta(100,2,4), digits = 2)
id <- seq(1,100,1)
##### this is how the data would arrive for me to analyze
dat <- as.data.frame(cbind(id,grp,x1,x2))
The data would arrive in this format, with id as a unique identifier for each element.
##### pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]
crit <- 55
x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))
mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2
out <- cbind(x, mean.x, k, y, mean.y)
##### finding the sets with the least amount of discrepancy
pick <- out[ which(k == min(k)), ]
pick
##### finding the sets with low discrepancy and high values of y (means of X2) by "hand"
sorted <- out[order(k), ]
head(sorted, n=20)
With respect to the values in pick, I can see that the values of X1 are:
> pick
mean.x k mean.y
[1,] 55 47 48 48 52 50 25 0.62 0.08 0.31 0.18 0.54 0.346
[2,] 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426
I would like to return the id value for these elements, so that I know that I pick elements: 3, 8, 10, 11, and 18 (choosing set 2 since the discrepancy from k is the same, but the mean for y is higher).
> dat.grp.1
id grp x1 x2
1 1 1 45 0.12
2 2 1 27 0.34
3 3 1 55 0.62
4 4 1 39 0.32
5 5 1 41 0.18
6 6 1 29 0.47
7 7 1 47 0.08
8 8 1 48 0.31
9 9 1 35 0.48
10 10 1 48 0.18
11 11 1 47 0.48
12 12 1 31 0.29
13 13 1 39 0.15
14 14 1 36 0.54
15 15 1 36 0.20
16 16 1 38 0.40
17 17 1 30 0.31
18 18 1 52 0.54
19 19 1 44 0.37
20 20 1 31 0.20
Doing this "by hand" works for now, but it would be good to make this as "hands-off" as possible.
Any help is greatly appreciated.
You are almost there. You can change your definition of sorted to
sorted <- out[order(k, -mean.y), ]
And then sorted[1,] (or if you prefer sorted[1,,drop=FALSE]) is your selected set.
If you want the indexes rather than/in addition to the points, then you can include that earlier. Replace:
x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))
with
idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))
and include idx in out later.
Putting int all together:
##### pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]
crit <- 55
idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))
mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2
out <- cbind(idx, x, mean.x, k, y, mean.y)
##### finding the sets with the least amount of discrepancy and among
##### those the largest second mean
pick <- out[order(k, -mean.y)[1],,drop=FALSE]
pick
which gives
mean.x k mean.y
[1,] 3 8 10 11 18 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426
EDIT: description of applying over idx was requested; I want more options than just what i can do in a comment so I'm adding it to my answer. Will also address looping over subsets.
idx is a matrix (15504 x 5), each row of which is a set of (5) indexes for the dataframe. apply allows going through row-by-row (row-by-row is margin 1) to do something with each row. That something is take the values and use them to index the desired rows of dat.grp.1 and pull out the corresponding x1 values. I could have written dat.grp.1[i,"x1"] as dat.grp.1$x1[i]. Each row of idx becomes a column and the results of indexing into dat.grp.1 are the rows, so the whole thing needs to be transposed.
You can break the loop apart to see how each step works if you like. Make the function into a non-anonymous function.
f <- function(i) {dat.grp.1[i,"x1"]}
and pass row at a time of idx to it.
> f(idx[1,])
[1] 45 27 55 39 41
> f(idx[2,])
[1] 45 27 55 39 29
> f(idx[3,])
[1] 45 27 55 39 47
> f(idx[4,])
[1] 45 27 55 39 48
These are what get bundled into x
> head(x,4)
[,1] [,2] [,3] [,4] [,5]
[1,] 45 27 55 39 41
[2,] 45 27 55 39 29
[3,] 45 27 55 39 47
[4,] 45 27 55 39 48
As for looping over subsets, the plyr library is very handy for this. The way you have set it up (assign the subset of interest to a variable and work with that) makes the transformation easy. Everything you do to create the answer for one subset goes into a function with that part as a parameter.
find.best.set <- function(dat.grp.1) {
crit <- 55
idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))
mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2
out <- cbind(idx, x, mean.x, k, y, mean.y)
out[order(k, -mean.y)[1],,drop=FALSE]
}
This is basically what you had before, but getting rid of some unnecessary assignments.
Now wrap this in a plyr call.
library("plyr")
ddply(dat, .(grp), find.best.set)
which gives
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 1 3 8 10 11 18 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426
2 2 8 10 12 15 16 53 35 55 76 56 55 0 0.71 0.20 0.43 0.50 0.70 0.508
3 3 4 10 15 17 20 47 48 73 55 52 55 0 0.67 0.54 0.28 0.42 0.31 0.444
4 4 2 11 13 17 19 47 46 70 62 50 55 0 0.35 0.47 0.18 0.13 0.47 0.320
5 5 3 6 10 17 19 72 40 58 66 39 55 0 0.33 0.42 0.32 0.32 0.51 0.380
I don't know that that is the best format for your results, but it mirrors the example you gave.
Related
I have tried to apply a function to a data.frame including only specific rows.
My aim is to have a fifth column which includes a function which varies according to the group and func. Say I would like to perform a t-test in the case that func=a and to calculate a mean difference in the case func=b. In other words, the first three rows in the fifth column should include the result of a t.test (t.test(n1[1:3],n2[1:3])$p.value) comparing the n1 and n2 in the group 1. How is this possible?
n1<-c(58,94,58,94,65,87,65,91,20,16)
n2<-c(37,34,88,23,86,37,80,34,24,67)
group<-c(1,1,1,2,2,2,2,3,3,3)
func<-c('a','a','a','b','b','b','b','a','a','a')
data<-data.frame(n1,n2,group,func)
data
n1 n2 group func
1 58 37 1 a
2 94 34 1 a
3 58 88 1 a
4 94 23 2 b
5 65 86 2 b
6 87 37 2 b
7 65 80 2 b
8 91 34 3 a
9 20 24 3 a
10 16 67 3 a
EDIT:
Manually I can do it like this. But is could I do it if I had +1000 rows with 100+ groups?
pvalue1<-t.test(c(58,94,58),c(37,34,88))$p.value
pvalue2<-chisq.test(c(94,65,87,65),c(23,86,37,80))$p.value
pvalue3<-t.test(c(91,20,16),c(34,24,67))$p.value
pvalue<-c(rep(pvalue1,3),rep(pvalue2,4),rep(pvalue3,3))
cbind(data,pvalue)
n1 n2 group func pvalue
1 58 37 1 a 0.4737073
2 94 34 1 a 0.4737073
3 58 88 1 a 0.4737073
4 94 23 2 b 0.2381033
5 65 86 2 b 0.2381033
6 87 37 2 b 0.2381033
7 65 80 2 b 0.2381033
8 91 34 3 a 0.9822272
9 20 24 3 a 0.9822272
10 16 67 3 a 0.9822272
You can do the calculations with dplyr like this:
library(dplyr)
my_df %>%
group_by(group) %>%
mutate(p_value = ifelse(func == 'a', t.test(n1, n2)$p.value, chisq.test(n1, n2)$p.value))
# A tibble: 10 x 5
# Groups: group [3]
# n1 n2 group func p_value
# <dbl> <dbl> <dbl> <fct> <dbl>
# 1 58. 37. 1. a 0.474
# 2 94. 34. 1. a 0.474
# 3 58. 88. 1. a 0.474
# 4 94. 23. 2. b 0.238
# 5 65. 86. 2. b 0.238
# 6 87. 37. 2. b 0.238
# 7 65. 80. 2. b 0.238
# 8 91. 34. 3. a 0.982
# 9 20. 24. 3. a 0.982
# 10 16. 67. 3. a 0.982
I've seen some cool stuff along these lines in Hadley's R4DS book. Check this out for an example and some discussion around my approach below.
The following goes some way to achieving what you'd like:
library(dplyr)
library(purrr)
library(tidyr)
test_function <- function(func, data) {
if (func == "a") {t.test(data$n1, data$n2)$p.value}
else if (func == "b") {chisq.test(data$n1, data$n2)$p.value}
}
df %>%
group_by(group, func) %>%
nest() %>%
mutate(p_value = map2_dbl(func, data, function(x, y) test_function(x, y)))
%>% unnest()
Consider base R's underused by() which can split dataframes by one or more factors and then pass subsets into a defined or anonymous function, returning a list of function's output.
Data (assuming functions are strings)
n1 <- c(58,94,58,94,65,87,65,91,20,16)
n2 <- c(37,34,88,23,86,37,80,34,24,67)
group <- c(1,1,1,2,2,2,2,3,3,3)
func < -c('t.test','t.test','t.test','chisq.test','chisq.test',
'chisq.test','chisq.test','t.test','t.test','t.test')
data <- data.frame(n1,n2,group,func)
By processing (using get() to retrieve actual function):
data_list <- by(data, data$group, function(sub){
func <- print(as.character(sub$func[[1]]))
f <- get(func)
sub$pvalue <- f(sub$n1, sub$n2)$p.value
return(sub)
})
final_df <- do.call(rbind, data_list)
final_df
# n1 n2 group func pvalue
# 1.1 58 37 1 t.test 0.4737073
# 1.2 94 34 1 t.test 0.4737073
# 1.3 58 88 1 t.test 0.4737073
# 2.4 94 23 2 chisq.test 0.2381033
# 2.5 65 86 2 chisq.test 0.2381033
# 2.6 87 37 2 chisq.test 0.2381033
# 2.7 65 80 2 chisq.test 0.2381033
# 3.8 91 34 3 t.test 0.9822272
# 3.9 20 24 3 t.test 0.9822272
# 3.10 16 67 3 t.test 0.9822272
Assume I have a random data table and I want to loop over its subsets.
e.g.
DT <- data.table(date = rep(c(1979,1980,1981,1982),3),
Id = rep(c(1,2,3),each = 4),
x1 = c(10, 40, 80,12,13,19,9,5,22,13,49,110),
x2 = sample(100,12,replace=T),
x3 = sample(100,12,replace=T))
I also have the following function:
test <- function(x){x[,3:5]/100}
Assume I loop over id, apply the function 'test' to the subsets of the datatable and save everything in a list:
resultinglist <- vector("list",3)
for (i in 1:3){resultinglist[[i]] <- test(DT[Id == i])}
This, so far, is straight forward. Now my question is, with very large datasets, this can take a while. Therefore: Can this code be optimized in any way, maybe so that no copies of the datatable-subsets are made?
In particular, I wonder what happens if I pass DT[id == i] to functiontest? Is this the right approach? For example I could also try to loop and just filter at every iteration, then apply some code on the filtered datatable.
Thanks for any hints.
I would go with split(test(DT), DT$Id).
> system.time(resultinglist1<- split(test(DT), DT$Id))
user system elapsed
0.002 0.000 0.002
> resultinglist <- vector("list",3)
> system.time(for (i in 1:3){resultinglist[[i]] <- test(DT[Id == i])})
user system elapsed
0.015 0.000 0.016
Even with that few data points it takes 1/8th of the time (on my machine).
There is a split.data.table method: see ?split.data.table so try:
> split(DT, by=c("Id"), flatten=FALSE)
$`1`
date Id x1 x2 x3
1: 1979 1 10 26 74
2: 1980 1 40 17 5
3: 1981 1 80 43 51
4: 1982 1 12 35 96
$`2`
date Id x1 x2 x3
1: 1979 2 13 8 65
2: 1980 2 19 66 69
3: 1981 2 9 69 27
4: 1982 2 5 4 80
$`3`
date Id x1 x2 x3
1: 1979 3 22 100 29
2: 1980 3 13 28 83
3: 1981 3 49 53 55
4: 1982 3 110 89 7
If you wanted to extract the 3rd to 5th columns it might be:
lapply( split(DT, by=c("Id"), flatten=FALSE), subset, select=3:5)
$`1`
x1 x2 x3
1: 10 26 74
2: 40 17 5
3: 80 43 51
4: 12 35 96
$`2`
x1 x2 x3
1: 13 8 65
2: 19 66 69
3: 9 69 27
4: 5 4 80
$`3`
x1 x2 x3
1: 22 100 29
2: 13 28 83
3: 49 53 55
4: 110 89 7
See also ?subset.data.table
My initial goal was to set ylim for data plotted by barplot. When I started to dig deeper I've found several things that I do not understand. Let me explain my research:
I have 1D vector:
> str(vectorName)
num [1:999] 1 1 1 1 1 1 1 1 1 1 ...
> dim(vectorName)
NULL
> length(vectorName)
[1] 999
If I want to count the particular elements of this vector I do:
> vectorNameTable = table(vectorName)
> vectorNameTable
vectorName
0 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4
563 72 35 22 14 21 14 10 5 3 7 3 6 5 3 1 3
0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.75 0.775 0.8 0.825
1 3 3 5 7 11 3 4 3 11 5 9 5 7 8 5 3
0.85 0.875 0.9 0.925 0.975 1
3 4 2 1 1 108
This is how I display those data more elegant way (in R-studio):
> View(vectorNameTable)
Which gives me output like this:
vectorName Freq
1 0 563
2 0.025 72
3 0.05 35
4 0.075 22
5 0.1 14
6 0.125 21
7 0.15 14
8 0.175 10
9 0.2 5
10 0.225 3
11 0.25 7
12 0.275 3
13 0.3 6
14 0.325 5
15 0.35 3
16 0.375 1
17 0.4 3
18 0.425 1
19 0.45 3
20 0.475 3
21 0.5 5
22 0.525 7
23 0.55 11
24 0.575 3
25 0.6 4
26 0.625 3
27 0.65 11
28 0.675 5
29 0.7 9
30 0.725 5
31 0.75 7
32 0.775 8
33 0.8 5
34 0.825 3
35 0.85 3
36 0.875 4
37 0.9 2
38 0.925 1
39 0.975 1
40 1 108
If I want to plot this data I do:
> barplot(vectorNameTable)
Which gives me this plot:
As you can see 0 is occurring more times than is y-axis size. So what I want is to set the size of y-axis using:
barplot(table(vectorNameTable), ylim=c(0,MAX_VALUE_IN_FREQ_COLUMN))
The problem is that I cannot find the largest value in Freq column. To be more precise I cannot even access the Freq column. I've tried:
> vectorNameTable[,1]
Error in vectorNameTable[, 1] : incorrect number of dimensions
and several other attempts, but seems that the only thing that I am able to obtain is whole row:
> vectorNameTable[1]
0
563
> vectorNameTable[2]
0.025
72
Or even the Freq value in given row:
> vectorNameTable[[1]]
[1] 563
> vectorNameTable[[2]]
[1] 72
The one possible workaround that is working is converting the data to matrix:
vectorNameDF = data.frame(vectorNameTable)
val = vectorNameDF[[1]]
frq = vectorNameDF[[2]]
val = as.numeric(levels(val))
vectorNameMTX = matrix(c(val, frq), nrow=length(val))
Then I cand do something like this:
barplot(vectorNameTable, ylim=c(0,max(vectorNameMTX[,2])+50))
Which will return:
But as you can see it is extreme overkill. Another mysterious thing that I've found is that plotting the graph this way (same as barplot(vectorNameMTX, beside=FALSE)):
> barplot(vectorNameMTX)
Will return this:
This command > barplot(vectorNameMTX, beside=TRUE) will return this:
Why this is happening? I mean what is this "line" on the left? And where is x-axis? If I do View(vectorNameMTX) it returns very similar table to View(vectorNameTable). The documentation for barplot says (only important things):
Bar Plots
Description
Creates a bar plot with vertical or horizontal bars.
Usage
barplot(height, ...)
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
I'm passing the matrix, but it does not working as expected:
> class(vectorNameMTX)
[1] "matrix"
On the other hand this one is not mentioned as supported type but it is working:
> class(vectorNameTable)
[1] "table"
Why I can't access columns of vectorNameTable? Why is passing the table object working while passing an matrix is not? What I'm missing here and what is the best way to achieve my goal?
Thank you
Table of a 1d vector is a 1d vector, so there is no columns. You can do something like
> a <- rbinom(1000, 25, 0.5)
> tb <- table(a)
> tb
a
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
8 20 31 71 96 155 141 146 136 94 46 33 15 7 1
> dim(tb)
[1] 15 # 1 dimension of 15
> tb[which.max(tb)]
11
155
So you can feed this max value to barplot.
I have a data set generated as follows:
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
The data looks like as this
I would like to generate a stratified sample set of myData with given sample size, i.e., 50. The resulting sample set should follow the proportion allocation of the original data set in terms of "group". For instance, assume myData has 20 records belonging to group 4, then the resulting data set should have 50*20/200=5 records belonging to group 4. How to do that in R.
You can use my stratified function, specifying a value < 1 as your proportion, like this:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected.
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6
Say I have the following data frame:
df <- data.frame(store = LETTERS[1:8],
sales = c( 9, 128, 54, 66, 23, 132, 89, 70),
successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))
I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):
df <- df[order(-df$successRate, -df$sales), ]
In order to actually create a ranking variable, I do the following:
df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))
So df looks like this:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 2
1 A 9 0.80 3
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:
df$rank <- ifelse(df$sales < 50, NA,
ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))
The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 4
3 C 54 0.54 5
8 H 70 0.46 6
6 F 132 0.35 7
2 B 128 0.25 8
I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?
store sales successRate rank
4 D 66 0.92 1
5 E 23 0.85 NA
1 A 9 0.80 NA
7 G 89 0.54 2
3 C 54 0.54 3
8 H 70 0.46 4
6 F 132 0.35 5
2 B 128 0.25 6
Super easy to do with data.table:
library(data.table)
dt = data.table(df)
# do the ordering you like (note, could also use setkey to do this faster)
dt = dt[order(-successRate, -sales)]
dt[sales >= 50, rank := .I]
dt
# store sales successRate rank
#1: D 66 0.92 1
#2: E 23 0.85 NA
#3: A 9 0.80 NA
#4: G 89 0.54 2
#5: C 54 0.54 3
#6: H 70 0.46 4
#7: F 132 0.35 5
#8: B 128 0.25 6
If you must do it in data.frame, then after your preferred order, run:
df$rank <- NA
df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))