I have a bunch of Boolean variables. I want to summarize them and show the percentage of positive values. The big thing in this question is that the variables are logical organized in two dimensions.
The result I want should look like this (kind of):
a b
v1_1 30% 60%
v1_2 60% 50%
Here is a minimal working (self running) example.
#!/usr/bin/env Rscript
set.seed(0)
df <- data.frame(v1_1_a = sample(c(T,F), 10, replace=TRUE),
v1_1_b = sample(c(T,F), 10, replace=TRUE),
v1_2_a = sample(c(T,F), 10, replace=TRUE),
v1_2_b = sample(c(T,F), 10, replace=TRUE))
my_percent <- function (col) { return (100 / length(col) * sum(col)) }
p <- apply(df, 2, my_percent)
print(p)
This is the output:
v1_1_a v1_1_b v1_2_a v1_2_b
30 60 60 50
Just for information: The real data has 80 Boolean variables logical organized in a 10 x 8 matrix.
If you don't mind having to add in the row and column names, you could use colMeans together with the matrix construction function to build a matrix with the desired structure.
myMat <- matrix(colMeans(df), 2, byrow = TRUE)
MyMat
[,1] [,2]
[1,] 0.3 0.6
[2,] 0.6 0.5
If desired, you could add the names using dimnames. In this instance,
dimnames(myMat) <- list(paste0("V1", 1:2), letters[1:2])
will do the trick.
You could break up the metric names into separate columns.
With dplyr and tidyr:
p <- data.frame(p)
p$metric <- row.names(p)
p %>% mutate(metric_1 = ifelse(grepl('v1_1_', metric), "v1_1", "v1_2"),
metric_2 = ifelse(grepl('a', metric), 'a', 'b')) %>%
select(-metric) %>%
spread(key = metric_2, value = p)
Giving...
metric_1 a b
1 v1_1 30 60
2 v1_2 60 50
You could set the row names to get exactly what you want:
row.names(p) <- p$metric_1
p %<>% select(-metric_1)
Resulting in...
a b
v1_1 30 60
v1_2 60 50
Related
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
I want to create random mock data looks like this.
__ID__|__Amount__
1 20
1 14
1 9
1 3
2 11
2 5
2 2
Starting from the random number but the second number with the same ID should be lesser than the first one, and the third number has to be lesser than the second one. Maximum number to start should be 20.
you can just create the data first and then sort it as you need, using tidyverse :
set.seed(0)
df <- data.frame(id = rep(1:3,10), amt = sample(1:20, 30, replace = TRUE))
df %>%
group_by(id) %>%
arrange(id, desc(amt))
This is a tricky one if you want the Amount column to be truly random values you can use a recursive call that will use sample recursively:
## Recursively sampling from a uniform distribution
recursive.sample <- function(start, end, length, results = NA, counter =0) {
## To enter the recursion, counter must be smaller than the length out
## and the last result must be smaller than the starting point (except the firs time)
if(counter < length && ifelse(counter != 0, results[counter] > start, TRUE)){
## Increment the counter
counter <- counter + 1
## Sample between start and the last result or the start and the end of the vector
results[counter] <- ifelse(counter != 1, sample(start:results[counter-1], 1), sample(start:end, 1))
## Recursive call
return(recursive.sample(start = start, end = end, length = length, results = results, counter = counter))
} else {
## Exit the recursion
return(results)
}
}
## Example
set.seed(0)
recursive.sample(start = 1, end = 20, length = 3, results = NA, counter = 0)
#[1] 18 5 2
Alternatively (and way easier) you can use sort(sample()):
set.seed(0)
sort(sample(1:20, 3), decreasing = TRUE)
#[1] 18 7 6
Note that the results differ due to the lower probability of sampling higher values in the recursive function.
You can then easily create your table with your chosen function as follow:
set.seed(123)
## The ID column
ID <- c(rep(1, 4), rep(2,3))
## The Amount column
Amount <- c(recursive.sample(1, 20, 4, NA, 0), recursive.sample(1, 11, 3, NA, 0))
## The table
cbind(ID, Amount)
# ID Amount
#[1,] 1 18
#[2,] 1 5
#[3,] 1 2
#[4,] 1 2
#[5,] 2 10
#[6,] 2 3
#[7,] 2 3
Or, again, with the simple sort(sample()) function for a higher probability of picking larger numbers.
Two methods, one using dplyr and one using only base R functions. These are slightly different to the two previous solutions.
I used sorted ID column, but this is not necessary.
Method 1
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df %>% group_by(ID) %>%
mutate(Amount = sort(sample(1 : 20, n(), replace = T), decreasing = TRUE))
Method 2
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
uniq_ID <- unique(df$ID)
index_lst <- lapply(uniq_ID, function(x) which(df$ID == x))
res <- lapply(index_lst, function(x) sort(sample(1 : 20, length(x)),
decreasing = TRUE))
df$Amount[unlist(index_lst)] <- unlist(res)
Method 2.5
This is more convoluted than the 2nd method.
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
tab <- as.data.frame(table(df$ID))
lapply(1 : nrow(tab), function(x) df$Amount[which(df$ID == tab$Var1[x])] <<-
sort(sample(1 : 20, tab$Freq[x]), decreasing = TRUE))
I wanted to multiply to each list element in say l1 with b1's col1 and store it in a separate column. Basically this is what i wanted to do :
res = 0
for item in a
for col_item in b
res = res + item * col_item
E.g.
l1 = list(c('17-Nov-14', 10), c('17-Apr-15', 20))
b1 = data.frame(col1 = c(10, 20), res=c(0))
result = data.frame(col1= c(10, 20), res = c(2*10+4*10+3*10, 2*20+4*20+3*20))
I have a working code but can be improved.
test <- function(param, df) {
df$res <- as.integer(param[2]) * df$col1
df
}
t <- lapply(l1, test, b)
result <- cbind(t[[1]]$col1, t[[1]]$res + t[[2]]$res + t[[3]]$res)
We can simplify the computation with a little algebra. If we factor out the element of b1$col1, then we can precompute the sum of the list and perform a vectorized multiplication against it:
b1$res <- sum(unlist(l1))*b1$col1;
b1;
## col1 res
## 1 10 90
## 2 20 180
For your new problem definition, we need to extract the required element out of each list component vector:
b1$res <- sum(as.integer(sapply(l1,`[`,2L)))*b1$col1;
b1;
## col1 res
## 1 10 300
## 2 20 600
If you are looking for a method to reduce your list after lapply, you can use the Reduce function:
Reduce(function(df1, df2) data.frame(col = df1[1], res = df1[2] + df2[2]), myList)
# col1 res
# 1 10 90
# 2 20 180
Suppose myList <- lapply(...).
I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.
I have this kind of data:
x <- matrix(c(2,2,3,3,3,4,4,20,33,2,3,45,6,9,45,454,7,4,6,7,5), nrow = 7, ncol = 3)
In the real dataset, I have a huge matrix with a lot of columns.
I want to extract unique rows with respect to the first column(Id) and minimum of the third column. For instance, for this matrix I would expect
y <- matrix(c(2,3,4,20,3,9,45,4,5), nrow = 3, ncol = 3)
I tried a lot of things but I couldn't figure out.
Any help is appreciated.
Thanks in advance,
Zeray
Here's a version that is more complicated, but somewhat faster that Chase's ddply solution - some 200x faster :-)
uniqueMin <- function(m, idCol = 1L, minCol = ncol(m)) {
t(vapply(split(1:nrow(m), m[,idCol]), function(i, x, minCol) x[i, , drop=FALSE][which.min(x[i,minCol]),], m[1,], x=m, minCol=minCol))
}
And the following test code:
nRows <- 10000
nCols <- 100
ids <- nRows/5
m <- cbind(sample(ids, nRows, T), matrix(runif(nRows*nCols), nRows))
system.time( a<-uniqueMin(m, minCol=3L) ) # 0.07
system.time(ddply(as.data.frame(m), "V1", function(x) x[which.min(x$V3) ,])) # 15.72
You can use package plyr. Convert to a data.frame so you can group on the first column, then use which.min to extract the min row by group:
library(plyr)
ddply(as.data.frame(x), "V1", function(x) x[which.min(x$V3) ,])
V1 V2 V3
1 2 20 45
2 3 3 4
3 4 9 5