Select top n columns (based on an aggregation) - r

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}

Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]

We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]

Related

sample multiple different sample sizes using crossing and sample_n to create single df

I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))

R as.data.frame.matrix turns first column into row names

I want to turn a table into a data frame. Three columns should be there: 1. the zip code 2 outcome "0" and 3 outcome "1". But as.data.frame.matrix turns the zip-code into row names and makes them unusable.
I tried to add a fourth column with imaginary ID's (1:100) so R makes them to row names but R tells me, that "all arguments must be the same length" - which they are!
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
abs <- table(df$zip, df$outcome)
abs <- as.data.frame.matrix(abs)
Some has a nice and slick idea? Thanks in advance!
Edit:
When:
abs <- as.matrix(as.data.frame(abs))
I get something close to what I want but the outcomes are together in one column. How to untie them, to make them look like the table again?
You can get to your desired result easier with dplyr and tidyr:
library(dplyr)
library(tidyr)
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
df <- df %>% group_by(zip, outcome) %>%
summarise(freq = n()) %>%
ungroup() %>%
spread(outcome, freq)
You are supplying only a 100 values to a data.frame that has 101 rows.
> nrow(abs)
[1] 101
so this would work
abs$new_col <- 1:101
I think you want this:
abs2 <- as.data.frame(abs) %>% select(2,3,1)

Slice 2nd half of data frame in R

I can easily slice the 1st half (or any other percentage) of a data frame using:
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(seq(0.5 * n()))
However, how can I slice the 2nd half of my data frame?
With negative indices
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(-seq(0.5 * n()))
slice() can do two things: keep rows if you give it positive row numbers, or drop rows if you give it negative row numbers. You can use either of these to grab the second half of your dataframe:
# Keeping later rows
df %>% slice(seq(n()/2, n()))
# Dropping earlier rows
df %>% slice(-seq(1, n()/2))
You'll want to be careful if you have an odd number of rows, since n()/2 won't be an integer in those cases. Using seq(0.5 * n()) as in your example could run into this problem too. To be safe, you can be explicit about how to handle the middle cases with floor() and ceiling():
df <- data.frame(x = 1:11)
# Include row 5
df %>% slice(seq(floor(n()/2), n()))
# Exclude row 5
df %>% slice(seq(ceiling(n()/2), n()))
You can also just slightly modify your seq argument:
df <- data.frame(x = 1:10)
df %>%
slice(seq(n() * 0.5, n()))
Update per #Kerry Jackson's suggestion:
df %>%
slice(seq(floor(n() * 0.5) + 1, n()))
if an odd number of rows - you'll need to select how to deal with the middle row.

Dataframe is too big for supercomputer

I am trying to create a matrix of donors and recipients, populated with the sum of donations produced in each couple keeping the eventual NAs.
It works well for small datasets (See toy example below) but when I switch to national datasets (3m entries) several problems emerge: besides being painstakingly slow, the creation of the fill df consume all the memory of the (super)computer and I get the error "Error: cannot allocate vector of size 1529.0 Gb"
How should I tackle the problem?
Thanks a lot!
library(dplyr)
library(tidyr)
libray(bigmemory)
candidate_id <- c("cand_1","cand_1","cand_1","cand_2","cand_3")
donor_id <- c("don_1","don_1","don_2","don_2","don_3")
donation <- c(1,2,3.5,4,10)
df = data.frame(candidate_id,donor_id,donation)
colnames(df) <- c("candidate_id","donor_id","donation")
fill <- df %>%
group_by(df$candidate_id,df$donor_id) %>%
summarise(tot_donation=sum(as.numeric(donation))) %>%
complete(df$candidate_id,df$donor_id)
fill <- unique(fill[ ,1:3])
colnames(fill) <- c("candidate_id","donor_id","tot_donation")
nrow = length(unique(df$candidate_id))
ncol = length(unique(df$donor_id))
row_names = unique(fill$candidate_id)
col_names = unique(fill$donor_id)
x <- big.matrix(nrow, ncol, init=NA,dimnames=list(row_names,col_names))
for (i in 1:nrow){
for (j in 1:ncol){
x[i,j] <- fill[which(fill$candidate_id == row_names[i] &
fill$donor_id == col_names[j]), 3]
}
}
I see you're using unique because your output has duplicated values.
Based on this question,
you should try the following in order to avoid duplication:
fill <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup %>%
complete(candidate_id, donor_id)
Can you then try to create your desired output?
I think unique can be very resource-heavy,
so try to avoid calling it.
The tidyr version of what Benjamin suggested should be:
spread(fill, donor_id, tot_donation)
EDIT: By the way, since you tagged the question with sparse-matrix,
you could indeed use sparsity to your advantage:
library(Matrix)
library(dplyr)
df <- data.frame(
candidate_id = c("cand_1","cand_1","cand_1","cand_2","cand_3"),
donor_id = c("don_1","don_1","don_2","don_2","don_3"),
donation = c(1,2,3.5,4,10)
)
summ <- df %>%
group_by(candidate_id, donor_id) %>%
summarise(tot_donation=sum(donation)) %>%
ungroup
num_candidates <- nlevels(df$candidate_id)
num_donors <- nlevels(df$donor_id)
smat <- Matrix(0, num_candidates, num_donors, sparse = TRUE, dimnames = list(
levels(df$candidate_id),
levels(df$donor_id)
))
indices <- summ %>%
select(candidate_id, donor_id) %>%
mutate_all(unclass) %>%
as.matrix
smat[indices] <- summ$tot_donation
smat
3 x 3 sparse Matrix of class "dgCMatrix"
don_1 don_2 don_3
cand_1 3 3.5 .
cand_2 . 4.0 .
cand_3 . . 10
You might try
library(reshape2)
dcast(fill, candidate_id ~ donor_id,
value.var = "tot_donation",
fun.aggregate = sum)
I don't know if it will avoid the memory issue, but it will likely be much faster than a double for loop.
I have to run to a meeting, but part of me wonders if there is a way to do this with outer.

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

Resources