Slice 2nd half of data frame in R - r

I can easily slice the 1st half (or any other percentage) of a data frame using:
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(seq(0.5 * n()))
However, how can I slice the 2nd half of my data frame?

With negative indices
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(-seq(0.5 * n()))

slice() can do two things: keep rows if you give it positive row numbers, or drop rows if you give it negative row numbers. You can use either of these to grab the second half of your dataframe:
# Keeping later rows
df %>% slice(seq(n()/2, n()))
# Dropping earlier rows
df %>% slice(-seq(1, n()/2))
You'll want to be careful if you have an odd number of rows, since n()/2 won't be an integer in those cases. Using seq(0.5 * n()) as in your example could run into this problem too. To be safe, you can be explicit about how to handle the middle cases with floor() and ceiling():
df <- data.frame(x = 1:11)
# Include row 5
df %>% slice(seq(floor(n()/2), n()))
# Exclude row 5
df %>% slice(seq(ceiling(n()/2), n()))

You can also just slightly modify your seq argument:
df <- data.frame(x = 1:10)
df %>%
slice(seq(n() * 0.5, n()))
Update per #Kerry Jackson's suggestion:
df %>%
slice(seq(floor(n() * 0.5) + 1, n()))
if an odd number of rows - you'll need to select how to deal with the middle row.

Related

Gather a tibble with matrix columns

My tibble looks like this:
df = tibble(x = 1:3, col1 = matrix(rnorm(6), ncol = 2),
col2 = matrix(rnorm(6), ncol = 2))
it has three columns of which two contain a matrix with 2 columns each (in my case there are many more columns, this example is just to illustrate the problem). I transform this data to long format by using gather
gather(df, key, val, -x)
but this gives me not the desired result. It stacks only the first column of column 1 and column 2 and dismisses the rest. What I want is that val contains the row vectors of column 1 and column 2, i.e. val is a matrix valued column (containing 1x2 matrices). The tidyverse seems, however, not be able to deal with matrix-valued columns appropriately. Is there a way to achieve my desired result? (Ideally using the routines from tidyverse)
Some of the columns are matrix. It needs to be converted to proper data.frame columns and then would work
library(dplyr)
library(tidyr)
do.call(data.frame, df) %>%
pivot_longer(cols = -x)
Or use gather
do.call(data.frame, df) %>%
gather(key, val, -x)
Or another option is to convert the matrix to vector with c and then use unnest
df %>%
mutate_at(-1, ~ list(c(.))) %>%
unnest(c(col1, col2))
if the 'col1', 'col2', values would be in a single column
df %>%
mutate_at(-1, ~ list(c(.))) %>%
pivot_longer(cols = -x) %>%
unnest(c(value))

R as.data.frame.matrix turns first column into row names

I want to turn a table into a data frame. Three columns should be there: 1. the zip code 2 outcome "0" and 3 outcome "1". But as.data.frame.matrix turns the zip-code into row names and makes them unusable.
I tried to add a fourth column with imaginary ID's (1:100) so R makes them to row names but R tells me, that "all arguments must be the same length" - which they are!
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
abs <- table(df$zip, df$outcome)
abs <- as.data.frame.matrix(abs)
Some has a nice and slick idea? Thanks in advance!
Edit:
When:
abs <- as.matrix(as.data.frame(abs))
I get something close to what I want but the outcomes are together in one column. How to untie them, to make them look like the table again?
You can get to your desired result easier with dplyr and tidyr:
library(dplyr)
library(tidyr)
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
df <- df %>% group_by(zip, outcome) %>%
summarise(freq = n()) %>%
ungroup() %>%
spread(outcome, freq)
You are supplying only a 100 values to a data.frame that has 101 rows.
> nrow(abs)
[1] 101
so this would work
abs$new_col <- 1:101
I think you want this:
abs2 <- as.data.frame(abs) %>% select(2,3,1)

Select top n columns (based on an aggregation)

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD).
How to efficiently do it?
One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set.
I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.
code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.
#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))
# add more columns
for (i in 1:5){
set.seed (101)
df_stage <-
as.data.frame(matrix(
round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
))
colnames(df_stage) <- paste("v",(10*i):(10*i+4))
df <- cbind(df, df_stage)
}
Another tidyverse approach with a bit of reshaping:
library(tidyverse)
n = 3
df %>%
summarise_all(mean) %>%
gather() %>%
top_n(n, value) %>%
pull(key) %>%
df[.]
We can do this with
library(dplyr)
n <- 3
df %>%
summarise_all(mean) %>%
unlist %>%
order(., decreasing = TRUE) %>%
head(n) %>%
df[.]

R dedupe records that are not exactly duplicates

I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

Resources