How can I cast a data frame with two related columns? [duplicate] - r

I have a table like this
data.table(ID = c(1,2,3,4,5,6),
R = c("s","s","n","n","s","s"),
S = c("a","a","a","b","b","b"))
and I'm trying to get this result
a b
s 1, 2 5, 6
n 3 4
Is there any option in data.table can do this?

Here's an alternative that uses plain old data.table syntax:
DT[,lapply(split(ID,S),list),by=R]
# or...
DT[,lapply(split(ID,S),toString),by=R]

You can use dcast from reshape2 with the appropriate aggregating function:
library(functional)
library(reshape2)
dcast(df, R~S, value.var='ID', fun.aggregate=Curry(paste0, collapse=','))
# R a b
#1 n 3 4
#2 s 1,2 5,6
Or even short as #akrun underlined:
dcast(df, R~S, value.var='ID', toString)

You could try:
library(dplyr)
library(tidyr)
df %>%
group_by(R, S) %>%
summarise(i = toString(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n 3 4
#2 s 1, 2 5, 6
Note: This will store the result in a string. If you want a more convenient format to access the elements, you could store in a list:
df2 <- df %>%
group_by(R, S) %>%
summarise(i = list(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n <dbl[1]> <dbl[1]>
#2 s <dbl[2]> <dbl[2]>
You can then access the elements by doing:
> df2$a[[2]][2]
#[1] "2"

Related

Remove rows that have a duplicate in R

I have
a <- c(rep("A", 3), rep("B", 3), rep("C",2), rep("D", 1))
b <- c(1,1,2,4,1,1,2,2,5)
df <-data.frame(a,b)
Based on df$a, i would like to return only the values that do not have a duplicate (those rows that have a single occurence of df$a), in this example it would be 1 D 5
I have tried duplicate(), !duplicate() and unique() but none outputs what I need.
Cleanest way with dplyr:
library(dplyr)
df %>% group_by(a) %>%
filter(n() == 1)
Output:
# A tibble: 1 x 2
# Groups: a [1]
a b
<chr> <dbl>
1 D 5
One option
df[!(df$a %in% df$a[duplicated(df$a)]),]
a b
9 D 5
Using data.table
library(data.table)
setDT(df)
df[, tmp:= .N, by = a][tmp == 1, -"tmp"]
a b
1: D 5
With Base R ,
x <- table(df[,1])
df[rep(x<2,x),]
gives,
# a b
# 9 D 5

Looping through list of dataframes with a group_by and piping function?

I have a list of dataframes which I am trying to apply a script to which works for a single data frame.
Part of the script uses both piping and group_by:
df2 <- df1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
I've tried various loops or variations with lapply but haven't been able to find a way for it to work with a lists of dataframes where it would be something along the lines of:
mylist2 <- mylist1 %>%
group_by (col1) %>%
summarise(newcol = sum(col2))
But obviously changed around to work with loops or lapply. I'm probably missing something simple here but would appreciate some help. Thanks
PS - I looked at providing the data from the lists but wasn't able to provide reproducible samples.
Here is a tidyverse way.
# generate some data
mylist1 <- replicate(2, data.frame(col1 = rep(letters[1:2], 2),
col2 = 1:4),
simplify = FALSE)
library(purrr)
library(dplyr)
mylist1 %>%
map(., ~ group_by(., col1) %>%
summarise(new_col = sum(col2)))
#[[1]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
#[[2]]
# A tibble: 2 x 2
# col1 new_col
# <fct> <int>
#1 a 4
#2 b 6
In base R you might try lapply and tapply
lapply(mylist1, function(x)
tapply(X = x[["col2"]], INDEX = x[["col1"]], FUN = 'sum'))
#[[1]]
#a b
#4 6
#[[2]]
#a b
#4 6

Get description of groups from within a grouped data frame

I need to write a function that will take in a grouped data frame (from dplyr) and make a plot for each group, with the title describing what group it is for. The kicker is I don't know what the grouping variable is, or even how many there will be.
I've hacked together something using groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Although I'm new to dplyr, this feels like the wrong way to go about this, that is, it's not really a dplyr native way of doing it. It works in the little testing I've done but I'm worried it will fail in some odd circumstance I haven't foreseen. How would you all do it? Is there a more dplyr-ish way of doing it?
On the odd chance that what I've done is actually a good idea, I've posted it as answer for you all to vote on as appropriate.
library(data.table)
setDT(d) # or create directly as data.table
par(mfrow = c(2, 3))
d[, plot(y, main = paste(names(.BY), .BY, sep = "=", collapse = ", ")), by = .(A, B)]
This is what I've hacked together; as described in the question, it uses groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Instead of making a plot, it just makes a data frame with the title as a variable.
library(dplyr)
d <- as.tbl(data.frame(expand.grid(A=1:3,B=1:2,y=1:2)))
d1 <- d %>% group_by(A)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [3 x 2]
## Groups: A [3]
##
## A title
## <int> <chr>
## 1 1 A = 1
## 2 2 A = 2
## 3 3 A = 3
d1 <- d %>% group_by(A, B)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [6 x 3]
## Groups: A, B [6]
##
## A B title
## <int> <int> <chr>
## 1 1 1 A = 1, B = 1
## 2 1 2 A = 1, B = 2
## 3 2 1 A = 2, B = 1
## 4 2 2 A = 2, B = 2
## 5 3 1 A = 3, B = 1
## 6 3 2 A = 3, B = 2

Binding a list variable into a new data frame

I am using dplyr version 0.4.1, and am trying to wrap my head around list variables.
I am having trouble creating a new data frame (or a tbl_df or data_frame or whatever) from a table containing a list variable.
For example, if I have a tbl_df like so:
x <- c(1,2,3)
y <- c(3,2,1)
d <- data_frame(X = list(x, y))
d
## Source: local data frame [2 x 1]
##
## X
## 1 <dbl[3]>
## 2 <dbl[3]>
Assuming all the values of the list variable X is the same length or dimensions, is there an operation that I can run to create a table that looks like rbind(x, y) from the list variable inside the table?
I am hoping to get something that will look like:
data_frame(V1 = c(1, 3), V2 = c(2, 2), V3 = c(3, 1))
## Source: local data frame [2 x 3]
##
## V1 V2 V3
## 1 1 2 3
## 2 3 2 1
The closest I got to to my desired result was a stacked column:
d %>% tidyr::unnest(X)
I thought that maybe using rowwise to group by row might allow me to do an operation for each row, but I am seeing the same results as above.
d %>% rowwise %>% tidyr::unnest(X) # %>% some extra commands here??
You can do a little work on d first, then use bind_rows()
library(dplyr)
d$X %>%
lapply(function(x) data.frame(matrix(x, 1))) %>%
bind_rows
# Source: local data frame [2 x 3]
#
# X1 X2 X3
# 1 1 2 3
# 2 3 2 1
Another way is to use tbl_dt after rbindlist(), which can also be fed into dplyr functions
library(data.table)
tbl_dt(rbindlist(lapply(d$X, as.list)))
# Source: local data table [2 x 3]
#
# V1 V2 V3
# 1 1 2 3
# 2 3 2 1

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

I want to make a grouped filter using dplyr, in a way that within each group only that row is returned which has the minimum value of variable x.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
Update
With dplyr >= 0.3 you can use the slice function in combination with which.min, which would be my favorite approach for this task:
df %>% group_by(A) %>% slice(which.min(x))
#Source: local data frame [3 x 3]
#Groups: A
#
# A x y
#1 A 1 0.2979772
#2 B 2 -1.1265265
#3 C 5 -1.1952004
Original answer
For the sample data, it is also possible to use two filter after each other:
group_by(df, A) %>%
filter(x == min(x)) %>%
filter(1:n() == 1)
Just for completeness: Here's the final dplyr solution, derived from the comments of #hadley and #Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
For what it's worth, here's a data.table solution, to those who may be interested:
# approach with setting keys
dt <- as.data.table(df)
setkey(dt, A,x)
dt[J(unique(A)), mult="first"]
# without using keys
dt <- as.data.table(df)
dt[dt[, .I[which.min(x)], by=A]$V1]
This can be accomplished by using row_number combined with group_by. row_number handles ties by assigning a rank not only by the value but also by the relative order within the vector. To get the first row of each group with the minimum value of x:
df.g <- group_by(df, A)
filter(df.g, row_number(x) == 1)
For more information see the dplyr vignette on window functions.
dplyr offers slice_min function, wich do the job with the argument with_ties = FALSE
library(dplyr)
df %>%
group_by(A) %>%
slice_min(x, with_ties = FALSE)
Output :
# A tibble: 3 x 3
# Groups: A [3]
A x y
<fct> <dbl> <dbl>
1 A 1 0.273
2 B 2 -0.462
3 C 5 1.08
Another way to do it:
set.seed(1)
x <- data.frame(a = rep(1:2, each = 10), b = rnorm(20))
x <- dplyr::arrange(x, a, b)
dplyr::filter(x, !duplicated(a))
Result:
a b
1 1 -0.8356286
2 2 -2.2146999
Could also be easily adapted for getting the row in each group with maximum value.
In case you are looking to filter the minima of x and then the minima of y. An intuitive way of do it is just using filtering functions:
> df
A x y
1 A 1 1.856368296
2 A 1 -0.298284187
3 A 2 0.800047796
4 B 2 0.107289719
5 B 3 0.641819999
6 B 4 0.650542284
7 C 5 0.422465687
8 C 5 0.009819306
9 C 5 -0.482082635
df %>% group_by(A) %>%
filter(x == min(x), y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
This code will filter the minima of x and y.
Also you can do a double filter
that looks even more readable:
df %>% group_by(A) %>%
filter(x == min(x)) %>%
filter(y == min(y))
# A tibble: 3 x 3
# Groups: A [3]
A x y
<chr> <dbl> <dbl>
1 A 1 -0.298
2 B 2 0.107
3 C 5 -0.482
I like sqldf for its simplicity..
sqldf("select A,min(X),y from 'df.g' group by A")
Output:
A min(X) y
1 A 1 -1.4836989
2 B 2 0.3755771
3 C 5 0.9284441
For the sake of completeness, here's the base R answer:
df[with(df, ave(x, A, FUN = \(x) rank(x, ties.method = "first")) == 1), ]
# A x y
#1 A 1 0.1076158
#4 B 2 -1.3909084
#7 C 5 0.3511618

Resources