Get description of groups from within a grouped data frame - r

I need to write a function that will take in a grouped data frame (from dplyr) and make a plot for each group, with the title describing what group it is for. The kicker is I don't know what the grouping variable is, or even how many there will be.
I've hacked together something using groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Although I'm new to dplyr, this feels like the wrong way to go about this, that is, it's not really a dplyr native way of doing it. It works in the little testing I've done but I'm worried it will fail in some odd circumstance I haven't foreseen. How would you all do it? Is there a more dplyr-ish way of doing it?
On the odd chance that what I've done is actually a good idea, I've posted it as answer for you all to vote on as appropriate.

library(data.table)
setDT(d) # or create directly as data.table
par(mfrow = c(2, 3))
d[, plot(y, main = paste(names(.BY), .BY, sep = "=", collapse = ", ")), by = .(A, B)]

This is what I've hacked together; as described in the question, it uses groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Instead of making a plot, it just makes a data frame with the title as a variable.
library(dplyr)
d <- as.tbl(data.frame(expand.grid(A=1:3,B=1:2,y=1:2)))
d1 <- d %>% group_by(A)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [3 x 2]
## Groups: A [3]
##
## A title
## <int> <chr>
## 1 1 A = 1
## 2 2 A = 2
## 3 3 A = 3
d1 <- d %>% group_by(A, B)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [6 x 3]
## Groups: A, B [6]
##
## A B title
## <int> <int> <chr>
## 1 1 1 A = 1, B = 1
## 2 1 2 A = 1, B = 2
## 3 2 1 A = 2, B = 1
## 4 2 2 A = 2, B = 2
## 5 3 1 A = 3, B = 1
## 6 3 2 A = 3, B = 2

Related

Paste letter index on string R

I want to paste a number and some letters together to index them. The columns of my dataframe are as follows;
When CNTR is NA, i want it to be the booking number + an index, so for booking 202653 for example, I want it to be 202653A and 202653B. I already achieved pasting the booking numbers into the CNTR column when its empty with;
dfUNIT$CNTR <- ifelse(is.na(dfUNIT$CNTR), dfUNIT$BOOKING, dfUNIT$CNTR)
which gives me the following table;
But as I said, I need unique CNTR values. My dataframe contains thousands of rows and changes frequently, is there a way to 'index' them the way I want (A, B, C etc)? Thank you in advance
I'll make up some data,
dat <- data.frame(B=c(202658,202654,202653,202653),C=c("TCLU","KOCU",NA,NA))
dplyr
library(dplyr)
dat %>%
group_by(B) %>%
mutate(C = if_else(is.na(C), paste0(B, LETTERS[row_number()]), C))
# # A tibble: 4 x 2
# # Groups: B [3]
# B C
# <dbl> <chr>
# 1 202658 TCLU
# 2 202654 KOCU
# 3 202653 202653A
# 4 202653 202653B
A fundamental risk in this is if you ever have more than 26 rows for a booking, in which case the letter-suffix will fail. An alternative is to append a number instead (e.g., paste0(B, "_", row_number()) or add some other safeguards.
base R alternatives
do.call(rbind, by(dat, dat[,"B",drop=FALSE],
FUN = function(z) transform(z,
C = ifelse(is.na(C), paste0(B, LETTERS[seq_along(z$C)]), C)
)
))
or
append <- ave(dat$C, dat$B, FUN = function(z) ifelse(is.na(z), LETTERS[seq_along(z)], ""))
append
# [1] "" "" "A" "B"
dat$C <- paste0(ifelse(is.na(dat$C), dat$B, dat$C), append)
dat
# B C
# 1 202658 TCLU
# 2 202654 KOCU
# 3 202653 202653A
# 4 202653 202653B
If you don't insist on using letters to index the transformations, here's arough and ready dplyr solution based on rleid from the data.table package:
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(B)) %>%
mutate(CNTR_new = if_else(is.na(CNTR), paste0(B, "_", grp), CNTR))
# A tibble: 7 x 4
# Groups: grp [5]
B CNTR grp CNTR_new
<dbl> <chr> <int> <chr>
1 12 TCU 1 TCU
2 13 NA 2 13_2
3 13 NA 2 13_2
4 15 NA 3 15_3
5 1 PVDU 4 PVDU
6 1 NA 4 1_4
7 5 NA 5 5_5
Data:
df <- data.frame(
B = c(12,13,13,15,1,1,5),
CNTR = c("TCU", NA, NA, NA, "PVDU", NA, NA)
)

Removing groups from dataframe if variable has repeated values

I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())

Converting columns in dataframe within list?

What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!
Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})
This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double

How can I cast a data frame with two related columns? [duplicate]

I have a table like this
data.table(ID = c(1,2,3,4,5,6),
R = c("s","s","n","n","s","s"),
S = c("a","a","a","b","b","b"))
and I'm trying to get this result
a b
s 1, 2 5, 6
n 3 4
Is there any option in data.table can do this?
Here's an alternative that uses plain old data.table syntax:
DT[,lapply(split(ID,S),list),by=R]
# or...
DT[,lapply(split(ID,S),toString),by=R]
You can use dcast from reshape2 with the appropriate aggregating function:
library(functional)
library(reshape2)
dcast(df, R~S, value.var='ID', fun.aggregate=Curry(paste0, collapse=','))
# R a b
#1 n 3 4
#2 s 1,2 5,6
Or even short as #akrun underlined:
dcast(df, R~S, value.var='ID', toString)
You could try:
library(dplyr)
library(tidyr)
df %>%
group_by(R, S) %>%
summarise(i = toString(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n 3 4
#2 s 1, 2 5, 6
Note: This will store the result in a string. If you want a more convenient format to access the elements, you could store in a list:
df2 <- df %>%
group_by(R, S) %>%
summarise(i = list(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n <dbl[1]> <dbl[1]>
#2 s <dbl[2]> <dbl[2]>
You can then access the elements by doing:
> df2$a[[2]][2]
#[1] "2"

Binding a list variable into a new data frame

I am using dplyr version 0.4.1, and am trying to wrap my head around list variables.
I am having trouble creating a new data frame (or a tbl_df or data_frame or whatever) from a table containing a list variable.
For example, if I have a tbl_df like so:
x <- c(1,2,3)
y <- c(3,2,1)
d <- data_frame(X = list(x, y))
d
## Source: local data frame [2 x 1]
##
## X
## 1 <dbl[3]>
## 2 <dbl[3]>
Assuming all the values of the list variable X is the same length or dimensions, is there an operation that I can run to create a table that looks like rbind(x, y) from the list variable inside the table?
I am hoping to get something that will look like:
data_frame(V1 = c(1, 3), V2 = c(2, 2), V3 = c(3, 1))
## Source: local data frame [2 x 3]
##
## V1 V2 V3
## 1 1 2 3
## 2 3 2 1
The closest I got to to my desired result was a stacked column:
d %>% tidyr::unnest(X)
I thought that maybe using rowwise to group by row might allow me to do an operation for each row, but I am seeing the same results as above.
d %>% rowwise %>% tidyr::unnest(X) # %>% some extra commands here??
You can do a little work on d first, then use bind_rows()
library(dplyr)
d$X %>%
lapply(function(x) data.frame(matrix(x, 1))) %>%
bind_rows
# Source: local data frame [2 x 3]
#
# X1 X2 X3
# 1 1 2 3
# 2 3 2 1
Another way is to use tbl_dt after rbindlist(), which can also be fed into dplyr functions
library(data.table)
tbl_dt(rbindlist(lapply(d$X, as.list)))
# Source: local data table [2 x 3]
#
# V1 V2 V3
# 1 1 2 3
# 2 3 2 1

Resources