Binding a list variable into a new data frame - r

I am using dplyr version 0.4.1, and am trying to wrap my head around list variables.
I am having trouble creating a new data frame (or a tbl_df or data_frame or whatever) from a table containing a list variable.
For example, if I have a tbl_df like so:
x <- c(1,2,3)
y <- c(3,2,1)
d <- data_frame(X = list(x, y))
d
## Source: local data frame [2 x 1]
##
## X
## 1 <dbl[3]>
## 2 <dbl[3]>
Assuming all the values of the list variable X is the same length or dimensions, is there an operation that I can run to create a table that looks like rbind(x, y) from the list variable inside the table?
I am hoping to get something that will look like:
data_frame(V1 = c(1, 3), V2 = c(2, 2), V3 = c(3, 1))
## Source: local data frame [2 x 3]
##
## V1 V2 V3
## 1 1 2 3
## 2 3 2 1
The closest I got to to my desired result was a stacked column:
d %>% tidyr::unnest(X)
I thought that maybe using rowwise to group by row might allow me to do an operation for each row, but I am seeing the same results as above.
d %>% rowwise %>% tidyr::unnest(X) # %>% some extra commands here??

You can do a little work on d first, then use bind_rows()
library(dplyr)
d$X %>%
lapply(function(x) data.frame(matrix(x, 1))) %>%
bind_rows
# Source: local data frame [2 x 3]
#
# X1 X2 X3
# 1 1 2 3
# 2 3 2 1
Another way is to use tbl_dt after rbindlist(), which can also be fed into dplyr functions
library(data.table)
tbl_dt(rbindlist(lapply(d$X, as.list)))
# Source: local data table [2 x 3]
#
# V1 V2 V3
# 1 1 2 3
# 2 3 2 1

Related

Converting columns in dataframe within list?

What is the best way to convert a specific column in each list object to a specific format?
For instance, I have a list with four objects (each of which is a data frame) and I want to change column 3 in each data.frame from double to integer?
I'm guessing something along the line of lapply but I didn't know what specific synthax to use. I was trying:
lapply(df,function(x){as.numeric(var1(x))})
but it wasn't working.
Thanks!
Yes, lapply works well here:
lapply(listofdfs, function(df) { # loop through each data.frame in list
df[ , 3] <- as.integer(df[ , 3]) # make the 3rd column of type integer
df # return the new data.frame
})
This is just an alternative to C. Braun's answer.
You can also use map() function from the purr library.
Input:
library(tidyverse)
df <- tibble(a = c(1, 2, 3), b =c(4, 5, 6), d = c(7, 8, 9))
myList <- list(df, df, df)
myList
Method:
map(myList, ~(.x %>% mutate_at(vars(3), funs(as.integer(.)))))
Output:
[[1]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[2]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
[[3]]
# A tibble: 3 x 3
a b d
<dbl> <dbl> <int>
1 1. 4. 7
2 2. 5. 8
3 3. 6. 9
You can use this:
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
Simple example:
data <- data.frame(cbind(c("1","2","3","4",NA),c(1:5)),stringsAsFactors = F)
typeof(data[,1]) #character
dlist <- list(data,data,data)
coltochange <- 1
dlist2 <- lapply(dlist,function(x){
y <- x
y[,coltochange] <- as.numeric(x[,coltochange])
return(y)
} )
typeof(dlist[[1]][,1]) #character
typeof(dlist2[[1]][,1]) #double

dplyr mutate - How do I pass one row as a function argument?

I'm trying to create a new column in my tibble which collects and formats all words found in all other columns. I would like to do this using dplyr, if possible.
Original DataFrame:
df <- read.table(text = " columnA columnB
1 A Z
2 B Y
3 C X
4 D W
5 E V
6 F U " )
As a simplified example, I am hoping to do something like:
df %>%
rowwise() %>%
mutate(newColumn = myFunc(.))
And have the output look like this:
columnA columnB newColumn
1 A Z AZ
2 B Y BY
3 C X CX
4 D W DW
5 E V EV
6 F U FU
When I try this in my code, the output looks like:
columnA columnB newColumn
1 A Z ABCDEF
2 B Y ABCDEF
3 C X ABCDEF
4 D W ABCDEF
5 E V ABCDEF
6 F U ABCDEF
myFunc should take one row as an argument but when I try using rowwise() I seem to be passing the entire tibble into the function (I can see this from adding a print function into myFunc).
How can I pass just one row and do this iteratively so that it applies the function to every row? Can this be done with dplyr?
Edit:
myFunc in the example is simplified for the sake of my question. The actual function looks like this:
get_chr_vector <- function(row) {
row <- row[,2:ncol(row)] # I need to skip the first row
words <- str_c(row, collapse = ' ')
words <- str_to_upper(words)
words <- unlist(str_split(words, ' '))
words <- words[words != '']
words <- words[!nchar(words) <= 2]
words <- removeWords(words, stopwords_list) # from the tm library
words <- paste(words, sep = ' ', collapse = ' ')
}
Take a look at ?dplyr::do and ?purrr::map, which allow you to apply arbitrary functions to arbitrary columns and to chain the results through multiple unary operators. For example,
df1 <- df %>% rowwise %>% do( X = as_data_frame(.) ) %>% ungroup
# # A tibble: 6 x 1
# X
# * <list>
# 1 <tibble [1 x 2]>
# 2 <tibble [1 x 2]>
# ...
Notice that column X now contains 1x2 data.frames (or tibbles) comprised of rows from your original data.frame. You can now pass each one to your custom myFunc using map.
myFunc <- function(Y) {paste0( Y$columnA, Y$columnB )}
df1 %>% mutate( Result = map(X, myFunc) )
# # A tibble: 6 x 2
# X Result
# <list> <list>
# 1 <tibble [1 x 2]> <chr [1]>
# 2 <tibble [1 x 2]> <chr [1]>
# ...
Result column now contains the output of myFunc applied to each row in your original data.frame, as desired. You can retrieve the values by concatenating a tidyr::unnest operation.
df1 %>% mutate( Result = map(X, myFunc) ) %>% unnest
# # A tibble: 6 x 3
# Result columnA columnB
# <chr> <fctr> <fctr>
# 1 AZ A Z
# 2 BY B Y
# 3 CX C X
# ...
If desired, unnest can be limited to specific columns, e.g., unnest(Result).
EDIT: Because your original data.frame contains only two columns, you can actually skip the do step and use purrr::map2 instead. The syntax is very similar to map:
myFunc <- function( a, b ) {paste0(a,b)}
df %>% mutate( Result = map2( columnA, columnB, myFunc ) )
Note that myFunc is now defined as a binary function.
This should work
df <- read.table(text = " columnA columnB
1 A Z
2 B Y
3 C X
4 D W
5 E V
6 F U " )
df %>%
mutate(mutate_Func = paste0(columnA,columnB))
columnA columnB mutate_Func
1 A Z AZ
2 B Y BY
3 C X CX
4 D W DW
5 E V EV
6 F U FU

Get description of groups from within a grouped data frame

I need to write a function that will take in a grouped data frame (from dplyr) and make a plot for each group, with the title describing what group it is for. The kicker is I don't know what the grouping variable is, or even how many there will be.
I've hacked together something using groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Although I'm new to dplyr, this feels like the wrong way to go about this, that is, it's not really a dplyr native way of doing it. It works in the little testing I've done but I'm worried it will fail in some odd circumstance I haven't foreseen. How would you all do it? Is there a more dplyr-ish way of doing it?
On the odd chance that what I've done is actually a good idea, I've posted it as answer for you all to vote on as appropriate.
library(data.table)
setDT(d) # or create directly as data.table
par(mfrow = c(2, 3))
d[, plot(y, main = paste(names(.BY), .BY, sep = "=", collapse = ", ")), by = .(A, B)]
This is what I've hacked together; as described in the question, it uses groups to get the grouping variables and then accessing the value with .[1,g], where g is a character version of the grouping variable names, as below.
Instead of making a plot, it just makes a data frame with the title as a variable.
library(dplyr)
d <- as.tbl(data.frame(expand.grid(A=1:3,B=1:2,y=1:2)))
d1 <- d %>% group_by(A)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [3 x 2]
## Groups: A [3]
##
## A title
## <int> <chr>
## 1 1 A = 1
## 2 2 A = 2
## 3 3 A = 3
d1 <- d %>% group_by(A, B)
g <- unlist(lapply(groups(d1), paste))
d1 %>% do(data.frame(title=paste(paste(g, "=", .[1,g]), collapse=", "), stringsAsFactors=FALSE))
## Source: local data frame [6 x 3]
## Groups: A, B [6]
##
## A B title
## <int> <int> <chr>
## 1 1 1 A = 1, B = 1
## 2 1 2 A = 1, B = 2
## 3 2 1 A = 2, B = 1
## 4 2 2 A = 2, B = 2
## 5 3 1 A = 3, B = 1
## 6 3 2 A = 3, B = 2

How can I cast a data frame with two related columns? [duplicate]

I have a table like this
data.table(ID = c(1,2,3,4,5,6),
R = c("s","s","n","n","s","s"),
S = c("a","a","a","b","b","b"))
and I'm trying to get this result
a b
s 1, 2 5, 6
n 3 4
Is there any option in data.table can do this?
Here's an alternative that uses plain old data.table syntax:
DT[,lapply(split(ID,S),list),by=R]
# or...
DT[,lapply(split(ID,S),toString),by=R]
You can use dcast from reshape2 with the appropriate aggregating function:
library(functional)
library(reshape2)
dcast(df, R~S, value.var='ID', fun.aggregate=Curry(paste0, collapse=','))
# R a b
#1 n 3 4
#2 s 1,2 5,6
Or even short as #akrun underlined:
dcast(df, R~S, value.var='ID', toString)
You could try:
library(dplyr)
library(tidyr)
df %>%
group_by(R, S) %>%
summarise(i = toString(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n 3 4
#2 s 1, 2 5, 6
Note: This will store the result in a string. If you want a more convenient format to access the elements, you could store in a list:
df2 <- df %>%
group_by(R, S) %>%
summarise(i = list(ID)) %>%
spread(S, i)
Which gives:
#Source: local data table [2 x 3]
#Groups:
#
# R a b
#1 n <dbl[1]> <dbl[1]>
#2 s <dbl[2]> <dbl[2]>
You can then access the elements by doing:
> df2$a[[2]][2]
#[1] "2"

Bind data frames on longer identifiers R

I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

Resources