I'm trying to create a new column in my tibble which collects and formats all words found in all other columns. I would like to do this using dplyr, if possible.
Original DataFrame:
df <- read.table(text = " columnA columnB
1 A Z
2 B Y
3 C X
4 D W
5 E V
6 F U " )
As a simplified example, I am hoping to do something like:
df %>%
rowwise() %>%
mutate(newColumn = myFunc(.))
And have the output look like this:
columnA columnB newColumn
1 A Z AZ
2 B Y BY
3 C X CX
4 D W DW
5 E V EV
6 F U FU
When I try this in my code, the output looks like:
columnA columnB newColumn
1 A Z ABCDEF
2 B Y ABCDEF
3 C X ABCDEF
4 D W ABCDEF
5 E V ABCDEF
6 F U ABCDEF
myFunc should take one row as an argument but when I try using rowwise() I seem to be passing the entire tibble into the function (I can see this from adding a print function into myFunc).
How can I pass just one row and do this iteratively so that it applies the function to every row? Can this be done with dplyr?
Edit:
myFunc in the example is simplified for the sake of my question. The actual function looks like this:
get_chr_vector <- function(row) {
row <- row[,2:ncol(row)] # I need to skip the first row
words <- str_c(row, collapse = ' ')
words <- str_to_upper(words)
words <- unlist(str_split(words, ' '))
words <- words[words != '']
words <- words[!nchar(words) <= 2]
words <- removeWords(words, stopwords_list) # from the tm library
words <- paste(words, sep = ' ', collapse = ' ')
}
Take a look at ?dplyr::do and ?purrr::map, which allow you to apply arbitrary functions to arbitrary columns and to chain the results through multiple unary operators. For example,
df1 <- df %>% rowwise %>% do( X = as_data_frame(.) ) %>% ungroup
# # A tibble: 6 x 1
# X
# * <list>
# 1 <tibble [1 x 2]>
# 2 <tibble [1 x 2]>
# ...
Notice that column X now contains 1x2 data.frames (or tibbles) comprised of rows from your original data.frame. You can now pass each one to your custom myFunc using map.
myFunc <- function(Y) {paste0( Y$columnA, Y$columnB )}
df1 %>% mutate( Result = map(X, myFunc) )
# # A tibble: 6 x 2
# X Result
# <list> <list>
# 1 <tibble [1 x 2]> <chr [1]>
# 2 <tibble [1 x 2]> <chr [1]>
# ...
Result column now contains the output of myFunc applied to each row in your original data.frame, as desired. You can retrieve the values by concatenating a tidyr::unnest operation.
df1 %>% mutate( Result = map(X, myFunc) ) %>% unnest
# # A tibble: 6 x 3
# Result columnA columnB
# <chr> <fctr> <fctr>
# 1 AZ A Z
# 2 BY B Y
# 3 CX C X
# ...
If desired, unnest can be limited to specific columns, e.g., unnest(Result).
EDIT: Because your original data.frame contains only two columns, you can actually skip the do step and use purrr::map2 instead. The syntax is very similar to map:
myFunc <- function( a, b ) {paste0(a,b)}
df %>% mutate( Result = map2( columnA, columnB, myFunc ) )
Note that myFunc is now defined as a binary function.
This should work
df <- read.table(text = " columnA columnB
1 A Z
2 B Y
3 C X
4 D W
5 E V
6 F U " )
df %>%
mutate(mutate_Func = paste0(columnA,columnB))
columnA columnB mutate_Func
1 A Z AZ
2 B Y BY
3 C X CX
4 D W DW
5 E V EV
6 F U FU
Related
Edit: Simply use rbind from base!
I have a list of tibbles with the same column names and orders, but possibly incompatible column types. I would like to vertically-concatenate the tables into one, à la tibble::add_row(), automatically converting types to the greatest common denominator where necessary (in the same way that e.g., c(1, 2, "a") returns c("1", "2", "a"). I don’t know the types of columns in advance.
For example,
> X = tibble(a = 1:3, b = c("a", "b", "c"))
# A tibble: 3 × 2
a b
<int> <chr>
1 1 a
2 2 b
3 3 c
> Y = tibble(a = "Any", b = 1)
# A tibble: 1 × 2
a b
<chr> <dbl>
1 Any 1
Desired output:
# A tibble: 4 × 2
a b
<chr> <chr>
1 1 a
2 2 b
3 3 c
4 Any 1
Is there a way to do this generically? I’m trying to write code for a package that is agnostic about data frames and tibbles (i.e., it doesn’t convert into one or the other).
Ideally, type promotion should reflect the behaviour of c(...) (NULL < raw < logical < integer < double < complex < character < list < expression) — except for factors, where I’d like to preserve the factor label (whatever its type), not the underlying index.
I think rbind(X, Y) has achieved what you want. Herer is another idea. Assume that X and Y have the same column names and orders, you could use map2() from purrr to apply c() over the corresponding columns from X and Y.
purrr::map2_dfc(X, Y, c)
# # A tibble: 4 × 2
# a b
# <chr> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 Any 1
If X and Y do not have the same column names and orders, you could intersect their names and follow the same way:
cols <- intersect(names(X), names(Y))
purrr::map2_dfc(X[cols], Y[cols], c)
Utilising the overly liberal behaviour of base R by doing do.call(rbind, list(X, Y)) would get you some of the way there, but comes with some downsides, such as that the order in which you combine things matters (consider the output of as.character(TRUE) vs as.character(as.integer(TRUE)).
A better approach would probably be to look at all of your data frames to work out what final column types you need to cast to, and cast your columns to these types separately before combining the data frames. Here's a function that will do this:
library(tidyverse)
coerce_bind_rows <- function(...) {
casts <- list(
raw = NULL,
logical = as.logical,
integer = as.integer,
numeric = as.numeric,
double = as.double,
character = as.character,
list = as.list
)
dfs <- list(...)
dfs_fmt_objs <- map(dfs, mutate, across(where(is.object), format))
targets <-
dfs_fmt_objs |>
map(partial(map_chr, ... = , typeof)) |>
pmap(c) |>
map(factor, levels = names(casts), ordered = TRUE) |>
map(compose(as.character, max))
dfs_casted <-
dfs_fmt_objs |>
map(function(.data, .types = targets) {
for (.col in names(.types)) {
.fn <- casts[[.types[[.col]]]]
.data[[.col]] <- .fn(.data[[.col]])
}
.data
})
bind_rows(dfs_casted)
}
[Edited to format classed objects to handle factors as specified in update to the question]
Testing on your examples above:
X <- tibble(a = 1:3, b = c("a", "b", "c"))
Y <- tibble(a = "Any", b = 1)
coerce_bind_rows(X, Y)
#> # A tibble: 4 x 2
#> a b
#> <chr> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 Any 1
Testing on some data frames with a broader range of types:
W <- tibble(a = FALSE, b = raw(1L))
Z <- tibble(a = list(4), b = "d")
coerce_bind_rows(W, X, Y, Z)
#> # A tibble: 6 x 2
#> a b
#> <list> <chr>
#> 1 <lgl [1]> 00
#> 2 <int [1]> a
#> 3 <int [1]> b
#> 4 <int [1]> c
#> 5 <chr [1]> 1
#> 6 <dbl [1]> d
By the way, data frame columns have to be vectors (which include atomic vectors or lists), so you can't have a data frame with columns that are NULLs or expressions. But this approach should also work for everything between raw to list type vectors.
I would like to reorder some columns to come after a particular other column using dplyr::relocate. Here is a MWE:
a <- letters[1:3]
b <- letters[4:6]
c <- letters[7:9]
d <- letters[10:12]
mytib <- tibble::tibble(a,b,c,d)
# A tibble: 3 x 4
# a b c d
# <chr> <chr> <chr> <chr>
# 1 a d g j
# 2 b e h k
# 3 c f i l
mytib %>%
relocate(c, .after = a)
This example works but is there a way that I could, with one relocate command, move c after a and, say, d after b?
I tried the following without success:
mytib %>%
relocate(c(c, d), .after(c(a, b)))
Edit 1: I explicitly ask about relocate because functions like select do not work for large datasets where all I know is after which column (name) I want to insert a column.
Edit 2: This is my expected output:
# A tibble: 3 x 4
# a c b d
# <chr> <chr> <chr> <chr>
# 1 a g d j
# 2 b h e k
# 3 c i f l
As dplyr::relocate itself apparently doesn't allow relocating in pairs, you can "hack" this behavior by preparing a list of column pairs like the ones you describe ("c after a" & "d after b") and reduce over that list, passing your df in as an .init value and in each reduce-step relocating one pair.
Like this:
library(dplyr)
library(purrr)
df_relocated <- reduce(
.x = list(c('c','a'), c('d','b')),
.f = ~ relocate(.x, .y[1], .after = .y[2]),
.init = mytib
)
This produces a tibble just as you expect it:
> df_relocated
# A tibble: 3 x 4
a c b d
<chr> <chr> <chr> <chr>
1 a g d j
2 b h e k
3 c i f l
In case you want to work with two lists, where element 1 of list 2 should relocated after element 1 of list 1 and so forth, this would be a solution:
reduce2(
.x = c("a", "b"),
.y = c("c", "d"),
.f = ~ relocate(..1, ..3, .after = ..2),
.init = mytib
)
I am wondering how to manipulate a list containing data.frames stored in a tibble.
Specifically, I would like to extract two columns from a data.frame that are stored in a tibble list column.
I would like to go from this tibble c
random_data<-list(a=letters[1:10],b=LETTERS[1:10])
x<-as.data.frame(random_data, stringsAsFactors=FALSE)
y<-list()
y[[1]]<-x[1,,drop=FALSE]
y[[3]]<-x[2,,drop=FALSE]
c<-tibble(z=c(1,2,3),my_data=y)
to this tibble d
d<-tibble(z=c(1,2,3),a=c('a',NA,'b'),b=c('A',NA,'B'))
thanks
Iain
c2 is the final output.
library(tidyverse)
c2 <- c %>%
filter(!map_lgl(my_data, is.null)) %>%
unnest() %>%
right_join(c, by = "z") %>%
select(-my_data)
You could create a function f to change out the NULL values, then apply it to the my_data column and finish with unnest.
library(dplyr); library(tidyr)
unnest(mutate(c, my_data = lapply(my_data, f)))
# # A tibble: 3 x 3
# z a b
# <dbl> <chr> <chr>
# 1 1 a A
# 2 2 <NA> <NA>
# 3 3 b B
Where f is a helper function to change out the NULL values, and is defined as
f <- function(x) {
if(is.null(x)) data.frame(a = NA, b = NA) else x
}
I think this does the trick with d the requested tibble:
library(dplyr)
new.y <- lapply(y, function(x) if(is.null(x)) data.frame(a = NA, b = NA) else x)
d <- cbind(z = c(1, 2, 3), bind_rows(new.y)) %>% tbl_df()
# # A tibble: 3 x 3
# z a b
# <dbl> <fctr> <fctr>
# 1 1 a A
# 2 2 NA NA
# 3 3 b B
Do you know your column names ahead of time?
extract_column <- function( d, column_name ) {
if( is.null(d) ) {
NA_character_
} else {
as.character(d[[column_name]])
}
}
cc %>%
dplyr::mutate(
a = purrr::map_chr(.$my_data, extract_column, column_name="a"),
b = purrr::map_chr(.$my_data, extract_column, column_name="b")
) %>%
dplyr::select(-my_data)
(I renamed your c tibble to cc so it can't collide with c().)
I am using dplyr version 0.4.1, and am trying to wrap my head around list variables.
I am having trouble creating a new data frame (or a tbl_df or data_frame or whatever) from a table containing a list variable.
For example, if I have a tbl_df like so:
x <- c(1,2,3)
y <- c(3,2,1)
d <- data_frame(X = list(x, y))
d
## Source: local data frame [2 x 1]
##
## X
## 1 <dbl[3]>
## 2 <dbl[3]>
Assuming all the values of the list variable X is the same length or dimensions, is there an operation that I can run to create a table that looks like rbind(x, y) from the list variable inside the table?
I am hoping to get something that will look like:
data_frame(V1 = c(1, 3), V2 = c(2, 2), V3 = c(3, 1))
## Source: local data frame [2 x 3]
##
## V1 V2 V3
## 1 1 2 3
## 2 3 2 1
The closest I got to to my desired result was a stacked column:
d %>% tidyr::unnest(X)
I thought that maybe using rowwise to group by row might allow me to do an operation for each row, but I am seeing the same results as above.
d %>% rowwise %>% tidyr::unnest(X) # %>% some extra commands here??
You can do a little work on d first, then use bind_rows()
library(dplyr)
d$X %>%
lapply(function(x) data.frame(matrix(x, 1))) %>%
bind_rows
# Source: local data frame [2 x 3]
#
# X1 X2 X3
# 1 1 2 3
# 2 3 2 1
Another way is to use tbl_dt after rbindlist(), which can also be fed into dplyr functions
library(data.table)
tbl_dt(rbindlist(lapply(d$X, as.list)))
# Source: local data table [2 x 3]
#
# V1 V2 V3
# 1 1 2 3
# 2 3 2 1
I've got two data frames in which the unique identifiers common to both frames differ in the number of observations. I would like to create a dataframe from both in which the observations from each frame are taken if they have more observations for a common identifier. For example:
f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))
I would like this to generate a merge based on the longer x such that it produces:
x y
a 1
a 1
b 5
b 5
c 3
c 3
c 3
Any and all thoughts would be great.
Here's a solution using split
dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))
keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
y<-table(x)
x == names(y[which.max(y)])
}), dd$x)
dd <- dd[keep,]
Normally i'd prefer to use the ave function here but because i'm changing data.types from a factor to a logical, it wasn't as appropriate so I basically copied the idea that ave uses and used split.
dplyr solution
library(dplyr)
First we combine the data:
with rbind() and introduce a new variable called ref to know where each observation came from:
both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )
then count the observations:
make another new variable that contains how many observations for each ref and x combination:
both_with_counts <- both %>%
group_by( ref ,x ) %>%
mutate( counts = n() )
then filter for the largest count:
both_with_counts %>% group_by( x ) %>% filter( n==max(n) )
note: you could also select only the x and y cols with select(x,y)...
this gives:
## Source: local data frame [7 x 4]
## Groups: x
##
## x y ref counts
## 1 a 1 f1 2
## 2 a 1 f1 2
## 3 c 3 f1 3
## 4 c 3 f1 3
## 5 c 3 f1 3
## 6 b 5 f2 2
## 7 b 5 f2 2
Altogether now...
what_I_want <-
rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
group_by(ref,x) %>%
mutate(counts = n()) %>%
group_by( x ) %>%
filter( counts==max(counts) ) %>%
select( x, y )
and thus:
> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
#
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5
Not a elegant answer but still give the desired result. Hope this help.
f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)
f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)
table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]
data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3