How can I make combn work in dplyr::mutate? - r

I'm trying to make combn() work in dplyr::mutate, but I'm failing and can't quite figure out why.
This works:
c("a", "b", "c") %>% combn(2, FUN = paste, collapse = ";", simplify = TRUE)
[1] "a;b" "a;c" "b;c"
But how can I make this work?
tribble(
~col,
c("a", "b", "c"),
c("a", "d", "f")
) %>%
mutate(col = combn(str_split(names, ";"), 2, FUN = paste, collapse = ";"))
I want each row in the matrix to be a character vector in this form:
[1] "a;b" "a;c" "b;c"
The example above would be the first row.
Edit: I guess it's fine if combn() isn't used.

We could use map to loop over the list and paste
library(tidyverse)
out <- tribble(
~col,
c("a", "b", "c"),
c("a", "d", "f")
) %>%
mutate(col = map(col, ~ combn(.x, 2, FUN = paste, collapse=";")))

Try:
tribble(
~col,
c("a", "b", "c"),
c("a", "d", "f")
) %>%
rowwise() %>%
mutate(new = toString(combn(col, 2, FUN = paste, collapse = ";")))

Related

Append text to a field based on another field's value

I want to append a text based on another field's value. For example:-
This is the current df:
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl", "Slash", "Duff", "Steven", "Izzy", "Dizzy")
df <- cbind(field_x, field_y)
I need to change the field_y based on field_x values so that it looks like this:
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl (Apple)", "Slash (Apple)", "Duff (Cat)", "Steven (Ball)", "Izzy (Ball)", "Dizzy (Cat)")
So, basically if field_x has "A" then "(Apple)" is to be appended to field_y and so forth. Thanks in advance!
First note that your df is actually a matrix: when you cbind vectors, you get a matrix. So first thing to do is convert to dataframe.
Then it depends on whether or not you are using dplyr.
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl", "Slash", "Duff", "Steven", "Izzy", "Dizzy")
df <- cbind(field_x, field_y)
df <- as.data.frame(df)
Without dplyr:
df <- within(df, {
s <- ifelse(field_x == "A", "Apple", ifelse(field_x == "B", "Ball", "Cat"))
field_y <- paste0(field_y, "(", s, ")")
rm(s)
})
With dplyr:
library(dplyr)
library(stringr)
library(magrittr)
df %<>%
mutate(
s = recode(field_x, "A" = "Apple", "B" = "Ball", "C" = "Cat"),
field_y = str_glue("{field_y}({s})")) %>%
select(-s)
Another way, with case_when instead of recode:
df %<>%
mutate(
s = case_when(
field_x == "A" ~ "Apple",
field_x == "B" ~ "Ball",
field_x == "C" ~ "Cat"
),
field_y = str_glue("{field_y}({s})")) %>%
select(-s)
Note that I create an auxiliary variable s: it's not really necessary, but it makes the code more readable.
Here is another approach:
We could create a look-up table to address the concerns of #Tim Biegeleisen in the comment section:
look_up <- data.frame(x = c("A", "B" ,"C"),
y = c("Apple", "Ball", "Cat"))
library(dplyr)
df %>%
as.data.frame() %>%
rowwise() %>%
mutate(field_y = paste0(field_y, ' (', look_up$y[look_up$x==field_x], ')'))
field_x field_y
<chr> <chr>
1 A Axl (Apple)
2 A Slash (Apple)
3 C Duff (Cat)
4 B Steven (Ball)
5 B Izzy (Ball)
6 C Dizzy (Cat)

Exchange data.table columns with most prevalent value of columns

I have data
test = data.table(
a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6)
)
I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:
test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)
Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.
Attempt so far:
test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]
EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)
EDIT2: as #MichaelChirico points out, how do we keep the class. With the following data.table some of the solutions does not work, although solution of #chinsoon12 does work:
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
Another option:
dtmode <- function(x) x[which.max(rowid(x))]
test[, .(A=unique(A), B=dtmode(B), C=dtmode(C))]
data:
test = data.table(
A = c(1,1,3,4,5,6),
B = c("a", "be", "a", "c", "d", "c"),
C = rep(1, 6)
)
Not a clean way to do this but it works.
test = data.frame(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
a = unique(test$a)
b = tail(names(sort(table(test$b))), 1)
c = tail(names(sort(table(test$c))), 1)
test2 = cbind(a,b,c)
Output is like this:
> test2
a b c
[1,] "1" "c" "1"
[2,] "3" "c" "1"
[3,] "4" "c" "1"
[4,] "5" "c" "1"
[5,] "6" "c" "1"
>
#EmreKiratli is very close to what I would do:
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) as(tail(names(sort(table(x))), 1L), class(x)))
), .SDcols = !'a']
The as(., class(x)) part is because names in R are always character, so we have to convert back to the original class of x.
You might like this better in magrittr form since it's many nested functions:
library(magrittr)
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) {
table(x) %>% sort %>% names %>% tail(1L) %>% as(class(x))
})
), .SDcols = !'a']
I was able to make an OK solution, but if somebody can do it more elegantly, for example not going through the step of storting a list in refLevel below, please let me know! I'm very interested in learning data.table properly!
#solution:
test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
test2 = unique(test, by="a")
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(test[, c("b", "c")], funPrev)
test2[, c("b", "c") := refLevel]
...and using a function (if anybody see any un-necessary step, please let me know):
genData = function(dt, var_unique, vars_prev){
data = copy(dt)
data = unique(data, by = var_unique)
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(dt[, .SD, .SDcols = vars_prev], funPrev)
data[, (vars_prev) := refLevel]
return(data)
}
test2 = genData(test, "a", c("b", "c"))
Here's another variant which one might find less sophisticated, yet more readable. It's essentially chinsoon12's rowid approach generalized for any number of columns. Also the classes are kept.
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
test2 = unique(test, by = "a")
for (col in setdiff(names(test2), "a")) test2[[col]] = test2[[col]][which.max(rowid(test2[[col]]))]

rlang: refer var in the tidyr unite function

Here is my data:
df <- tibble::tribble(
~A, ~B, ~C,
"a", "b", 2L,
"a", "b", 4L,
"c", "d", 3L,
"c", "d", 5L
)
var <- "AB"
I want to get this output:
df1 <- df %>%
unite("AB", c("A", "B")) %>%
group_by(AB) %>%
nest()
However, I want to refer var, maybe using rlang. I do not want to manually input "AB". I tried the following, but not getting the desired output.
df1 <- df %>%
unite(var, c("A", "B")) %>%
group_by(!!var) %>%
nest()
This solved the problem:
df1 <- df %>%
unite(!!var, c("A", "B")) %>%
group_by(!!sym(var)) %>%
nest()

three min values for each column and more

The data I have contain pair-wise distance between different locations (x,y,z) and (a,b,c,d,e,f,g,h,i,j). See below:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
I need to create another data that include: 1) the column name, row name of the min, and the minimum three values for each column. So in total, the new data will contain
three column and nine rows. Here is the first rows:
col_name <- c("x", "x", "x")
row_name <- c("h", "g", "a")
min_val <- c(14.21208, 14.88804, 14.98797)
newdat <- data.frame(col_name, row_name, min_val)
Similarly, we need to repeat this for column y and z.
How about this:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
# find indices of smallest values
idx <- sapply(distdat, order)[1:3, ]
# put everything in a data.frame
data.frame(col_name = rep(colnames(distdat), each = 3),
row_name = row.names(distdat)[c(idx)],
min_val = distdat[cbind(c(idx), rep(1:3, each = 3))]
)
Also, with the given seed I could not replicate your example, let me know if I missed something.
Its not real pretty, but this could work:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
distdat$row_name <- rownames(distdat)
select(distdat, x, row_name) %>%
arrange(x) %>%
head(3) %>%
mutate(col_name='x') %>%
rename(min_val = x) -> newdat_x
select(distdat, y, row_name) %>%
arrange(y) %>%
head(3) %>%
mutate(col_name='y') %>%
rename(min_val = y) -> newdat_y
select(distdat, z, row_name) %>%
arrange(z) %>%
head(3) %>%
mutate(col_name='z') %>%
rename(min_val = z) -> newdat_z
newdat <- bind_rows(newdat_x, newdat_y, newdat_z)
certainly, we could (should) create a function to create those newdat_ dfs, and then run the function for each variable x,y,z.
You can use the dplyr and tidyr packages to do this. They make transformations much more readable.
newdat <- distdat %>%
mutate(row = rownames(.)) %>%
gather(col, dist, -row) %>%
group_by(col) %>%
arrange(col, dist) %>%
top_n(-3, dist)

I am trying to change all the column names of the data whose class is tbl_spark

Here is the code:
library(sparklyr)
sc <- spark_connect(master = "local", config = list())
iris_tbl <- copy_to(sc, iris, overwrite = T)
newColList <- c("a", "b" , "c" , "d" , " e")
colnames(iris_tbl) <- newColList
Error:
Error in colnames<- ( tmp, value = c("a", "b", "c", "d", "
e")) : 'dimnames' applied to non-array
names(iris_tbl) <- newColList works but I think a better answer would utilize %>% and dplyr::rename
I've been searching around for this all day. Right now my best solution is to create a custom function that goes direct to the Spark API:
sdf_write_colnames <- function(in_tbl, new_names) {
sdf_name <- as.character(in_tbl$ops$x)
in_tbl %>%
spark_dataframe() %>%
invoke("toDF", as.list(new_names)) %>%
sdf_register(name = sdf_name)
}
iris_tbl <- sdf_write_colnames(iris_tbl, c("a", "b", "c", "d", "e"))
head(iris_tbl)
With a bit of effort it could be made to work more like colnames() <-

Resources