Replacement of column values based on a named vector - r

Consider the following named vector vec and tibble df:
vec <- c("1" = "a", "2" = "b", "3" = "c")
df <- tibble(col = rep(1:3, c(4, 2, 5)))
df
# # A tibble: 11 x 1
# col
# <int>
# 1 1
# 2 1
# 3 1
# 4 1
# 5 2
# 6 2
# 7 3
# 8 3
# 9 3
# 10 3
# 11 3
I would like to replace the values in the col column with the corresponding named values in vec.
I'm looking for a tidyverse approach, that doesn't involve converting vec as a tibble.
I tried the following, without success:
df %>%
mutate(col = map(
vec,
~ str_replace(col, names(.x), .x)
))
Expected output:
# A tibble: 11 x 1
col
<chr>
1 a
2 a
3 a
4 a
5 b
6 b
7 c
8 c
9 c
10 c
11 c

You could use col :
df$col1 <- vec[as.character(df$col)]
Or in mutate :
library(dplyr)
df %>% mutate(col1 = vec[as.character(col)])
# col col1
# <int> <chr>
# 1 1 a
# 2 1 a
# 3 1 a
# 4 1 a
# 5 2 b
# 6 2 b
# 7 3 c
# 8 3 c
# 9 3 c
#10 3 c
#11 3 c

We can also use data.table
library(data.table)
setDT(df)[, col1 := vec[as.character(col)]]

Related

Create new column based on previous column by group; if missing, use NA

I am trying out to select a value by group from one column, and pass it as value in another column, extending for the whole group. This is similar to question asked here . BUt, some groups do not have this number: in that case, I need to fill the column with NAs. How to do this?
Dummy example:
dd1 <- data.frame(type = c(1,1,1),
grp = c('a', 'b', 'd'),
val = c(1,2,3))
dd2 <- data.frame(type = c(2,2),
grp = c('a', 'b'),
val = c(8,2))
dd3 <- data.frame(type = c(3,3),
grp = c('b', 'd'),
val = c(7,4))
dd <- rbind(dd1, dd2, dd3)
Create new column:
dd %>%
group_by(type) %>%
mutate(#val_a = ifelse(grp == 'a', val , NA),
val_a2 = val[grp == 'a'])
Expected outcome:
type grp val val_a # pass in `val_a` value of teh group 'a'
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA # value for 'a' is missing from group 3
You were close with your first approach; use any to apply the condition to all observations in the group:
dd %>%
group_by(type) %>%
mutate(val_a = ifelse(any(grp == "a"), val[grp == "a"] , NA))
type grp val val_a
<dbl> <chr> <dbl> <dbl>
1 1 a 1 1
2 1 b 2 1
3 1 d 3 1
4 2 a 8 8
5 2 b 2 8
6 3 b 7 NA
7 3 d 4 NA
Try this:
dd %>%
group_by(type) %>%
mutate(val_a2 = val[which(c(grp == 'a'))[1]])
# # A tibble: 7 x 4
# # Groups: type [3]
# type grp val val_a2
# <dbl> <chr> <dbl> <dbl>
# 1 1 a 1 1
# 2 1 b 2 1
# 3 1 d 3 1
# 4 2 a 8 8
# 5 2 b 2 8
# 6 3 b 7 NA
# 7 3 d 4 NA
This also controls against the possibility that there could be more than one match, which may cause bad results (with or without a warning).

Replace multiple characters with multiple values in multiple columns? R [duplicate]

This question already has answers here:
Dictionary style replace multiple items
(11 answers)
Closed 1 year ago.
Another thread solved a similar problem very nicely
But what i would like to do is get rid of some redundancy in my similar problem.
Using their example:
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
creates:
df
name foo var1 var2
1 a 1 a 3
2 a 2 a 3
3 a 3 a 3
4 b 4 b 4
5 b 5 b 4
6 b 6 b 4
7 c 7 c 5
8 c 8 c 5
9 c 9 c 5
But what do i need to do to replace multiple characters with unique values?
a=1
b=2
c=3
I tried:
df[,c(4,6)] <- lapply(df[,c(4,6)], function(x) replace(x,x %in% "a", 1),
replace(x,x %in% "b", 2),
replace(x,x %in% "c", 3))
and
z<- c("a","b","c")
y<- c(1,2,3)
df[,c(1,3)] <- lapply(df[,c(1,3)], function(x) replace(x,x %in% z, y))
But neither seem to work.
Thanks.
You can use dplyr::recode
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
library(dplyr, warn.conflicts = FALSE)
df %>%
mutate(across(c(name, var1), ~ recode(., a = 1, b = 2, c = 3)))
#> name foo var1 var2
#> 1 1 1 1 3
#> 2 1 2 2 3
#> 3 1 3 3 3
#> 4 2 4 1 4
#> 5 2 5 2 4
#> 6 2 6 3 4
#> 7 3 7 1 5
#> 8 3 8 2 5
#> 9 3 9 3 5
Created on 2021-10-19 by the reprex package (v2.0.1)
Across will apply the function defined by ~ recode(., a = 1, b = 2, c = 3) to both name and var1.
Using ~ and . is another way to define a function in across. This function is equivalent to the one defined by function(x) recode(x, a = 1, b = 2, c = 3), and you could use that code in across instead of the ~ form and it would give the same result. The only name I know for this is what it's called in ?across, which is "purrr-style lambda function", because the purrr package was the first to use formulas to define functions in this way.
If you want to see the actual function created by the formula, you can look at rlang::as_function(~ recode(., a = 1, b = 2, c = 3)), although it's a little more complex than the one above to support the use of ..1, ..2 and ..3 which are not used here.
Now that R supports the easier way of defining functions below, this purrr-style function is maybe no longer useful, it's just an old habit to write it that way.
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
library(dplyr, warn.conflicts = FALSE)
df %>%
mutate(across(c(name, var1), \(x) recode(x, a = 1, b = 2, c = 3)))
#> name foo var1 var2
#> 1 1 1 1 3
#> 2 1 2 2 3
#> 3 1 3 3 3
#> 4 2 4 1 4
#> 5 2 5 2 4
#> 6 2 6 3 4
#> 7 3 7 1 5
#> 8 3 8 2 5
#> 9 3 9 3 5
Created on 2021-10-19 by the reprex package (v2.0.1)
A simple for loop would do the trick:
for (i in 1:length(z)) {
df[df==z[i]] <- y[i]
}
df
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
You could use a lookup vector combined with apply:
z <- c("a","b","c")
y <- c(1,2,3)
lookup <- setNames(y, z)
df[,c(1,3)] <- apply(df[,c(1,3)], 2, function(x) lookup[x])
df
This returns
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
If you are open to a tidyverse approach:
library(tidyverse)
df_new <- df %>%
mutate(across(c(var1, name), ~case_when(. == 'a' ~ 1,
. == 'b' ~ 2,
. == 'c' ~ 3)))
df_new
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
Note, this code works only if you change all values of your column. E.g. if there was a „d“ in your var1 column that you don‘t tuen into a number, it would be changed to NA.
# Import data: df => data.frame
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
# Function performing a mapping replacement:
# replaceMultipleValues => function()
replaceMultipleValues <- function(df, mapFrom, mapTo){
# Extract the values in the data.frame:
# dfVals => named character vector
dfVals <- unlist(df)
# Get all values in the mapping & data
# and assign a name to them: tmp1 => named character vector
tmp1 <- c(
setNames(mapTo, mapFrom),
setNames(dfVals, dfVals)
)
# Extract the unique values:
# valueMap => named character vector
valueMap <- tmp1[!(duplicated(names(tmp1)))]
# Recode the values, coerce vectors to appropriate
# types: res => data.frame
res <- type.convert(
data.frame(
matrix(
valueMap[dfVals],
nrow = nrow(df),
ncol = ncol(df),
dimnames = dimnames(df)
)
)
)
# Explicitly define the returned object: data.frame => env
return(res)
}
# Recode values in data.frame:
# res => data.frame
res <- replaceMultipleValues(
df,
c("a", "b", "c"),
c("1", "2", "3")
)
# Print data.frame to console:
# data.frame => stdout(console)
res

Add together 2 dataframes in R without losing columns

I have 2 dataframes in R (df1, df2).
A C D
1 1 1
2 2 2
df2 as
A B C
1 1 1
2 2 2
How can I merge these 2 dataframes to produce the following output?
A B C D
2 1 2 1
4 2 4 2
Columns are sorted and column values are added. Both DFs have same number of rows. Thank you in advance.
Code to create DF:
df1 <- data.frame("A" = 1:2, "C" = 1:2, "D" = 1:2)
df2 <- data.frame("A" = 1:2, "B" = 1:2, "C" = 1:2)
nm1 = names(df1)
nm2 = names(df2)
nm = intersect(nm1, nm2)
if (length(nm) == 0){ # if no column names in common
cbind(df1, df2)
} else { # if column names in common
cbind(df1[!nm1 %in% nm2], # columns only in df1
df1[nm] + df2[nm], # add columns common to both
df2[!nm2 %in% nm1]) # columns only in df2
}
# D A C B
#1 1 2 2 1
#2 2 4 4 2
You can try:
library(tidyverse)
list(df2, df1) %>%
map(rownames_to_column) %>%
bind_rows %>%
group_by(rowname) %>%
summarise_all(sum, na.rm = TRUE)
# A tibble: 2 x 5
rowname A B C D
<chr> <int> <int> <int> <int>
1 1 2 1 2 1
2 2 4 2 4 2
By using left_join() from dplyr you won't lose the column
library(tidyverse)
dat1 <- tibble(a = 1:10,
b = 1:10,
c = 1:10)
dat2 <- tibble(c = 1:10,
d = 1:10,
e = 1:10)
left_join(dat1, dat2, by = "c")
#> # A tibble: 10 x 5
#> a b c d e
#> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 1
#> 2 2 2 2 2 2
#> 3 3 3 3 3 3
#> 4 4 4 4 4 4
#> 5 5 5 5 5 5
#> 6 6 6 6 6 6
#> 7 7 7 7 7 7
#> 8 8 8 8 8 8
#> 9 9 9 9 9 9
#> 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
allnames <- sort(unique(c(names(df1), names(df2))))
df3 <- data.frame(matrix(0, nrow = nrow(df1), ncol = length(allnames)))
names(df3) <- allnames
df3[,allnames %in% names(df1)] <- df3[,allnames %in% names(df1)] + df1
df3[,allnames %in% names(df2)] <- df3[,allnames %in% names(df2)] + df2
df3
A B C D
1 2 1 2 1
2 4 2 4 2
Here is a fun base R method with Reduce.
Reduce(cbind,
list(Reduce("+", list(df1[intersect(names(df1), names(df2))],
df2[intersect(names(df1), names(df2))])), # sum results
df1[setdiff(names(df1), names(df2))], # in df1, not df2
df2[setdiff(names(df2), names(df1))])) # in df2, not df1
This returns
A C D B
1 2 2 1 1
2 4 4 2 2
This assumes that both df1 and df2 have columns that are not present in the other. If this is not true, you'd have to adjust the list.
Note also that you could replace Reduce with do.call in both places and you'd get the same result.

bind_rows to each group of tibble

Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))

R - How to replicate rows in a spark dataframe using sparklyr

Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr?
sc <- spark_connect(master = "spark://####:7077")
df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")
This is the desired output, saved into a new spark tbl:
> df2_tbl
row1 row2
<int> <chr>
1 1 A
2 1 A
3 1 A
4 2 B
5 2 B
6 2 B
7 3 C
8 3 C
9 3 C
With sparklyr you can use array and explode as suggested by #Oli:
df_tbl %>%
mutate(arr = explode(array(1, 1, 1))) %>%
select(-arr)
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
and generalized
library(rlang)
df_tbl %>%
mutate(arr = !!rlang::parse_quo(
paste("explode(array(", paste(rep(1, 3), collapse = ","), "))")
)) %>% select(-arr)
# # Source: lazy query [?? x 2]
# # Database: spark_connection
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
where you can easily adjust number of rows.
The idea that comes to mind first is to use the explode function (it is exactly what it is meant for in Spark). Yet arrays do not seem to be supported in SparkR (to the best of my knowledge).
> structField("a", "array")
Error in checkType(type) : Unsupported type for SparkDataframe: array
I can however propose two other methods:
A straightforward but not very elegant one:
head(rbind(df, df, df), n=30)
# row1 row2
# 1 1 A
# 2 2 B
# 3 3 C
# 4 1 A
# 5 2 B
# 6 3 C
# 7 1 A
# 8 2 B
# 9 3 C
Or with a for loop for more genericity:
df2 = df
for(i in 1:2) df2=rbind(df, df2)
Note that this would also work with union.
The second, more elegant method (because it only implies one spark operation) is based on a cross join (Cartesian product) with a dataframe of size 3 (or any other number):
j <- as.DataFrame(data.frame(s=1:3))
head(drop(crossJoin(df, j), "s"), n=100)
# row1 row2
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
I am not aware of a cluster side version of R's repfunction. We can however use a join to emulate it cluster side.
df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")
replyr <- function(data, n, sc){
joiner_frame <- copy_to(sc, data.frame(joiner_index = rep(1,n)), "tmp_joining_frame", overwrite = TRUE)
data %>%
mutate(joiner_index = 1) %>%
left_join(joiner_frame) %>%
select(-joiner_index)
}
df_tbl2 <- replyr(df_tbl, 3, sc)
# row1 row2
# <int> <chr>
# 1 1 A
# 2 1 A
# 3 1 A
# 4 2 B
# 5 2 B
# 6 2 B
# 7 3 C
# 8 3 C
# 9 3 C
It gets the job done, but it is a bit dirty since the tmp_joining_frame will persist. I'm not sure how well this will work given lazy evaluation on multiple calls to the function.

Resources