Replace value from other dataframe - r

I have a data frame (x) with a factor variable which has values seperated by comma. I have another data frame (y) with description for the same values. Now I want to replace the values in the data frame (x) with the description from the data frame (y). Any help would be highly appreciated.
say for example, the two data frame looks like below
data frame (x)
s.no x
1 2,5,45
2 35,5
3 45
data fram (y)
s.no x description
1 2 a
2 5 b
3 45 c
4 35 d
I need the output as below
s.no x
1 a,b,c
2 d,b
c c

With splitstackshape:
library(splitstackshape)
cSplit(x, 'x', ',', 'long')[setDT(y), on='x'][,.(x=paste(description, collapse=',')), s.no]
# s.no x
#1: 1 a,b,c
#2: 2 b,d
#3: 3 c

A solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
x %>%
separate(x, paste0('x',1:3),',',convert=TRUE) %>%
gather(var, x, -1, na.rm=TRUE) %>%
left_join(., y, by='x') %>%
group_by(s.no = s.no.x) %>%
summarise(x = paste(description,collapse = ','))
the result:
s.no x
(int) (chr)
1 1 a,b,c
2 2 d,b
3 3 c

We can split the 'x' column in 'x' dataset by ',', loop over the list, match the value with the 'x' column in 'y' to get the numeric index, get the corresponding 'description' value from 'y' and paste it together.
x$x <- sapply(strsplit(x$x, ","), function(z)
toString(y$description[match(as.numeric(z), y$x)]))
x
# s.no x
#1 1 a, b, c
#2 2 d, b
#3 3 c
NOTE: If the 'x' column in 'x' is factor class, use strsplit(as.character(x$x, ","))

Related

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

Simultaneous Count and Sort in R

I am trying to obtain counts of a certain categorical variable in 2 separate columns, with each column reflecting the presence or an absence of an indicator variable. This is for a very large data frame. Here is an example data frame to further illustrate what I'm trying to do.
X <- (1:10)
Y <- c('a','b','a','c','b','b','a','a','c','c')
Z <- c(0,1,1,1,0,1,0,1,1,1)
test_df <- data.frame(X,Y,Z)
I would like to make a new DF grouped by 'a','b', and 'c' with 2 columns to the right, one with counts of the letter for Z==1 and the a count of that letter for Z==0.
The dplyr way:
library(dplyr)
library(tidyr)
#Code
res <- test_df %>% group_by(Y,Z) %>% summarise(N=n()) %>%
pivot_wider(names_from = Z,values_from=N,
values_fill = 0)
Output:
# A tibble: 3 x 3
# Groups: Y [3]
Y `0` `1`
<chr> <int> <int>
1 a 2 2
2 b 1 2
3 c 0 3
We can use values_fn in pivot_wider to do this in a single step
library(dplyr)
library(tidyr)
test_df %>%
pivot_wider(names_from = Z, values_from = X,
values_fn = length, values_fill = 0)
# A tibble: 3 x 3
# Y `0` `1`
# <chr> <int> <int>
#1 a 2 2
#2 b 1 2
#3 c 0 3
A base R option using aggregate + reshape
replace(
u <- reshape(
aggregate(X ~ ., test_df, length),
idvar = "Y",
timevar = "Z",
direction = "wide"
),
is.na(u),
0
)
giving
Y X.0 X.1
1 a 2 2
2 b 1 2
5 c 0 3
One way with data.table:
library(data.table)
setDT(test_df)
test_df[ , z1 := sum(Z==1), by=Y]
test_df[ , z0 := sum(Z==0), by=Y]
In base R you can use table :
table(test_df$Y, test_df$Z)
# 0 1
# a 2 2
# b 1 2
# c 0 3

Create a new column in dataframe containing variables of another column, based on dataframe subsets [duplicate]

This question already has answers here:
Concatenate / paste a column by a group and add to original data
(2 answers)
Concatenate rows in a column by ID in R
(2 answers)
Closed 2 years ago.
I have a dataframe (df) and am trying to add column z that contains a list of the qualitative elements from column y, but only the elements that are present when grouping the rows by column x.
df <- data.frame('x'=c("a","a","b","b"), 'y'=c("a","c","c","b"))
x y
1 a a
2 a c
3 b c
4 b b
#Desired outcome;
df <- data.frame(x,y,'z'=c("a,c", "a,c", "c,b", "c,b"))
x y z
1 a a a,c
2 a c a,c
3 b c c,b
4 b b c,b
I know there are a bunch of questions here on how to add/create new columns in a dataframe, but I couldn't find any involving subsetting. I was thinking of using the dplyr package and filter() or mutate(), or aggregating the elements with aggregate(), but have had no success. My attempts:
library(dplyr)
z <- for (i in row.names(df)) {
filter(df, x == unique(i))
df[ ,3] <- levels(df$y)
}
z <- aggregate(x = df, by = as.list(df$x), FUN = levels)
Much thanks.
We can paste after grouping by 'x'
library(dplyr)
df %>%
group_by(x) %>%
mutate(z = toString(y))
# A tibble: 4 x 3
# Groups: x [2]
# x y z
# <fct> <fct> <chr>
#1 a a a, c
#2 a c a, c
#3 b c c, b
#4 b b c, b
aggregate returns a summarised output and if we need to create a column with base R, use ave
df$z <- with(df, ave(as.character(y), x, FUN = toString))
If we don't need that space after the , (toString == paste(., collapse=", "))
df$z <- with(df, ave(as.character(y), x, FUN = function(x) paste(x, collapse=",")))

Using dplyr mutate_at when a function takes multiple arguments which are different columns

I have a data.frame with a large number of columns whose names follow a pattern. Such as:
df <- data.frame(
x_1 = c(1, NA, 3),
x_2 = c(1, 2, 4),
y_1 = c(NA, 2, 1),
y_2 = c(5, 6, 7)
)
I would like to apply mutate_at to perform the same operation on each pair of columns. As in:
df %>%
mutate(
x = ifelse(is.na(x_1), x_2, x_1),
y = ifelse(is.na(y_1), y_2, y_1)
)
Is there a way I can do that with mutate_at/mutate_each?
This:
df %>%
mutate_each(vars(x_1, y_1), funs(ifelse(is.na(.), vars(x_2, y_2), .)))
and various variations I've tried all fail.
The question is similar to Using functions of multiple columns in a dplyr mutate_at call, but different in that the second argument to the function call is not a single column, but a different column for each column in vars.
Thanks in advance.
I don't know if you can get it that way, but here's a different perspective on the problem. If you find yourself with really wide data (e.g., tons of columns with similar names) and you want to do something with them, it might help to tidy the data (long in stata terms) with tidyr::gather (see docs here http://tidyr.tidyverse.org/).
> df %>% gather()
key value
1 x_1 1
2 x_1 NA
3 x_1 3
4 x_2 1
5 x_2 2
6 x_2 4
7 y_1 NA
8 y_1 2
9 y_1 1
10 y_2 5
11 y_2 6
12 y_2 7
After converting the data to this format, it's easier to combine and rearrange values using group_by instead of trying to mutate_at things. E.g., you can ge the first values with df %>% gather() %>% mutate(var = substr(key,1,1)) and manipulate the xs and ys differently using group_by(var).
Old question, but I agree with Jesse that you need to tidy your data a bit. gather would be the way to go, but it lacks somehow the possibility of stats::reshape where you can specify groups of columns to gather. So here's a solution with reshape:
df %>%
reshape(varying = list(c("x_1", "y_1"), c("x_2", "y_2")),
times = c("x", "y"),
direction = "long") %>%
mutate(x = ifelse(is.na(x_1), x_2, x_1)) %>%
reshape(idvar = "id",
timevar = "time",
direction = "wide") %>%
rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
# id x_1 x_2 x y_1 y_2 y
# 1 1 1 1 1 NA 5 5
# 2 2 NA 2 2 2 6 2
# 3 3 3 4 3 1 7 1
In order to do that with any number of column pairs, you could do something like:
df2 <- setNames(cbind(df, df), c(t(outer(letters[23:26], 1:2, paste, sep = "_"))))
v <- split(names(df2), purrr::map_chr(names(df2), ~ gsub(".*_(.*)", "\\1", .)))
n <- unique(purrr::map_chr(names(df2), ~ gsub("_[0-9]+", "", .) ))
df2 %>%
reshape(varying = v,
times = n,
direction = "long") %>%
mutate(x = ifelse(is.na(!!sym(v[[1]][1])), !!sym(v[[2]][1]), !!sym(v[[1]][1]))) %>%
reshape(idvar = "id",
timevar = "time",
direction = "wide") %>%
rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
# id w_1 w_2 w x_1 x_2 x y_1 y_2 y z_1 z_2 z
# 1 1 1 1 1 NA 5 5 1 1 1 NA 5 5
# 2 2 NA 2 2 2 6 2 NA 2 2 2 6 2
# 3 3 3 4 3 1 7 1 3 4 3 1 7 1
This assumes that columns which should be compared are next to each other and that all columns for with possible NA values are in columns suffixed by _1 and the replacement value columns are sufficed by _2.
When I asked this question, the answer was "you can't!" That's no longer the answer, since tidyr now supports pivot_wider and pivot_longer.

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

Resources