assign groups depending on pairwise combinations within string - r

I have a big df with the following structure
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7), name = c("aa", "ab", "ac", "aa", "aab", "aac", "aabc")), .Names = c("id", "name"), row.names = c(NA, -7L), class = "data.frame")
df
id name
1 1 aa
2 2 ab
3 3 ac
4 4 aa
5 5 aab
6 6 aac
7 7 aabc
I would like to create a new column group depending two character strings in column name (here aa, ab, ac) to achieve something like
df
id name group
1 1 aa 1
2 2 ab 2
3 3 ac 3
4 4 aa 1
5 5 aab 1
5 5 aab 2
6 6 aac 1
6 6 aac 3
7 7 aabc 1
7 7 aabc 2
7 7 aabc 3
While assigning groups for the two character strings is straight forward I struggle to find an efficient way to include the pairwise combinations of longer strings. I thought about splitting each string with nchar>2 into all the possible pairwise combinations and assign them to the respective groups but wonder if there is a better way.
Further notes
only pairwise combinations found in df (not all possible combinations)
order of the two character string does not matter (e.g. ab=ba)
only unique recombinations of longer strings (e.g. aaab is just aa and ab) d
Similar question without the recombination problem Assigning groups using grepl with multiple inputs

How about the following
# Your data
df <- structure(
list(
id = c(1, 2, 3, 4, 5, 6, 7),
name = c("aa", "ab", "ac", "aa", "aab", "aac", "aabc")),
.Names = c("id", "name"), row.names = c(NA, -7L), class = "data.frame")
# Create all possible 2char combinations from unique chars in string
group <- lapply(strsplit(df$name, ""), function(x)
unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))));
# Melt and add original data
require(reshape2);
df2 <- melt(group);
df2 <- cbind.data.frame(
df2,
df[match(df2$L1, df$id), ]);
df2$group <- as.numeric(as.factor(df2$value));
df2;
# value L1 id name group
#1 aa 1 1 aa 1
#2 ab 2 2 ab 2
#3 ac 3 3 ac 3
#4 aa 4 4 aa 1
#5 aa 5 5 aab 1
#5.1 ab 5 5 aab 2
#6 aa 6 6 aac 1
#6.1 ac 6 6 aac 3
#7 aa 7 7 aabc 1
#7.1 ab 7 7 aabc 2
#7.2 ac 7 7 aabc 3
#7.3 bc 7 7 aabc 4
Explanation: strsplit splits the strings from df$name into char vectors. combn creates all 2-char combinations based on those char vectors. paste0 and unique keeps the concatenated unique 2-char combinations.
Note that this almost reproduces your example. That's because in my case, aabc also gives rise to group 4 = bc.
Update 1
You can filter entries based on a list of 2-char comparisons
# Filter entries
filter <- c("aa", "ab", "ac");
df2 <- df2[df2$value %in% filter, ]
# Clean up df2 to be consistent with OPs request
df2 <- df2[, -(1:2)];
df2;
# id name group
#1 1 aa 1
#2 2 ab 2
#3 3 ac 3
#4 4 aa 1
#5 5 aab 1
#5.1 5 aab 2
#6 6 aac 1
#6.1 6 aac 3
#7 7 aabc 1
#7.1 7 aabc 2
#7.2 7 aabc 3
Update 2
You can also create a filter dynamically, by selecting those value entries that are represented as 2-char strings in the original dataframe (in this case aa, ab and ac).
filter <- unique(unlist(group[sapply(group, function(x) length(x) == 1)]));

Related

grouping values and bringing values into character string

This is an example we can work with:
df <- tibble(a = c("aaa", "aaa", "aaa", "bbb", "bbb", "bbb"),
b = c(1,2,3,1,2,3), c = c(5,10,15,100,95,90))
df
# A tibble: 6 x 3
a b c
<chr> <dbl> <dbl>
1 aaa 1 5
2 aaa 2 10
3 aaa 3 15
4 bbb 1 100
5 bbb 2 95
6 bbb 3 90
I want to group the values in column a and combine column b and c to a single string. The final result should look exactly like this:
# A tibble: 2 x 2
a result
<chr> <chr>
1 aaa {"1":5,"2":10,"3":15}
2 bbb {"1":100,"2":95,"3":90}
and reads like this: aaa has 5 times 1, 10 times 2 and 15 times 3 etc.
I dont even know where to start honestly. Maybe with some dplyr::group_by and paste? I am glad for any hint.
Using the entire artillery paste, toString, sprintf.
b <- by(df[-1], df$a, function(x)
sprintf("{%s}", toString(Reduce(paste0, c(x, "'", "':")[c(3, 1, 4, 2)]))))
data.frame(a=unique(df$a), result=do.call(rbind, as.list(b)), row.names=NULL)
# a result
# 1 aaa {'1':5, '2':10, '3':15}
# 2 bbb {'1':100, '2':95, '3':90}
Note: You may also use '"' quotes, will get escaped in display though.

add recursive number with condition in dataframe R

i have problem with add some records table with particular condition.
for example, i have this kind of table
id word count
1 1 aa 2
2 2 bb 3
then, i want to change and add some number in id column with similar data for other column like this
id word count
1 100 aa 2
2 101 aa 2
3 102 aa 2
4 103 aa 2
5 200 bb 3
6 201 bb 3
7 202 bb 3
8 203 bb 3
the id column need to add with 2 digits in behind and then add recursive number after without changing other column data. Supposed that i have thousand records, i wonder how to make this happen.
It is not entirely clear from the description. Based on the expected output, an option is to create a list column by looping over the 'id', get the sequence after multiplying by '4' and then unnest the list column
library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
mutate(id = map(id*100, seq, length.out = 4)) %>%
unnest(c(id))
# A tibble: 8 x 3
# id word count
# <dbl> <chr> <int>
#1 100 aa 2
#2 101 aa 2
#3 102 aa 2
#4 103 aa 2
#5 200 bb 3
#6 201 bb 3
#7 202 bb 3
#8 203 bb 3
Or another option is to replicate the rows (uncount), grouped by 'word', modify the 'id'
df1 %>%
uncount(4) %>%
group_by(word) %>%
mutate(id = seq(100 * first(id), length.out = n()))
data
df1 <- structure(list(id = 1:2, word = c("aa", "bb"), count = 2:3),
class = "data.frame", row.names = c("1",
"2"))
Try the following base R function.
It loops (lapply) over column 'id' creating a vector like the one in the question and then putting the other columns in order in a data.frame, then combines (rbind) all these df's into the return value.
fun <- function(x, n = 3){
cols <- grep('id', names(x), invert = TRUE)
out <- lapply(x[['id']], function(i){
y <- sprintf(paste0(i, "%02d"), c(0L, seq.int(n)))
y <- data.frame(id = y)
for(j in cols) y[[j]] <- x[i, j]
y
})
out <- do.call(rbind, out)
row.names(out) <- NULL
out
}
fun(df1)
# id V2 V3
#1 100 aa 2
#2 101 aa 2
#3 102 aa 2
#4 103 aa 2
#5 200 bb 3
#6 201 bb 3
#7 202 bb 3
#8 203 bb 3
Data
df1 <- read.table(text = "
id word count
1 1 aa 2
2 2 bb 3
", header = TRUE)

looping and condition on string in r

I would like to create a new column based condition below:
if the `str` column only contains `A` then insert `A`
if the `str` column only contains `B` then insert `B`
if the `str` column only contains `A` and `B` then insert `AB`
df<-read.table(text="
ID str
1 A
1 A
1 AA
1 ABB
2 BA
2 BB", header=T)
ID str simplify_str
1 A A
1 A A
1 AA A
1 ABB AB
2 BA AB
2 BB B
As far as tidyverse options are concerned, you could use dplyr::case_when with stringr::str_detect
library(dplyr)
library(stringr)
df %>%
mutate(simplify_str = case_when(
str_detect(str, "^A+$") ~ "A",
str_detect(str, "^B+$") ~ "B",
TRUE ~ "AB"))
# ID str simplify_str
#1 1 A A
#2 1 A A
#3 1 AA A
#4 1 ABB AB
#5 2 BA AB
#6 2 BB B
Using your data.frame:
As <- grep("A",df$str)
Bs <- grep("B",df$str)
df$simplify_str <- ""
df$simplify_str[As] <- paste0(df$simplify_str[As],"A")
df$simplify_str[Bs] <- paste0(df$simplify_str[Bs],"B")
df
ID str simplify_str
1 1 A A
2 1 A A
3 1 AA A
4 1 ABB AB
5 2 BA AB
6 2 BB B
A general solution in base R where it splits the string and pastes together the unique characters in a sorted way.
df$simplify_str <- sapply(strsplit(as.character(df$str), ""),
function(x) paste(unique(sort(x)), collapse = ""))
df
# ID str simplify_str
#1 1 A A
#2 1 A A
#3 1 AA A
#4 1 ABB AB
#5 2 BA AB
#6 2 BB B

use value from a col to choose value from another col, put into new df in R

I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))
First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6
Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4
It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4

R: Conditionally replacing values based on column pre-fixes and suffixes

I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6

Resources