How can I make a data.frame using chr vertor [duplicate] - r

This question already has answers here:
aggregating unique values in columns to single dataframe "cell" [duplicate]
(2 answers)
Closed 2 years ago.
I have the data.frame below.
> Chr Chr
> A E
> A F
> A E
> B G
> B G
> C H
> C I
> D E
and... I want to convert the dataset as belows as you may be noticed.
I want to coerce all chr vectors into an row.
chr chr
A E,F
B G
C H,I
D E
they are all characters, so I tried to do several things so that I want to make.
Firstly, I used unique function for FILTER <- unique(chr[,15])1st column and try to subset them using
FILTER data that I created using rbind or bind rows function.
Secondly, I tested to check whether my idea works or not
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
result = unique(Top10Data[TGT == FILTER[i]]$`NM`))
print(result)
}
to this stage, it seems to be working well.
The problem for me is that when I used both functions, the data frame only creates 1 column and ignored the others vector (2nd variables from above data.frame) all.
Only For the chr [1,1], those functions do work well, but I have chr vectors such as chr[1,n], which is unable to be coerced.
here's my code for your reference.
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
CGONM <- rbind(NN,unique(Top10Data[TGT == FILTER[i]]$`NM`))
}

Base R solutions:
# Solution 1:
df_str_agg1 <- aggregate(var2~var1, df, FUN = function(x){
paste0(unique(x), collapse = ",")})
# Solution 2:
df_str_agg2 <- data.frame(do.call("rbind",lapply(split(df, df$var1), function(x){
data.frame(var1 = unique(x$var1),
var2 = paste0(unique(x$var2), collapse = ","))
}
)
),
row.names = NULL
)
Tidyverse solution:
library(tidyverse)
df_str_agg3 <-
df %>%
group_by(var1) %>%
summarise(var2 = str_c(unique(var2), collapse = ",")) %>%
ungroup()
Data:
df <- data.frame(var1 = c("A", "A", "A", "B", "B", "C", "C", "D"),
var2 = c("E", "F", "E", "G", "G", "H", "I", "E"), stringsAsFactors = FALSE)

Related

Compare two vectors within a data frame with %in% with R

Compare two vectors within a data frame with %in%
I have the following data
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
Col1
Col2
a
a,b,c
b
aa,c,d
aa
c,d,e
d
d,f,g
I want to select the rows that contain a character from this vector c("a", "e", "g"), specifying the columna
library(dplyr)
T1 %>% filter(Col1 %in% c("a", "e", "g"))
I returned
1 a a,b,c
It is correct, but if I want to compare two vectors, example:
With unlist and strsplit, I transform the value of each row to a character vector and try to compare it with the reference vector to select the rows that contain any of the values:
unlist(strsplit(T1$Col2[1],","))
[1] "a" "b" "c"
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
It gives me an error:
Error in filter():
! Problem while computing ..1 = unlist(strsplit(Col2, ",")) %in% c("a", "e", "g").
✖ Input ..1 must be of size 4 or 1, not size 12.
Run ]8;;rstudio:run:rlang::last_error()rlang::last_error() ]8;; to see where the error occurred.
I can do it like this:
T1[grep(c("a|e|g"), T1$Col2),]
1 a a,b,c
2 b aa,c,d
3 aa c,d,e
4 d d,f,g
But it's wrong, row 3 aa c,d,e, shouldn't be, because it's not a, it's aa
To search for the "a" alone, you would have to do:
T1[grep(c("\\<a\\>"), T1$Col2),]
I think that with this form I will end up making a mistake, it would give me more security to be able to do it comparing vector with vector:
T1 %>% filter(unlist(strsplit(Col2,",")) %in% c("a", "e", "g"))
Edited answer
You can use the syntax \\b for regular expressions word boundary. The | is for boundaries adjacent to like an or operation. You can use the following code:
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
library(dplyr)
library(stringr)
T1 %>%
filter(grepl("\\b(a|e|g)\\b", Col2))
#> Col1 Col2
#> 1 a a,b,c
#> 2 aa c,d,e
#> 3 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
Note: \\b is for R version 4.1+ otherwise use \b.
old answer
It returns all rows back because you check if one of the strings exists in Col2 and you can see that in row 3, "e" exists which is one of the strings and that's why it returns also row 4. You could also use str_detect like this:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(any(str_detect(Col2, paste0(vector, collapse="|"))))
#> Col1 Col2
#> 1 a a,b,c
#> 2 b aa,c,d
#> 3 aa c,d,e
#> 4 d d,f,g
Created on 2022-07-16 by the reprex package (v2.0.1)
If you want to check if the strings exists, one of them, in both columns. You can use the following code:
library(dplyr)
library(stringr)
T1 <- data.frame( "Col1" = c("a", "b", "aa", "d"), "Col2" = c("a,b,c", "aa,c,d", "c,d,e", "d,f,g") )
vector <- c("a", "e", "g")
T1 %>%
filter(Reduce(`|`, across(all_of(colnames(T1)), ~str_detect(paste0(vector, collapse="|"), .x))))
#> Col1 Col2
#> 1 a a,b,c
Created on 2022-07-16 by the reprex package (v2.0.1)
Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

How do I Identify by row id the values in a data frame column not in another data frame column?

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x
I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"
I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

Exclude common rows in tibbles [duplicate]

This question already has an answer here:
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
Closed 2 years ago.
I'm looking for a way to join two tibbles in a a way to leave rows only unique to the first first tibble or unique in both tibbles - simply those one that do not have any matched key.
Let's see example:
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
With common dplyr::join I am not able to get this:
A
1 d
2 e
Is there some way within dplyr to overcome it or in general in tidyverse to overcome it?
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
dplyr::anti_join will keep only the rows that are unique to the tibble/data.frame of the first argument.
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
dplyr::anti_join(A, B, by = "A")
# A
# <chr>
# 1 d
# 2 e
A base R possibility (well except the tibble):
A[!A$A %in% B$A,]
returns
# A tibble: 2 x 1
A
<chr>
1 d
2 e

Count string followed by separate string occurrences in r

I am trying to count the number of occurrences of a string followed by another string in r. I cannot seem to get the regex worked out to count this correctly.
As an example:
v <- c("F", "F", "C", "F", "F", "C", "F", "F")
b <- str_count(v, "F(?=C)")
I would like b to tell me how many times string F was followed by string C in vector v (which should equal 2).
I have successfully implemented str_count() to count single strings, but I cannot figure out how to count string followed by a different string.
Also, I found that in regex (?=...) should indicated "followed by" however this does not seem to be sufficient.
You don't have one string. You have individual strings. Her you can test if F is followed by C by shifting using [ for subsetting.
sum(v[-length(v)] == "F" & v[-1] == "C")
#sum(v == "F" & c(v[-1] == "C", FALSE)) #Alternative
#[1] 2
To use stringr::str_count you can paste v to one string.
stringr::str_count(paste(v, collapse = ""), "F(?=C)")
#[1] 2
And for rows of a data.frame:
set.seed(42)
v <- as.data.frame(matrix(sample(c("F", "C"), 25, TRUE), 5))
stringr::str_count(apply(v, 1, paste, collapse = ""), "F(?=C)")
#[1] 1 1 2 1 1
You can use lag() from dplyr:
library(dplyr)
sum(v == "C" & lag(v) == "F", na.rm = TRUE)
(The na.rm = TRUE is because the first value of lag(v) is NA).
Your comment notes that you're also interested in applying this across each row of a data frame. This can be done by pivoting the data to be longer, then applying a grouped mutate, then pivoting the data to be wider again. On an example dataset:
example <- tibble(id = 1:3,
s1 = c("F", "F", "F"),
s2 = c("C", "F", "C"),
s3 = c("C", "C", "F"),
s4 = c("F", "C", "C"))
example %>%
pivot_longer(s1:s4) %>%
group_by(id) %>%
mutate(fc_count = sum(value == "C" & lag(value) == "F", na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
Result:
# A tibble: 3 x 6
id fc_count s1 s2 s3 s4
<int> <int> <chr> <chr> <chr> <chr>
1 1 1 F C C F
2 2 1 F F C C
3 3 2 F C F C
Note that this assumed the data had something like an id column that uniquely identifies each original row. If it doesn't, you can add one with mutate(id = row_number()) first.

Replace strings in variable using lookup vector

I have a dataframe df with a character variable and the fromvec and tovec.
df <- tibble(var = c("A", "B", "C", "a", "E", "D", "b"))
fromvec <- c("A", "B", "C")
tovec <- c("X", "Y", "Z")
Use strings in fromvec, check them in df and then replace them with the corresponding strings in tovec so that "A" in df gets replaced with "X", "B" with "Y" and so on to get the desired_df.
desired_df <- tibble(var = c("X", "Y", "Z", "X", "E", "D", "Y"))
I tried following, but not getting the desired result!
from_vec <- paste(fromvec, collapse="|")
to_vec <- paste(tovec, collapse="|")
undesired_df <- df %>%
mutate(var = str_replace(str_to_upper(var), from_vec, to_vec))
i.e. this
tibble(var = c("X|Y|Z", "X|Y|Z", "X|Y|Z", "X|Y|Z", "E", "D", "X|Y|Z"))
How can I get the desired_df?
You could use chartr :
df$var <- chartr(paste(fromvec,collapse=""),
paste(tovec,collapse=""),
toupper(df$var))
# # A tibble: 7 x 1
# var
# <chr>
# 1 X
# 2 Y
# 3 Z
# 4 X
# 5 E
# 6 D
# 7 Y
Or we can use recode
library(dplyr)
df$var <- recode(toupper(df$var), !!!setNames(tovec,fromvec))
If you really want to use str_replace you could do:
library(purrr)
library(stringr)
df$var <- reduce2(fromvec, tovec, str_replace, .init=toupper(df$var))
The correct way to do this with stringr is with str_replace_all:
mutate(df,str_replace_all(str_to_upper(var),setNames(tovec, fromvec)))
(thanks, #Moody_Mudskipper!)
We can use base R
with(df, ifelse(toupper(var) %in% fromvec,
setNames(tovec, fromvec)[toupper(var)], var))
#[1] "X" "Y" "Z" "X" "E" "D" "Y"
which can be also written in two lines by creating a logical condition
i1 <- toupper(df$var) %in% fromvec
df$var[i1] <- setNames(tovec, fromvec)[toupper(df$var)[i1]]
Or using data.table
library(data.table)
setDT(df)[toupper(var) %in% fromvec, var := setNames(tovec, fromvec)[toupper(var)]]
It's not clear the result should be case insensitive.
In my opinion, replacement (update) operations that involve an indeterminate number of changes are best accomplished using JOINs. In this case, it also cements a good practice of tracking your changes in a separate dataframe.
Unfortunately, the tidyverse has no "update dataframe" function....a glaring omission. That means tidyverse-ers must use a work-around, coalesce.
#JOIN Operation
tibble(fromvec, tovec) %>% #< dataframe of changes
right_join(df, by = c("fromvec" = "var")) %>% #< join operation
transmute(var = coalesce(tovec, fromvec)) #< coalesce work-around
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 a
5 E
6 D
7 b
If a case insensitive operation is preferred, consider inserting str_to_upper in the pipeline:
tibble(fromvec, tovec) %>%
right_join(df %>% mutate(var = (str_to_upper(var))), #<modify case
by = c("fromvec" = "var")) %>%
transmute(var = coalesce(tovec, fromvec))
# A tibble: 7 x 1
var
<chr>
1 X
2 Y
3 Z
4 X
5 E
6 D
7 Y

Resources