Replacing multiple numbers with string in a dataframe without regex in R - r

I have columns in a dataframe where I want to replace integers with their corresponding string values. The integers are often repeating in cells (separated by spaces, commas, /, or - etc.). For example my dataframe column is:
> df = data.frame(c1=c(1,2,3,23,c('11,21'),c('13-23')))
> df
c1
1 1
2 2
3 3
4 23
5 11,21
6 13-23
I have used both str_replace_all() and str_replace() methods but did not get the desired results.
> df[,1] %>% str_replace_all(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
[1] "a" "b" "c" "bc" "aa,ba" "ac-bc"
> df[,1] %>% str_replace(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
Error in fix_replacement(replacement) : argument "replacement" is missing, with no default
The desired result would be:
[1] "a" "b" "c" "g" "d,f" "e-g"
As there are multiple values to replace that's why my first choice was str_replace_all() as it allows to have a vector with the original column values and desired replacement values but the method fails due to regex. Am I doing it wrong or is there any better alternative to solve my problem?

Simply place the longest multi-character at the beginning like:
library(stringr)
str_replace_all(df[,1],
c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c"))
#[1] "a" "b" "c" "g" "d,f" "e-g"
and for complexer cases:
x <- c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g")
x <- x[order(nchar(names(x)), decreasing = TRUE)]
str_replace_all(df[,1], x)
#[1] "a" "b" "c" "g" "d,f" "e-g"

Using the ordering method in #GKi's answer, here's a base R version using Reduce/gsub instead of stringr::str_replace_all
Starting vector
x <- as.character(df$c1)
Ordering as in #GKi answer
repl_dict <- c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c")
repl_dict <- repl_dict[order(nchar(names(repl_dict)), decreasing = TRUE)]
Replacement
Reduce(
function(x, n) gsub(n, repl_dict[n], x, fixed = TRUE),
names(repl_dict),
init = x)
# [1] "a" "b" "c" "g" "d,f" "e-g"

Related

What's the R function used to find unique and distinct value in a column? [duplicate]

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

substitute the elements of a vector with values from dataframe

I need to substitute the elements of a vector which match the elements of a particular column in data frame in R.
Reproducible example:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"))
I need to get the following vector:
new<-c("A","T","Y","D")
What I tried is:
new <- a
new <- b$col2[match(a, b$col1)]
which does the substitution, but converts the unmatched elements into NAs.
Any help is appreciated
You can make a data.table from a and then update only the rows for which there is a match when joining with b.
library(data.table)
setDT(b)
data.table(a)[b, on = .(a = col1), a := i.col2][]
# a
# 1: A
# 2: T
# 3: Y
# 4: D
In base R you could use your current approach but replace the NAs with elements of a using ifelse
temp <- as.character(b$col2[match(a, b$col1)])
ifelse(is.na(temp), a, temp)
# [1] "A" "T" "Y" "D"
You can use replace in base R:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"), stringsAsFactors = F)
replace(a, which(a %in% b$col1), b$col2[b$col1 %in% a])
#[1] "A" "T" "Y" "D"

Find all unique values in column separated by comma

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

R: Non-greedy version of setdiff?

Here's setdiff normal behaviour:
x <- rep(letters[1:4], 2)
x
# [1] "a" "b" "c" "d" "a" "b" "c" "d"
y <- letters[1:2]
y
# [1] "a" "b"
setdiff(x, y)
# [1] "c" "d"
… but what if I want y to be taken out only once, and therefore get the following result?
# "c" "d" "a" "b" "c" "d"
I'm guessing that there is an easy solution using either setdiff or %in%, but I just cannot see it.
match returns a vector of the positions of (first) matches of its first argument in its second. It's used as an index constructor:
x[ -match(y,x) ]
#[1] "c" "d" "a" "b" "c" "d"
If there are duplicates in 'y' and you want removal in proportion to their numbers therein, then the first thing that came to my mind is a for-loop:
y <- c("a","b","a")
x2 <- x
for( i in seq_along(y) ){ x2 <- x2[-match(y[i],x2)] }
> x2
[1] "c" "d" "b" "c" "d"
This would be one possible result of using the tabling approach suggested below. Uses some "set" functions, but this is not really a set problem. Seems somewhat more "vectorised":
c( table(x [x %in% intersect(x,y)]) - table(y[y %in% intersect(x,y)]) ,
table( x[!x %in% intersect(x,y)]) )
a b c d
0 1 2 2
vecsets package has vsetdiff function for this.
x <- rep(letters[1:4], 2)
y <- letters[1:2]
vecsets::vsetdiff(x, y)
#[1] "c" "d" "a" "b" "c" "d"
Here is another looping method. I think 42's method is cleaner, but it provides another option.
# construct a table containing counts for all possible values in x and y in y
myCounts <- table(factor(y, levels=sort(union(x, y))))
# extract these elements from x
x[-unlist(lapply(names(myCounts),
function(i) which(i == x)[seq_len(myCounts[i])]))]
The "non-greedy" aspect comes from [seq_len(myCounts[i])] which only takes the number of identical elements that are present in y

Extract distinct characters that differ between two strings

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?
A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"
You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"
We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"
An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

Resources