Extract distinct characters that differ between two strings - r

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?

A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"

You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"

We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"

An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

Related

Replacing multiple numbers with string in a dataframe without regex in R

I have columns in a dataframe where I want to replace integers with their corresponding string values. The integers are often repeating in cells (separated by spaces, commas, /, or - etc.). For example my dataframe column is:
> df = data.frame(c1=c(1,2,3,23,c('11,21'),c('13-23')))
> df
c1
1 1
2 2
3 3
4 23
5 11,21
6 13-23
I have used both str_replace_all() and str_replace() methods but did not get the desired results.
> df[,1] %>% str_replace_all(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
[1] "a" "b" "c" "bc" "aa,ba" "ac-bc"
> df[,1] %>% str_replace(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
Error in fix_replacement(replacement) : argument "replacement" is missing, with no default
The desired result would be:
[1] "a" "b" "c" "g" "d,f" "e-g"
As there are multiple values to replace that's why my first choice was str_replace_all() as it allows to have a vector with the original column values and desired replacement values but the method fails due to regex. Am I doing it wrong or is there any better alternative to solve my problem?
Simply place the longest multi-character at the beginning like:
library(stringr)
str_replace_all(df[,1],
c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c"))
#[1] "a" "b" "c" "g" "d,f" "e-g"
and for complexer cases:
x <- c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g")
x <- x[order(nchar(names(x)), decreasing = TRUE)]
str_replace_all(df[,1], x)
#[1] "a" "b" "c" "g" "d,f" "e-g"
Using the ordering method in #GKi's answer, here's a base R version using Reduce/gsub instead of stringr::str_replace_all
Starting vector
x <- as.character(df$c1)
Ordering as in #GKi answer
repl_dict <- c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c")
repl_dict <- repl_dict[order(nchar(names(repl_dict)), decreasing = TRUE)]
Replacement
Reduce(
function(x, n) gsub(n, repl_dict[n], x, fixed = TRUE),
names(repl_dict),
init = x)
# [1] "a" "b" "c" "g" "d,f" "e-g"

How to remove empty string with 0 count in R dataframe? [duplicate]

I have a data frame containing a factor. When I create a subset of this dataframe using subset or another indexing function, a new data frame is created. However, the factor variable retains all of its original levels, even when/if they do not exist in the new dataframe.
This causes problems when doing faceted plotting or using functions that rely on factor levels.
What is the most succinct way to remove levels from a factor in the new dataframe?
Here's an example:
df <- data.frame(letters=letters[1:5],
numbers=seq(1:5))
levels(df$letters)
## [1] "a" "b" "c" "d" "e"
subdf <- subset(df, numbers <= 3)
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
# all levels are still there!
levels(subdf$letters)
## [1] "a" "b" "c" "d" "e"
Since R version 2.12, there's a droplevels() function.
levels(droplevels(subdf$letters))
All you should have to do is to apply factor() to your variable again after subsetting:
> subdf$letters
[1] a b c
Levels: a b c d e
subdf$letters <- factor(subdf$letters)
> subdf$letters
[1] a b c
Levels: a b c
EDIT
From the factor page example:
factor(ff) # drops the levels that do not occur
For dropping levels from all factor columns in a dataframe, you can use:
subdf <- subset(df, numbers <= 3)
subdf[] <- lapply(subdf, function(x) if(is.factor(x)) factor(x) else x)
If you don't want this behaviour, don't use factors, use character vectors instead. I think this makes more sense than patching things up afterwards. Try the following before loading your data with read.table or read.csv:
options(stringsAsFactors = FALSE)
The disadvantage is that you're restricted to alphabetical ordering. (reorder is your friend for plots)
It is a known issue, and one possible remedy is provided by drop.levels() in the gdata package where your example becomes
> drop.levels(subdf)
letters numbers
1 a 1
2 b 2
3 c 3
> levels(drop.levels(subdf)$letters)
[1] "a" "b" "c"
There is also the dropUnusedLevels function in the Hmisc package. However, it only works by altering the subset operator [ and is not applicable here.
As a corollary, a direct approach on a per-column basis is a simple as.factor(as.character(data)):
> levels(subdf$letters)
[1] "a" "b" "c" "d" "e"
> subdf$letters <- as.factor(as.character(subdf$letters))
> levels(subdf$letters)
[1] "a" "b" "c"
Another way of doing the same but with dplyr
library(dplyr)
subdf <- df %>% filter(numbers <= 3) %>% droplevels()
str(subdf)
Edit:
Also Works ! Thanks to agenis
subdf <- df %>% filter(numbers <= 3) %>% droplevels
levels(subdf$letters)
For the sake of completeness, now there is also fct_drop in the forcats package http://forcats.tidyverse.org/reference/fct_drop.html.
It differs from droplevels in the way it deals with NA:
f <- factor(c("a", "b", NA), exclude = NULL)
droplevels(f)
# [1] a b <NA>
# Levels: a b <NA>
forcats::fct_drop(f)
# [1] a b <NA>
# Levels: a b
Here's another way, which I believe is equivalent to the factor(..) approach:
> df <- data.frame(let=letters[1:5], num=1:5)
> subdf <- df[df$num <= 3, ]
> subdf$let <- subdf$let[ , drop=TRUE]
> levels(subdf$let)
[1] "a" "b" "c"
This is obnoxious. This is how I usually do it, to avoid loading other packages:
levels(subdf$letters)<-c("a","b","c",NA,NA)
which gets you:
> subdf$letters
[1] a b c
Levels: a b c
Note that the new levels will replace whatever occupies their index in the old levels(subdf$letters), so something like:
levels(subdf$letters)<-c(NA,"a","c",NA,"b")
won't work.
This is obviously not ideal when you have lots of levels, but for a few, it's quick and easy.
Looking at the droplevels methods code in the R source you can see it wraps to factor function. That means you can basically recreate the column with factor function.
Below the data.table way to drop levels from all the factor columns.
library(data.table)
dt = data.table(letters=factor(letters[1:5]), numbers=seq(1:5))
levels(dt$letters)
#[1] "a" "b" "c" "d" "e"
subdt = dt[numbers <= 3]
levels(subdt$letters)
#[1] "a" "b" "c" "d" "e"
upd.cols = sapply(subdt, is.factor)
subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]
levels(subdt$letters)
#[1] "a" "b" "c"
here is a way of doing that
varFactor <- factor(letters[1:15])
varFactor <- varFactor[1:5]
varFactor <- varFactor[drop=T]
I wrote utility functions to do this. Now that I know about gdata's drop.levels, it looks pretty similar. Here they are (from here):
present_levels <- function(x) intersect(levels(x), x)
trim_levels <- function(...) UseMethod("trim_levels")
trim_levels.factor <- function(x) factor(x, levels=present_levels(x))
trim_levels.data.frame <- function(x) {
for (n in names(x))
if (is.factor(x[,n]))
x[,n] = trim_levels(x[,n])
x
}
Very interesting thread, I especially liked idea to just factor subselection again. I had the similar problem before and I just converted to character and then back to factor.
df <- data.frame(letters=letters[1:5],numbers=seq(1:5))
levels(df$letters)
## [1] "a" "b" "c" "d" "e"
subdf <- df[df$numbers <= 3]
subdf$letters<-factor(as.character(subdf$letters))
Thank you for posting this question. However, none of the above solutions worked for me. I made a workaround for this problem, sharing it in case some else stumbles upon this problem:
For all factor columns that contain levels having zero values in them, you can first convert those columns into character type and then convert them back into factors.
For the above-posted question, just add the following lines of code:
# Convert into character
subdf$letters = as.character(subdf$letters)
# Convert back into factor
subdf$letters = as.factor(subdf$letters)
# Verify the levels in the subset
levels(subdf$letters)
Unfortunately factor() doesn't seem to work when using rxDataStep of RevoScaleR. I do it in two steps:
1) Convert to character and store in temporary external data frame (.xdf).
2) Convert back to factor and store in definitive external data frame. This eliminates any unused factor levels, without loading all the data into memory.
# Step 1) Converts to character, in temporary xdf file:
rxDataStep(inData = "input.xdf", outFile = "temp.xdf", transforms = list(VAR_X = as.character(VAR_X)), overwrite = T)
# Step 2) Converts back to factor:
rxDataStep(inData = "temp.xdf", outFile = "output.xdf", transforms = list(VAR_X = as.factor(VAR_X)), overwrite = T)
Have tried most of the examples here if not all but none seem to be working in my case.
After struggling for quite some time I have tried using as.character() on the factor column to change it to a col with strings which seems to working just fine.
Not sure for performance issues.
A genuine droplevels function that is much faster than droplevels and does not perform any kind of unnecessary matching or tabulation of values is collapse::fdroplevels. Example:
library(collapse)
library(microbenchmark)
# wlddev data supplied in collapse, iso3c is a factor
data <- fsubset(wlddev, iso3c %!in% "USA")
microbenchmark(fdroplevels(data), droplevels(data), unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval cld
## fdroplevels(data) 1.0 1.00000 1.00000 1.00000 1.00000 1.00000 100 a
## droplevels(data) 30.2 29.15873 24.54175 24.86147 22.11553 14.23274 100 b

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

substitute the elements of a vector with values from dataframe

I need to substitute the elements of a vector which match the elements of a particular column in data frame in R.
Reproducible example:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"))
I need to get the following vector:
new<-c("A","T","Y","D")
What I tried is:
new <- a
new <- b$col2[match(a, b$col1)]
which does the substitution, but converts the unmatched elements into NAs.
Any help is appreciated
You can make a data.table from a and then update only the rows for which there is a match when joining with b.
library(data.table)
setDT(b)
data.table(a)[b, on = .(a = col1), a := i.col2][]
# a
# 1: A
# 2: T
# 3: Y
# 4: D
In base R you could use your current approach but replace the NAs with elements of a using ifelse
temp <- as.character(b$col2[match(a, b$col1)])
ifelse(is.na(temp), a, temp)
# [1] "A" "T" "Y" "D"
You can use replace in base R:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"), stringsAsFactors = F)
replace(a, which(a %in% b$col1), b$col2[b$col1 %in% a])
#[1] "A" "T" "Y" "D"

Get matrix column name by matching vector

I started with a matrix
Xray Stay Leave
[1,] "H" "H" "H"
[2,] "A" "L" "O"
And I have the following vector:
[1] "H" "L"
I want to get the output
"Stay".
I tried this:
which(vec %in% matrix )
but that gives me the following output:
[1] 1 2
Seems to be just telling me the rows that it finds the H and L in. I need the column name of the one that is an exact match.
Another approach:
vec <- c("H", "L")
colnames(mat)[colMeans(mat == vec) == 1]
# [1] "Stay"
where mat is the name of your matrix.
Assuming m is your matrix, you could do
> vec <- c("H", "A")
> colnames(m)[apply(m, 2, identical, vec)]
NOTE: identical used here because the original post says "I need the column name of the one that is an exact match"
This should return a logical vector:
logv <- apply(mat, 2, function(x) identical(vec,x))
Then this will select the correct column name:
dimnames(mat)[[2]][logv]
[1] "Stay"
Test case:
mat <- matrix( c( "H","H", "H", "A","L","O") ,2, byrow=TRUE,
dimnames=list(NULL, c('Xray', 'Stay', 'Leave') ) )
You could try:
If mat and vec are the matrix and vector
colnames(mat)[table(mat %in% vec, (seq_along(mat)-1)%/%nrow(mat) +1)[2,] >1]
#[1] "Stay"

Resources