substitute the elements of a vector with values from dataframe - r

I need to substitute the elements of a vector which match the elements of a particular column in data frame in R.
Reproducible example:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"))
I need to get the following vector:
new<-c("A","T","Y","D")
What I tried is:
new <- a
new <- b$col2[match(a, b$col1)]
which does the substitution, but converts the unmatched elements into NAs.
Any help is appreciated

You can make a data.table from a and then update only the rows for which there is a match when joining with b.
library(data.table)
setDT(b)
data.table(a)[b, on = .(a = col1), a := i.col2][]
# a
# 1: A
# 2: T
# 3: Y
# 4: D
In base R you could use your current approach but replace the NAs with elements of a using ifelse
temp <- as.character(b$col2[match(a, b$col1)])
ifelse(is.na(temp), a, temp)
# [1] "A" "T" "Y" "D"

You can use replace in base R:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"), stringsAsFactors = F)
replace(a, which(a %in% b$col1), b$col2[b$col1 %in% a])
#[1] "A" "T" "Y" "D"

Related

Replacing multiple numbers with string in a dataframe without regex in R

I have columns in a dataframe where I want to replace integers with their corresponding string values. The integers are often repeating in cells (separated by spaces, commas, /, or - etc.). For example my dataframe column is:
> df = data.frame(c1=c(1,2,3,23,c('11,21'),c('13-23')))
> df
c1
1 1
2 2
3 3
4 23
5 11,21
6 13-23
I have used both str_replace_all() and str_replace() methods but did not get the desired results.
> df[,1] %>% str_replace_all(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
[1] "a" "b" "c" "bc" "aa,ba" "ac-bc"
> df[,1] %>% str_replace(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
Error in fix_replacement(replacement) : argument "replacement" is missing, with no default
The desired result would be:
[1] "a" "b" "c" "g" "d,f" "e-g"
As there are multiple values to replace that's why my first choice was str_replace_all() as it allows to have a vector with the original column values and desired replacement values but the method fails due to regex. Am I doing it wrong or is there any better alternative to solve my problem?
Simply place the longest multi-character at the beginning like:
library(stringr)
str_replace_all(df[,1],
c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c"))
#[1] "a" "b" "c" "g" "d,f" "e-g"
and for complexer cases:
x <- c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g")
x <- x[order(nchar(names(x)), decreasing = TRUE)]
str_replace_all(df[,1], x)
#[1] "a" "b" "c" "g" "d,f" "e-g"
Using the ordering method in #GKi's answer, here's a base R version using Reduce/gsub instead of stringr::str_replace_all
Starting vector
x <- as.character(df$c1)
Ordering as in #GKi answer
repl_dict <- c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c")
repl_dict <- repl_dict[order(nchar(names(repl_dict)), decreasing = TRUE)]
Replacement
Reduce(
function(x, n) gsub(n, repl_dict[n], x, fixed = TRUE),
names(repl_dict),
init = x)
# [1] "a" "b" "c" "g" "d,f" "e-g"

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

subset of list of vector with grep?

I have a list of vector and I want to create a new list containing any value containing the letter 'a' but keep in internal structure.
l = list ( g1 = c('a','b','ca') ,
g2 = c('a','b') )
lapply(l, function(x) grep('a',x) )
lapply on provides the index number but what I want it to return are the values.
The end result should be a list with vector g1 containing a and ca whilst g2 with just a.
thanks!
Add value = TRUE.
lapply(l, function(x) grep('a', x, value = TRUE))
# $g1
# [1] "a" "ca"
#
# $g2
# [1] "a"
Alternatively, you can do:
lapply(l, function(x) x[grepl("a", x)])
$g1
[1] "a" "ca"
$g2
[1] "a"
If you want to try with tidyverse here are couple of approaches.
library(tidyverse)
map(l, ~grep('a', .x, value=T))
map(l, ~str_subset(.x, 'a')) # str_subset from stringr package is a wrapper for grep shown above.

Extract distinct characters that differ between two strings

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?
A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"
You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"
We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"
An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

Split string in each column for several columns

I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
Here you want to use apply over the rows instead of columns:
df <- rbind(c("SNP", "rs6576700", "rs17054099", "rs7730126"),
c("sample1", "G-G", "T-T", "G-G"),
c("sample2", "C-C", "T-T", "G-C"))
t(apply(df[-1,], 1, function(col) unlist(strsplit(col, "-"))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "sample1" "G" "G" "T" "T" "G" "G"
#[2,] "sample2" "C" "C" "T" "T" "G" "C"

Resources