Replace multiple words with a word single word - r

I want to replace multiple letters/words with a single letter/word, multiple times in a dataframe. As an example,
Some data:
df = data.frame(
a = 1:8,
b = c("colour1 o", "colour2 O", "colour3 out", "colour4 Out",
"soundi i", "soundr I", "sounde in", "soundw In"))
df
a b
1 1 colour1 o
2 2 colour2 O
3 3 colour3 out
4 4 colour4 Out
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Here is what I want to replace with:
df_repl <- list(
O = c("o", "out", "Out"),
In = c("i", "in", "I"))
So in df$b o, out and Out should become O and i, in and I become In, but only if they are separated from any other words by a space, so o in colour is not capitalised.
This gets me half way there, but I think I need another nested for-loop to move through df_repl...
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Above o, out and Out become O but i, in and I are not altered, here is the desired output:
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
In the real data there are many more than two replacement words/letters so I can't just rerun the for-loop again. I'm not tied to a for-loop solution, but preferably using base R, any suggestions much appreciated.
EDIT
Trying to clarify my question:
Whenever one of o, out or Out occur in df$b I want to replace it with O
Whenever one of i, in or I occur in df$b I want to replace it with In
I can achieve the desired output like this:
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
for (word in df_repl[[2]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[2]))
df$b <- gsub(patt, repl, df$b)
}
But in my real dataset df_repl is length 50 rather two so I don't want to copy/paste/edit/rerun the for-loop 50 times

You can skip the loop over the words in df_repl when you paste them with | (or) between the words like:
for(i in names(df_repl)) {
df$b <- sub(paste(paste0("\\b",df_repl[[i]],"\\b"), collapse = "|")
, i, df$b)
}
df
# a b
#1 1 colour1 O
#2 2 colour2 O
#3 3 colour3 O
#4 4 colour4 O
#5 5 soundi In
#6 6 soundr In
#7 7 sounde In
#8 8 soundw In

You may try using three separate calls to sub:
df$b <- sub("\\bo\\b", "i", df$b)
df$b <- sub("\\bout\\b", "in", df$b)
df$b <- sub("\\bOut\\b", "I", df$b)
df
a b
1 1 colour1 i
2 2 colour2 O
3 3 colour3 in
4 4 colour4 I
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
To automate this, you could try using sapply with an index:
terms_in <- c("o", "out", "Out")
pat <- paste0("\\b", terms_in, "\\b")
replace <- c("i", "in", "I")
sapply(seq_along(pat), function(x) {
df$b <<- sub(pat[x], replace[x], df$b)
})

This is another solution:
library(stringr)
in1 <- str_split(df$b, " ", simplify = TRUE)[,1]
in2 <- str_split(df$b, " ", simplify = TRUE)[,2]
in2[in2 %in% c("o", "out", "Out")] <- "O"
in2[in2 %in% c("i", "in", "I")] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
If you have a long list of words in your data, you could also move c(word list) outside:
in1<- str_split(df$b, " ", simplify = TRUE)[,1]
in2<- str_split(df$b, " ", simplify = TRUE)[,2]
o <- c("o", "Out", "Out")
i <- c("i", "in", "I")
in2[in2 %in% o] <- "O"
in2[in2 %in% i] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
> df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In

Related

How to move values between columns with condition

I want to move value from the column II to the column I only when rows on the column I show NA.
This is the data.frame :
id <- c("A","B","C","D", "E","F","G")
I <- c("NA","NA","NA","10","20","NA","30")
II <- c("3","4","5","6","7", "8", "8")
df <- data.frame(id, I, II)
The expected result would be like this :
id <- c("A","B","C","D", "E","F","G")
I <- c("NA","NA","NA","10","20","NA","30")
II <- c("3","4","5","6","7", "8", "8")
III <- c("3","4","5","10","20", "8", "30")
df <- data.frame(id, I, II,III)
Thanks in advance!
You can use ifelse :
transform(df, III = ifelse(I == 'NA', II, I))
# id I II III
#1 A NA 3 3
#2 B NA 4 4
#3 C NA 5 5
#4 D 10 6 10
#5 E 20 7 20
#6 F NA 8 8
#7 G 30 8 30
We can use a simple coalesce after converting the quoted "NA" to actual unquoted NA in a single line
library(dplyr)
df1 <- df %>%
mutate(III = coalesce(na_if(I, "NA"), II))
df1
# id I II III
#1 A <NA> 3 3
#2 B <NA> 4 4
#3 C <NA> 5 5
#4 D 10 6 10
#5 E 20 7 20
#6 F <NA> 8 8
#7 G 30 8 30
Or using base R, change the "NA" to NA, create a logical vector based on the presence of NA elements in 'I' to change the values of 'III' (after assigning the values of 'II'
df$I[df$I == "NA"] <- NA
df$III <- df$II
df$III[!is.na(df$I)] <- df$I[!is.na(df$I)]
Or with ifelse
df$III <- with(df, ifelse(I == "NA", II, I))
A simple for loop in Base R will get this done
III = 0
for (i in 1:length(id)){
if (I[i] == "NA"){
III[i] = II[i]} else {
III[i] = I[i]}
}
df = data.frame(id, I, II, III)

Add an incremental letter to filled NAs from na.locf()

I have a data.frame that looks like this:
df <- structure(list(
a = c("atg", "tga", "agt", "acc", "cgt", "gca",
"gtc", "ggg", "ccc"),
b = c("1", "2", NA, "3", NA, NA, "4", "5",
"6")
),
row.names = c(NA, -9L),
class = "data.frame")
I have replaced the NAs with the nearest non-NA using na.locf from the zoo package, but I need to add an incremental letter to the replaced NA values, so that the end product looks like this:
> df
a b
1 atg 1
2 tga 2
3 agt 2a
4 acc 3
5 cgt 3a
6 gca 3b
7 gtc 4
8 ggg 5
9 ccc 6
I wrote a small if function, that fills the NA appropriately but adds letters to all values and recycles the numbers to match the length of letters. I can see that this result is from the any call within the function I am now thinking I probably need to do a for loop and use that to increment through each cell, however a for loop with a variant of the if statement doesn't do anything. Any suggestions are welcome.
> testif <- function(x) {
+ if (any(is.na(x))) {
+ paste(na.locf(x), letters, sep = "")
+ }
+ }
for (x in df$b) {
+ if (any(is.na(x))) {
+ paste(test$b, na.locf(x), letters, sep = "")
+ }
+ }
Define seq_let which gives a sequence of letters the length of its argument if its argument is all NA and "" otherwise. Then group the NAs and non-NA runs using ave and rleid and apply seq_let to each group prepending na.locf0(b) to it.
library(data.table)
library(zoo)
seq_let <- function(x) if (all(is.na(x))) letters[seq_along(x)] else ""
transform(df, b = paste0(na.locf0(b), ave(b, rleid(is.na(b)), FUN = seq_let)))
giving:
a b
1 atg 1
2 tga 2
3 agt 2a
4 acc 3
5 cgt 3a
6 gca 3b
7 gtc 4
8 ggg 5
9 ccc 6
Do with zoo and base R
x=zoo::na.locf(df$b)
s=as.numeric(ave(x,x,FUN=function(x) seq_along(x)))-1
x[s!=0]=paste0(x[s!=0],letters[s])
df$b=x
df
a b
1 atg 1
2 tga 2
3 agt 2a
4 acc 3
5 cgt 3a
6 gca 3b
7 gtc 4
8 ggg 5
9 ccc 6
Borrowing code from Create counter within consecutive runs of certain values:
i <- is.na(df$b)
g <- cumsum(i)
df$b <- paste0(na.locf(df$b), c("", letters)[g - cummax((!i) * g) + 1])
# a b
# 1 atg 1
# 2 tga 2
# 3 agt 2a
# 4 acc 3
# 5 cgt 3a
# 6 gca 3b
# 7 gtc 4
# 8 ggg 5
# 9 ccc 6
More compact using data.table, picking the main idea from: Count consecutive TRUE values within each block separately
library(data.table)
setDT(df)[ , b := paste0(na.locf(b), c("", letters)[rowid(rleid(b)) * is.na(b) + 1])]
# a b
# 1: atg 1
# 2: tga 2
# 3: agt 2a
# 4: acc 3
# 5: cgt 3a
# 6: gca 3b
# 7: gtc 4
# 8: ggg 5
# 9: ccc 6

remove cases following certain other cases

I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]

R: split data frame rows by space, remove common elements, put unequal length columns in new df

Suppose, I have df with two rows of strings that I need to split by space, unlist, then find anti-intersection and reuse in a list. I can do it brute force by working with each row individually. Problem is that there can be more than 2 rows etc. My working solution thus far is below, but there must be a simpler way of not accessing each line. Thanks!!
df = structure(list(A = structure(1:2, .Label = c("R1", "R2"), class = "factor"),
B = c("a b c d e f g o l",
"b h i j k l m n o p q"
)), .Names = c("A", "B"), row.names = c(NA, -2L), class = "data.frame")
dat1 = unlist(strsplit(df[1,2]," "))
dat2 = unlist(strsplit(df[2,2]," "))
f <- function (...)
{
aux <- list(...)
ind <- rep(1:length(aux), sapply(aux, length))
x <- unlist(aux)
boo <- !(duplicated(x) | duplicated(x, fromLast = T))
split(x[boo], ind[boo])
}
excl = (f(dat1, dat2))
L <- list(excl[[1]],excl[[2]])
cfun <- function(L) {
pad.na <- function(x,len) {
c(x,rep("",len-length(x)))
}
maxlen <- max(sapply(L,length))
print(maxlen)
do.call(data.frame,lapply(L,pad.na,len=maxlen))
}
a = cfun(L)
What I had:
A B
1 Food a b c d e f g
2 HABA b h i j k l m n o p q
What I got:
c..a....c....d....e....f....g.......... c..h....i....j....k....m....n....p....q..
1 a h
2 c i
3 d j
4 e k
5 f m
6 g n
7 p
8 q
Edit: The goal is to eliminate common elements from all columns. I.e. if "4" is present in row 1 and seen anywhere else - remove. New test set:
df1 = structure(list(A = structure(1:3, .Label = c("R1", "R2", "R3"
), class = "factor"), B = c("1 4 78 5 4 6 7 0", "2 3 76 8 2 1 8 0",
"4 7 1 2")), .Names = c("A", "B"), row.names = c(NA, -3L), class = "data.frame")
Current output from suggested code:
a b c
1 4 2 4
2 78 3 7
3 5 76 2
4 4 8 NA
5 6 2 NA
6 7 8 NA
7 0 0 NA
2, 4, and 7 should not be there as they are seen in more than 1 column. Bottom line - output should consist of unique numbers/elements only in any columns. Thanks!!
Here's one way using base R that avoids a lot of your current code
## split column B on the space character
s <- strsplit(df$B, " ")
## find the intersection of all s
r <- Reduce(intersect, s)
## iterate over s, removing the intersection characters in r
l <- lapply(s, function(x) x[!x %in% r])
## reset the length of each vector in l to the length of the longest vector
## then create the new data frame
setNames(as.data.frame(lapply(l, "length<-", max(lengths(l)))), letters[seq_along(l)])
# a b
# 1 a h
# 2 c i
# 3 d j
# 4 e k
# 5 f m
# 6 g n
# 7 <NA> p
# 8 <NA> q
I presume that's what you are shooting for?
Note that lengths() is a new function in the base package of R version 3.2.0 that is a faster more efficient replacement for sapply(x, length) on a list.

How can I build an inverted index from a data frame in R?

Say I have a data frame in R : data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
x y
1 1 a b c
2 2 b
3 3 a c
4 4 c
Now I want to build a new data frame, an inverted index which is quite common in IR or recommendation systems, from it:
y x
a 1 3
b 1 2
c 1 3 4
How can I do this in an efficient way?
conv <- function(x) {
l <- function(z) {
paste(x$x[grep(z, x$y)], collapse=' ')
}
lv <- Vectorize(l)
alphabet <- unique(unlist(strsplit(as.character(x$y), ' '))) # hard-coding this might be preferred for some uses.
y <- lv(alphabet)
data.frame(y=names(y), x=y)
}
x <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
> conv(x)
## y x
## a a 1 3
## b b 1 2
## c c 1 3 4
An attempt, after converting y to characters:
test <- data.frame(x=1:4,y=c("a b c","b","a c","c"),stringsAsFactors=FALSE)
result <- strsplit(test$y," ")
result2 <- sapply(unique(unlist(result)),function(y) sapply(result,function(x) y %in% x))
result3 <- apply(result2,2,function(x) test$x[which(x)])
final <- data.frame(x=names(result3),y=sapply(result3,paste,collapse=" "))
> final
x y
a a 1 3
b b 1 2
c c 1 3 4
quick and dirty
original.df <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
original.df$y <- as.character(original.df$y)
y.split <- strsplit(original.df$y, " ")
y.unlisted <- unique(unlist(y.split))
new.df <-
sapply(y.unlisted, function(element)
paste(which(sapply(y.split, function(y.row) element %in% y.row)), collapse=" " ))
as.data.frame(new.df)
> new.df
a 1 3
b 1 2
c 1 3 4

Resources