R remove repetitive strings [duplicate] - r

This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I want to remove repetitve strings in R.
I simplified my situation and tried two things.
#1 remove a vector
x=c("a","a","b","c","d")
x[-(x=="a")]
I expected it would remove all "a"s but the result is
[1] "a" "b" "c"
Secondly, I tried "NULL"
x[x=="a"]=NULL
But there was an error :
Error in x[x == "a"] = NULL : replacement has length zero
How can I remove repetitve strings? In this situation, removing all "a"s and print
[1] "b" "c" "d"
?

If the intention is to remove 'a' as it is repeated, then use table to get the frequency of elements and based on that subset the string with %in% and negate (!)
x[!x %in% names(which(table(x) > 1))]
#[1] "b" "c" "d"
Or using duplicated
x[!(duplicated(x)|duplicated(x, fromLast = TRUE))]
Or if it is based on the adjacent elements that are repeated, use rle
with(rle(x), values[lengths ==1])
#[1] "b" "c" "d"
NOTE: All the above removes the elements programmatically instead of manual checks
From the OP's comments, it is clear that they want to remove only specific elements that are known to be duplicates. In that case,
x[! x %in% c("a")]
Here, we use %in% as == can only be used for comparing a single element

Related

Two Column R Dataframe to Named LIst [duplicate]

This question already has answers here:
Named List To/From Data.Frame
(4 answers)
Closed 2 years ago.
I am trying to convert a two-column dataframe to a named list. There are several solutions on StackOverflow where every value in the first column becomes the 'name', but I am looking to collapse the values in column 2 into common values in column 1.
For example, the list should look like the following:
# Create a Named list of keywords associated with each file.
fileKeywords <- list(fooBar.R = c("A","B","C"),
driver.R = c("A","F","G"))
Where I can retrieve all keywords for "fooBar.R" using:
# Get the keywords for a named file
fileKeywords[["fooBar.R"]]
My data frame looks like:
df <- read.table(header = TRUE, text = "
file keyWord
'fooBar.R' 'A'
'fooBar.R' 'B'
'fooBar.R' 'C'
'driver.R' 'A'
'driver.R' 'F'
'driver.R' 'G'
")
I'm sure there is a simple solution that I am missing.
You could use unstack:
as.list(unstack(rev(df)))
$driver.R
[1] "A" "F" "G"
$fooBar.R
[1] "A" "B" "C"
This is equivalent to as.list(unstack(df, keyWord~file))
We can use stack in base R
stack(fileKeywords)[2:1]
if it is the opposite, then we can do
with(df, tapply(keyWord, file, FUN = I))
-output
#$driver.R
#[1] "A" "F" "G"
#$fooBar.R
#[1] "A" "B" "C"

return number of specific element of vector based of its name [duplicate]

This question already has answers here:
Convert letters to numbers
(5 answers)
Closed 5 years ago.
I need to return number of element in vector based on vector element name. Lets say i have vector of letters:
myLetters=letters[1:26]
> myLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and what I intent to do is to create/find function that returns me the number of element when called for example:
myFunction(myLetters["b"])
[1] 2
myFunction(myLetters["z"])
[1]26
In summary I need a way to refer to excel columns by writing letters of a column (A,B,C later maybe even AA or further) and to get the number.
If you want to refer to excel columnnames, you could create a reference vector with all possible excel column names:
eg1 <- expand.grid(LETTERS, LETTERS)
eg2 <- expand.grid(LETTERS, LETTERS, LETTERS)
excelcols <- c(LETTERS, paste0(eg1[[2]], eg1[[1]]), paste0(paste0(eg2[[3]], eg2[[2]], eg2[[1]])))
After which you can use which:
> which(excelcols == 'A')
[1] 1
> which(excelcols == 'AB')
[1] 28
> which(excelcols == 'ABC')
[1] 731
If you need to find the number of times specific letter occurs then the following should work:
myLetters = c("a","a", "b")
myFunction = function(myLetters, findLetter){
length(which(myLetters==findLetter))
}
Let find how many times "a" occurs in myLetters:
myFunction(myLetters, "a")
# [1] 2

Regex Matching Negative values

I'm trying to create some simple and easy to write content-clusters with multiple regexes.
Imagine a list of strings: c("a","b","ac")
The groups I need to define are "All: a's" and "All: b's". So the values "a" and "ac" are "A" and "b" is "B".
myDF$contentGroup <- sub(".*a.*", "A", myDF$stringList)
However this will result in a column within my dataframe "contentGroup" which contains the value of "stringList" if no match occured. So if I do the same line of code with "B" it will overwrite the "A"s.
myDF$contentGroup <- sub(".*b.*", "B", myDF$stringList)
I just cant figure out how to do simple clustering in a single line of code. Making it as simple as possible.
You can use grep to match 'a' and 'b', and replace as follows,
x[grep('a', x, fixed = TRUE)] <- 'A'
x[grep('b', x, fixed = TRUE)] <- 'B'
x
#[1] "A" "B" "A"

R: trim consecutive trailing and leading special characters from set of strings

I have a list of character vectors, all equal lengths. Example data:
> a = list('**aaa', 'bb*bb', 'cccc*')
> a = sapply(a, strsplit, '')
> a
[[1]]
[1] "*" "*" "a" "a" "a"
[[2]]
[1] "b" "b" "*" "b" "b"
[[3]]
[1] "c" "c" "c" "c" "*"
I would like to identify the indices of all leading and trailing consecutive occurrences of the character *. Then I would like to remove these indices from all three vectors in the list. By trailing and leading consecutive characters I mean e.g. either only a single occurrence as in the third one (cccc*) or multiple consecutive ones as in the first one (**aaa).
After the removal, all three character vectors should still have the same length.
So the first two and the last character should be removed from all three vectors.
[[1]]
[1] "a" "a"
[[2]]
[1] "*" "b"
[[3]]
[1] "c" "c"
Note that the second vector of the desired result will still have a leading *, which, however became the first character after the operation, so it should be in.
I tried using which to identify the indices (sapply(a, function(x)which(x=='*'))) but this would still require some code to detect the trailing ones.
Any ideas for a simple solution?
I would replace the lead and lag stars with NA:
aa <- lapply(setNames(a,seq_along(a)), function(x) {
star = x=="*"
toNA = cumsum(!star) == 0 | rev(cumsum(rev(!star))) == 0
replace(x, toNA, NA)
})
Store in a data.frame:
DF <- do.call(data.frame, c(aa, list(stringsAsFactors=FALSE)) )
Omit all rows with NA:
res <- na.omit(DF)
# X1 X2 X3
# 3 a * c
# 4 a b c
If you hate data.frames and want your list back: lapply(res,I) or c(unclass(res)), which gives
$X1
[1] "a" "a"
$X2
[1] "*" "b"
$X3
[1] "c" "c"
First of, like Richard Scriven asked in his comment to your question, your output is not the same as the thing you asked for. You ask for removal of leading and trailing characters, but your given ideal output is just the 3rd and 4th element of the character lists.
This would be easily achievable by something like
a <- list('**aaa', 'bb*bb', 'cccc*')
alist = sapply(a, strsplit, '')
lapply(alist, function(x) x[3:4])
Now for an answer as you asked it:
IMHO, sapply() isn't necessary here.
You need a function of the grep family to operate directly on your characters, which all share a help page in R opened by ?grep.
I would propose gsub() and a bit of Regular Expressions for your problem:
a <- list('**aaa', 'bb*bb', 'cccc*')
b <- gsub(pattern = "^(\\*)*", x = a, replacement = "")
c <- gsub(pattern = "(\\*)*$", x = b, replacement = "")
> c
[1] "aaa" "bb*bb" "cccc"
This is doable in one regex, but then you need a backreference for the stuff in between i think, and i didn't get this to work.
If you are familiar with the magrittr package and its excellent pipe operator, you can do this more elegantly:
library(magrittr)
gsub(pattern = "^(\\*)*", x = a, replacement = "") %>%
gsub(pattern = "(\\*)*$", x = ., replacement = "")

Selecting and matching multiple vectors in a list in R

I have a list of vectors like this:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "m" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s"
[[4]]
[1] "b" "u" "z" "u" "l" "a"
[[5]]
[1] "c" "m" "u" "s" "r" "i" "x" "t"
1-First, I want to select the vector in the table with the highest number of elements (in this case the 5th vector with 8 elements). This is easy.
2-Second I want to select all vectors in the list with length equal or immediately lower than the previous, and intersect them with the previous vector.
Another possibility I have is selecting by the name of the 1st character. In this case this would be equivalent to select the vectors starting with "a" or "b", the first and fourth in the list. In this case what I do not know is how to select multiple vectors in a list knowing their first element.
3-Finally, I want to keep just the intersection with the minimum number of matches.
In this case the the four vector in the list, starting with "b". Then start the process again for the rest of the vectors but considering already the 4th and 5th vector when "intersecting". In this case would be pick up the second element and intersect this element with a "unique() combination" of the 4th and 5th.
I hope I have explained myself!. Is there a way to do this in R without 3-4 "for" and "if" loops? in another words. Is there a clever way to do it using lapply or similar?
This should do it?
list <- strsplit(list("amlsto", "myote","nas","buzula","cmsusrixt"), "")
# find minimum length
lens <- sapply(list, length)
which.min(lens)
# which are same or 1 shorter than previous
inds <- which (lens==c(-1,head(lens, -1)) | lens==c(-1,head(lens,-1))-1)
# get the intersections
inters <- mapply(intersect, list[inds], list[inds-1], SIMPLIFY=FALSE)
#Get items where first in vector is in target set
target <- c("a","b")
isTarget <- sapply(list, "[[",1) %in% target
# Minimum number of overlaps
which.min(lapply(inters, length))

Resources