Reorder a string in R using splitstring - r

I can't figure out what I'm doing wrong. I'm trying to reorder a string, and the easiest way that I could think of doing so was by removing elements and then putting them back in using paste. But I can't figure out how to remove elements. Here's a string:
x <- "the.cow.goes.moo"
But when I use
x <- strsplit(x, '[.]')
resulting in the list "the" "cow" "goes" "moo".
And try to remove the second element using either
x <- x[-2]
or
[x <- x[x != "cow"]
I get the exact same list. But when I declare x as
x <- list("the", "cow", "goes", "moo")
then
x <- x[-2]
works!
What's different? What am I doing wrong? Also, is there an easier way to reorder a string?
EDIT: I just realized that what I need is "moo.goes.the.cow", but I need to repeat this same change for a number of other strings. So I need to reorder the elements, and can't actually delete them. How can I do that?

strsplit returns a list object. So each element of the vector x will now be broken out into individuals pieces in a list. Lists can be painful to subset in this fashion but it's good to get your head around it early.
In your example, it would be:
x[[1]][-2]
For your update, you can reorder like so:
x[[1]][c(2,1,3,4)] # or whatever order you want.
x[[1]][sample(1:x[[1]],length(x[[1]]))] # randomly even

Add this line
x<-unlist(x)
x <- x[-2]

Related

Stringr str_which first compare 1st row with whole column than to next row

I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))

concatenate each element of character vector with all elements of a second vector r

I am trying to concatenate two character vectors in a way that the following output is produced
aggmodes<-c("17x8","17x7x8","17x28x8")
listdata<-c("Motion.Age","res.Context.Only")
The output should be like this
"Motion.Age,17x8"
"Motion.Age,17x7x8"
"Motion.Age,17x28x8"
"res.Context.Only,17x8"
"res.Context.Only,17x7x8"
"res.Context.Only,17x28x8"
I have written following code:
c<-as.vector(sapply(1:length(listdata), function(i){
sapply(1:length(aggmodes),function(j){paste(aggmodes,listdata)})
}))
but it gives me an a 10 dimensional vector. I am sorry if it is a duplicate, but i couldnot find a correct answer for solving my problem
c(sapply(listdata,paste,aggmodes,sep=","))
# [1] "Motion.Age,17x8" "Motion.Age,17x7x8" "Motion.Age,17x28x8"
# [4] "res.Context.Only,17x8" "res.Context.Only,17x7x8" "res.Context.Only,17x28x
We paste each element of listdata to all of aggmodes with sapply, and then unwrap it all.
Your code is suboptimal because you don't leverage the fact paste is vectorized, however it can work with a slight modification:
as.vector(sapply(1:length(listdata), function(i){
sapply(1:length(aggmodes),function(j){paste(aggmodes[j],listdata[i])})
}))
as.character(outer(listdata, aggmodes, paste, sep = ","))
outer takes three arguments: x, y, and FUN. It applies FUN to all elements of x and y - in this case, pasting them together. Since outer returns a matrix, wrap it in as.character to return a vector!

Best approach to remove all parts of data column that don't match any part of a list?

I've got a dataframe with a column of song titles, label info and other messy string data. I also have an isolated vector of specific song titles. I'd like to filter out all characters that aren't a matched song from the song titles. I'm using something like this, but is showing errors.
song.list <- c("Song.1","Song.2", "Song.3")
Mydata$Songs <- My data column containing all sorts of things including the songs I'm after
levels(Mydata$Songs)[(Mydata$Songs) %in% song.list] <- "" #I'd like the opposite of this
levels(Mydata$Songs)![(Mydata$Songs) %in% song.list] <- ""#My use of '!' doesn't work
I know that using the above indexing without the ! will work to replace my song list with blank space, but I'm trying to replace everything else with a blank space. I've got about 29 songs in my list and about 1000 rows of messy string data in a single column. I've also tried gsub and grep to no avail.
I haven't been able to come up with a vectorized solution, but if I understood you correctly this loop over the factor levels should do the job:
library(stringr)
for (level in levels(df$A)) {
match <- na.omit(str_extract(level, song.list))
if (length(match) > 0) {
levels(df$A)[levels(df$A) == level] <- match
}
}
Original answer which didn't do what the OP intended
I'm not sure I fully understand what you're trying to do, but I think this is what you're after. This doesn't remove the rows, though!
levels(Mydata$Songs)[!Mydata$Songs %in% song.list] <- ""

Convert a list of strings to call already set values?

Is it possible to convert a list of strings so that it will return the value it's named after?
For example, I have this list of strings that I made with paste:
mylist <- c("nhdata$Credit", "nhdata$Honey", "nhdata$Plants")
mylist
The list I'm working with is a lot bigger (about 35). So is it possible to print these strings in a way that it will actually call the value they are named after?
Appreciate any help, this is my first question stackoverflow
You can use the get function:
temp <- 1:10
get("temp")
In your example, you may do better to use the following, though:
mylist <- c("Credit", "Honey", "Plants")
nhdata[, mylist[1]]
or similarly,
nhdata[[mylist[1]]]

Processing files in a particular order in R

I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)

Resources