Least occurring element in vector R - r

If I have a vector
vec = c('a','a','a','b','b','c','c','c','c','c')
Is there a simple way to find the least occurring element in vec? Thanks!
Edit: is there a simple way to do it with characters?

This should work, even if more than one of the elements is tied as the least frequent item:
vec = c(1,1,1,2,2,3,3,3,3,3)
f <- table(vec)
as.numeric(names(f[f == min(f)]))
# [1] 2

table(vec)[which.min(table(vec))]
(In all likelihood a duplicate, although I have searched. Found what seemed to be similar on the max side: Create a variable capturing the most frequent occurence by group Maybe it sounds familiar to that one 'cuz I posted an answer?)

Related

difference between <- and = in R with an example [duplicate]

This question already has answers here:
What are the differences between "=" and "<-" assignment operators?
(9 answers)
Closed 3 years ago.
I was wondering if there is a technical difference between the assignment operators "=" and "<-" in R. So, does it make any difference if I use:
Example 1: a = 1 or a <- 1
Example 2: a = c(1:20) or a <- c(1:20)
Thanks for your help
Sven
Yes there is. This is what the help page of '=' says:
The operators <- and = assign into the
environment in which they are
evaluated. The operator <- can be used
anywhere, whereas the operator = is
only allowed at the top level (e.g.,
in the complete expression typed at
the command prompt) or as one of the
subexpressions in a braced list of
expressions.
With "can be used" the help file means assigning an object here. In a function call you can't assign an object with = because = means assigning arguments there.
Basically, if you use <- then you assign a variable that you will be able to use in your current environment. For example, consider:
matrix(1,nrow=2)
This just makes a 2 row matrix. Now consider:
matrix(1,nrow<-2)
This also gives you a two row matrix, but now we also have an object called nrow which evaluates to 2! What happened is that in the second use we didn't assign the argument nrow 2, we assigned an object nrow 2 and send that to the second argument of matrix, which happens to be nrow.
Edit:
As for the edited questions. Both are the same. The use of = or <- can cause a lot of discussion as to which one is best. Many style guides advocate <- and I agree with that, but do keep spaces around <- assignments or they can become quite hard to interpret. If you don't use spaces (you should, except on twitter), I prefer =, and never use ->!
But really it doesn't matter what you use as long as you are consistent in your choice. Using = on one line and <- on the next results in very ugly code.

Stringr str_which first compare 1st row with whole column than to next row

I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))

Unexpected outcome, not replacing, in R out of a gsub function

As the output of a certain operation, I have the following dataframe whith 729 observations.
> head(con)
Connections
1 r_con[C3-C3,Intercept]
2 r_con[C3-C4,Intercept]
3 r_con[C3-CP1,Intercept]
4 r_con[C3-CP2,Intercept]
5 r_con[C3-CP5,Intercept]
6 r_con[C3-CP6,Intercept]
As can be seen, the pattern to be removed is everything but the pair of Electrode information, for instance, in the first observation this would be C3-C3. Now, this is my take on the issue, which I'd expect to have the dataframe with everything removed. If I'm not wrong (which probably am) the regex syntax is ok and from my understanding I believe fixed=TRUE is also necessary. However, I do not understand the R output. When I would expect the pattern to be changed by nothing ""it returns this output, which doesn't make sense to me.
> gsub("r_con\\[\\,Intercept\\]\\","",con,fixed=TRUE)
[1] "3:731"
I believe this will probably be a silly question for an expert programmer, which I am far from being, and any insight would be much appreciated.
[UPDATE WITH SOLUTION]
Thanks to Tim and Ben I realised I was using a wrong regex syntax and a wrong source, this made it to me:
con2 <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", con$Connections)
I think your problem is that you're accessing "con" in your sub call. Also, as the user above me pointed out, you probably don't want to use sub.
I'm assuming, that your data is consistent, i.e., the strings in con$Connections follow more or less the same pattern. Then, this works:
I have set up this example:
con <- data.frame(Connections = c("r_con[C3-C3,Intercept]", "r_con[C3-CP1,Intercept]"))
library(stringr)
f <- function(x){
part <- str_split(x, ",")[[1]][1]
str_sub(part, 7, -1)
}
f(con$Connections[1])
sapply(con$Connections, f)
The sub function doesn't work this way. One viable approach would be to capture the quantity you want, then use this capture group as the replacement:
x <- "r_con[C3-C3,Intercept]"
term <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", x)
term
[1] "C3-C3"

How to grep for all-but-one matching columns in R

I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.
The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?
DF <- read.dta("./big.dta")
DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))
I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.
Thanks.
Replace your grep with:
vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"
(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).
setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"

Processing files in a particular order in R

I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)

Resources