subset function in R [duplicate] - r

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?

This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8

You could try the | operator for short conditions
which(x == 8 | x == 9)

In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.

Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19

Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9

If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.

grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

Related

How to use '%in%' operator in R?

I have been using the %in% operator for a long time since I knew about it.
However, I still don't understand how it works. At least, I thought that I knew how, but I always doubt about the order of the elements.
Here you have an example:
This is my dataframe:
df <- data.frame("col1"=c(1,2,3,4,30,21,320,123,4351,1234,3,0,43), "col2"=rep("something",13))
This how it looks
> df
col1 col2
1 1 something
2 2 something
3 3 something
4 4 something
5 30 something
6 21 something
7 320 something
8 123 something
9 4351 something
10 1234 something
11 3 something
12 0 something
13 43 something
Let's say I have a numerical vector:
myvector <- c(30,43,12,333334,14,4351,0,5,55,66)
And I want to check if all the numbers (or some) from my vector are in the previous dataframe. To do that, I always use %in%.
I thought 2 approaches:
#common in both: 30, 4351, 0, 43
# are the numbers from df$col1 in my vector?
trial1 <- subset(df, df$col1 %in% myvector)
# are the numbers of the vector in df$col1?
trial2 <- subset(df, myvector %in% df$col1)
Both approaches make sense to me and they should give the same result. However, only the result from trial1 is okay.
> trial1
col1 col2
5 30 something
9 4351 something
12 0 something
13 43 something
What I don't understand is why the second way is giving me some common numbers and some which are not in the vector.
col1 col2
1 1 something
2 2 something
6 21 something
7 320 something
11 3 something
12 0 something
Could someone explain to me how `%in% operator works and why the second way gives me the wrong result?
Thanks very much in advance
Regards
Answer is given, but a bit more detailed simply look at the %in% result
df$col1 %in% myvector
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The above one is correct as you subset df and keep the TRUE values, row 5, 9, 12, 13
versus
myvector %in% df$col1
# [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
This one goes wrong as you subset df and tell to keep 1, 2, 6, 7 and as length here is only 10 it recycles 11, 12, 13 as TRUE, TRUE, FALSE again so you get 11 and 12 in your subset as well

removing factor levels using subset [duplicate]

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

Identifying which values are duplicates in R [duplicate]

This question already has answers here:
Identify duplicates and mark first occurrence and all others
(2 answers)
Closed 8 years ago.
I would like to identify which observations are duplicates based on the values within one variable, however I would like all of the observations which generate the duplicates to be identified rather than just the second time they appear. For example:
x <- c(1,2,3,4,5,7,5,7)
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
Rather than identify the last two elements, I would like the last four elements to be identified as well as which element is matched (e.g. element 5 and 7, 6 and 8). Thanks.
You can use duplicated twice:
duplicated(x) | duplicated(x, fromLast = TRUE)
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
You could try a table
x <- c(1,2,3,4,5,7,5,7)
tab <- table(x) > 1
x[x %in% names(which(tab))]
# [1] 5 7 5 7
Another method inspired by #rawr's comment is
x %in% x[duplicated(x)]
# [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
x[ x %in% x[duplicated(x)] ]
# [1] 5 7 5 7
which(x %in% x[duplicated(x)])
# [1] 5 6 7 8

Removing outliers of different lengths from different columns of a dataframe using R

I have a large dataframe. I want to remove the outliers from each column of my dataframe inferred from boxplots. Here is a reproducible example-
Make a dummy dataframe with 3 columns + few outliers
sample<-data.frame(a=c(444,2,3,4,-555), b=c(2,3,4,5,68), c=c(-100,8,9,10,11))
> sample
a b c
1 444 2 -100
2 2 3 8
3 3 4 9
4 4 5 10
5 -555 68 11
Define the outliers for each column
out<-lapply(1:length(sample), function(i) sort(boxplot.stats(sample[[i]])$out))
> out
[[1]]
[1] -555 444
[[2]]
[1] 68
[[3]]
[1] -100
Subset data by omitting the outliers
sample<-lapply(1:length(sample), function(i)
subset(sample[[i]], sample[[i]]!=out[[i]]))
Surprisingly it only works partially with warnings?!?
Warning message:
In sample[[i]] != out[[i]] :
longer object length is not a multiple of shorter object length
Data after subset looks like
> sample
[[1]]
[1] 444 2 3 4
[[2]]
[1] 2 3 4 5
[[3]]
[1] 8 9 10 11
For column 1, it removed only -555, kept 444?? Worked nicely for column 2 and 3. The warning message clearly states why is it happening. By removing one outlier from each group, it might be keeping similar lengths ...
My second approach is to make all outliers 'NA'
sample<-lapply(1:length(sample), function(i)
sample[[i]][sample[[i]]==out[[i]]]<-NA)
Doesn't work!! How can I solve this problem?
Try this:
> lapply(1:length(sample), function(i)
subset(sample[[i]], !sample[[i]]%in%out[[i]]) )
[[1]]
[1] 2 3 4
[[2]]
[1] 2 3 4 5
[[3]]
[1] 8 9 10 11
Note that when you do sample[[i]]!=out[[i]]) it doesn't work because sample[[i]] is a vector, and so is out[[i]]. What you actually want to know is what elements of sample[[i]] are not in out[[i]], so you should do !sample[[i]]%in%out[[i]].
To further clarify, you can try this toy example:
> c(444,2,3,4,-555) == c(-555, 444)
[1] FALSE FALSE FALSE FALSE TRUE
Warning message:
In c(444, 2, 3, 4, -555) == c(-555, 444) :
longer object length is not a multiple of shorter object length
> c(444,2,3,4,-555) %in% c(-555, 444)
[1] TRUE FALSE FALSE FALSE TRUE
In the == example you get a TRUE at the end because of recycling. Internally, it is actually comparing these two vectors c(444,2,3,4,-555) == c(-555, 444, -555, 444, -555), and the last element is the same.

Finding All Positions for Multiple Elements in a Vector

Suppose I have the following vector:
x <- c(8, 6, 9, 9, 7, 3, 2, 5, 5, 1, 6, 8, 5, 2, 9, 3, 5, 10, 8, 2)
How can I find which elements are either 8 or 9?
This is one way to do it. First I get the indices at which x is either 8 or 9. Then we can verify that at those indices, x is indeed 8 and 9.
> inds <- which(x %in% c(8,9))
> inds
[1] 1 3 4 12 15 19
> x[inds]
[1] 8 9 9 8 9 8
You could try the | operator for short conditions
which(x == 8 | x == 9)
In this specific case you could also use grep:
# option 1
grep('[89]',x)
# option 2
grep('8|9',x)
which both give:
[1] 1 3 4 12 15 19
When you also want to detect number with more than one digit, the second option is preferred:
> grep('10|8',x)
[1] 1 12 18 19
However, I did put emphasis on this specific case at the start of my answer for a reason. As #DavidArenburg mentioned, this could lead to unintended results. Using for example grep('1|8',x) will detect both 1 and 10:
> grep('1|8',x)
[1] 1 10 12 18 19
In order to avoid that side-effect, you will have to wrap the numbers to be detected in word-bounderies:
> grep('\\b1\\b|8',x)
[1] 1 10 12 19
Now, the 10 isn't detected.
Here is a generalized solution to find the locations of all target values (only works for vectors and 1-dimensional arrays).
locate <- function(x, targets) {
results <- lapply(targets, function(target) which(x == target))
names(results) <- targets
results
}
This function returns a list because each target may have any number of matches, including zero. The list is sorted (and named) in the original order of the targets.
Here is an example in use:
sequence <- c(1:10, 1:10)
locate(sequence, c(2,9))
$`2`
[1] 2 12
$`9`
[1] 9 19
Alternatively, if you do not need to use the indices but just the elements you can do
> x <- sample(1:10,20,replace=TRUE)
> x
[1] 6 4 7 2 9 3 3 5 4 7 2 1 4 9 1 6 10 4 3 10
> x[8<=x & x<=9]
[1] 9 9
If you want to find the answer using loops, then the following script will do the job:
> req_nos<- c(8,9)
> pos<-list()
> for (i in 1:length(req_nos)){
pos[[i]]<-which(x==req_nos[i])}
The output will look like this:
>pos
[[1]]
[1] 1 12 19
[[2]]
[1] 3 4 15
Here, pos[[1]] contains positions of 8 and pos[[2]] contains positions of 9. If you are using the %in% method and change the input order of elements, i.e, c(9,8) instead of c(8,9), the output will be the same for both of them. This method alleviates such problem.
grepl maybe a useful function. Note that grepl appears in versions of R 2.9.0 and later. What's handy about grepl is that it returns a logical vector of the same length as x.
grepl(8, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
grepl(9, x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
To arrive at your answer, you could do the following
grepl(8,x) | grepl(9,x)

Resources