I have a vector of the type
c(3,3,...,9,9,...,2,2,...,3,3,...,7,7,...)
I want to remove the repeated numbers in a sequence, without breaking the order. This is, I would like to obtain something like
c(3,9,2,3,7,...)
How can I do this in R?
We can also use the observation that a duplicate in sequence has a difference of 0 with their neighbour. Therefore, using base-R, we can do:
v[c(1,diff(v))!=0]
We can try with rleid and duplicated. We create the run-length-ids with rleid (from data.table) so that only adjacent elements that are equal forms one group, get the logical index of not duplicated values and subset the vector.
library(data.table)
v1[!duplicated(rleid(v1))]
#[1] 3 9 2 3 7
Or as the OP mentioned, we can use rle from base R and extract the values.
rle(v1)$values
#[1] 3 9 2 3 7
data
v1 <- c(3,3,9,9,2,2,3,3,7,7)
Just for the fun of it, here is an Rcpp version of solving the problem:
library(Rcpp)
cppFunction('NumericVector remove_multiples(NumericVector& vec) {
NumericVector c_vec(clone(vec));
NumericVector::iterator it = std::unique(c_vec.begin(),c_vec.end());
c_vec.erase(it,c_vec.end());
return(c_vec);
}'
)
x <- c(1,1,1,2,2,2,1,1,3,4,4,1,1)
> remove_multiples(x)
[1] 1 2 1 3 4 1
Related
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])
I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3
I need to compare the values stored in two variables.The variable sizes are different. For example
x = c(1,2,3,4,5,6,7,8,9,10)
and
y = c(2,6,11,12,13)
I need an answer that 2 and 6 are present in both variables. I need this to be done in R.Anyone help please.
The intersect function avoids the need for #mdsumner's simple indexing:
> x = c(1,2,3,4,5,6,7,8,9,10)
> y = c(2,6,11,12,13)
> intersect(x,y)
[1] 2 6
Whole bunch of set operators to be found here: help(intersect)
Posted after the added requirement that some sort of tolerance be allowed: You could sequentially check one set of values against all the others in the second set or you could do it all at once with outer(). Once you have the outer result as a logical matrix there remains the task of referring back to the values, but expand.grid seems capable of handling that:
expand.grid(x,y)[outer(x,y, FUN=function(x,y) abs(x-y) < 0.01), ]
# Var1 Var2
#2 2 2
#16 6 6
After posting It occurred to me that you values were sorted. Turns out that this extraction from expand.grid() survives passing unsorted vectors.
x[x %in% y]
[1] 2 6
Or, more explicitly:
x[match(x, y, nomatch = 0) > 0]
[1] 2 6
Note that you actually chain together the results of the match with simple indexing into the input values.
See ?match.