How to find vectors with duplicate values in a row? - r

I have a lot of vectors, which looks something like this:
a <- c(0,0,0,1,1)
b <- c(1,0,0,0,1)
c <- c(0,0,1,1,1)
In all of these vectors have the values that are repeated three times in succession.
I need to somehow identify these repetitions. The main condition is that the value of repeated one after the other.
Duplicated() will not help, at least in the base.
The definition of such vectors is necessary in order then to remove them.
A suitable vector for my work.
d <- c(1,0,1,0,0)
Improper vector.
e <- c(1,1,1,0,0)

You might want to take a look at the rle from the base package or the rleid function from data.table.
rle(c(0,0,0,1,1))
Run Length Encoding
lengths: int [1:2] 3 2
values : num [1:2] 0 1
library(data.table)
rleid(c(0,0,0,1,1))
[1] 1 1 1 2 2
Both will look at runs of the same number. The rle function returns a list of lengths and values, and the rleid function returns a vector counting up each time the number in the series changes.

Related

Apply a function that requires seq() in R

I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30

Remove repeated numbers in a sequence

I have a vector of the type
c(3,3,...,9,9,...,2,2,...,3,3,...,7,7,...)
I want to remove the repeated numbers in a sequence, without breaking the order. This is, I would like to obtain something like
c(3,9,2,3,7,...)
How can I do this in R?
We can also use the observation that a duplicate in sequence has a difference of 0 with their neighbour. Therefore, using base-R, we can do:
v[c(1,diff(v))!=0]
We can try with rleid and duplicated. We create the run-length-ids with rleid (from data.table) so that only adjacent elements that are equal forms one group, get the logical index of not duplicated values and subset the vector.
library(data.table)
v1[!duplicated(rleid(v1))]
#[1] 3 9 2 3 7
Or as the OP mentioned, we can use rle from base R and extract the values.
rle(v1)$values
#[1] 3 9 2 3 7
data
v1 <- c(3,3,9,9,2,2,3,3,7,7)
Just for the fun of it, here is an Rcpp version of solving the problem:
library(Rcpp)
cppFunction('NumericVector remove_multiples(NumericVector& vec) {
NumericVector c_vec(clone(vec));
NumericVector::iterator it = std::unique(c_vec.begin(),c_vec.end());
c_vec.erase(it,c_vec.end());
return(c_vec);
}'
)
x <- c(1,1,1,2,2,2,1,1,3,4,4,1,1)
> remove_multiples(x)
[1] 1 2 1 3 4 1

R - How to compare values across more than two columns

I'm trying to write code to compare the values of several columns, and i dont know ahead of time how many columns I will have. The data will look like this:
X Val1 Val2 Val3 Val4
A 1 1 1 2
B NA 2 2 2
C 3 3 3 3
The code should return a Fail for rows A and B, and a Pass for row C, but needs to be able to handle a changing number of columns. I can't figure out how to do this without nesting a couple of for loops, but there has to be some way to use apply or sapply to iterate through columns 2: length(df)
EDIT: I want to see if the values (which will be numbers) are equal
Assuming that the first column is excluded from the comparison and that all the other columns are not, you can try:
which(rowSums(df[,2]==df[,3:ncol(df)])==(ncol(df)-2))
You can use apply with a custom function length(unique(x)) to count the unique number of values in rows 2:ncol(yourDataFrame). You can then throw the whole thing into an ifelse function to return a true/false list.
ifelse(apply(df[ , 2:ncol(yourDataFrame)], MARGIN=1, function(x) length(unique(x))) == 1, TRUE, FALSE)

r - Force which() to return only first match

Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?
If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))
To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.

R: Index to unique vector that returns original

I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3

Resources