R function like which to match multiple inputs to multiple values - r

I have a vector of multiple values that I want to match to multiple values without the use of a loop. Is there a function that can do this?
x <- c(2,5,4)
y <- 2:10
which(x==y) #won't work
Expected output is 1,4,3
In my real use case, you can assume that there is only 1 correct match and it will match y every time. I need this to be as fast as possible, that's why I'm trying to avoid a loop. As a side note, this part is already inside of a foreach loop.

You want match
match(x,y)
# 1 4 3

The which() version would be which(x %in% y). But I don't think this fits for your purpose as the expected output is 1,2,3.
But if you apply which(y %in% x) than your output will be 1,3,4

Related

How to get all elements at indices stored in a vector?

I have a matrix where every row consists of zeros and a single one, say y <- rbind(c(1,0,0), c(0,1,0), c(0,1,0)) and I have a vector holding indices for each row, say x <- c(1,2,3) and . Now I would like to count the number of times that y[i,x[i]] == 1 holds. I know I could do this like
count <- 0
for(i in 1:3)
count <- count + y[i, x[i]]
but I was interested if there would be a smarter way. Something like count <- sum(y[,x]). Of course this does not work, because y[,x] gives a matrix.
Therefore my question is there a way to get a vector with the elements at the positions given by another vector by using apply or any other smart trick, i.e. without for-loops?
I've already been looking for this, but I don't really know how to call this and therefore I didn't find anything useful. Apologies if this question already hangs around somewhere...
We can use row/column indexing to extract the elements corresponding to 'x' and 'y' indices and then get the sum
sum(y[cbind(1:nrow(y), x)])
#[1] 2
If the values are different than 1,
sum(y[cbind(1:nrow(y), x)]==1)
Or for this case,
sum(diag(y)==1)
#[1] 2
Or
sum(y*diag(y))
EDIT: Changed the row/column index from cbind(x,1:ncol(y)) to cbind(1:nrow(y), x) as per the comments.

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

FOR Loop in R for a random vector

I am trying to write a loop using "for" where my index,i, should have values from a set of values c(2,4,6,8,10,12) . I am further using i for subsetting values from another vector.
I defined a vector X , where X <- c(2,4,,6,8,10,12) ,
and then using for(i in X[1]:tail(X,n=1)).
This results in i taking all values from 2 to 12!
Whereas I want it to take the values mentioned in X only, i.e 2,4,6,8,10,12.
I hope someone can give me a hint how to do this
Thank you in advanced
You are looping through a sequence. If you want to loop only through the values of a vector, use this:
X <- c(2,4,6,8,10,12)
for(i in X) {
# Your code
}

Efficient method to subset drop rows with NA values in R

Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

How to remove only one instance of a duplicate value in a vector in R?

Let's consider a vector of numeric values "x". Some values may be duplicates. I need to remove the max value one by one until x is empty.
Problem, if I use:
x <- x[x != max(x)]
It removes all duplicates equal to the maximum. I want to remove only one of the duplicates. So until now, I do:
max.x <- x[x == max(x)]
max.x <- max.x[1:length(max.x) - 1]
x <- c(x[x != max(x)], max.x)
But this is far from computationally efficient, and I'm not good enough at R to find the right way to do this. Can someone has a better trick?
Thanks
Just for fun,
x <- x[ -which.max(x)]
rinse, lather, repeat.
dagnabit howcome 4 spaces isn't causing code coloration?
You're not entirely clear what the scope of your problem is, so I'll just give the first suggestion I have that comes to mind. Use the sort function to get the list of values in decreasing order.
sorted <- sort(x,decreasing=TRUE,index.return=TRUE)
You can now iteratively remove the highest item from x. Re-using the sort function over and over on your subset data is inefficient - better to keep a permanent copy of x and do the removals from that, if possible.
Consider this approach
# random set of data with duplicates
x <- floor(runif(50)*15)
# sort with index.return returns a sorted x in sorted$x and the
# indices of the sorted values from the original x in sorted$ix
sorted <- sort(x,decreasing=TRUE,index.return=TRUE)
for( i in 1:length(x) )
{
# remove data from x
newX <- x[-sorted$ix[1:i]]
print(sort(newX,decreasing=TRUE))
}
The way I understand your question,
?unique
might give you what you want.
Rgds,
Rainer

Resources