create equation function acting across rows in R - r

I have a dataframe similar to the one this creates:
dummy=data.frame(c(1,2,3,4),c("a","b","c","d"));colnames(dummy)=c("Num","Let")
dummy$X1=rnorm(4,35,6)
dummy$X2=rnorm(4,35,6)
dummy$X3=rnorm(4,35,6)
dummy$X4=rnorm(4,35,6)
dummy$X5=rnorm(4,35,6)
dummy$X6=rnorm(4,35,6)
dummy$X7=rnorm(4,35,6)
dummy$X8=rnorm(4,35,6)
dummy$X9=rnorm(4,35,6)
dummy$X10=rnorm(4,35,6)
dummy$Xmax=apply(dummy[3:12],1,max)
only the real thing is 260*13000 cells roughly
what I aim to do is implement the equation below to each row in a set of columns defined by data[x:x] (in the example those within columns dummy[3:12])
TSP = Sum( (1-(Xi/Xmax)) /(n-1))
where Xi is each individual value within the row & among the columns of interest (i signifying each column, ie there is an X1, an X2, an X3... value for each row), Xmax is the largest of all those values in the row (as defined in the dummmy$Xmax column), and n is the number of columns selected (in the case of the example: n=10). In the actual data set I will be selecting 26 columns.
I would like to create a tidy little function which performs this calculation and deposits each row's value in to a column called dummy$TSP and does so for all 13000 rows.
One crude solution is the following, but like I said I would like to get this in to some kind of tidy function, where I can select the columns and the rest is (nearly) automatic.
dummy$TSP<- ((((1-(dummy$X1/dummy$Xmax))/(10-1))
+(((1-(dummy$X2/dummy$Xmax))/(10-1))
...
+(((1-(dummy$X10/dummy$Xmax))/(10-1)))
I would also really appreciate answers which explain the process well so I will be more likely to be able to learn, thanks in advance!

If you know the columns you want to apply the function over you can, as you suspect use apply to apply the function over the rows, on the columns you want like so;
# Columns you want to use for this function
cols <- c( 3:13 )
# Use apply to loop over rows
dummy$TSP <- apply( dummy[,cols] , 1 , FUN = function(x){ sum( ( 1 - ( x / max(x) ) ) / (length(x) - 1) ) } )
R is vectorised, so when we pass a row to the function in apply ( the row is passed as the argument x which will be a vector of 10 numbers), when we perform some operations R assumes that we want to do that operation on each element of the vector.
So in the first instance x/max(x) will return a vector of 10 numbers, which is an element from each column of that row / the maximum value in those columns for that row. We also divide each result of 1 - x/max(x) over the number of columns - 1. We then collate these into one value using sum which is returned from the function.

A more vectorized solution would be to perform the inner function over all elements and then perform the sum operation for each row with the efficient rowSums function like this:
vars.to.use <- paste0("X", 1:10)
dummy$TSP <- rowSums((1-(dummy[vars.to.use]/dummy$Xmax))/(length(vars.to.use) - 1))

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

Compare multiple columns in 2 different dataframes in R

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...

Learning Functions [Error: undefined columns selected]

R newbie here.
I'm learning functions, and i have a problem running this:
newfunction = function(x) {
limit = ncol(x)
for(i in 1:limit){
if(anyNA(x[,i] == T)) {
x[,i] = NULL
}
}
}
newfunction(WBD_SA)
I get the error: Error in '[.data.frame(x, , i) : undefined columns selected
I'm trying to remove all columns that have any NA values from my data set WBD_SA.
I know na.omit() removes for rows with NA values, but not sure if there is something for columns.
Any suggestions regarding packages/functions that can make this happen are also appreciated.
Cheers!
You are getting this error because you are iterating from 1 to limit, where limit is the number of columns at the start of the function, and you're dropping columns from the data.frame as you iterate through the for loop. This means that if you drop even 1 column, ncol(x) will be less than limit by the time the for loop ends. I'll give you 3 alternatives that work:
iterate backward:
for(i in limit:1)
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
with the above loop, the i'th column will always be in the the same position as the it was when the for loop started.
iterate forward using a while loop:
i = 1
while(i <=ncol(x)){
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
i=i+1
}
use the fact that data.frames are subclasses of lists, and use lapply to create an index that is TRUE for columns that contain a missing value and FALSE otherwise, like so:
columnHasMissingValue <- lapply(x,function(y)any(is.na(y)))
x <- x[,!columnHasMissingValue]
as long as you're learing about data.frames, it's useful that you can use negative indicies to drop column like so:
x <- x[,-which(columnHasMissingValue)]
Note that the above solution is similar to the apply solution in user1362215's solution, which takes advantage of the fact that data.frames have two dimensions* so you can apply a function over the second margin (columns) like so:
good_cols = apply(x,# the object over which to apply the function
2,# apply the function over the second margin (columns)
function(x) # the function to apply
!any(is.na(x))
)
x = x[,good_cols]
* 2 dimensions means that the [ operator defined for the data.frame class takes 2 arguments that are interpreted as rows and columns indexes.
When you are iterating over the columns, using x[,i] = NULL removes the column, reducing the number of columns by 1. Unless i is the last column, this will produce errors for future values of i. You should instead do something like this
good_cols = apply(x,2,function(x) {!any(is.na(x))})
x = x[,good_cols]
apply(x,margin,function) applies function over the margin dimension (rows for the value of 1, columns for the value of 2; 3 or higher is possible with arrays) of x, which is more efficient than looping (and doesn't cause errors from changing x partway).

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Return value from column indicated in same row

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

Resources