I have a data set x consists of 4 columns. When I apply range(x) I receive one answer for all rows. How can I get the range for each row of the 4 columns without using loops?
This is a typical case for functions of the *apply-family, which are technically loops with a special syntax. In your case, you can use
apply(X = x, MARGIN = 1, FUN = range)
This tells R to apply the function range() over all rows, as expressed by MARGIN = 1 (MARGIN = 2 would be the same over all columns).
Related
I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...
I have a data set that contains multiple attributes with integer values from 1 to 5 and I would like to rescale these attributes so that their values range from -1 to 1. My current code that I have is
newdata$Rats = rescale(newdata$Rats, to = c(-1,1), from=c(1,5))
Where newdata is my dataset and Rats is one of my attributes. If I only had a few attributes to change that would be fine, but I have about 30 or so to change. Is there a way to use a for loop to do this or use the select function that R has or possibly another way?
Use lapply():
newdata[, c(1:30)] <- lapply(newdata[, c(1:30)],
function(x) rescale(x, to = c(-1, 1), from = c(1, 5)))
For the c(1:30), insert a vector of either positions of your variables within your dataframe, or a vector of the names of your variables as strings.
R newbie here.
I'm learning functions, and i have a problem running this:
newfunction = function(x) {
limit = ncol(x)
for(i in 1:limit){
if(anyNA(x[,i] == T)) {
x[,i] = NULL
}
}
}
newfunction(WBD_SA)
I get the error: Error in '[.data.frame(x, , i) : undefined columns selected
I'm trying to remove all columns that have any NA values from my data set WBD_SA.
I know na.omit() removes for rows with NA values, but not sure if there is something for columns.
Any suggestions regarding packages/functions that can make this happen are also appreciated.
Cheers!
You are getting this error because you are iterating from 1 to limit, where limit is the number of columns at the start of the function, and you're dropping columns from the data.frame as you iterate through the for loop. This means that if you drop even 1 column, ncol(x) will be less than limit by the time the for loop ends. I'll give you 3 alternatives that work:
iterate backward:
for(i in limit:1)
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
with the above loop, the i'th column will always be in the the same position as the it was when the for loop started.
iterate forward using a while loop:
i = 1
while(i <=ncol(x)){
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
i=i+1
}
use the fact that data.frames are subclasses of lists, and use lapply to create an index that is TRUE for columns that contain a missing value and FALSE otherwise, like so:
columnHasMissingValue <- lapply(x,function(y)any(is.na(y)))
x <- x[,!columnHasMissingValue]
as long as you're learing about data.frames, it's useful that you can use negative indicies to drop column like so:
x <- x[,-which(columnHasMissingValue)]
Note that the above solution is similar to the apply solution in user1362215's solution, which takes advantage of the fact that data.frames have two dimensions* so you can apply a function over the second margin (columns) like so:
good_cols = apply(x,# the object over which to apply the function
2,# apply the function over the second margin (columns)
function(x) # the function to apply
!any(is.na(x))
)
x = x[,good_cols]
* 2 dimensions means that the [ operator defined for the data.frame class takes 2 arguments that are interpreted as rows and columns indexes.
When you are iterating over the columns, using x[,i] = NULL removes the column, reducing the number of columns by 1. Unless i is the last column, this will produce errors for future values of i. You should instead do something like this
good_cols = apply(x,2,function(x) {!any(is.na(x))})
x = x[,good_cols]
apply(x,margin,function) applies function over the margin dimension (rows for the value of 1, columns for the value of 2; 3 or higher is possible with arrays) of x, which is more efficient than looping (and doesn't cause errors from changing x partway).
I have a time series with multiple columns, some have NAs in them, for example:
date.a<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 2)
date.b<-seq(as.Date('2014-01-01'),as.Date('2014-02-15'),by = 3)
df.a <- data.frame(time=date.a, A=sin((1:16)*pi/8))
df.b <- data.frame(time=date.b, B=cos((1:16)*pi/8))
my.ts <- merge(xts(df.a$A,df.a$time),xts(df.b$B,df.b$time))
I'd like to apply a function to each of the rows, in particular:
prices2percreturns <- function(x){100*diff(x)/x}
I think that sapply should do the trick, but
sapply(my.ts, prices2percreturns)
gives Error in array(r, dim = d, dimnames = if (!(is.null(n1 <- names(x[[1L]])) & :
length of 'dimnames' [1] not equal to array extent. I suspect that this is due to the NAs when merging, but maybe I'm just doing something wrong. Do I need to remove the NAs or is there something wrong with the length of the vector returned by the function?
Per the comments, you don't actually want to apply the function to each row. Instead you want to leverage the vectorized nature of R. i.e. you can simply do this
100*diff(my.ts)/my.ts
If you do want to apply a function to each row of a matrix (which is what an xts object is), you can use apply with MARGIN=1. i.e. apply(my.ts, 1, myFUN).
sapply(my.ts, myFUN) would work like apply(my.ts, 2, myFUN) in this case -- applying a function to each column.
Your diff(x) will be 1 shorter than your x. Also your returns will be based on the results. You want returns based on the starting price not the end price. Here I change the function to reflect that and apply the function per column.
prices2percreturns <- function(x){100*diff(x)/x[-length(x)]}
prcRets = apply(my.ts, 2, prices2percreturns)
I have a dataframe similar to the one this creates:
dummy=data.frame(c(1,2,3,4),c("a","b","c","d"));colnames(dummy)=c("Num","Let")
dummy$X1=rnorm(4,35,6)
dummy$X2=rnorm(4,35,6)
dummy$X3=rnorm(4,35,6)
dummy$X4=rnorm(4,35,6)
dummy$X5=rnorm(4,35,6)
dummy$X6=rnorm(4,35,6)
dummy$X7=rnorm(4,35,6)
dummy$X8=rnorm(4,35,6)
dummy$X9=rnorm(4,35,6)
dummy$X10=rnorm(4,35,6)
dummy$Xmax=apply(dummy[3:12],1,max)
only the real thing is 260*13000 cells roughly
what I aim to do is implement the equation below to each row in a set of columns defined by data[x:x] (in the example those within columns dummy[3:12])
TSP = Sum( (1-(Xi/Xmax)) /(n-1))
where Xi is each individual value within the row & among the columns of interest (i signifying each column, ie there is an X1, an X2, an X3... value for each row), Xmax is the largest of all those values in the row (as defined in the dummmy$Xmax column), and n is the number of columns selected (in the case of the example: n=10). In the actual data set I will be selecting 26 columns.
I would like to create a tidy little function which performs this calculation and deposits each row's value in to a column called dummy$TSP and does so for all 13000 rows.
One crude solution is the following, but like I said I would like to get this in to some kind of tidy function, where I can select the columns and the rest is (nearly) automatic.
dummy$TSP<- ((((1-(dummy$X1/dummy$Xmax))/(10-1))
+(((1-(dummy$X2/dummy$Xmax))/(10-1))
...
+(((1-(dummy$X10/dummy$Xmax))/(10-1)))
I would also really appreciate answers which explain the process well so I will be more likely to be able to learn, thanks in advance!
If you know the columns you want to apply the function over you can, as you suspect use apply to apply the function over the rows, on the columns you want like so;
# Columns you want to use for this function
cols <- c( 3:13 )
# Use apply to loop over rows
dummy$TSP <- apply( dummy[,cols] , 1 , FUN = function(x){ sum( ( 1 - ( x / max(x) ) ) / (length(x) - 1) ) } )
R is vectorised, so when we pass a row to the function in apply ( the row is passed as the argument x which will be a vector of 10 numbers), when we perform some operations R assumes that we want to do that operation on each element of the vector.
So in the first instance x/max(x) will return a vector of 10 numbers, which is an element from each column of that row / the maximum value in those columns for that row. We also divide each result of 1 - x/max(x) over the number of columns - 1. We then collate these into one value using sum which is returned from the function.
A more vectorized solution would be to perform the inner function over all elements and then perform the sum operation for each row with the efficient rowSums function like this:
vars.to.use <- paste0("X", 1:10)
dummy$TSP <- rowSums((1-(dummy[vars.to.use]/dummy$Xmax))/(length(vars.to.use) - 1))