Learning Functions [Error: undefined columns selected] - r

R newbie here.
I'm learning functions, and i have a problem running this:
newfunction = function(x) {
limit = ncol(x)
for(i in 1:limit){
if(anyNA(x[,i] == T)) {
x[,i] = NULL
}
}
}
newfunction(WBD_SA)
I get the error: Error in '[.data.frame(x, , i) : undefined columns selected
I'm trying to remove all columns that have any NA values from my data set WBD_SA.
I know na.omit() removes for rows with NA values, but not sure if there is something for columns.
Any suggestions regarding packages/functions that can make this happen are also appreciated.
Cheers!

You are getting this error because you are iterating from 1 to limit, where limit is the number of columns at the start of the function, and you're dropping columns from the data.frame as you iterate through the for loop. This means that if you drop even 1 column, ncol(x) will be less than limit by the time the for loop ends. I'll give you 3 alternatives that work:
iterate backward:
for(i in limit:1)
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
with the above loop, the i'th column will always be in the the same position as the it was when the for loop started.
iterate forward using a while loop:
i = 1
while(i <=ncol(x)){
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
i=i+1
}
use the fact that data.frames are subclasses of lists, and use lapply to create an index that is TRUE for columns that contain a missing value and FALSE otherwise, like so:
columnHasMissingValue <- lapply(x,function(y)any(is.na(y)))
x <- x[,!columnHasMissingValue]
as long as you're learing about data.frames, it's useful that you can use negative indicies to drop column like so:
x <- x[,-which(columnHasMissingValue)]
Note that the above solution is similar to the apply solution in user1362215's solution, which takes advantage of the fact that data.frames have two dimensions* so you can apply a function over the second margin (columns) like so:
good_cols = apply(x,# the object over which to apply the function
2,# apply the function over the second margin (columns)
function(x) # the function to apply
!any(is.na(x))
)
x = x[,good_cols]
* 2 dimensions means that the [ operator defined for the data.frame class takes 2 arguments that are interpreted as rows and columns indexes.

When you are iterating over the columns, using x[,i] = NULL removes the column, reducing the number of columns by 1. Unless i is the last column, this will produce errors for future values of i. You should instead do something like this
good_cols = apply(x,2,function(x) {!any(is.na(x))})
x = x[,good_cols]
apply(x,margin,function) applies function over the margin dimension (rows for the value of 1, columns for the value of 2; 3 or higher is possible with arrays) of x, which is more efficient than looping (and doesn't cause errors from changing x partway).

Related

Add a column to empty data.frame

I want to initialise a column in a data.frame look so:
df$newCol = 1
where df is a data.frame that I have defined earlier and already done some processing on. As long as nrow(df)>0, this isn't a problem, but sometimes my data.frame has row length 0 and I get:
> df$newCol = 1
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
I can work around this by changing my original line to
df$newCol = rep(1,nrow(df))
but this seems a bit clumsy and is computationally prohibitive if the number of rows in df is large. Is there a built in or standard solution to this problem? Or should I use some custom function like so
addCol = function(df,name,value) {
if(nrow(df)==0){
df[,name] = rep(value,0)
}else{
df[,name] = value
}
df
}
If I understand correctly,
df = mtcars[0, ]
df$newCol = numeric(nrow(df))
should be it?
This is assuming that by "row length" you mean nrows, in which case you need to append a vector of length 0. In such case, numeric(nrow(df)) will give you the exact same result as rep(0, nrow(df)).
It also kind of assumes that you just need a new column, and not specifically column of ones - then you would simply do +1, which is a vectorized operation and therefore fast.
Other than that, I'm not sure you can have an "empty" column - the vector should have the same number of elements as the other vectors in the data frame. But numeric is fast, it should not hurt.

Applying a function to multiple rows of a time series in R

I have a time series with multiple columns, some have NAs in them, for example:
date.a<-seq(as.Date('2014-01-01'),as.Date('2014-02-01'),by = 2)
date.b<-seq(as.Date('2014-01-01'),as.Date('2014-02-15'),by = 3)
df.a <- data.frame(time=date.a, A=sin((1:16)*pi/8))
df.b <- data.frame(time=date.b, B=cos((1:16)*pi/8))
my.ts <- merge(xts(df.a$A,df.a$time),xts(df.b$B,df.b$time))
I'd like to apply a function to each of the rows, in particular:
prices2percreturns <- function(x){100*diff(x)/x}
I think that sapply should do the trick, but
sapply(my.ts, prices2percreturns)
gives Error in array(r, dim = d, dimnames = if (!(is.null(n1 <- names(x[[1L]])) & :
length of 'dimnames' [1] not equal to array extent. I suspect that this is due to the NAs when merging, but maybe I'm just doing something wrong. Do I need to remove the NAs or is there something wrong with the length of the vector returned by the function?
Per the comments, you don't actually want to apply the function to each row. Instead you want to leverage the vectorized nature of R. i.e. you can simply do this
100*diff(my.ts)/my.ts
If you do want to apply a function to each row of a matrix (which is what an xts object is), you can use apply with MARGIN=1. i.e. apply(my.ts, 1, myFUN).
sapply(my.ts, myFUN) would work like apply(my.ts, 2, myFUN) in this case -- applying a function to each column.
Your diff(x) will be 1 shorter than your x. Also your returns will be based on the results. You want returns based on the starting price not the end price. Here I change the function to reflect that and apply the function per column.
prices2percreturns <- function(x){100*diff(x)/x[-length(x)]}
prcRets = apply(my.ts, 2, prices2percreturns)

R: Test condition on column of dataframe elements within list; return smaller list

My goal is take a list of dataframes, see if a specific column of the data frames has a max value of 0, and if so, remove that data frame from my list.
Right now I am looping over names of the list. Given that this is R, there must be a better way. I feel I need some function applied through lapply() to get this right. I've also considered ddply() but I think that maybe overkill. Here is what I have so far:
# Make df of First element
myColumn <- rep ("ElementA",times=10)
values <- seq(1,10)
a <- data.frame(myColumn,values)
# Make df of second element
myColumn <- rep ("ElementB",times=10)
values <- rep(0,10)
b <- data.frame(myColumn,values)
# Bind the dataframes together
df <- rbind(a,b)
#Now split the dataframes based on element name
myList <- split(df,df$myColumn)
# Now loop through element lists and check for max of 0 in values
for (name in names(myList)) { # Loop through List
if (max(myList[[name]]$values) == 0) { # Check Max for 0
myList <- myList[[-names]] # If 0, remove element from list
} # Close If
} # Close Loop
Error in -names : invalid argument to unary operator
I've tested my code outside the loop, and it all seems to work.
Any help is greatly appreciated. Thanks!
You can use this:
myList <- myList[sapply(myList, function(d) max(d$values) != 0)]
instead of the for() loop. This will let pass dataframes with zero rows, with a warning.
To ensure empty dataframes are removed, use this:
myList <- myList[sapply(myList, function(d) if(nrow(d)==0) FALSE else max(d$values)!=0)]

create equation function acting across rows in R

I have a dataframe similar to the one this creates:
dummy=data.frame(c(1,2,3,4),c("a","b","c","d"));colnames(dummy)=c("Num","Let")
dummy$X1=rnorm(4,35,6)
dummy$X2=rnorm(4,35,6)
dummy$X3=rnorm(4,35,6)
dummy$X4=rnorm(4,35,6)
dummy$X5=rnorm(4,35,6)
dummy$X6=rnorm(4,35,6)
dummy$X7=rnorm(4,35,6)
dummy$X8=rnorm(4,35,6)
dummy$X9=rnorm(4,35,6)
dummy$X10=rnorm(4,35,6)
dummy$Xmax=apply(dummy[3:12],1,max)
only the real thing is 260*13000 cells roughly
what I aim to do is implement the equation below to each row in a set of columns defined by data[x:x] (in the example those within columns dummy[3:12])
TSP = Sum( (1-(Xi/Xmax)) /(n-1))
where Xi is each individual value within the row & among the columns of interest (i signifying each column, ie there is an X1, an X2, an X3... value for each row), Xmax is the largest of all those values in the row (as defined in the dummmy$Xmax column), and n is the number of columns selected (in the case of the example: n=10). In the actual data set I will be selecting 26 columns.
I would like to create a tidy little function which performs this calculation and deposits each row's value in to a column called dummy$TSP and does so for all 13000 rows.
One crude solution is the following, but like I said I would like to get this in to some kind of tidy function, where I can select the columns and the rest is (nearly) automatic.
dummy$TSP<- ((((1-(dummy$X1/dummy$Xmax))/(10-1))
+(((1-(dummy$X2/dummy$Xmax))/(10-1))
...
+(((1-(dummy$X10/dummy$Xmax))/(10-1)))
I would also really appreciate answers which explain the process well so I will be more likely to be able to learn, thanks in advance!
If you know the columns you want to apply the function over you can, as you suspect use apply to apply the function over the rows, on the columns you want like so;
# Columns you want to use for this function
cols <- c( 3:13 )
# Use apply to loop over rows
dummy$TSP <- apply( dummy[,cols] , 1 , FUN = function(x){ sum( ( 1 - ( x / max(x) ) ) / (length(x) - 1) ) } )
R is vectorised, so when we pass a row to the function in apply ( the row is passed as the argument x which will be a vector of 10 numbers), when we perform some operations R assumes that we want to do that operation on each element of the vector.
So in the first instance x/max(x) will return a vector of 10 numbers, which is an element from each column of that row / the maximum value in those columns for that row. We also divide each result of 1 - x/max(x) over the number of columns - 1. We then collate these into one value using sum which is returned from the function.
A more vectorized solution would be to perform the inner function over all elements and then perform the sum operation for each row with the efficient rowSums function like this:
vars.to.use <- paste0("X", 1:10)
dummy$TSP <- rowSums((1-(dummy[vars.to.use]/dummy$Xmax))/(length(vars.to.use) - 1))

How to do a simple operation that normally requires double for loops in [R]?

I´m quite new to R and I´m coming from a c++ background. I have a data frame with multiple rows and columns. My question is how can I do this in a different manner because it takes for ever to run. I have over 60 thousand rows and around 15 columns. Is there a better way to do this? Help is greatly appreciated!
counter <-0
for(j in 7:length(SeaStateData[3,]))
{
for( i in 1:length(SeaStateData[,3]))
{
if(!is.na(SeaStateData[i,j]) & !is.na(SeaStateData[i+1,j]))
if(SeaStateData[i,j] == SeaStateData[i+1,j])
{
counter <- counter + 1
}
}
}
I'd try this:
nr <- nrow(SeaStateData)
nc <- ncol(SeaStateData)
counter <- sum(SeaStateData[1:(nr - 1), 7:nc] ==
SeaStateData[2:nr, 7:nc],
na.rm = TRUE)
The subsets represent two submatrices, with a relative offset of one row. The == operator will yield a logical vector (in this case a matrix, which is just a vector with added dimension information) containing TRUE if two items match, FALSE if they differ, and NA if one of them is NA. The sum over a logical vector counts all TRUE values. The na.rm attribute tells it to drop NA values; otherwise the sum would be NA as well. sum(…, na.rm = TRUE) is roughly the same as sum(na.omit(…)).

Resources