I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.
I have a problem with selecting a variable that should contain a certain range of values. I want to split up my variable into 3 categories. Namely; small, medium and big. A piece of context. I have a variable named obj_hid_woonopp which is (size in m2) and it goes from 16-375. And my dataset is called datalogitvar.
I'm sorry I have no reproduceable code. But since I think it's a rather simple question I hope it can be answered nonetheless. The code that I'm using is as follows
datalogitvar$size_small<- as.numeric(obj_hid_WOONOPP>="15" & obj_hid_WOONOPP<="75" )
datalogitvar$size_medium<- as.numeric(obj_hid_WOONOPP>="76" & obj_hid_WOONOPP<="100" )
datalogitvar$size_large<- as.numeric(obj_hid_WOONOPP>="101")
When I run this, I do get a result. Just not the result I'm hoping for. For example the small category also contains very high numbers. It seems that (since i define "75") it also takes values of "175" since it contains "75". I've been thinking about it and I feel it reads my data as text and not as numbers. However I do say as.numeric so I'm a bit confused. Can someone explain to me how I make sure I create these 3 variables with the proper range? I feel I'm close but the result is useless so far.
Thank you so much for helping.
For a question like this you can replicate your problem with a publicly available dataset like mtcars.
And regarding your code
1) you will need to name the dataset for DATASET$obj_hid_WOONOPP on the right side of your code.
2) Why are you using quotes around your numeric values? These quotes prevent the numbers from being treated as numbers. They are instead treated as string values.
I think you want to use something like the code I've written below.
mtcars$mpg_small <- as.numeric(mtcars$mpg >= 15 & mtcars$mpg <= 20)
mtcars$mpg_medium <- as.numeric(mtcars$mpg > 20 & mtcars$mpg <= 25)
mtcars$mpg_large <- as.numeric(mtcars$mpg > 25)
Just to illustrate your problem:
a <- "75"
b <- "175"
a > b
TRUE (75 > 175)
a < b
FALSE (75 < 175)
Strings don't compare as you'd expect them to.
Two ideas come to mind, though an example of code would be helpful.
First, look into the documentation for cut(), which can be used to convert numeric vector into factors based on cut-points that you set.
Second, as #MrFlick points out, your code could be rewritten so that as.numeric() is run on a character vector containing strings that you want to convert to numeric values THEN perform Boolean comparisons such as > or &.
To build on #Joe
mtcars$mpg_small <- (as.numeric(mtcars$mpg) >= 15 &
(as.numeric(mtcars$mpg) <= 20))
Also be careful, if your vector of strings obj_hid_WOONOPP contains some values that cannot be coerced into numerics, they will become NA.
I am trying to do 10-fold-cross-validation in R. In each for run a new row with several columns will be generated, each column will have an appropriate name, I want the results of each 'for' to go under the appropriate column, so that at end I will be able to compute the average value for each column. In each 'for' run results that are generated belong to different columns than the previous for, therefore the names of the columns should also be checked. Is it possible to do it anyway? Or maybe it would be better to just compute the averages for the columns on the spot?
for(i in seq(from=1, to=8200, by=820)){
fold <- df_vector[i:i+819,]
y_fold_vector <- df_vector[!(rownames(df_vector) %in% rownames(folding)),]
alpha_coefficient <- solve(K_training, y_fold_vector)
test_points <- df_matrix[rownames(df_matrix) %in% rownames(K_training), colnames(df_matrix) %in% rownames(folding)]
predictions <- rbind(predictions, crossprod(alpha_coefficient,test_points))
}
You are having problems with the operator precedence of dyadic operators in R should be:
fold <- df_vector[ i:(i+819), ]
Consider:
> i=1
> i:i+189
[1] 190
Lack of a simple example (or any comments on what your code is supposed to be doing) prevents any testing of the rest of the code, but you can find the precedence of operators at ?Syntax. Unary "=" is higher, but binary "+" is lower than ":".
(It's also unclear what the folding vector is supposed to be. You only defined a fold value and it wasn't a vector since you addressed it as you would a dataframe.)
I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names