I am working with R for solving a multi classification problem. I want to use e1071. How is scaling done for multiclass classification ? On this page, they say that
“A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.”
I am wondering how y is scaled. When we have m classes we have m columns for y, which they have different means and variances. So after scaling y, we have different number in each column for the same class! And it doesn’t make sense to me.
Could you please let me know what is going on in scaling? I am so curious to know that.
Also I am wondering what this mean:
"If scale is of length 1, the value is recycled as many times as needed."
Let's have look at some information for the argument scale:
A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance.
The value expected here is a logical vector (so a vector of TRUE and FALSE). If this vector has as many values as you have columns in your matrix, then the columns are scaled or not according to your vector (eg. if you have svm(..., scale = c(TRUE, FALSE, TRUE), ...) the first and third columns are scaled while the second one is not).
What happens during scaling is explained in the third sentence quoted above: "data are scaled [...] to zero mean and unit variance". To do this:
you substract each value of a column by the mean of this column (this is called centering), and
then you divide each value of this column by the columns standard deviation (this is the actual scaling).
You can reproduce the scaling with following example:
# create a data.frame with four variables
# as you can see the difference between each term of aa and bb is one
# and the difference between each term of cc is 21.63 while dd is random
(df <- data.frame(aa = 11:15,
bb = 1:5,
cc = 1:5*21.63,
dd = rnorm(5,12,4.2)))
# then we substract the mean of each column to this colum and
# put everything back together to a data.frame
(df1 <- as.data.frame(sapply(df, function(x) {x-mean(x)})))
# you can observe that now the mean value of each column is 0 and
# that aa==bb because the difference between each term was the same
# now we divide each column by its standard deviation
(df1 <- as.data.frame(sapply(df1, function(x) {x/sd(x)})))
# as you can see, the first three columns are now equal because the
# only difference between them was that cc == 21.63*bb
# the data frame df1 is now identical to what you would obtain by
# using the default scaling function `scale`
(df2 <- scale(df))
Scaling is necessary when your columns represent data on different scales. For example, if you wanted to distinguish individuals that are obese from lean ones you could collect their weight, height and waist-to-hip ratio. Weight would probably have values ranging from 50 to 95 kg, while height would be around 175 cm (± 20 cm) and waist-to-hip could range from 0.60 to 0.95. All these measurements are on different scales so that it is difficult to compare them. Scaling the variables solves this problem. Moreover, if one variable reaches high numerical values while the other ones do not, this variable will likely be given more importance during multivariate algorithms. Therefore scaling is advisable in most cases for such methods.
Scaling does affect the mean and the variance of each variable but as it is applied equally to each row (potentially belonging to different classes) this is not a problem.
Related
Working in R, I need to create a vector of length n with the values randomly drawn from a Poisson distribution with lambda=1, but with a lower bound of 2 and upper bound of 6 (i.e. all numbers will be either 2,3,4,5, or 6).
I am unsure how to do this. I tried creating a for loop that would replace any values outside that range with values inside the range:
seed(123)
n<-25 #example length
example<-rpois(n,1)
test<-example #redundant - only duplicating to compare with original *example* values
for (i in 1:length(n)){
if (test[i]<2||test[i]>6){
test[i]<-rpois(1,1)
}
}
But this didn't seem to work (still getting 0's and 1, etc, in test). Any ideas would be greatly appreciated!
Here is one way to generate n numbers with Poisson distribution and replace all the numbers which are outside range to random number inside the range.
n<-25 #example length
example<-rpois(n,1)
inds <- example < 2 | example > 6
example[inds] <- sample(2:6, sum(inds), replace = TRUE)
I am new to statistics, so I excuse myself if this question is trivial
I have a variable that is normally distributed with a range between -15 and +15 like the following one:
df <- data.frame("weight" = runif(1000, min=-15, max=15), stringsAsFactors = FALSE)
The median and mean value of this variable is 0.
I need to transform this variable to use it as a weight in my regression. For substantive reasons, it does not make any sense to have negative values in my variable (it is itself the result of previous transformations).
Negative values of my variable should simply reduce the effects of my main explanatory variable (hence should be bounded between 0 and 1) while positive values should have a multiplicative effect on my explanatory variable (greater than 1). While values close to 0 of my weight should have no effect on my explanatory variable (close to 1).
Hence I would like centre my variable so that the minimum value of my weight is 0 and the median value becomes 1, while I do not want to put constraints on the maximum value thought this will necessarily change the mean (it will become greater than 1). I am not concerned about this provided that the median remains 1.
so far I have considered standardizing the variable between 0 and 2
library(BBmisc)
df$normalizedweight <- normalize(df$weight, method = "range",
range = c(0, 2))
however, this operation puts an unnecessary constraint to my normalized variable as the effect of my weight can be greater than a factor of two, while
To clarify, in the real data, negative values of the weight are perfectly mirroring positive values of the weight. Ideally, once I have standardized the data, I would want that multiplying the same number by the maximum and minimum value of the weight, would increase/decrease the value by the same proportion.
For example, taking the value of the response variable of 5 both for the maximum (10) and minimum value of my weight, the minimum value should be 0.1, so that 5*10 and 5*0.1, would be and proportional increase/decrease by a factor of 10 of my original value.
I thank you in advance for all the help you are able to provide
Best
One option is to used the exponential transformation. All your negative values will be between 0 and 1, and all your positive values will be over 1. And your median will be close to 1.
Moreover, as exp() will create very large value (exp(15) = 3 269 017), you can first divided your values by its maximum.
sample <- runif(10000, min=-15, max=15)
sample_transform = exp(sample / max(sample))
median(sample_transform)
# [1] 0.9930663
hist(sample_transform)
I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.
Once again I have been set another programming task and to most of which I have done, so a quick run through: I've had to take n amount of samples of multivariate normal distribution with dimension p (called it X) then to put it into a matrix (Matx) where the first two values in each row were taken and summed a long with a value randomly drawn from the standard normal distribution. (Call this vector Y) Then we had to order Y numerically and split it up into H groups, and then I had to find out the mean of each row in the matrix and now having to order then in terms of which Y group they were associated. I've struggled a fair bit and have now hit a brick wall. Quite confusing I understand, if anyone could help it'd be greatly appreciated!
Task:Return the pxH matrix which has in the first column the mean of the observations in the first group and in the Hth column the mean in the observations in the Hth group.
Code:
library('MASS')
x<-mvrnorm(36,0,1)
Matx<-matrix(c(x), ncol=6, byrow=TRUE)
v<-rnorm(6)
y1<-sum(x[1:2],v[1])
y2<-sum(x[7:8],v[2])
y3<-sum(x[12:13],v[3])
y4<-sum(x[19:20],v[4])
y5<-sum(x[25:26],v[5])
y6<-sum(x[31:32],v[6])
y<-c(y1,y2,y3,y4,y5,y6)
out<-order(y)
h1<-c(out[1:2])
h2<-c(out[3:4])
h3<-c(out[5:6])
x1<-c(x[1:6])
x2<-c(x[7:12])
x3<-c(x[13:18])
x4<-c(x[19:24])
x5<-c(x[25:30])
x6<-c(x[31:36])
mx1<-mean(x1)
mx2<-mean(x2)
mx3<-mean(x3)
mx4<-mean(x4)
mx5<-mean(x5)
mx6<-mean(x6)
d<-c(mx1,mx2,mx3,mx4,mx5,mx6)[order(out)]
d
I am trying to generate a plot from a dataset of 2 columns - the first column contains distances and the second contains correlations of something measured at those distances.
Now there multiple entries with the same distance but different correlation values. I want to take the average of these various entries and generate a plot of distance versus correlation. So, this is what I did (the dataset is called correlation table):
bins <- sort(unique(correlationtable[,1]))
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
plot(bins,corr,type = 'l')
However, this gives me the error that lengths of bins and corr don't match.
I cannot figure out what am I doing wrong.
I tried it with some random data and for me it worked every time. To track the error you would need to supply us with the concrete example that did not work for you.
However to answer the question here is alternative way to do the same thing:
corr <- tapply(correlationtable[,2],correlationtable[,1],mean)
bins <- as.numeric(names(corr))
plot(bins,corr,type = 'l')
This uses the fact that tapply returns names attribute which then is converted into numeric and used as distance. And it must be the same length as corr.