Can I set a benchmark with several data sets in R? - r

This is just a premature thought, but say I have X datasets, and they have Y characteristics, e.g. artificial vs. non-fictional data.
Is it possible to compute an index ranging from 0 to 1 based on the X datasets' Y characteristics? I have not conceptualised yet what exactly 1 means or 0.

Related

Divide a set of curves into groups using functional data analysis

I have a dataset containing about 500 curves. In my dataset every row is a curve (which comes from some experimental measurements) and in the columns there are the measurement intervals (I don't think it's important, but intervals are not time measurements but frequency measurements).
Here you can find the data:
https://drive.google.com/file/d/1q1F1any8RlCIrn-CcQEzLWyrsyTBCCNv/view?usp=sharing
curves
t1
t2
1
-57.48
-57.56
2
-56.22
-56.28
3
-57.06
-57.12
I want to divide this dataset into 2 - 4 homogeneous groups of curves.
I've seen that there are some packages in R (fda and funHDDC) that allow you to find clusters but I don't know how to create the list with which to start the analysis, and I also don't understand why the initial dataset doesn't fit. How can I transform the data I have into a list suitable for processing with the above packages?
What results should I expect?

SVM applied in images using Julia Language

I would like to create programming involving a support vector machine (SVM) in Julia with scikit-Learn of Python.
So I need a training set (X_train) and a test set (y_train).
As an example, I have an X set of type 500 x 3 (this is my full data).
Hence, I get a subset A of X of type Array 110 x 3 and a subset B of X of type Array 120 x 3.
The union of sets A and B will be my X_train training set.
My y_train set is obtained as follows:
For every vector element of A, I associate with the number 0 and for every vector element of B, I associate with the number 1.
So I have the following training and exit set:
T = {(a1, 0), (a2, 0), ..., (a110,0), (b1, 1), (b2, 1), ..., (b120, 1)}.
These sets A and B are disjoint and obtained from an RGB img vector image that I transformed into X of the 500 x 3 array type.
After having the T set, I should use Python's scikitlearn library in Julia, the svc.fit() command and probably the svc.predict() command to create a prediction model for the remaining pixels that have not been trained.
Could someone help me make this code?
I am not able to associate every element from A to 0 and every element from B to 1.
To concatenate I am using vcat(A, B) and defining X_train = vcat(A, B).
I've tried many things and I can't get out of here.
In summary:
I have an X set of vectors in 3 coordinates. And I would like to associate a numerical value between 0 and 1 for each X vector. So I took a part of the X set as the X_train training set. This X_train set is the union of two disjoint subsets A and B in which for each vector of A I know that y is 0 and for each vector of B I know that y is 1.
I would now like to train the X_train set and get a y function that gives me a value between 0 to 1 of the whole X set. This via SVM. I believe that there are many other supervised approaches, but with that I can follow.

How is scaling done in multi classification SVM?

I am working with R for solving a multi classification problem. I want to use e1071. How is scaling done for multiclass classification ? On this page, they say that
“A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance. The center and scale values are returned and used for later predictions.”
I am wondering how y is scaled. When we have m classes we have m columns for y, which they have different means and variances. So after scaling y, we have different number in each column for the same class! And it doesn’t make sense to me.
Could you please let me know what is going on in scaling? I am so curious to know that.
Also I am wondering what this mean:
"If scale is of length 1, the value is recycled as many times as needed."
Let's have look at some information for the argument scale:
A logical vector indicating the variables to be scaled. If scale is of length 1, the value is recycled as many times as needed. Per default, data are scaled internally (both x and y variables) to zero mean and unit variance.
The value expected here is a logical vector (so a vector of TRUE and FALSE). If this vector has as many values as you have columns in your matrix, then the columns are scaled or not according to your vector (eg. if you have svm(..., scale = c(TRUE, FALSE, TRUE), ...) the first and third columns are scaled while the second one is not).
What happens during scaling is explained in the third sentence quoted above: "data are scaled [...] to zero mean and unit variance". To do this:
you substract each value of a column by the mean of this column (this is called centering), and
then you divide each value of this column by the columns standard deviation (this is the actual scaling).
You can reproduce the scaling with following example:
# create a data.frame with four variables
# as you can see the difference between each term of aa and bb is one
# and the difference between each term of cc is 21.63 while dd is random
(df <- data.frame(aa = 11:15,
bb = 1:5,
cc = 1:5*21.63,
dd = rnorm(5,12,4.2)))
# then we substract the mean of each column to this colum and
# put everything back together to a data.frame
(df1 <- as.data.frame(sapply(df, function(x) {x-mean(x)})))
# you can observe that now the mean value of each column is 0 and
# that aa==bb because the difference between each term was the same
# now we divide each column by its standard deviation
(df1 <- as.data.frame(sapply(df1, function(x) {x/sd(x)})))
# as you can see, the first three columns are now equal because the
# only difference between them was that cc == 21.63*bb
# the data frame df1 is now identical to what you would obtain by
# using the default scaling function `scale`
(df2 <- scale(df))
Scaling is necessary when your columns represent data on different scales. For example, if you wanted to distinguish individuals that are obese from lean ones you could collect their weight, height and waist-to-hip ratio. Weight would probably have values ranging from 50 to 95 kg, while height would be around 175 cm (± 20 cm) and waist-to-hip could range from 0.60 to 0.95. All these measurements are on different scales so that it is difficult to compare them. Scaling the variables solves this problem. Moreover, if one variable reaches high numerical values while the other ones do not, this variable will likely be given more importance during multivariate algorithms. Therefore scaling is advisable in most cases for such methods.
Scaling does affect the mean and the variance of each variable but as it is applied equally to each row (potentially belonging to different classes) this is not a problem.

How does SMOTE create new data from categorical data?

I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.

consumer surplus in r

I'm trying to do some econometric analysis using R and can't figure out how to do the analysis I'm look for. Specifically, I want to calculate consumer surplus.
I am trying to predict number of trips (dependent) based on variables like water quality, scenery, parking, etc. I've run a regression of my independent variables on my dependent variable using:
lm()
and also got my predicted values using:
y_hat <- as.matrix(mydata[c("y")])
Now I want to calculate the consumer surplus for each individual (~260 total) from my predicted (y_hat) values.
Welcome to R. I studied economics in college and wish R was taught. You will find that the programming language is very useful in your work.
Note that R is able to accomplish vectorized operations that may speed up your analysis. Consider:
mydata <- data.frame(x=letters[1:3], y=1:3)
x y
1 a 1
2 b 2
3 c 3
Let's say your predicted 'y' is 1.25.
y_hat <- 1.25
You can subtract that number by the entire column of the dataset and it will go row by row for you without having to use compicated 'for loops.'
y_hat - mydata[c("y")]
y
1 0.25
2 -0.75
3 -1.75
Without more information about your particular issue, that is all the help that I can offer. In the future, add a reproducible example that illustrates your data and the specific issue that you are stuck on.

Resources