How to use BoxCoxTrans function in R? - r

I want to use BoxCoxTrans function in R to resolve problem of skewness.
But, I have a problem that couldn't get result as data frame. This is my R code.
df<-read.csv("dataSetNA1.csv",header=TRUE)
dd1<-apply(df[2:61],2,BoxCoxTrans) #Except independent variable that located first column, All variables are numeric variable.
dd1
$LT1Y_MXOD_AMT
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 19594 0 1600000
Lambda could not be estimated; no transformation is applied
$MOBL_PRIN
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 100000 191229 320000 1100000
Lambda could not be estimated; no transformation is applied
str(dd1)
I don't know how to get result as data frame.
If I use as.data.frame function, this error message is posted.
dd2<-as.data.frame(dd1)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
클래스 ""BoxCoxTrans""를 data.frame으로 강제형변환 할 수 없습니다
please help me.

Here is one way to accomplish what you are after (I assume you are transforming the features):
library(caret)
data(cars)
#create a list with the BoxCox objects
g <- apply(cars, 2, BoxCoxTrans)
#use map2 from purr to apply the models to new data
z <- purrr::map2(g, cars, function(x, y) predict(x, y))
#here the transformation is performed on the same data on
#which I estimated the BoxCox lambda for
B_trans = as.data.frame(do.call(cbind, z)) #to convert to data frame
head(data.frame(B_trans, cars), 20)
#outpout
speed dist speed.1 dist.1
1 4 0.8284271 4 2
2 4 4.3245553 4 10
3 7 2.0000000 7 4
4 7 7.3808315 7 22
5 8 6.0000000 8 16
6 9 4.3245553 9 10
7 10 6.4852814 10 18
8 10 8.1980390 10 26
9 10 9.6619038 10 34
10 11 6.2462113 11 17
11 11 8.5830052 11 28
12 12 5.4833148 12 14
13 12 6.9442719 12 20
14 12 7.7979590 12 24
15 12 8.5830052 12 28
16 13 8.1980390 13 26
17 13 9.6619038 13 34
18 13 9.6619038 13 34
19 13 11.5646600 13 46
20 14 8.1980390 14 26
First two columns are transformed data and 2nd two are original data.
Another way is to incorporate the transformation of features during the training:
train(....preProcess = "BoxCox"...)
more on the matter: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/train

In order to perform a Box Cox transformation your data has to be positive. Hence, the values should be greater than 0.
The reason for this is, that the logarithm of 0 is -Inf.
If your data contains values of 0 you can just add 1 to each observation. This won't change your distribution/skewness.

A BoxCox transformation is a transformation on your response variable. You could use the Boxcox function of the MASS package to find out what transformation is needed. Boxcox returns a lambda value. U should raise your response, say y, to the power lambda and this results in a new response variable, y*.
Then just replace the y-column in your old data frame by y*.
Note that if the resulting lambda is 0, you should apply a logarithmic transformation ln(y).

Related

Coefplot for a chi square distribution

I was told to do a coefplot in R to visualise my data better.
Therefore i first did a chi square test. and after i put my data into a table it looked like this:
1 2 3 5 6
5_min_blank 11 21 18 19 8
Boldstyle 6 7 14 10 2
Boldstyle_pause 9 22 19 8 0
Breaststroke 7 16 10 5 4
Breaststroke_pause 9 13 10 8 3
Diving 14 20 10 10 4
1-6 are categories and "bold style" etc. are different sounds.
i than did a test:
fit.swim<-chisq.test(X2,simulate.p.value = TRUE, B = 10000)
and got this result:
Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
data: X2
X-squared = 87.794, df = NA, p-value = 0.09479
Now i would like to do a coefplot with my data but i only get this error:
coefplot(fit.swim)
Error: $ operator is invalid for atomic vectors
Any ideas how to draw a nice plot?
Thank you very much for the help!
All the best
Marie
I think that the reason you are getting that error is because coefplot requires a fitted model as input in the form of an lm, glm or rxLinMod obj.
In your case you have carried out a goodness of fit test that essentially compares the observed sample distribution with the expected probability distribution. There isn't a fitted model to plot the coefficients from.

r join two lists and sum their values

I have two lists: x, y
> x
carlo monte simulation model quantum
31 31 9 6 6
> y
model system temperature quantum simulation problem
15 15 15 13 13 12
What function should I use to obtain:
simulation model quantum
22 21 19
I tried to merge them like in example but it gives me an error:
merge(x,y,by=intersect(names(x),names(y))) produces:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
There's no argument in that function what to do with values. What would be the best function to use?
intersect(names(x),names(y)) will give the names of resulting list, but how to summarize values together??
You can use Map in base R to return a list.
Map("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
$simulation
[1] 22
$model
[1] 21
$quantum
[1] 19
or mapply to return a named vector which may be more useful.
mapply("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
simulation model quantum
22 21 19
Using [intersect(names(x), names(y))] will not only be subset the contents of x and y to those with intersecting names, but will also properly sort the elements for the operation.
data
x <- list(carlo=1, monte=2, simulation=9, model=6, quantum=6)
y <-list(model=15, system=8, temperature=10, quantum=13, simulation=13, problem="no")
simple names matching does the trick :
# subset those from x which have names in y also
x1 = x[names(x)[names(x) %in% names(y)]]
# x1
# simulation model quantum
# 9 6 6
# similarily do it for y. note the order of names might be different from that in x
y1 = y[names(y)%in%names(x1)]
# y1
# model quantum simulation
# 15 13 13
# now order the names in both and then add.
x1[order(names(x1))]+y1[order(names(y1))]
# model quantum simulation
# 21 19 22
Base function merge() should do this with no issue so long as your fields make sense, but you need to include merge(..., all=TRUE), as in:
y <- data.frame(rbind(c(15,15,15,13,13,12)))
names(y) <- c("model","system","temperature","quantum","simulation","problem")
x <- data.frame(rbind(c(31,31,9,6,6)))
names(x) <- c("carlo","monte","simulation","model","quantum")
merge(x, y, by = c("simulation","model","quantum"), all = TRUE)
results in:
simulation model quantum carlo monte system temperature problem
1 9 6 6 31 31 NA NA NA
2 13 15 13 NA NA 15 15 12
Here you actually have data frames of length 1, not lists.

Normalize/scale data set

I have the following data set:
dat<-as.data.frame(rbind(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10))
colnames(dat)<-"Score"
print(dat)
Score
10
8
2
7
10
10
1
10
14
9
2
6
10
8
10
8
10
10
7
11
10
these are the test scores which students obtained, a student could get a maximum of 15 or a minimum of 0 in this test (by the way, nobody got the max or the min), however the lowest score obtained in this test was 1 and the highest was 14.
Now, I want to normalize/scale this data to the scale of 0 to 20.
How to achieve this in excel? or in R?
My final goal is to normalize the scores in this test to the above scale and to compare them with another set of data for which the max and min is 5 and 0 respectively.
How to compare these two different scaled data sets correctly against each other?
What I tried:
I went through many stuff on the internet, and came up with this:
which I got it from the wikipedia.
Is this method reliable?
In your case I would use the feature scale formula you posted on your question. The (x - min(x)) / (max(x) - min(x)) will essentially convert your test marks to the range between 0-1.
Since your edges are indeed 0 and 15 and not 2 and 14, your min(x)=0 and your max(x)=15. Once you have your marks between 0-1 using the above, you just multiply by 20.
i.e.
tests <- read.table(header=T, file='clipboard')
tests2 <- (tests - 0) / (15 - 0) #or equally tests / 15
And multiply by 20 to get marks between 0-20:
> tests2 * 20
Score
1 13.333333
2 10.666667
3 2.666667
4 9.333333
5 13.333333
6 13.333333
7 1.333333
8 13.333333
9 18.666667
10 12.000000
11 2.666667
12 8.000000
13 13.333333
14 10.666667
15 13.333333
16 10.666667
17 13.333333
18 13.333333
19 9.333333
20 14.666667
21 13.333333
The results are intuitive and the function is reliable. For example the person who scored 14/15 should get the highest mark (and very close to 20) which is the case here (after the transformation they scored 18.6666).
In Excel, if you want the normalized data to have a min of 0 and and max of 20, then we need to solve:
y = A * x + b
for two points.
Put the max of the raw data in C1:
=MAX(A:A)
Put the min of the raw data in C2:
=MIN(A:A)
Put the desired max in D1 and the desired min in D2. Put the formula for the A-coefficient in C3:
=($D$1-$D$2)/($C$1-$C$2)
and the formula for the B-coefficient in C4:
=$D$1-$C$3*$C$1
Finally put the scaling formula in B1:
=A1*$C$3+$C$4
and copy down:
Naturally, if you want the scaling to be independent of the raw max or min, you would use 15 in C1 and 0 in C2.
You can scale between 0 to 20 with this command in R:
newvalue <- 20/(max(score)-min(score))*(score-min(score))
The math way is fairly straightforward if the floor for all scales is 0.
new_value = new_ceiling * old_value / old_ceiling
The next formula will account for different floors on each scale:
new_value = new_floor + (new_ceiling - old_ceiling) * ((old_value-old_floor)/(old_ceiling-old_floor)) which is actually the formula you posted from Wikipedia. ;)
Hope this helps!
That is very simple. Due to the fact that both of those grades are linear, that a simple multiple ratio will do the work. Or in other word each grade in your set needs to be *20/15.
Here's a little r function which can help you run this if you need to repeat the operation and give you some flexibility on what you rescale to. Also one must be careful of NA values because min() and max() do not drop them by default which will then return NA. Therefore I provided an option on to handle NA values (drops them by default).
# function rescales data from 0 to 1 and optionally multiplies by new max
rescale <- function(x, new_max = 1, na.rm = T) {
as.vector(new_max * scale(x,
center = min(x, na.rm = na.rm),
scale = (max(x, na.rm = na.rm) - min(x, na.rm = na.rm))))
}
# old scores
scores <- c(10,8,2,7,10,10,1,10,14,9,2,6,10,8,10,8,10,10,7,11,10)
# new scores
data.frame(old = scores,
new = rescale(scores, new_max = 20))
#> old new
#> 1 10 13.846154
#> 2 8 10.769231
#> 3 2 1.538462
#> 4 7 9.230769
#> 5 10 13.846154
#> 6 10 13.846154
#> 7 1 0.000000
#> 8 10 13.846154
#> 9 14 20.000000
#> 10 9 12.307692
#> 11 2 1.538462
#> 12 6 7.692308
#> 13 10 13.846154
#> 14 8 10.769231
#> 15 10 13.846154
#> 16 8 10.769231
#> 17 10 13.846154
#> 18 10 13.846154
#> 19 7 9.230769
#> 20 11 15.384615
#> 21 10 13.846154
Created on 2022-03-10 by the reprex package (v2.0.1)

Draw nearest value from sorted data frame into unsorted data frame

I have two data frames in R. The first data frame is a cumulative frequency distribution (cumFreqDist) with associated periods. The first rows of the data frame look like this:
Time cumfreq
0 0.0000000
4 0.9009009
6 1.8018018
8 7.5075075
12 23.4234234
16 39.6396396
18 53.4534535
20 58.2582583
24 75.3753754
100 100.0000000
The second data frame is 10000 draws from a runif distribution, using the code:
testData <- (runif(10000))*100
For each row in testData, I want to locate the corresponding cumfreq in cumFreqDist and add the corresponding Time value into a new column in testData. Because testData is a test data frame standing in for a real data frame, I do not wish to sort testData.
Because I am dealing with cumulative frequencies, if the testData value is 23.30... the Time value that should be returned is 8. That is, I need to locate the nearest cumfreq value that does not exceed the testData value, and return only that one value.
The data.table package has been mentioned for other similar questions, but my limited understanding is that this package requires a key to be identified in both data frames (after conversion to data tables) and I can't assume that the testData values meet the requirements for being assigned as a key - and it appears that assigning a key will sort the data. This will cause me issues when I set a seed later in further work I am doing.
findInterval() is perfect for this:
set.seed(1);
cumFreqDist <- data.frame(Time=c(0,4,6,8,12,16,18,20,24,100), cumfreq=c(0.0000000,0.9009009,1.8018018,7.5075075,23.4234234,39.6396396,53.4534535,58.2582583,75.3753754,100.0000000) );
testData <- data.frame(x=runif(10000)*100);
testData$Time <- cumFreqDist$Time[findInterval(testData$x,cumFreqDist$cumfreq)];
head(testData,20);
## x Time
## 1 26.550866 12
## 2 37.212390 12
## 3 57.285336 18
## 4 90.820779 24
## 5 20.168193 8
## 6 89.838968 24
## 7 94.467527 24
## 8 66.079779 20
## 9 62.911404 20
## 10 6.178627 6
## 11 20.597457 8
## 12 17.655675 8
## 13 68.702285 20
## 14 38.410372 12
## 15 76.984142 24
## 16 49.769924 16
## 17 71.761851 20
## 18 99.190609 24
## 19 38.003518 12
## 20 77.744522 24

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

Resources