KNN error:‘train' and 'class' have different lengths - r

I hope to run KNN with the following two data frame. The following is the information of the data(already been scaled). age and lr_scale would be the features and euRefVoteAfter is the outcome variable.
head(training)
# A tibble: 6 x 3
age lr_scale euRefVoteAfter
<dbl> <dbl> <dbl+lbl>
1 -1.20 -0.808 0 [Rejoin the EU]
2 1.25 -1.29 1 [Stay out of the EU]
3 0.636 0.886 0 [Rejoin the EU]
4 0.0245 -0.324 1 [Stay out of the EU]
5 -1.26 0.402 0 [Rejoin the EU]
6 -0.770 0.402 0 [Rejoin the EU]
> head(testing)
# A tibble: 6 x 3
age lr_scale euRefVoteAfter
<dbl> <dbl> <dbl+lbl>
1 -1.20 -0.808 0 [Rejoin the EU]
2 1.25 -1.29 1 [Stay out of the EU]
3 0.636 0.886 0 [Rejoin the EU]
4 0.0245 -0.324 1 [Stay out of the EU]
5 -1.26 0.402 0 [Rejoin the EU]
6 -0.770 0.402 0 [Rejoin the EU]
And i run the following codes:
y_pred <- knn(train = training[, -3],
test = testing[, -3],
cl = training[,3],
k = 3,
prob = FALSE)
And i got the message'train' and 'class' have different lengths.
I've found some solution to fix this error, and try again as follow:
v1=training[,3]
y_pred <- knn(train = training[, -3],
test = testing[, -3],
cl = v1,
k = 3,
prob = FALSE)
But the same error message occured.
I'm sure the length of the variables are the same
> length(training$euRefVoteAfter)
[1] 26026
> length(training$age)
[1] 26026
> length(training$lr_scale)
[1] 26026
If someone can help me with this problem, I'd be really appreciated.

Related

How to save forloop() with different names in R

I have a function and a for-loop, I would like to iterate the same for-loop to 3 times for(i in 1:3){} and save the for loop output as a list with different names such as df.1, df.2, and df.3. Many thanks in advance.
df <- tibble( a = rnorm(10),b = rnorm(10))
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
df
Expected Answer
DF.1
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
DF.2
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
DF.3
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
Put the for loop code in a function and repeat the code for n times using replicate -
apply_fun <- function(df) {
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
df
}
result <- replicate(3, apply_fun(df), simplify = FALSE)
result will have list of dataframes.
If you want them as separate dataframes name the list and use list2env.
names(result) <- paste0('df.', seq_along(result))
list2env(result, .GlobalEnv)

Error in emmfcn(...) : Variable 'CO2' is not in the dataset in r

I want to perform a comparison between the slope of different regressions: CO2 changes through time (day) for 8 different nests.
> structure(as1)
# A tibble: 16 x 4
day nest N2O CO2
<dbl> <dbl> <dbl> <dbl>
1 1 1 0.00549 0.206
2 1 2 0.129 0.0343
3 1 3 0.157 0.113
4 1 4 0.0760 0.106
5 2 1 -0.0487 0.214
6 2 2 -0.0561 0.358
7 2 3 -0.0522 0.767
8 2 4 -0.0193 0.188
9 3 1 -0.0757 0.255
10 3 2 -0.237 0.753
11 3 3 -0.117 0.745
12 3 4 0.0345 0.502
13 4 1 0.135 0.325
14 4 2 0.264 0.767
15 4 3 0.0116 0.926
16 4 4 0.0342 0.358
I'm following the instructions given in https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models by the answer with a rate of 16.
Instead of using the library lsmeans as it suggests I used emmeans because R encourages to switch to emmeans the rest of the way. However I've also tried it with lsmeans and I get the same problem. When I run this:
library(emmeans)
m.interaction <- lm(CO2 ~ day*nest, data = as1)
anova(m.interaction)
# Obtain slopes
m.interaction$coefficients
m.lst <- lstrends(m.interaction, "day", var="CO2", data = as1)
Everything works fine until lstrends, where I get this error:
##Error in emmfcn(...) : Variable 'CO2' is not in the dataset
Does somebody know what can be happening?
Thanks in advance!

How to mutate the slopes of lines

I have a question on how to mutate the slopes of lines into a new data frame into
by category.
d1 <-read.csv(file.choose(), header = T)
d2 <- d1 %>%
group_by(ID)%>%
mutate(Slope=sapply(split(df,df$ID), function(v) lm(x~y,v)$coefficients["y"]))
ID x y
1 3.429865279 2.431363764
1 3.595066124 2.681241237
1 3.735263469 2.352182518
1 3.316473584 2.51851394
1 3.285984642 2.380211242
1 3.860793029 2.62324929
1 3.397714117 2.819543936
1 3.452997088 2.176091259
1 3.718933278 2.556302501
1 3.518566578 2.537819095
1 3.689033452 2.40654018
1 3.349160923 2.113943352
1 3.658888644 2.556302501
1 3.251151343 2.342422681
1 3.911194909 2.439332694
1 3.432584505 2.079181246
1 4.031267043 2.681241237
1 3.168733129 1.544068044
1 4.032239897 3.084576278
1 3.663361648 2.255272505
1 3.582302046 2.62324929
1 3.606585565 2.079181246
1 3.541791347 2.176091259
4 3.844012861 2.892094603
4 3.608318477 2.767155866
4 3.588990218 2.883661435
4 3.607957917 2.653212514
4 3.306753044 2.079181246
4 4.002604841 2.880813592
4 3.195299837 2.079181246
4 3.512203238 2.643452676
4 3.66878494 2.431363764
4 3.598910385 2.511883361
4 3.721810134 2.819543936
4 3.352964661 2.113943352
4 4.008109343 3.084576278
4 3.584693332 2.556302501
4 4.019461819 3.084576278
4 3.359474563 2.079181246
4 3.950256012 2.829303773
I got the error message like'replacement has 2 rows, data has 119'. I am sure that the error is derived from mutate().
Best,
Once you do group_by, any function that succeeds uses on the columns in the grouped data.frame, in your case, it will only use x,y column within.
If you only want the coefficient, it goes like this:
df %>% group_by(ID) %>% summarize(coef=lm(x~y)$coefficients["y"])
# A tibble: 2 x 2
ID coef
<int> <dbl>
1 1 0.437
2 4 0.660
If you want the coefficient, which means a vector a long as the dataframe, you use mutate:
df %>% group_by(ID) %>% mutate(coef=lm(x~y)$coefficients["y"])
# A tibble: 40 x 4
# Groups: ID [2]
ID x y coef
<int> <dbl> <dbl> <dbl>
1 1 3.43 2.43 0.437
2 1 3.60 2.68 0.437
3 1 3.74 2.35 0.437
4 1 3.32 2.52 0.437
5 1 3.29 2.38 0.437
6 1 3.86 2.62 0.437
7 1 3.40 2.82 0.437
8 1 3.45 2.18 0.437
9 1 3.72 2.56 0.437
10 1 3.52 2.54 0.437
# … with 30 more rows

Why am I getting 'train' and 'class' have different lengths"

Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0
It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6

Calculate confidence intervals (binomial) within data frame

I want to get the confidence intervals for proportions within my tibble. Is there a way of doing this?
library(tidyverse)
library(Hmisc)
library(broom)
df <- tibble(id = c(1, 2, 3, 4, 5, 6),
count = c(4, 1, 22, 4545, 33, 23),
n = c(22, 65, 34, 6323, 35, 45))
Which looks like this:
# A tibble: 6 x 3
id count n
<dbl> <dbl> <dbl>
1 1 4 22
2 2 1 65
3 3 22 34
4 4 4545 6323
5 5 33 35
6 6 23 45
Using binconf from Hmisc and tidy from broom the solution could be from any package:
The intervals for the first row:
tidy(binconf(4, 22))
# A tibble: 1 x 4
.rownames PointEst Lower Upper
<chr> <dbl> <dbl> <dbl>
1 "" 0.182 0.0731 0.385
I have tried using map in purrr but get errors:
map(df, tidy(binconf(count, n)))
Error in x[i] : object of type 'closure' is not subsettable
I could just calculate them using dplyr but I get values below zero (e.g. row 2) or above one (e.g row 5), which I don't want. e.g.
df %>%
mutate(prop = count / n) %>%
mutate(se = (sqrt(prop * (1-prop)/n))) %>%
mutate(lower = prop - (se*1.96)) %>%
mutate(upper = prop + (se*1.96))
# A tibble: 6 x 7
id count n prop se lower upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0822 0.0206 0.343
2 2 1 65 0.0154 0.0153 -0.0145 0.0453
3 3 22 34 0.647 0.0820 0.486 0.808
4 4 4545 6323 0.719 0.00565 0.708 0.730
5 5 33 35 0.943 0.0392 0.866 1.02
6 6 23 45 0.511 0.0745 0.365 0.657
Is there a good way of doing this? I did have a look at the confint_tidy() function, but could not get that to work. Any ideas?
It may not be tidy but
> as.tibble(cbind(df, binconf(df$count, df$n)))
# A tibble: 6 x 6
id count n PointEst Lower Upper
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 22 0.182 0.0731 0.385
2 2 1 65 0.0154 0.000789 0.0821
3 3 22 34 0.647 0.479 0.785
4 4 4545 6323 0.719 0.708 0.730
5 5 33 35 0.943 0.814 0.984
6 6 23 45 0.511 0.370 0.650
seems to work

Resources