linear disriminant function error - arguments must be same length - r

My example dataset:
year <- c("2002","2002","2002","2004","2005","2005","2005","2006", "2006")
FA1 <- c(0.7975030, 1.5032768, 0.8805000, 1.0505961, 1.1379715, 1.1334510, 1.1359434, 0.9614926, 1.2631387)
FA2 <- c(0.7930153, 1.2862355, 0.5633592, 1.0396431, 0.9446277, 1.1944455, 1.086171, 0.767955, 1.2385361)
FA3 <- c(-0.7825210, 0.56415672, -0.9294417, 0.21485071, -0.447953,0.037978, 0.038363, -0.495383, 0.509704)
FA4 <- c(0.38829957,0.34638035,-0.06783007, 0.505020, 0.3158221,0.55505411, 0.42822783, 0.36399347, 0.51352115)
df <- data.frame(year,FA1,FA2,FA3,FA4)
I then select the data I want to use and run a DFA
library(magrittr)
library(DiscriMiner)
yeardf <- df[df$year %in% c(2002, 2005, 2006),]
yeardfd <- linDA(yeardf[,2:4],yeardf$year, validation = "crossval")
But now i get an error telling me the arguments are different lengths.
"Error in table(original = y[test], predicted = pred_class) :
all arguments must have the same length"
I looked at
length(yeardf$year)
dim(yeardf)
And it looks like they are the same.
I also checked for spelling mistakes as that seems to cause this error sometimes.
following up on answer.
The suggested answer works on my example data (which does give me the same error), but I can't quite make it work on my real code.
I first apply the transformation to selected columns in my data.frame. And then I combine the transformed columns with the variables I want to use as groups in my DFA
library(robCompositions)
tFA19 <- cenLR(fadata.PIZ[names(FA19)])[1]
tFA19 <- cbind(fadata.PIZ[1:16],tFA19)
So I think creating my data.frame this way must be leading to my error. I tried to insert stringsAsFactors into my cbind statement, but no luck.

You need ,stringsAsFactors = FALSE in data.frame:
year <- c("2002","2002","2002","2004","2005","2005","2005","2006", "2006")
FA1 <- c(0.7975030, 1.5032768, 0.8805000, 1.0505961, 1.1379715, 1.1334510, 1.1359434, 0.9614926, 1.2631387)
FA2 <- c(0.7930153, 1.2862355, 0.5633592, 1.0396431, 0.9446277, 1.1944455, 1.086171, 0.767955, 1.2385361)
FA3 <- c(-0.7825210, 0.56415672, -0.9294417, 0.21485071, -0.447953,0.037978, 0.038363, -0.495383, 0.509704)
FA4 <- c(0.38829957,0.34638035,-0.06783007, 0.505020, 0.3158221,0.55505411, 0.42822783, 0.36399347, 0.51352115)
df <- data.frame(year,FA1,FA2,FA3,FA4,stringsAsFactors = FALSE)
library(magrittr)
library(DiscriMiner)
yeardf <- df[df$year %in% c(2002, 2005, 2006),]
yeardfd <- linDA(yeardf[,2:4],yeardf$year, validation = "crossval")
yeardfd
Linear Discriminant Analysis
-------------------------------------------
$functions discrimination functions
$confusion confusion matrix
$scores discriminant scores
$classification assigned class
$error_rate error rate
-------------------------------------------
$functions
2002 2005 2006
constant -345 -371 -305
FA1 228 231 213
...

Related

Creating a function to loop columns through an equation in R

Solution (thanks #Peter_Evan!) in case anyone coming across this question has a similar issue
(Original question is below)
## get all slopes (lm coefficients) first
# list of subfields of interest to loop through
sf <- c("left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG")
# dependent variables are sf, independent variable common to all models in the inner lm() call is ICV
# applies the lm(subfield ~ ICV, dataset = DF) to all subfields of interest (sf) specified previously
lm.results <- lapply(sf, function(dv) {
temp.lm <- lm(get(dv) ~ ICV, data = DF)
coef(temp.lm)
})
# returns a list, where each element is a vector of coefficients
# do.call(rbind, ) will paste them together
lm.coef <- data.frame(sf = sf,
do.call(rbind, lm.results))
# tidy up name of intercept variable
names(lm.coef)[2] <- "intercept"
lm.coef
## set up all components for the equation
# matrix to store output
out <- matrix(ncol = length(sf), nrow = NROW(DF))
# name the rows after each subject
row.names(out) <- DF$Subject
# name the columns after each subfield
colnames(out) <- sf
# nested for loop that goes by subject (j) and subfield (i)
for(j in DF$Subject){
for (i in sf) {
slope <- lm.coef[lm.coef$sf == i, "ICV"]
out[j,i] <- as.numeric( DF[DF$Subject == j, i] - (slope * (DF[DF$Subject == j, "ICV"] - mean(DF$ICV))) )
}
}
# check output
out
===============
Original Question:
I have a dataframe (DF) with 13 columns (12 different brain subfields, and one column containing total intracranial volume(ICV)) and 50 rows (each a different participant). I'm trying to automate an equation being looped over every column for each participant.
The data:
structure(list(Subject = c("sub01", "sub02", "sub03", "sub04",
"sub05", "sub06", "sub07", "sub08", "sub09", "sub10", "sub11",
"sub12", "sub13", "sub14", "sub15", "sub16", "sub17", "sub18",
"sub19", "sub20"), ICV = c(1.50813, 1.3964237, 1.6703585, 1.4641886,
1.6351018, 1.5524641, 1.4445532, 1.6384505, 1.6152434, 1.5278011,
1.4788126, 1.4373356, 1.4109637, 1.3634952, 1.3853583, 1.4855268,
1.6082085, 1.5644998, 1.5617522, 1.4304141), left_subiculum = c(411.225013,
456.168033, 492.968477, 466.030173, 533.95505, 476.465524, 448.278213,
476.45566, 422.617374, 498.995121, 450.773906, 461.989663, 549.805272,
452.619547, 457.545623, 451.988333, 475.885847, 490.127968, 470.686415,
494.06548), left_CA1 = c(666.893596, 700.982955, 646.21927, 580.864234,
721.170599, 737.413139, 737.683665, 597.392434, 594.343911, 712.781376,
733.157168, 699.820162, 701.640861, 690.942843, 606.259484, 731.198846,
567.70879, 648.887718, 726.219904, 712.367433), left_presubiculum = c(325.779458,
391.252815, 352.765098, 342.67797, 390.885737, 312.857458, 326.916867,
350.657957, 325.152464, 320.718835, 273.406949, 305.623938, 371.079722,
315.058313, 311.376271, 319.56678, 348.343569, 349.102678, 322.39908,
306.966008), `left_GC-ML-DG` = c(327.037756, 305.63224, 328.945065,
238.920358, 319.494513, 305.153183, 311.347404, 259.259723, 295.369164,
312.022281, 324.200989, 314.636501, 306.550385, 311.399107, 295.108592,
356.197094, 251.098248, 294.76349, 317.308576, 301.800253), left_CA3 = c(275.17038,
220.862237, 232.542718, 170.088695, 234.707172, 210.803287, 246.861975,
171.90896, 220.83478, 236.600832, 246.842024, 239.677362, 186.599097,
224.362411, 229.9142, 293.684776, 172.179779, 202.18936, 232.5666,
221.896625), left_CA4 = c(277.614028, 264.575987, 286.605092,
206.378619, 281.781858, 258.517989, 269.354864, 226.269982, 256.384436,
271.393257, 277.928824, 265.051581, 262.307377, 266.924683, 263.038686,
306.133918, 226.364556, 262.42823, 264.862956, 255.673948), right_subiculum = c(468.762375,
445.35738, 446.536018, 456.73484, 521.041823, 482.768261, 487.2911,
456.39996, 445.392976, 476.146498, 451.775611, 432.740085, 518.170065,
487.642399, 405.564237, 487.188989, 467.854363, 479.268714, 473.212833,
472.325916), right_CA1 = c(712.973011, 717.815214, 663.637105,
649.614586, 711.844375, 779.212704, 862.784416, 648.925038, 648.180611,
760.761704, 805.943016, 717.486756, 801.853608, 722.213109, 621.676321,
791.672796, 605.35667, 637.981476, 719.805053, 722.348921), right_presubiculum = c(327.285242,
364.937865, 288.322641, 348.30058, 341.309111, 279.429847, 333.096795,
342.184296, 364.245998, 350.707173, 280.389853, 276.423658, 339.439377,
321.534798, 302.164685, 328.365751, 341.660085, 305.366589, 320.04127,
303.83284), `right_GC-ML-DG` = c(362.391907, 316.853532, 342.93274,
282.550769, 339.792696, 357.867386, 342.512721, 277.797528, 309.585721,
343.770416, 333.524912, 302.505077, 309.063135, 291.29361, 302.510461,
378.682679, 255.061044, 302.545288, 313.93902, 297.167161), right_CA3 = c(307.007404,
243.839349, 269.063801, 211.336979, 249.283479, 276.092623, 268.183349,
202.947849, 214.642782, 247.844657, 291.206598, 235.864996, 222.285729,
201.427853, 237.654913, 321.338801, 199.035108, 243.204203, 236.305659,
213.386702), right_CA4 = c(312.164065, 272.905586, 297.99392,
240.765062, 289.98697, 306.459566, 284.533068, 245.965817, 264.750571,
296.149675, 290.66935, 264.821461, 264.920869, 246.267976, 266.07378,
314.205819, 229.738951, 274.152503, 256.414608, 249.162404)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The equation:
adjustedBrain(participant1) = rawBrain(participant1) - slope*[ICV(participant1) - (mean of all ICV measures included in the calculation of the slope)]
The code (which is not working and I was hoping for some pointers):
adjusted_Brain <- function(DF, subject) {
subfields <- colnames(select(DF, "left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG"))
out <- matrix(ncol = length(subfields), nrow = NROW(DF))
for (i in seq_along(subfields)) {
DF[i] = DF[DF$Subject == "subject", "i"] -
slope * (DF[DF$Subject == "subject", "ICV"] -
mean(DF$ICV))
}
}
Getting this error:
Error: Can't subset columns that don't exist.
x Column `i` doesn't exist.
A few notes:
The slopes for each subject for each subfield will be different (and will come from a regression) -> is there a way to specify that in the function so the slope (coefficient from the appropriate regression equation) gets called in?
I have my nrow set to the number of participants right now in the output because I'd like to have this run through EVERY subject across EVERY subfield and spit out a matrix with all the adjusted brain volumes... But that seems very complicated and so for now I will just settle for running each participant separately.
Any help is greatly appreciated!
As others have noted in the comments, there are quite a few syntax issues that prevent your code from running, as well as a few unstated requirements. That aside, I think there is enough to recommend a few improvements that you can hopefully build on. Here are the top line changes:
You likely don't need this to be a function, but rather a nested for loop (if you want to do this with base R). As written, the code isn't flexible enough to merit a function. If you intend to apply this many times across different datasets, a function might make sense. However, it will require a much larger rewrite.
Assuming you are fitting a simple regression via lm, then you can pull out the coefficient of interest via the $ operator and indexing (see below). Some thought will need to go into how to handle different models in the loop. Here, we assume you only need one coefficient from one model.
There are a few areas where the syntax is incorrect and a review of sub setting in base R would be helpful. Others have pointed out in the comments were some of these are.
Here is one approach were we loop through each subject (j) through each feature or subfield (i) and store them in a matrix (out). This is just an approach and will almost certainly need tweaking on your end!
#NOTE: the dataset your provided is saved as x in this example.
#fit a linear model - here we assume there is only one coef. of interest, but you may need to alter
# depending on how the slope changes in each calculation
reg <- lm(ICV ~ right_CA3, x)
# view the coeff.
reg$coefficients
# pull out the slope by getting the coeff. of interest (via index) from the reg object
slope <- reg$coefficients[[1]]
# list of features/subfeilds to loop through
sf <- c("left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG")
# matrix to store output
out <- matrix(ncol = length(sf), nrow = NROW(x))
#name the rows after each subject
row.names(out) <- x$Subject
#name the columns after each sub feild
colnames(out) <- sf
# nested for loop that goes by subject (j) and features/subfeilds (i)
for(j in x$Subject){
for (i in sf) {
out[j,i] <- as.numeric( x[x$Subject == j, i] - (slope * (x[x$Subject == j, "ICV"] - mean(x$ICV))) )
}
}
# check output
out

KNN: "no missing values are allow" -> I do not have missing values

I am in a group project for a class and one of the people in my group ran the normalization, as well as creating the test/train sets so that we all have the same sets to work from (we're all utilizing different algorithms). I am assigned with running the KNN algorithm.
We had multiple columns with NA's so those columns were omitted (<-NULL). When attempting to run the KNN I keep getting the error of
Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
no missing values are allowed
I ran which(is.na(dataset$col)) and found:
which(is.na(testsetne$median_days_on_market))
# [1] 8038 8097 8098 8100 8293 8304
When I look through the dataset those cells do not have missing data.
I am wondering if I may get some help with how to either find and fix the "No missing values" or to find a work around (if any).
I am sorry if I am missing something simple. Any help is appreciated.
I have listed the code that we have below:
ne$pending_ratio_yy <- ne$total_listing_count_yy <- ne$average_listing_price_yy <- ne$median_square_feet_yy <- ne$median_listing_price_per_square_feet_yy <- ne$pending_listing_count_yy <- ne$price_reduced_count_yy <- ne$median_days_on_market_yy <- ne$new_listing_count_yy <- ne$price_increased_count_yy <- ne$active_listing_count_yy <- ne$median_listing_price_yy <- ne$flag <- NULL
ne$pending_ratio_mm <- ne$total_listing_count_mm <- ne$average_listing_price_mm <- ne$median_square_feet_mm <- ne$median_listing_price_per_square_feet_mm <- ne$pending_listing_count_mm <- ne$price_reduced_count_mm <- ne$price_increased_count_mm <- ne$new_listing_count_mm <- ne$median_days_on_market_mm <- ne$active_listing_count_mm <- ne$median_listing_price_mm <- NULL
ne$factor_month_date <- as.factor(ne$month_date_yyyymm)
ne$factor_median_days_on_market <- as.factor(ne$median_days_on_market)
train20ne= sample(1:20893, 4179)
trainsetne=ne[train20ne,1:10]
testsetne=ne[-train20ne,1:10]
#This is where I start to come in
ne_train_target <- ne[train20ne, 3]
ne_test_target <- ne[-train20ne, 3]
predict_1 <- knn(train = trainsetne, test = testsetne, cl=ne_train_target, k=145)
# Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
# no missing values are allowed

R: Package topicmodels: LDA: Error: invalid argument

I have a question regarding LDA in topicmodels in R.
I created a matrix with documents as rows, terms as columns, and the number of terms in a document as respective values from a data frame. While I wanted to start LDA, I got an Error Message stating "Error in !all.equal(x$v, as.integer(x$v)) : invalid argument type" . The data contains 1675 documents of 368 terms. What can I do to make the code work?
library("tm")
library("topicmodels")
data_matrix <- data %>%
group_by(documents, terms) %>%
tally %>%
spread(terms, n, fill=0)
doctermmatrix <- as.DocumentTermMatrix(data_matrix, weightTf("data_matrix"))
lda_head <- topicmodels::LDA(doctermmatrix, 10, method="Gibbs")
Help is much appreciated!
edit
# Toy Data
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toydata <- data.frame(documentstoy,meta1toy,meta2toy,termstoy)
So I looked inside the code and apparently the lda() function only accepts integers as the input so you have to convert your categorical variables as below:
library('tm')
library('topicmodels')
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
toydata <- data.frame(documentstoy,meta1toy,meta2toy)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toy_unique = unique(termstoy)
for (i in 1:length(toy_unique)){
A = as.integer(termstoy == toy_unique[i])
toydata[toy_unique[i]] = A
}
lda_head <- topicmodels::LDA(toydata, 10, method="Gibbs")

How to estimate static yield curve with 'termstrc' package in R?

I am trying to estimate the static yield curve for Brazil using termstrc package in R. I am using the function estim_nss.couponbonds and putting 0% coupon-rates and $0 cash-flows, except for the last one which is $1000 (the face-value at maturity) -- as far as I know this is the function to do this, because the estim_nss.zeroyields only calculates the dynamic curve. The problem is that I receive the following error message:
"Error in (pos_cf[i] + 1):pos_cf[i + 1] : NA/NaN argument In addition: Warning message: In max(n_of_cf) : no non-missing arguments to max; returning -Inf "
I've tried to trace the problem using trace(estim_nss.couponbons, edit=T) but I cannot find where pos_cf[i]+1 is calculated. Based on the name I figured it could come from the postpro_bondfunction and used trace(postpro_bond, edit=T), but I couldn't find the calculation again. I believe "cf" comes from cashflow, so there could be some problem in the calculation of the cashflows somehow. I used create_cashflows_matrix to test this theory, but it works well, so I am not sure the problem is in the cashflows.
The code is:
#Creating the 'couponbond' class
ISIN <- as.character(c('ltn_2017','ltn_2018', 'ltn_2019', 'ltn_2021','ltn_2023')) #Bond's identification
MATURITYDATE <- as.Date(c(42736, 43101, 43466, 44197, 44927), origin='1899-12-30') #Dates are in system's format
ISSUEDATE <- as.Date(c(41288,41666,42395, 42073, 42395), origin='1899-12-30') #Dates are in system's format
COUPONRATE <- rep(0,5) #Coupon rates are 0 because these are zero-coupon bonds
PRICE <- c(969.32, 867.77, 782.48, 628.43, 501.95) #Prices seen 'TODAY'
ACCRUED <- rep(0.1,5) #There is no accrued interest in the brazilian bond's market
#Creating the cashflows sublist
CFISIN <- as.character(c('ltn_2017','ltn_2018', 'ltn_2019', 'ltn_2021', 'ltn_2023')) #Bond's identification
CF <- c(1000,1000,1000,1000,1000)# The face-values
DATE <- as.Date(c(42736, 43101, 43466, 44197, 44927), origin='1899-12-30') #Dates are in system's format
CASHFLOWS <- list(CFISIN,CF,DATE)
names(CASHFLOWS) <- c("ISIN","CF","DATE")
TODAY <- as.Date(42646, origin='1899-12-30')
brasil <- list(ISIN,MATURITYDATE,ISSUEDATE,
COUPONRATE,PRICE,ACCRUED,CASHFLOWS,TODAY)
names(brasil) <- c("ISIN","MATURITYDATE","ISSUEDATE","COUPONRATE",
"PRICE","ACCRUED","CASHFLOWS","TODAY")
mybonds <- list(brasil)
class(mybonds) <- "couponbonds"
#Estimating the zero-yield curve
ns_res <-estim_nss.couponbonds(mybonds, 'brasil' ,method = "ns")
#Testing the hypothesis that the error comes from the cashflow matrix
cf_p <- create_cashflows_matrix(mybonds[[1]], include_price = T)
m_p <- create_maturities_matrix(mybonds[[1]], include_price = T)
b <- bond_yields(cf_p,m_p)
Note that I am aware of this question which reports the same problem. However, it is for the dynamic curve. Besides that, there is no useful answer.
Your code has two problems. (1) doesn't name the 1st list (this is the direct reason of the error. But if modifiy it, another error happens). (2) In the cashflows sublist, at least one level of ISIN needs more than 1 data.
# ...
CFISIN <- as.character(c('ltn_2017','ltn_2018', 'ltn_2019',
'ltn_2021', 'ltn_2023', 'ltn_2023')) # added a 6th element
CF <- c(1000,1000,1000,1000,1000, 1000) # added a 6th
DATE <- as.Date(c(42736,43101,43466,44197,44927, 44928), origin='1899-12-30') # added a 6th
CASHFLOWS <- list(CFISIN,CF,DATE)
names(CASHFLOWS) <- c("ISIN","CF","DATE")
TODAY <- as.Date(42646, origin='1899-12-30')
brasil <- list(ISIN,MATURITYDATE,ISSUEDATE,
COUPONRATE,PRICE,ACCRUED,CASHFLOWS,TODAY)
names(brasil) <- c("ISIN","MATURITYDATE","ISSUEDATE","COUPONRATE",
"PRICE","ACCRUED","CASHFLOWS","TODAY")
mybonds <- list(brasil = brasil) # named the list
class(mybonds) <- "couponbonds"
ns_res <-estim_nss.couponbonds(mybonds, 'brasil', method = "ns")
Note: the error came from these lines
bonddata <- bonddata[group] # prepro_bond()'s 1st line (the direct reason).
# cf <- lapply(bonddata, create_cashflows_matrix) # the additional error
create_cashflows_matrix(mybonds[[1]], include_price = F) # don't run

For each possible permutation of factor levels, apply function and also name list of results

Improve the following code by rewriting to be more compact (a one-liner with alply or similar?) Also if it can be made more performant (if possible).
I have a dataframe with several categorical variables, each with various number of levels. (Examples: T1_V4: B,C,E,G,H,N,S,W and T1_V7: A,B,C,D )
For any specific one of those categorical vars, I want to do the following:
Construct all possible level-permutations e.g. using DescTools::Permn()
Then for each level.perm in those level.perms...
Construct a list of function results where we apply some function to level.perm (in my particular case, recode the factor levels using level.perms, then take as.numeric, then compute correlation wrt some numeric response variable)
Finally, name that list with the corresponding string-concatenated values of level.perm (e.g. 'DBCA')
Example at bottom for permutations of A,B,C,D
Reproducible example at bottom:
The following code does this, can you improve on it? (I tried alply)
require(DescTools)
level.perms <- Permn(levels(MyFactorVariable))
tmp <- with(df,
apply( level.perms, 1,
function(var.levels) {
cor(MyResponseVariable,
as.numeric(factor(MyFactorVariable, levels=var.levels)))
})
)
names(tmp) <- apply(level.perms, 1, paste, collapse='')
Example (for CategVar1 with levels A,B,C,D):
ABCD BACD BCAD ACBD CABD CBAD BCDA ACDB
0.031423 0.031237 0.002338 0.002116 -0.026496 -0.026386 -0.008743 -0.009104
CADB CBDA ABDC BADC CDAB CDBA ADBC BDAC
-0.037228 -0.037364 0.048423 0.048075 -0.048075 -0.048423 0.037364 0.037228
BDCA ADCB DABC DBAC DBCA DACB DCAB DCBA
0.009104 0.008743 0.026386 0.026496 -0.002116 -0.002338 -0.031237 -0.031423
Reproducible example using randomly-generated dataframe:
set.seed(120)
df = data.frame(ResponseVar = exp(runif(1000, 0,4)),
CategVar1 = factor(sample(c('A','B','C','D'), 1000, replace=T)),
CategVar2 = factor(sample(c('B','C','E','G','H','N'), 1000, replace=T)) )
cor(as.numeric(df$CategVar1), df$MyResponseVar)
# 0.03142
cor(as.numeric(df$CategVar2), df$MyResponseVar)
# 0.02112
#then if you run the above code you get the above table of correlation values

Resources