Here is my problem. I need to implement a multi target decision tree algorithm. A multi target is an extension of multi label learning where the labels are not binary but can be continuous, categorical and so on. For example a label vector for a multi label classification problem could look like this {1,0,1,0,0,0,1}, while for a multi target could look like this {2,35,3,-2,24}.
My problem is this. If i have a label that takes 3 discrete values how do i represent them in a vector?
Lets say i have a label called job and takes 3 values, mechanic,teacher and athlete. How can i code this label in order to use it in a vector?
At each node in a decision tree in order to find my split, i need to compute the mean vector of all the label vectors in this node ( i am using the variance method equation to find my split). If i had binary label this would be easy because adding 0s and 1s doesn't pose any problem. If i code these 3 jobs with 0,1,2, then this is problem because adding a label vector that has the label athlete, counts more than adding a vector that has the job mechanic and the mean vector is inaccurate.
Lets take this example. I have these 3 labels:
job: {mechanic,teacher,athlete}
married:{yes,no}
age: continuous value
It is easy to say that the married label can be coded as {0,1} and the age label as a continuous number. But how can i code the job label? Coding it as {0,1,2} causes the next problem. Imagine 2 label vectors in a node: {0,0,45} which corresponds to mechanic,married and 45 years old and {2,1,48} which corresponds to athlete,not married,45 years old. The mean vector is {1,0.5,46.5}. With this vector i can predict that the age of the instance that falls in to that node is 46.5, i can say that the instance in not married (with a rule that says greater or equal than 0.5 is 1) and i can say that its job is a teacher. The teacher job is totally wrong while the others are OK. You see now the problem of coding categorical labels. An help or advice??? Thanks :D
How about taking all your discrete values of a feature and transform them all into features if values more than 2, for example:
job: {mechanic, teacher, athlete}
married:{yes, no}
age: continuous value
will result in an 5-dimensional vecor
(mechanic 0/1, teacher 0/1, athlete 0/1, married 0/1, age 0-inf)
Related
Hey guys I am trying to calculate the p value of individual variables to see if they have an impact when the other variable is set to 0. Here is my code:
quiet_result = aov(overbearing ~ as.factor(Intention)*as.factor(quiet_only), data=df)
summary(quiet_result)
loud_result = aov(overbearing ~ as.factor(Intention)*as.factor(loud_only), data = df)
summary(loud_result)
For context, the intention variable only has the values of -1 and 1. -1 is intentional and 1 is intentional. Quiet_only and loud_only are new columns created from a data set. quiet_only only has the values of 0 and 2 and it is the original column of sound + 1, and loud_only only has the values of -2 and 0 because it is only the original column of sound - 1. Therefore these are all ordered variables and they are not supposed to be assessed by their actual numerical value like a class variable. However, my code keeps reading it as a class variable even though I changed all the variables to factors to make them ordered variables. Therefore, when I run anova on them, they all return the same result. I am wondering how I can change the variables to make them into ordered variables because the anova is only reading the change between the intention and quiet_only/loud_only columns, which would obviously return the same anovas because there is no actual change if you subtract or add 1 to a column. Therefore, I'm trying to find the p value of the intention variable with loud_only and quiet_only and this p value should change depending on whether I use loud_only or quiet_only.
Sorry if this doesn't make any sense lol. This is research work for a graduate professor so it uses concept that I don't fully understand (I'm undergrad) so I don't think I explained it very well. Anyways, if any of you have any ideas that would be great.
I'm actually working on tuna tag-recapture data. I want to balance my sampling between two groups of individuals, the ones that where tagged in the reference area (Treated group) and the ones that where tagged outside this area (Control group). To do this, I used the MatchIt package.
I have 3 covariates: length (by 5 cm bins), month of tagging (January to December) and structure on which the tuna was tagged.
So there is the model: treatment ~ length + month + structure
This last variable, is a categorical variable with 5 levels coded as A to E. The level A is almost only represented in the Treated group (6000 individuals with structure = A, vs on 300 individuals with structure = A in control group).
I first used the nearest neighbour method, but the improvement in balance was not satisfying. So I ran exact and Coarsened Exact Matching methods.
I though that Exact methods should match pairs with the same values for each covariates. But in the output matched data, there are still more than 3000 individuals with structure = A in the treated group.
Do you guys have one explanation ? I red a lot but I didn't find answers.
Thanks
Exact and coarsened exact matching do not perform 1:1 matching. They find all members in the control group that exactly match each member in the treated group. Subclasses are formed based on each combination of the predictor values, and any subclass that has both treated and control units is retained, and others dropped. There is no pairing that takes place. Your results indicate that you have many control units that have identical (or near-identical in the case of CEM) values of the covariates as some treated units.
I am using read.dta13 packages to load data. There are a bunch of categorical variables with Stata values labels in the data set. The data set looks like below in Stataļ¼
cohort year age gender income health migration
1101 2010 35 F 13034 healthy yes
1102 2010 54 M 34134 unhealthy no
For gender, health and migration, the original values are numeric, for example, gender = 1 for male. In Stata, for the convenience of understanding, I add value labels for categorical variables using label define, so it shows as above. But the original values are kept. Now let's go to R. If I simply type
mydata <- read.dta13("mydata_stata13.dta")
I get a lot of warnings like these
Factor codes of type double or float detected - no labels assigned.
Set option nonint.factors to TRUE to assign labels anyway.
All the value labels I add in Stata will be dropped, which is what I need in R. The problem is that R gives warnings even for some variables that should be taken as numeric, for example income. I don't want to set nonint.factor = TRUE since I need the numeric values of the categorical variables for the calculation.
It's not actually an error, but I would like to know whether it is safe to just ignore the warnings.
As the warning states, there are doubles or floats with labels assigned. This is because I assumed you created a categorical variable without specifying Stata to store it as a byte. readstata13 gives you a warning because it is not sure if floats/doubles with value labels are categorical or continuous variables.
Let's say gender is the wrongly stored variable, I assumed the person who coded the variables in stata created it as:
gen gender = *expr*
instead of
gen byte gender = *expr*
This can be solved either by always prefixing categorical variables with gen byte or by using compress (see Stata's manual) before saving/exporting the whole dataset. You can detect which variables are wrongly coded using describe and checking value label assignment in non-byte-variables. This will in turn will store your data efficiently.
In addition, I assume that for some reason the same person accidentally added a value label to a "true" float variable, like income at some point. Check labelbook command to correct such problems.
Hi I am having trouble understanding the workings of the K nearest neighbor algorithm specifically when trying to implement it in code. I am implementing this in R but just want to know the workings, I'm not so much worried about the code as much as the process. I will post what I have, my data, and what my questions are:
Training Data (just a portion of it):
Feature1 | Feature2 | Class
2 | 2 | A
1 | 4 | A
3 | 10 | B
12 | 100 | B
5 | 5 | A
So far in my code:
kNN <- function(trainingData, sampleToBeClassified){
#file input
train <- read.table(trainingData,sep=",",header=TRUE)
#get the data as a matrix (every column but the class column)
labels <- as.matrix(train[,ncol(train)])
#get the classes (just the class column)
features <- as.matrix(train[,1:(ncol(train)-1)])
}
And for this I am calculating the "distance" using this formula:
distance <- function(x1,x2) {
return(sqrt(sum((x1 - x2) ^ 2)))
}
So is the process for the rest of the algorithm as follows:?
1.Loop through every data (in this case every row for the 2 columns) and calculate the distance from the one number at a time and compare it to the sampleToBeClassified?
2.In the starting case that I want 1 nearest-neighbor classification, would I just be storing the variable that has the least distance to my sampleToBeClassified?
3.Whatever the closest distance variable is find out what class it is, then that class becomes the class of the sampleToBeClassified?
My main question is what role do the features play in this? My instinct is that the two features together are what defines that data item as a certain class, so what should I be calculating the distance between?
Am I on the right track at all?
Thanks
It looks as though you're on the right track. The three steps in your process seem to be correct for the 1-nearest neighbor cases. For kNN, you just need to make a list of the k nearest neighbors and then determine which class is most prevalent in that list.
As for features, these are just attributes that define each instance and (hopefully) give us an indication as to what class they belong to. For instance, if we're trying to classify animals we could use height and mass as features. So if we have an instance in the class elephant, its height might be 3.27m and its mass might be 5142kg. An instance in the class dog might have a height of 0.59m and a mass of 10.4kg. In classification, if we get something that's 0.8m tall and has a mass of 18.5kg, we know it's more likely to be a dog than a elephant.
Since we're only using 2 features here we can easily plot them on a graph with one feature as the X-axis and the other feature as the Y (it doesn't really matter which one) with the different classes denoted by different colors or symbols or something. If you plot the sample of your training data above, it's easy to see the separation between Class A and B.
I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful