Make use of available data and neglect missing data for building classifier - r

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards

I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.

I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.

Related

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

How do I code opposite 'levels' or 'factors' in R

I have browsed through many different pages on the net but cannot seem to find an answer to this question:
How do I code up equally-opposing levels for use in multiple regression in R?
The only way I can currently see achieving this is to assign the data type to these columns as 'numeric' however I know that this is not an entirely accurate way of doing this as they should ideally be treated as factors.
Essentially what I am looking for is creating 'dummy' variables that are factors rather than numeric.
By way of example, suppose we have a scenario of sporting teams opposing each other, each row assigns a different game where Team A's presence in the game is suggested by the '1' dummy and Team B's presence is '-1' (and all other team's values are obviously set to 0 because they are not playing in this game).
I have attempted to change the data type of these columns to factors, however this results in three levels of 1,0 and -1 but the beta assigned to 1 and -1 is never equally opposite (and in theory the value assigned to 0 should be zero).
Does anybody know how to correctly mathematically and programatically achieve this? Your help would be much appreciated!

calculating DFT coefficient

I have a time series and it contains 256 integer values. It looks like this:
I have calculated STFT( Short Time Forier Transform) whit this code in r:
s<-stft(datalist, win=min(80,floor(length(datalist)/10)), inc=min(24,floor(length(datalist)/30)), coef=256, wtype="hanning.window")
as a resault I have a matrix with 29 rows and 256 values. If I show one row of this matrix in a plot( i.e 10th row). I am seeing such a diagramm:
but I have this expectation, that the coefficient diagram should look like the first diagramm?(only in another dimension)
should I use another package in R to do this job? or my understood is false?
I guess you are using the stft function from package GENEArerad. In your case, the call is basically
s<-stft(datalist, win=25, inc=8, coef=256, wtype="hanning.window")
So the way I read it, you are taking 25 samples but computing 256 coefficients from this. The documentation states that the maximal (reasonable) value for coef is win/2, due to the Nyquist-Shannon sampling theorem I guess. So all but the first 12 or so coefficients will be mostly bogus. And those first few coefficients are off the scale of your plot, so we can't say anything about these either.
I don't know where your expectation did come from, and I don't share it. But I also believe there are some more fundamental problems with how you expect this to work.

A Neverending cforest

how can I decouple the time cforest/ctree takes to construct a tree from the number of columns in the data?
I thought the option mtry could be used to do just that, i.e. the help says
number of input variables randomly sampled as candidates at each node for random forest like algorithms.
But while that does randomize the output trees it doesn't decouple the CPU time from the number of columns, e.g.
p<-proc.time()
ctree(gs.Fit~.,
data=Aspekte.Fit[,1:60],
controls=ctree_control(mincriterion=0,
maxdepth=2,
mtry=1))
proc.time()-p
takes twice as long as the same with Aspekte.Fit[,1:30] (btw. all variables are boolean). Why? Where does it scale with the number of columns?
As I see it the algorithm should:
At each node randomly select two columns.
Use them to split the response. (no scaling because of mincriterion=0)
Proceed to the next node (for a total of 3 due to maxdepth=2)
without being influenced by the column total.
Thx for pointing out the error of my ways

What makes these two R data frames not identical?

I have two small data frames, this_tx and last_tx. They are, in every way that I can tell, completely identical. this_tx == last_tx results in a frame of identical dimensions, all TRUE. this_tx %in% last_tx, two TRUEs. Inspected visually, clearly identical. But when I call
identical(this_tx, last_tx)
I get a FALSE. Hilariously, even
identical(str(this_tx), str(last_tx))
will return a TRUE. If I set this_tx <- last_tx, I'll get a TRUE.
What is going on? I don't have the deepest understanding of R's internal mechanics, but I can't find a single difference between the two data frames. If it's relevant, the two variables in the frames are both factors - same levels, same numeric coding for the levels, both just subsets of the same original data frame. Converting them to character vectors doesn't help.
Background (because I wouldn't mind help on this, either): I have records of drug treatments given to patients. Each treatment record essentially specifies a person and a date. A second table has a record for each drug and dose given during a particular treatment (usually, a few drugs are given each treatment). I'm trying to identify contiguous periods during which the person was taking the same combinations of drugs at the same doses.
The best plan I've come up with is to check the treatments chronologically. If the combination of drugs and doses for treatment[i] is identical to the combination at treatment[i-1], then treatment[i] is a part of the same phase as treatment[i-1]. Of course, if I can't compare drug/dose combinations, that's right out.
Generally, in this situation it's useful to try all.equal which will give you some information about why two objects are not equivalent.
Well, the jaded cry of "moar specifics plz!" may win in this case:
Check the output of dput() and post if possible. str() just summarizes the contents of an object whilst dput() dumps out all the gory details in a form that may be copied and pasted into another R interpreter to regenerate the object.

Resources