How do I code opposite 'levels' or 'factors' in R - r

I have browsed through many different pages on the net but cannot seem to find an answer to this question:
How do I code up equally-opposing levels for use in multiple regression in R?
The only way I can currently see achieving this is to assign the data type to these columns as 'numeric' however I know that this is not an entirely accurate way of doing this as they should ideally be treated as factors.
Essentially what I am looking for is creating 'dummy' variables that are factors rather than numeric.
By way of example, suppose we have a scenario of sporting teams opposing each other, each row assigns a different game where Team A's presence in the game is suggested by the '1' dummy and Team B's presence is '-1' (and all other team's values are obviously set to 0 because they are not playing in this game).
I have attempted to change the data type of these columns to factors, however this results in three levels of 1,0 and -1 but the beta assigned to 1 and -1 is never equally opposite (and in theory the value assigned to 0 should be zero).
Does anybody know how to correctly mathematically and programatically achieve this? Your help would be much appreciated!

Related

How do I get the proportion of variability between 0 and 1

working on SVD and the question is this:
s being the svd(data), the proportion of variabiilty in the data explained by the xth column of is equal to s$d[x]^2 divided by the sum of all s$d values.
It asks the 'proportion' for different column values and requires the ans be between 0 and 1. I have tried nearly everything including an endless google search, I'd be grateful for any kind of help.
(I'm aware it's probably an easy thing to do, but I can't seem to get to it)
Thanks!

When or why convert a numeric variable to factor?

I am quite new in R and sort of learning by myself. I have a data set with 43 variables and I want to forecast one of theme. Some are numeric variables and some are factor variables.
The question is that I don't know when someone should convert factors to numerics and vice versa. I found on internet that you should not keep variables as numeric if they always take integer values and in a narrow range. (For exemple if the values are always between 1 and 7).
On of my variables is "NSM" and it represents number of seconds since midnight for each day. The value are integer and discrete (61200 61800 62400 63600 64200 65400 66000 66600 68400 69000 69600 70800 72000 72600 73200 etc you can observe that there is a 600 step).
They go from 0 to 85800.
So I want to have the opinion of someone more experienced than me (I have 0). Should I keep NSP numerci or convert it to factor and then groups factor values by levels ( otherwise I would have 144 levels and that would be too much and not relevant)
Thank you,
I generally only convert a variable to a factor if one or more of the following are true:
the values of the variable represent some form of grouping, i.e. the variable is categorical in nature.
there are substantial memory savings to be had - this is usually the case where character variables have been used to identify group levels.
the variable is numeric in nature, but is highly non-linear, and there's no better way of entering it into a model than to convert it to a factor with one or two meaningful cut-points chosen.
However, manipulting factor variables can be more fiddly than characters or integers, so I tend to save the factoring until the very end, unless memory pressures force my hand.
I have also been self-learning just like you and as per my understanding of this topic it's better to use factors when we have limited nominal/categorical values especially in the case of character vectors like gender ("Male", "Female"). This saves us from comparison errors involved in characters like case sensitivity or spelling mistakes.
Also, internally factors and integers work in the same way and if there are limited categorical integer values then it's suggested to use factors in order to have more meaningful data through levels. In your case, my opinion is to use Integers than factors as there are too many of levels to attach any meaningful information even if its required.
Finally you should be the best judge to decide if you should use factors in your code as you'd know where exactly you're going to use them again in your program as some algorithms explicitly demand factors than character vectors.

Trying to see how much of a level of a variable is in another level of another variable

I am trying to figure out how to see how much of a certain variable is in another variable. What I mean is that I have two variables in R, and they both have two levels. I want to see how much of level 2 of Variable 1 is in level 2 of Variable 2, but I cannot figure out how. I cannot figure out which command or lines of code I need to input in order to get my desired out come. They are both categorical data.
I am new to R, and I have not been able to find any answers yet. I am hoping that you guys will be able to help.
To get more in depth, I am trying to compare smokers to premature birth. I have two variables, habit, and premature. Habit has two levels, non smokers and smokers. Premature has two levels, premature birth and full term. I want to see how many smokers are in the premature births or visa versa.
I am expecting a number, and it needs to be how much of level 2 of variable 1 is in level 2 of variable 2.

Make use of available data and neglect missing data for building classifier

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards
I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.
I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.

What makes these two R data frames not identical?

I have two small data frames, this_tx and last_tx. They are, in every way that I can tell, completely identical. this_tx == last_tx results in a frame of identical dimensions, all TRUE. this_tx %in% last_tx, two TRUEs. Inspected visually, clearly identical. But when I call
identical(this_tx, last_tx)
I get a FALSE. Hilariously, even
identical(str(this_tx), str(last_tx))
will return a TRUE. If I set this_tx <- last_tx, I'll get a TRUE.
What is going on? I don't have the deepest understanding of R's internal mechanics, but I can't find a single difference between the two data frames. If it's relevant, the two variables in the frames are both factors - same levels, same numeric coding for the levels, both just subsets of the same original data frame. Converting them to character vectors doesn't help.
Background (because I wouldn't mind help on this, either): I have records of drug treatments given to patients. Each treatment record essentially specifies a person and a date. A second table has a record for each drug and dose given during a particular treatment (usually, a few drugs are given each treatment). I'm trying to identify contiguous periods during which the person was taking the same combinations of drugs at the same doses.
The best plan I've come up with is to check the treatments chronologically. If the combination of drugs and doses for treatment[i] is identical to the combination at treatment[i-1], then treatment[i] is a part of the same phase as treatment[i-1]. Of course, if I can't compare drug/dose combinations, that's right out.
Generally, in this situation it's useful to try all.equal which will give you some information about why two objects are not equivalent.
Well, the jaded cry of "moar specifics plz!" may win in this case:
Check the output of dput() and post if possible. str() just summarizes the contents of an object whilst dput() dumps out all the gory details in a form that may be copied and pasted into another R interpreter to regenerate the object.

Resources