My data is 1,785,000 records with 271 features. I'm trying to reduce number of features used to build the model.
Q1. while exploring the data I found that some features are almost all missing data, like only 25 records has value for this feature and the others records has missing values, so I thought that is not informative enough and it's better to eleminate those features, am I right? and if I am right, for what level I can do that, I mean if 90%, 80%, etc.. of each feature are missing values, when I can decide to get rid of these features? (taking in consideration that it is the dependent variable is N/Y and only %1.157 of the whole data is belonging to Y).
Q2. for each individual in the dataset, there are 64 trait_type listed, where each one can take one of the values [1 or 3 or 5]. my question is: if some trait-type take only value [5] or missing dat for all the record, does it have any value or again we can eliminate that feature?
Q3. if the choice is to delete these features, how to delete column from data.frame in R?
Thank you
Update:
I'm trying to use caret package to do the variable selection.
I applied this:
ctrl<- rfeControl(functions = lmFuncs, method="cv", verbose = FALSE, returnResamp=
"final")
lmprofile<- rfe(x,y, sizes = subsets, rfeControl = ctrl)
where x is the data.frame that have 270 dependant variables and y is the factor of the independent variable which has value Y/N. I got this error:
Error in { :
task 1 failed - "contrasts can be applied only to factors with 2 or more levels"
enter code here
In addition: There were 11 warnings (use warnings() to see them)
any help please?
Just because much of your data in one column is missing doesn't mean that column will not be predictive, it's just the same as having many of the same value in that column.
Of course there is a cutoff, if that column can only help you distinguish between a few cases (of many) then it can be removed and could only affect overall model strength a little.
To help you decide whether to keep the column, you could build a univariate model with it - where the dataset just includes that column and the dependant variable, and look at the strength of that model. If it's not much better than random, then it's probably safe to drop the column.
Related
Hi y'all I'm fairly new to R and I'm supposed to calculate F statistic for this table
The code I have inputted is as follows:
# F-test
res.ftest <- var.test(TotalLength ~ SwimSpeed , data = my_data)
res.ftest
I know I have more than two levels from the other posts I have read online, but I am not sure what to change to get the outcome I want.
FIRST AND FOREMOST...If you invoke
?var.test()
you will note that the S3 version you called assumes lhs is numeric and rhs is a 2-level factor.
As for the rest, while I don't know the words to your specific work/school assignment here, the words shouldn't be "calculate an F-test", exactly. They should be "analyze these data appropriately". While there are a number of routes you could take, this is normally seen as a regression problem, NOT a problem of trying to compare two variances/complete a 1-way ANOVA which is what var.test() is designed to do. (Reading the documentation at, for example, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/var.test should make this clear and is something you should always do when invoking R procedures.)
Using a subset of your data (please do this yourself for stack helpers next time rather than make someone here do it for you)...
df <- data.frame(
ID = 1:4,
TL = c(27.1,29.0,33.0,29.3),
SS = c(86.6,62.4,63.8,62.3)
)
cor.test(df$TL,df$SS) # reports t statistic
# or
summary(lm(df$TL ~ df$SS)) # reports F statistic
Note that F is simply t^2 here in the 2 variable case.
Lastly, I should add it is remotely, vaguely possible the assignment is to check if the variances of the 2 distributions are equal even though I can see no reason why anyone would want to know considering they are 2 different measures on two different underlying scales measuring 2 different things. However,
var.test(df$TL, df$SS)
will return a "result" should you take the assignment to mean compare the observed variances.
I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).
The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]
I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.
I am working with a large federal dataset with thousands of observations and thousands of variables. Replicate weights are provided. I am using the "survey" package in R to apply these weights:
els.weighted=svrepdesign(data=els, repweights = ~els$F3F1PNLWT,
combined.weights = TRUE).
I am interested in some categorical descriptive characteristics of a subset of the population, such as family living arrangements. I want to get these sorted out into a contingency table that shows frequency. I would like to sort people based on four variables (none of which are binary, but all of which are numeric) This is what I would like to get:
.
The blank boxes are where the cross-tabulation/frequency counts would show. (I only put in 3 columns beneath F1COMP for brevity's sake, but it has 9 outcomes – indexed 1-9)
My current code: svyby(~F1FCOMP, ~F1RTRCC +BYS33C +F1A10 +byurban, els.weighted, svytotal)
This code does sort the data, but it sorts every single combination, by default. I want them pared down to represent only specific subpopulations of each variable. I tried:
svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C==1 +F1A10==2 | F1A10==3 +byurban==3, els.weighted, svytotal)
But got stopped:
Error: unexpected '==' in "svyby(~F1FCOMP, ~F1RTRCC==2 |F1RTRCC==3 +BYS33C=="
Additionally, my current version of the code tells me how many cases occur for each combination, This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down.
This is a picture of what my current output looks like. There are hundreds more rows, 1 for each combination, when I keep scrolling down
.
You can see in that picture that I only get one number for F1FCOMP per row – the number of cases who fit the specified combination – a specific subpopulation. I want to know more about that subpopulation. That is, F1COMP has nine different outcomes (indexed 1-9), and I want to see how many of each subpopulation fits into each of the 9 outcomes of F1COMP.
I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful
I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,
> data(fgl, package="MASS")
> tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
> tst$error.cv
9 6 4 3 2 1
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458
In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,
> attributes(tst)
$names
[1] "n.var" "error.cv" "predicted"
None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.
I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.
As you probably found out, this code
rf<-randomForest(type ~ .,data=fgl)
importance(rf)
gives you the relative importance of each of the variables.