Preparing data for classification algorithm - r

I have to prepare and classify a dataset composed by 100 000 + lines and 105 variables and I'm looking for advices.(I'm using R)
basically,
the set is full of dummy variables and missing values(44% of the full dataset).
and Idk what to do with the NAs, I'm split up between two ideas :
I]
1- eliminate every column that has more than 70% of mising values
2- Replace the missing values with mean or median in the remaining columns
II]
eliminate all the missing values
what do you think ?
is there something more I can do to prepare the data ? (except dealing with NAs)

The topic of imputation of missing values has a long history in the social sciences, going back at least as far as when I was a graduate student during the 1980s and had to explain to a professor of Political Science at Michigan State University why she couldn't replicate a factor analysis she had previously conducted because SPSS eliminated the mean substitution of missing values option from the factor analysis procedure.
There is a wide variety of research (and opinion) on how to handle missing data in statistical analyses. For example, in Chapter 25 of Data Analysis Using Regression and Multilevel / Hierarchical Models, Gelman and Hill describe multiple approaches for imputing one variable as well as multiple variables.
In order to select an imputation strategy for a particular data set, one must assess why the missing data are missing. Gelman & Hill review four major categories of "missingness mechanisms," including:
Missingness completely at random (probability of missingness is equal across all units / subjects)
Missingness at random (e.g. differing response rates across races)
Missingness that depends on unobserved predictors
Missingness that depends on the missing value itself (e.g. people earning more than $100,000 refuse to respond to income question)
Therefore, without analyzing the original poster's specific data set against the missingness mechanisms, specific guidance on which imputation technique to use is inappropriate. Additional research on missing data imputation may be found at Strategies for Handling Missing Values.

Related

Can I use quickpred in Mice to impute a subset of variables from a larger set of variables in a nested longitudinal (and long) dataframe?

I've tried to create a test data.frame to demonstrate my question but my r capacity isn't quite strong enough to even do that. I am not in a position to share my true database. I hope my question can stand on its own.
I am working with a nested longitudinal dataset that is saved as a long file (1000 subjects nested in 8 sites, 4 potential time points/subject, 68 potential predictor variables). I want to impute missing values on 4 static predictors (e.g., maternal education, family income) prior to conducting lme on the longitudinal outcomes in order to have a consistent number of cases for all models.
I am working with the package mice in r. From all that I have read, it is recommended that I use all the variables in my models and any other variables that may predict the missing values in my imputation. Given the number of variables in my models, I need something like quickpred to simplify this. But I'm getting an error that I do not understand.
I tried the following initial code for my database N2NPL, indicating c(14, 16, 18, 19) as the variables that I want to predict.
iniN2NPL <- mice(N2NPL[,c(14,16,18,19)], pred= quickpred(N2NPL,
minpuc = 0.25, exclude = c('ID','TypeConvNon','TypeCtPr','TypeName','CHR_converter')),
maxit = 0)
"Error in check.predictorMatrix(setup) :
The predictorMatrix has 73 rows and 73 columns. Both should be 4'
I know that mice::quickpred needs to be a square matrix, but is there anyway of not imputing all of the variables? Is it sufficient to include site as a predictor given the nesting of subjects within sites?
Thank you for any help directing me to the proper code or instructions on this. The examples I see all seem much simpler than mine, and thus little help with the issues I'm having.

In R, what is the difference between ICCbare and ICCbareF in the ICC package?

I am not sure if this is a right place to ask a question like this, but Im not sure where to ask this.
I am currently doing some research on data and have been asked to find the intraclass correlation of the observations within patients. In the data, some patients have 2 observations, some only have 1 and I have an ID variable to assign each observation to the corresponding patient.
I have come across the ICC package in R, which calculates the intraclass correlation coefficient, but there are 2 commands available: ICCbare and ICCbareF.
I do not understand what is the difference between them as they do give completely different ICC values on the same variables. For example, on the same variable, x:
ICCbare(ID,x) gave me a value of -0.01035216
ICCbareF(ID,x) gave me a value of 0.475403
The second one using ICCbareF is almost the same as the estimated correlation I get when using random effects models.
So I am just confused and would like to understand the algorithm behind them so I could explain them in my research. I know one is to be used when the data is balanced and there are no NA values.
In the description it says that it is either calculated by hand or using ANOVA - what are they?
By: https://www.rdocumentation.org/packages/ICC/versions/2.3.0/topics/ICCbare
ICCbare can be used on balanced or unbalanced datasets with NAs. ICCbareF is similar, however ICCbareF should not be used with unbalanced datasets.

how to find differentially methylated regions (for example with probe lasso in Champ) based on regression continuous variable ~ beta (with CpGassoc)

I performed 450K Illumina methylation chips on human samples, and want to search for the association between a continuous variable and beta, adjusted for other covariates. For this, I used the CpGassoc package in R. I would also like to search for differentially methylated regions based on the significant CpG sites. However, the probe lasso function in the Champ package and also other packages for 450K DMR analyses always assume 2 groups for which DMRs need to be find. I do not have 2 groups, but this continuous variable. Is there a way to load my output from CpGassoc in the probe lasso function from Champ? Or into another bump hunter package? I'm a MD, not a bio-informatician, thus comb-p, etc. would not be possible for me.
Thank you very much for your help.
Kind regards,
Line
I have not worked with methylation data before, so take what I say with a grain of salt. Also, don't use acronyms without describing them I'm guessing most people on this site don't know what a DMR is.
you could use lasso from the glmnet package to run a lasso on your data. So if your continuous variable was age you could do something like. If meth.dt is your methylations data.table with your columns as the amount of methylation for a given site, and your rows as subjects. I'm not sure if methylation data is considered to be poisson, I know RNA-seq data is. I also can't get too specific but the following code should work after adjusting to your number of columns
#load libraries
library(data.table)
library(glmnet)
#read in data
meth.dt <- fread("/data")
#lasso
AgeLasso <- glmnet(as.matrix(meth.dt[,1:70999,with=F]),meth.dt$Age, family="poisson")
cv.AgeLasso <- cv.glmnet(as.matrix(meth.dt[,1:70999,with=F]), meth.dt$Age, family="poisson")
coefTranscripts <- coef(cv.AgeLasso, s= "lambda.1se")[,1][coef(cv.AgeLasso, s= "lambda.1se")[,1] != 0]
This will give you the methylation sites that are the best predictors of your continuous variable using a parsimonious model. For additional info about glmnet see http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
Also might want to ask the people over at cross validated. They may have some better answers. http://stats.stackexchange.com
What is your continuous variable just out of curiosity?
Let me know how you ended up solving it if you don't use this method.

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

How to structure stratified data for Poisson regression

I'm trying to use R to conduct Poisson regression on some data that I have. The current structure of the data is as follows:
Data is stratified based on three occupations. There are four levels of income in the data. Within each stratum, for each level of income there is
the number of workplace accidents that have occurred, and
the total man months observed.
Here's an example of the setup. The number in parentheses is the total man months observed and the number not in parentheses is the number of workplace accidents.
My question is how do I set up this data and perform a Poisson regression on the effect of income level on the occurrence of workplace accidents? Ideally I would like to adjust for occupation and find out the effect of only income, but as a starting point, I'm not sure how to set it up as a Poisson regression problem at all. I thought about doing something like dividing the number of injuries by the months of observation, but then that gives non-integer values so I assume that's not the right thing to do.
To reiterate, predictor: income level; response variable: workplace accidents.
BTW, it would be very easy to separate the parentheses numbers and put them into their own column, if that would make sense to do.
I'd really appreciate any suggestions on how to set this up. I am sure other statisticians are working with similarly structured data and might like to gain some insight as well. Thanks so much!
#thelatemail might be correct in think this to be better suited for stats.stackexchange.com but here is some R code. That data is in wide format and you need to re-structure it to long format. (And you will not want to include the totals columns. After converting the first four columns to a long format where you had 'occupation' and 'level' as factor-class variables, and accident 'counts' and exposure 'months' as numeric columns, you could use this call to glm.
fit <- glm( counts ~ level + occup + offset(log(months)), data=dfrm, family="poisson")
The offset needs to be log()-ed to agree with the logged counts created by the default link function for the poisson-family.
(You cannot really expect us to redo that data entry task, now can you?)

Resources