I'm working with a dataset that is comparing the abundance of certain species against environmental variables in various sampling sites.
For some of the sites, environmental variables could not be measured in the field. As a result, these values are written as "NA" in my dataset.
However, for the variables relating to species abundance, there are some values which are zero, simply because at that particular site, one or more species were simply not observed.
I'm using the mice package to deal with these NA values using imputation methods. However, I also want to use the VIM package with the functions "md.pattern" and "aggr" to assess the proportion of missing values. The issue is that when using these functions, R is not only considering the NA values as missing data, but also the zero values as missing data. How can I make it so R only detects NA as missing data and not the values which are zero?
Related
I am using the mice package to impute data, and I have read about post processing to restrict imputed values. One of the variables that I am imputing categorical variable with 10 different levels (a,b,c,d,e,f,g,h,i,j). The missing values can take everything as value except a and d. I need to make it so people with category a or d have values of NA after the imputation. Because when I'm imputing now, people are imputed based on all the available levels and that is wrong.
I have also tried to create another binary variable that says actually 0 and 1 in order to make it work but it still imputed in the wrong way.
Any ideas about post processing this in mice in R?
I am trying to perform imputation on a dataset which has 69 columns and over 50000 rows. My dataset has different types of variables:
columns that only present binary variables (0,1)
categorical columns
columns that take continuous numerical data
Now, I want to perform imputation and I know that my columns have a high level of multicollinearity.
Do I have to split my dataset into 3 different subsets (one for each of 1), 2), 3) type of column that I can have) or should I perform imputation on the whole dataset?
The problem is that the package mice have different methods for each of these types. And if I run three different times, do I have to take into consideration the whole dataset or only that specific part?
You can input your whole dataset at once to mice.
(you can actually specify which method to use for each variable separately)
I am citing from the mice reference:
Parameter 'method'
Can be either a single string, or a vector of strings with length length(blocks), specifying the imputation method to be used for each column in data. If specified as a single string, the same method will be used for all blocks. The default imputation method (when no argument is specified) depends on the measurement level of the target column, as regulated by the defaultMethod argument. Columns that need not be imputed have the empty method "". See details.
I am new to R and have the following problem: I am working on a dataset that not only has numerical values, but also non numerical values (gender, state). I wanted to start to look through the data and find some correlations first. Well, this works only for numerical values obviously and the dataset doesnt find any correlations for the numerical values. I tried it out with ggcorr and it omits the non numerical columns.
My questions is: how do you treat such datasets? How do you find correlations if you have many non numerical values categories? Also what is the workflow on creating the lineal model for such a dataset? The model should predict if a person earns more or less then 50k a year.
Thanks for any help!
Edit: This is the dataset which I am talking about. I was thinking about convert the categories into numerical values and then correlate through cor.test() but I am not sure if I would gain a valid correlation number this way. So basically my question is: how do I check the correlation between non-numerical and numerical data?
I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.
I would like to run a time series regression with a list of dependent variables as the column. I would like to regress each column on a set of independent variables. I know you can just use
lm(dataframe~independent variables)
because if the dependent variable is a matrix, then they will just go through each column.
However, my dependent variables are information about stocks through time and sometimes information is not available for every single stock at every time point, so I have some NA values. The problem that I am having is that if I use lm, I have to omit the NA values, i.e. the lm function removes the whole row when running the regression. This is fine if I only want to run a regression on one dependent variable, but I have a list(1000+) of dependent variables which I would like to run my regression on. Because my dataset is only 15+ years, there is are missing values for very single time point, so when I run my lm regression, I get an error because the lm function has removed every single row when running the regression. The only way that I can think of to solve this problem is to run a for loop and run a separate regression for each stock, which I think will take a very long time to compute. For example, the following is an example of my data:
135081(P) 135084(P) 135090(P)
1994-12-30 NA NA NA
1995-01-02 NA NA NA
1995-01-03 06864935 NA NA
1995-01-04 NA NA -0.05474644
1995-01-05 NA NA 0.20894900
1995-01-06 NA -0.45672832 -0.02378632
so if I run a time series regression on this, I would get an error because the lm function would skip every single row.
So my question is, would there be another way to run a time series regression across a data frame with different DEPENDENT variables where the regression "skips" the NA for just the one particular dependent variable instead of skipping it for every other dependent variable as well?
I don't think using na.omit is correct because it removes the time series properties of my dataset and using na.action=NULL doesn't work because I have NA in my dataset.
Thank you a lot for your help.
You might want to employ a multiple imputation method using something like the Amelia 2 package on CRAN in order to properly account for increased uncertainty in your estimates due to missingness, and also to help minimize biases that result from case-wise deletion. See for example:
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581.