I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.
Related
I am new to R and have the following problem: I am working on a dataset that not only has numerical values, but also non numerical values (gender, state). I wanted to start to look through the data and find some correlations first. Well, this works only for numerical values obviously and the dataset doesnt find any correlations for the numerical values. I tried it out with ggcorr and it omits the non numerical columns.
My questions is: how do you treat such datasets? How do you find correlations if you have many non numerical values categories? Also what is the workflow on creating the lineal model for such a dataset? The model should predict if a person earns more or less then 50k a year.
Thanks for any help!
Edit: This is the dataset which I am talking about. I was thinking about convert the categories into numerical values and then correlate through cor.test() but I am not sure if I would gain a valid correlation number this way. So basically my question is: how do I check the correlation between non-numerical and numerical data?
Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.
I have two data sets, one of which shows seasonality while the other shows a trend.
I have removed seasonality from the first data set but I am not able to remove trend from the other data set.
Also, if I remove trend from the other data set and then try to make a data frame of both the altered data sets, then the number of rows will be different for both the data sets (because I have removed seasonality from the first data set using lag, so there is a difference of 52 values in the two data sets).
How do I go about it?
For de-trending a time series, you have several options, but the most commonly used one is HP filter from the "mFilter" package:
a <- hpfilter(x,freq=270400,type="lambda",drift=FALSE)
The frequency is for the weekly nature of the data, and drift=FALSE sets no intercept. The function calculates the cyclical and trend components and gives them to you separately.
If the time indices for both your series are the same (i.e weekly), you could use the following, where x and y are your dataframes:
final <- merge(x,y,by=index(a),all=FALSE)
You can always set all.x=TRUE (all.y=TRUE) to see which rows of x (y) have no matching output in y (x). Look at the documentation for merge here.
Hope this helps.
I am new to time-series analysis and have a data set with a daily time step at 5 factor levels. My goal is to use the acf function in R to determine whether there is significant autocorrelation across the response variable of interest so that I can justify whether or not a time-series model is necessary.
I have sorted the dataset by Day, and am using the following code:
acf(DE_vec, lag.max=7)
The dataset has not been converted to a time-series object…it is a vector sorted by Day.
My first question is whether the dataframe should be converted to a time-series object, or if it is also correct to sort the vector by Day?
Second, if I have a variable repeated over the 5 levels for each Day, then should I construct 5 different acf plots for each level, or would it be ok to pool over stations as was done with the code above?
Thanks in advance,
Yes, acf() will work on a data.frame class, and yes, you should compute the ACF for each of the 5 levels separately. If you pass the entire df to acf(), it will return the ACF for each of the levels.
If you are curious about the relationship across levels, then you need to use ccf() or some mutual information metric like those in the entropy or infotheo pkgs.
How can I code in R to duplicate cluster analyses done in SAS which involved
method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)
When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.
If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.
Thanks!
You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.
Illustrating using the example from the link above, you could do something like:
duration = faithful$eruptions
duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]