I have a dataset with 1658099 observations
These are the variables and number of missing observations in each column
As part of preprocessing, how do I impute longitude and latitude values in it? I don't think mean of the location makes sense. And I don't want to discard them as well. Please help me with this. Thanks.
Related
I am currently working on dataset with different firms. I have each firms' longitude and latitude. I want to find the firms' city locations by using R.
For example, I found that Shanghai's city longitude and latitude range 120.852326~122.118227 and 30.691701~31.874634 respectively.
I firstly want to create a column named "city", and I want to use find if firms' longitudes and latitudes within Shanghai's city longitude and latitude range. If yes, then R will print "Shanghai" in the "city column if not, it will remain NA.
In my dataframe longitude and latitude variables are displayed as "longitude" and "latitude".
I am not sure how to run the code and I am really appreciate your favor and help!
I am really struggling at the beginning. Your help and favor are highly appreciative!
Running ezDesign() to my dataset shows that there are conditions with missing observations. In the plot below, conditions that have two observations are in blue, those with only one are in red, and the grey one only has one.
I want to impute the missing observations so that all cells have two observations each.
I tried to use the MICE library but what it does is imputing NAs in the dataset. It does not impute missing observations. In other words, I am looking for a function that it automatically fills the missing conditions with imputed values.
Anybody knows how to do that?
I have searched stackoverflow and google regarding this but not yet found a fitting answer.
I have a data frame column with ages of individuals.
Out of around 10000 observations, 150 are NAs.
I do not want to impute those with the mean age of the whole column but assign random ages based on the distribution of the ages in my data set i.e. in this column.
How do I do that? I tried fiddling around with the MICE package but didn't make much progress.
Do you have a solution for me?
Thank you,
corkinabottle
You could simply sample 150 values from your observations:
samplevals <- sample(obs, 150)
You could also stratify your observations across quantiles to increase the chances of sampling your tail values by sampling within each quantile range.
I am new to R and have the following problem: I am working on a dataset that not only has numerical values, but also non numerical values (gender, state). I wanted to start to look through the data and find some correlations first. Well, this works only for numerical values obviously and the dataset doesnt find any correlations for the numerical values. I tried it out with ggcorr and it omits the non numerical columns.
My questions is: how do you treat such datasets? How do you find correlations if you have many non numerical values categories? Also what is the workflow on creating the lineal model for such a dataset? The model should predict if a person earns more or less then 50k a year.
Thanks for any help!
Edit: This is the dataset which I am talking about. I was thinking about convert the categories into numerical values and then correlate through cor.test() but I am not sure if I would gain a valid correlation number this way. So basically my question is: how do I check the correlation between non-numerical and numerical data?
I have a panel dataset with population data. I am working mostly with two vectors - population and households. The household vector(there are 3 countries) has a substantial amount of missing values, the population vector is full. I use a model with population as the independent variable to get the missing values of households. What function should I use to extract these values? I do not need to make any forecasts, just to imput the missing data.
Thank you.
EDIT:
This is a printscreen of my dataset:
https://imagizer.imageshack.us/v2/1366x440q90/661/RAH3uh.jpg
As you can see, many values of datatype = "original" data are missing and I need to input it somehow. I have created several panel data models (Pooled, within, between) and without further considerations tried to extract the missing data with each of them; however I do not know how to do this.
EDIT 2: What I need is not how to determine which model to use but how to get the missing values(so making the dataset more balanced) of the model.