I need to do a term paper with R (never did it before) and I have the following problem that I cannot solve.
I have a dataset with all countries of all years since 1950 (so one row (observation) is one country in one year, the next row is the same country one year later and so on). Now I need to construct a new variable, which is filled with the average value of the previous three years of a given variable.
Specifically it is about the democracy level of a country. So I have the variable of the democracy level of a country for a year T, and I need a new variable, which indicates the "democracy growth" of the previous three years T(-3,0).
How can I construct this new variable?
As I said I never used R before, but I need to use mutate() and then I need to address year-3, year-2, year-1 and divide it by 3. But how to I address the previous three years? Case_when or something?
Related
I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)
I have a question regarding the filtering of a loan dataset for my upcoming thesis.
My dataset consists of loan data which is reported for 5 years on a quarterly basis. The column of interest is the 'Loan Identifier' as well as the 'Cut-Off-Date'. I just want to observe the loans (via Loan Identifier) that exist at the first reporting date (first quarter) for every upcoming quarter (cut-off-date).
For example, if there are the loans with the identifier c("1001","1002","1003") in the first cut-off-date and the second cut-off date, one quarter later, has loans with identifiers ("1002","1003","1004"), R should filter for only the identifiers that existed in the first quarter ("1002","1003"). So that new loans during the analysis are completely ignored.
Is there also the possibility to do that all in one file? Or should I extract the data of each cut-off-date in a new table?
Thanks and best regards!
I am thinking about assigning each loan in the first quarter as a vector. After that, I should split up the loan dataset for each cut-off-date and merge the vector with the new tables via left_join. So that every loan that does not match with the vector is disregarded.
As I have multiple loan pools with 15 pool-cut-off dates, this seems very impractical for me. Maybe there is a smarter and more effective solution.
I have a question regarding conditional lagging variables. The data structure is as following: paired variables of S&P1500 CEO characteristics according to a company key and financial year. For one company you can have multiple values of the same financial year (multiple CEO's in that year). I would like to lookup the value of a third variable (called AT) of the last value of the previous financial year within the same key (same company).
Trying to create a plot showing the number of items (ex. pop_songs) released by year from a dataframe I have (ex. Music_Charts).
I have a year released column in my dataframe and can use that as the x-variable, but I don't know what I would use for the y-variable to show the boxplot since I have the Top 500 Ranked songs on the dataframe.
Well, based on your very general question, if you have a data frame column with the years for each song, you can easily get the count for that column using table.
table(dataframe$year_released)
That should give you the number of entries for every year, then you can plot them (i'm guessing that's what you need)
I have a data frame with 17497 observations of 1681 variables that I am working with using R. Some variables are nominal, some are ordinal, some numeric, etc.
I am concentrating on the one that stands for net recalled salary from the previous month (dataframe$q31, where q31 simply denotes question 31 from a questionnaire). The variable is numeric.
It happens so that there are many missing values denoted as NAs. People with managerial and professional positions tend to be more likely not to reveal their income. At the same time there are more likely to earn more. Hence, my further analysis might be distorted.
I would like to create another column with the net recalled salary where the NAs are substituted not with an average, but with the number a given person would be most likely to give taken into consideration preferably all other characteristics from the dataframe.
If not possible, at least its:
profession (q22isc27, ordinal)
years of experience (q24c, numeric)
age (q9age, numeric)
sex (q8, 1- men, 2 - women)
year when surveyed (pgssyear, numeric)
years of education (problematic: for all years q131ed variable available that was filled in by a surveyor itself and is highly approximate, additionally it needs to be recoded into numeric, as somehow it gets displayed as nominal in R; since 1999 q131edr is available that was filled in by surveyed themselves and is ordinal (in spss gets displayed as "scale")
marital status (q21, ordinal)
ownership status of the company where employed (q46e, ordinal)
hours worked per week (q21, numeric)
weight variable (weight, numeric: it depicts "representativeness" of a person in respect to the whole population) (!)
if possible also region where the respondent lives, but till 1999 there were 49 districts in Poland and afterwards 16, hence there are two variables: voiev49 and voiev16 that are coded as NAs for the invalid years.
I think it might be related to propensity score matching or to these packages that I found online: http://cran.r-project.org/web/packages/optmatch/optmatch.pdf
Is there any magical way to do it in R?
It seems I will be able to handle it using Amelia package:
http://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf#subsection.4.4
http://cran.r-project.org/web/packages/Amelia/Amelia.pdf
and true, there is lots of materials on it on Cross Validated, e.g.
https://stats.stackexchange.com/questions/95832/missing-values-nas-in-the-test-data-when-using-predict-lm-in-r
#nograpes, thank you for all the hints!