R treats the subset as factor variables instead of numeric variables - r

In a comlex dataframe I am having a column with a net recalled salary inclusive NAs that I want to exclude plus a column with the year when the study was conducted ranging from 1992 to 2010, more or less like this:
q32 pgssyear
2000 1992
1000 1992
NA 1992
3000 1994
etc.
If I try to draw a boxplot like:
boxplot(dataset$q32~pgssyear,data=dataset, main="Recalled Net Salary per Month (PLN)",
xlab="Year", ylab="Net Salary")
it seems to work, however NAs might distort the calculations, so I wanted to get rid of them:
boxplot(na.omit(dataset$q32)~pgssyear,data=dataset, main="Recalled Net Salary per Month (PLN)",
xlab="Year", ylab="Net Salary")
Then I get a warning message that the length of pgsyear and q32 do not match, most likely cause I removed NAs from q32, so I tried to shorten the pgsyear, so that it does not include the rows that correspond to NAs from the q32 column:
pgssyearprim <- subset(dataset$pgssyear, dataset$q32!= NA )
however then the pgsyearprim gets treated as a factor variable:
pgssyearprim
factor(0)
and I get the same warning message if I introduce it to the boxplot formula...
Levels: 1992 1993 1994 1995 1997 1999 2002 2005 2008 2010

Of course they wouldn't ... you removed some of the data only from the LHS with na.omit(dataset$q32)~pgssyear. Instead use !is.na(dataset$q32) as a subset argument

Related

Portfolio sorts with incomplete data

I have a panel data of stock returns, where after a certain year the coverage universe of stocks doubled. It looks a bit like this:
Year Stock 1 Stock 2 Stock 3 Stock 4
2000 5.1% 0.04% NA NA
2001 3.6% 9.02% NA NA
2002 5.0% 12.09% NA NA
2003 -2.1% -9.05% 1.1% 4.7%
2004 7.1% 1.03% 4.2% -1.1%
.....
Of course, I am trying to maximize my observations both in the time series and in the cross-section as much as possible. However, I am not sure which of these 3 ways to sort would be the most "academically honest":
Sort the years until 2001 using only stocks 1 and 2, and incorporate the remaining stocks in the calculations once they become available in 2003.
Only include those stocks in calculations that have been available since 2000, i.e. stocks 1 and 2. Ignore the remaining stocks altogether since we do not have the full return profile.
Start the sort in year 2003, to have a larger cross-section.
The reason why our coverage universe expands in 2003 is simply because the data provider I am using changed their methodology in that year and decided to track more stocks. Stocks 3 and 4 do exist before 2003, but I cannot use their past return data since I need to follow my data provider (for the second variable I am sorting on).
Thanks all!
I am using the portsort() package in R but this does not seem to work well with NA`s.

Adding missing panel dates by group as rows using data.table

I'm having difficulty using data.table operations to correctly manipulate my data. The goal is to, by group create a number of rows for the group based on the value of two date columns. I'm changing my data here in order to protect it, but it gets the idea across
head(my_data_table, 6)
team_name play_name first_detected last_detected PlayID
1: Baltimore Play Action 2016 2017 41955-58
2: Washington Four Verticals 2018 2020 54525-52
3: Dallas O1 Trap 2019 2019 44795-17
4: Dallas Play Action 2020 2020 41955-58
5: Dallas Power Zone 2020 2020 54782-29
6: Dallas Bubble Screen 2018 2018 52923-70
The goal is to turn it into this
team_name play_name year PlayID
1: Baltimore Play Action 2016 41955-58
2: Baltimore Play Action 2017 41955-58
3: Washington Four Verticals 2018 54525-52
4: Washington Four Verticals 2019 54525-52
5: Washington Four Verticals 2020 54525-52
6: Dallas O1 Trap 2019 44795-17
...
n: Dallas Bubble Screen 2018 52923-70
My code I attempt to employ for this purpose is the following
my_data_table[,.(PlayID, year = seq(first_detected,last_detected,by=1)), by = .(team_name, play_name)]
When I run this code, I get:
Error in seq.default(first_detected_ever, last_detected_ever, by = 1) :
'from' must be of length 1
Two other attempts also failed
my_data_table[,.(PlayID, year = seq(min(first_detected),max(last_detected),by=1)), by = .(team_name, play_name)]
my_data_table[,.(PlayID, year = list(seq(min(first_detected),max(last_detected),by=1))), by = .(team_name, play_name)]
which both result in something that looks like
by year PlayID
1: Baltimore Washington Dallas Play Action 2011, 2012, 2013, 2014, 2015, 2016 ... 41955-58
...
In as.data.table.list(jval, .named = NULL) :
Item 3 has 2 rows but longest item has 38530489; recycled with remainder.
I haven't found any clear answers on why this is happening. It seems like, when passing the "first detected' and "last detected", that it's interpreting it somehow as the entire range of the column's values, despite me passing the by = .(team_name,play_name), which always results in one distinct row, which I have verified. Going by the "by" grouping here should only have one value of first_detected and last_detected. I've done something similar before, but the difference was that I wasn't doing it with a "by = .(x,y,z,...)" grouping, and applied the operation on each row. Could anyone help me understand why I am unable to get the desired output with this data.table method?
Despite struggling with this for hours, I managed to solve my own question only a short while later.
The code
my_data_table[,.(PlayID, year = first_detected:last_detected), by = .(team_name, play_name)]
Produces the desired result, creating, by group, a row that has each year inclusive, so long as first_detected and last_detected are integers.

Indexing with mutate

I have an unbalanced panel by country as the following:
cname year disability_PC family_PC ... allFunctions_PC
Denmark 1992 953.42 1143.25 ... 9672.43
Denmark 1995 1167.33 1361.62 ... 11002.45
Denmark 2000 1341 1470.54 ... 11200
Finland 1991 1095 955 ... 7164
Finland 1996 1067 1040 ... 7600
And so on for more years and countries. What I would like to do is to compute the mobile indexing for each of the type of social expenditures (disability_PC, family_PC, ... allFunctions_PC).
Therefore, I tried the following:
pdata %>%
group_by(cname) %>%
mutate_at(vars(disability_absPC, family_absPC, Health_absPC, oldage_absPC, unemp_absPC, housing_absPC, allFunctions_absPC),
funs(chg = ((./lag(.))*100)))
The code seems to work, as R reports the first 10 columns and correctly says "with 56 more rows, and 13 more variables". However, these are not added to the data frame. I mean, typing
view(pdata)
the variables are not existing, as if the mutate command did not create these variables.
What am I doing wrong?
Thank you for the support.
We can make this simpler with some of the select_helpers and also the funs is deprecated. In place, we can use the list
library(dplyr)
pdata <- pdata %>%
group_by(cname) %>%
mutate_at(vars(ends_with('absPC')), list(chg = ~ ((./lag(.))*100))
Regarding the issue of not creating the variables, based on the OP's code, the output is not assigned to any object identifier or updated the original object (<-). If it is done, the columns will be created

Differencing with respect to specific value of a column

I have a variable called Depression which has 40 observations and goes from 2004 to 2013 quarterly (e.g. 2004 Q1, 2004 Q2 etc.) I would like to make a new column which differences with respect to the 27th row/observations which corresponds with 2010 Q3 and set that value to 0. Any help is greatly appreciated!
If I understand correctly your question, this would do it:
# generate sample data
dat <- data.frame(id=paste0("Obs.",1:40),depression=as.integer(runif(40,0,20)))
# Create new var that calculates difference with 27th observation on depression score
dat$diff <- dat$depression - dat$depression[27]

Simple function over twisting data- exercise

So I have a given function, that I aim to run over a dataset, but it do not seem to work, with an "error-message" that i quite cannot figure out. I will describe the dataset, because it is twisting, then write my function. The name of the dataset is: cohort_1x1:
Year Age Male Female
1940 18 0.001234 0.003028
1940 19 0.005278 0.002098
.... .. ........ ........ #up to age 100
1940 100 0.004678 0.006548
1941 18 0.004567 0.002753 # start new year, with age 18
1941 19 0.005633 0.001983
.... .. ........ ........
1979 100 0.003456 0.00256
The dataset contains death-rates for Male and Female, for every agegroup 18-100, from year 1940-1979. Further on; the function that i have written is this:
gompakmean=function(theta,y_0,h){
# Returns the expected remaining number of years at age "l_0" under
# the Gomperz-Makeham model with parameter "theta" (vector).
# "h" is the time increment for the numerical integration.
l_0=y_0/h
l_e=110/h
ll=(l_0+1):l_e
hh=(theta[2]/theta[3])*(exp(theta[3]*h)-1)
p=exp(-theta[1]*h-hh*exp(theta[3]*ll*h))
P=cumprod(p)
(0.5+sum(P))*h
}
This function returns the expected number of years, for every year/agegroup, and it is to be done separately for men and female.
But if i try with input like the one below, i get the following error-message:
Input:
s=-c(8,9,2)
theta=exp(s)
y_0 = cohort_1x1$Male
h = 1
gompakmean(theta,y_0,h)
Leads to this error-message:
[1] 46.13826
Warning message:
In (l_0 + 1):l_e :
numerical expression has 3320 elements: only the first used
So get an output for the first year (age?) which is: [1]46.13826.
But then the function seem to stop, hence the error-message. Is the reason somewhat with my dataset? That maybe after running over 1940, it must have year 1941? But that will only give me output for year 18 in every year?
Because my aim is to calculate expected number of years for every agegroup in every year, i.e: calculate expected number of years for every cohort in all the years.
Appreciate all answers!:)

Resources