I have some data with some individuals where I know their age (18-98) and which country they are from (1 = Germany , 2 = France, etc. ).
I have an other variable for which i want to see the effect eg. of 18 year old people from Germany.
With dummy(data$age, sep='_') and dummy(data$sg2, sep='_')[sg2 = country] I was able to create dummy variables for these two variables.
But while regressing, the output shows the effect now of every age and every country separate.
How can I combine 18 year olds from Germany that I see their effect on the other variable?
#Dummy age
x1 <- dummy(dat$age, sep='_')
#Dummy country
x2 <- dummy(dat$sg2, sep = '_')
fm <- lm(myvariable~x1+x2, data=dat)
summary(fm)
Estimate Std. Error t value Pr(>|t|)
x1age_18 1.547691 0.567995 2.725 0.006437 **
x1age_19 1.632648 0.567939 2.875 0.004047 **
x2sg2_1 0.083239 0.030118 2.764 0.005717 **
x2sg2_2 0.056555 0.030655 1.845 0.065063 .
This is what I get, bit how can i get x1age_18 & x2sg2_1 in one?
Related
I'm looking to generate a dataset in R for a repeated measures model and I'm not sure where to start.
The outcome of interest is continuous between 0-100. This is for a two arm trial (say groups "a" and "b"), with 309 participants in each arm. Each participant is assessed at baseline, then fortnightly for one year (27 total assessments). There will be loss to followup and withdrawals over the year (~30% after one year), and participants may miss individual assessments at random.
For now, I am assuming the standard deviation is the same at each timepoint, and for both arms (11). The mean will change over time. I'm working on the assumption each participant's score is correlated with their baseline measurement.
How can I generate this dataset? I'm intending to compare repeated measures regression methods.
I think the following fulfils your requirements. It works by taking the cumulative sum of samples from a normal distribution over 27 weeks and converting these into a logistic scale between 0 and 100 (so that the maximum / minimum scores are never breached). It uses replicate to do this for 309 participants. It then simulates 30% drop outs by choosing random participants and a random week, following which their measurements are all NA. It also adds in some random missing weeks for the rest of the participants. The result is pivoted into long format to allow for easier analysis.
library(tidyverse)
set.seed(1)
# Generate correlated scores for 309 people over 27 visits
df <- setNames(cbind(data.frame(ID = 1:309, t(replicate(309, {
x <- cumsum(rnorm(27, 0.05, 0.1))
round(100 * exp(x)/(1 + exp(x)))
})))), c('ID', paste0('Visit_', 1:27)))
# Model dropouts at 30% rate
dropout <- sample(c(TRUE, FALSE), 309, TRUE, prob = c(0.7, 0.3))
df[cbind(which(!dropout), sample(2:28, sum(!dropout), TRUE))] <- NA
df <- as.data.frame(t(apply(df, 1, function(x) ifelse(is.na(cumsum(x)), NA,x))))
# Add random missing visits
df[cbind(sample(309, 100, TRUE), sample(2:28, 100, TRUE))] <- NA
df <- pivot_longer(df, -ID, names_to = 'Week', values_to = 'Score') %>%
mutate(Week = 2 * (as.numeric(gsub('\\D+', '', Week)) - 1))
Our data frame now looks like this:
head(df)
#> # A tibble: 6 x 3
#> ID Week Score
#> <dbl> <dbl> <dbl>
#> 1 1 0 50
#> 2 1 2 51
#> 3 1 4 51
#> 4 1 6 56
#> 5 1 8 58
#> 6 1 10 57
And we can see the scores drift upward over time (since we set a small positive mu on our rnorm when creating the scores.
lm(Score ~ Week, data = df)
#>
#> Call:
#> lm(formula = Score ~ Week, data = df)
#>
#> Coefficients:
#> (Intercept) Week
#> 52.2392 0.5102
We can plot and see the overall shape of the scores and their spread:
ggplot(df, aes(Week, Score, group = ID)) + geom_line(alpha = 0.1)
Created on 2023-01-31 with reprex v2.0.2
My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]
I have a data set depicting the BMI change for 199 countries over the span of 38 years (1980 to 2008). The data was originally in a wide format, and I (thought I) tidied it up:
BMI <- read.csv("Indicator_BMI female ASM.csv", header = TRUE)
BMI2 <- gather(BMI, "year", "BMI", X1980:X2008)
BMI2$year <- as.numeric(BMI2$year)
BMI <- BMI2
until it ended up in this format:
Country Year BMI
Afghanistan X1980 20.443
Afghanistan X1981 20.477
Afghanistan X1982 20.523
...
Albania X1980 25.174
Albania X1981 25.191
.......
Algeria X1980 20.241
.....
etc, you get the idea..
What I want to do is, for each country to get the gradient of the slope of the relationship between Year and BMI.
First I tried very simply to extract the slope for one country at a time:
thefit <- lm(BMI ~ year, subset(BMI, Country == "Albania"))
gradientAlbania <- round(coef(thefit)[2],4)
but the output from the lm alone is quite unexpected (only intercepts for each year separately):
Coefficients:
(Intercept) yearX1981 yearX1982 yearX1983 yearX1984 yearX1985 yearX1986 yearX1987 yearX1988 yearX1989 yearX1990
25.17427 0.01661 0.02605 0.04479 0.04932 0.03830 0.03512 0.01491 -0.00462 -0.02057 -0.03550
yearX1991 yearX1992 yearX1993 yearX1994 yearX1995 yearX1996 yearX1997 yearX1998 yearX1999 yearX2000 yearX2001
-0.12757 -0.20812 -0.23869 -0.23865 -0.23721 -0.20784 -0.20673 -0.17207 -0.11173 -0.04630 0.02905
yearX2002 yearX2003 yearX2004 yearX2005 yearX2006 yearX2007 yearX2008
0.09655 0.15771 0.22377 0.29098 0.35901 0.41967 0.48299
I guess it has to do with the format of the data frame (like the X in front of the year that I couldn't get rid off..)
Although I successfully created a BMI~year scatter plot for this data, by country, so I assumed in principle the format should be ok?
BMI year scatter plot
Thanks for the help in advance.
In case it matters, later I want to include a slope function as part of the dplyr group_by, but this will come in a separate post (I decided to break down the question in different posts for clarity).
As #aosmith mentioned, you need to get rid of those "X"s.
Does this do work on your data and accomplish what you want:
BMI <- read.csv("Indicator_BMI female ASM.csv", header = TRUE)
BMI2 <- gather(BMI, "year", "BMI", X1980:X2008)
BMI2$year <- as.numeric(gsub('X',"",BMI2$year))
# Adapted from "R for Data Science"
country_model <- function(df){
lm(BMI~year, data = df)
}
BMI2 %>%
group_by(country) %>%
nest() %>%
mutate(model = map(data, country_model)) %>%
mutate(slope = model[[1]][[1]][[2]])
I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.
I have some basic calculations I want to apply on residuals of a plm model but I am stuck on how to automate the steps for a lot of data.
Let's assume the input is a data.frame (df) with the following data:
Id Year Population Y X1 X2 X3
country A 2009 977612 212451.009 19482.7995 0.346657979 0.001023221
country A 2010 985332 221431.632 18989.3 0.345142551 0.001015205
country A 2011 998211 219939.296 18277.79286 0.344020453 0.001002106
country A 2012 1010001 218487.503 17916.2765 0.342434314 0.000990409
country B 2009 150291 177665.268 18444.04522 0.330864789 0.001940218
country B 2010 150841 183819.407 18042 0.327563461 0.001933143
country B 2011 152210 183761.566 17817.3515 0.32539255 0.001915756
country B 2012 153105 182825.112 17626.62261 0.321315437 0.001904557
country c 2009 83129 132328.034 17113.64268 0.359525557 0.005862866
country c 2010 83752 137413.878 16872.5 0.357854141 0.005819254
country c 2011 84493 136002.537 16576.17856 0.356479235 0.005768219
country c 2012 84958 133064.911 16443.3057 0.355246122 0.005736648
A model was applied and the residuals are stored:
fixed <- plm(Y ~ Y1 + X2 + X3,
data=df, drop.unused.levels = TRUE, index=c("Id", "Year"), model="within")
residuals <- resid(fixed)
In my next step, I want to calculate "weighted averages" of my residuals with:
with nit standing for the population in country i at time t and nt being the total population at t.
My approach so far is:
First I compute the total population nt for every year:
year_range <- seq(from=2009,to=2012,by=1)
tot_pop = NULL
for (n in year_range)
{
tot_pop[n] = with(df, sum(Population[Year == n]))
}
Before taking the sum of the "weighted" residuals, my next step would be to automate the calculation of my "new" residuals:
res1 <- df$Population[1]/tot_pop[2009] * residuals[1]
res2 <- df$Population[2]/tot_pop[2010] * residuals[2]
res3 <- df$Population[3]/tot_pop[2011] * residuals[3]
...
res12 <- df$Population[12]/tot_pop[2011] * residuals[12]
Edit: Applying the solution of JTT to my problem, the last step would then be:
year_range1 <- rep(year_range, 3)
df_res <- data.frame(year = year_range1, res=as.vector(res))
aggr_res <- aggregate(df_res$res, list(df_res$year), sum)
colnames(aggr_res) <- c("Year", "Aggregated residual")
Is that correct?
I have tried the lapply function and a double "for-loop" without success. I don't know how to do this. Your help would be appreciated. If my question is unclear, please comment and I will try to improve it.
First, instead of a for-loop, you might want to calculate the total population using the aggregate funtion, e.g.:
a<-aggregate(df$Population, list(df$Year), sum)
Notice the column names of a (Group.1 and x).
Then you could match the results in a to the data in df using the match()-function. It gives the matching row numbers, which can be used to subset data from df to the division before multiplying with the residuals. For example:
res<-df$Population/a$x[match(df$Year, a$Group.1)]*residuals
Now you should have a vector of "new" residuals in object res.