Converting year of birth variable into age variable in R - r

I'm working on a dataset from 2009. I have a variable with the birth years of the respondents, which I would like to recode into an age variable. I guess the easiest way to do so is to subtract from 2009 the dates of birth of all observations of the column of the dataframe, but I don't know how, because I am still new to R (I have been working with Stata so far). What is the easiest way to do the recoding?
Here is the head of my data frame. The column with the year of birth is e2:
d6c d7a d7b d7c d7d e1 e2 e3 e3a
1 neither no answer Q not asked no answer no answer 1 1961 in [country] -1
2 agree agree Q not asked agree strongly agree 1 1945 in [country] -1
3 agree agree Q not asked neither agree 1 1945 in [country] -1
4 strongly disagree strongly agree Q not asked disagree disagree 0 1961 in [country] -1
e4 e4a e4b e6a e6b e7 e7a e8
1 large town or city -1 48 -1 Q not asked other, not in labour force -1 -1
2 small or middle-sized town -1 63 -1 Q not asked employed full-time [32 hrs or more per week] -1 -1
3 rural area of village -1 64 -1 Q not asked retired -1 -1
4 rural area of village -1 48 -1 Q not asked employed part-time [15-32 hrs] -1 -1
e9 e10 e11 e12a
1 never None married or living as married Q not asked
2 never None married or living as married Q not asked
3 2 or 3 times a month Protestant, no denomination given married or living as married Q not asked
4 a number of times a year Roman Catholic married or living as married Q not asked
e12b e12c e13 e14 e15a e15b
1 Q not asked Q not asked -1 -1 -1 -1
2 Q not asked Q not asked -1 -1 -1 -1
3 Q not asked Q not asked -1 -1 -1 -1
4 Q not asked Q not asked -1 -1 -1 -1

Sample data:
df <- data.frame(
e2 = c(1967, 1977, 1988, 1955, 2000, 1969)
)
To get age, simply subtract df$e2 from 2009:
df$age <- 2009 - df$e2
df
e2 age
1 1967 42
2 1977 32
3 1988 21
4 1955 54
5 2000 9
6 1969 40
If e2 is of type character, convert to type numeric:
df$age <- 2009 - as.numeric(df$e2)

Related

Is there a way to compute a variable with hierarchy in its categories?

I want to compute a variable with a hierarchical order in its values? Here is a slice of a fake dataset for this purpose.
study_id covid_result days_from_death time0_death indexyear
999100 N -7 0 2022
999101 C 9 0 2022
999101 N -3 0 2021
999102 N -87 0 2020
999103 N -89 0 2022
999103 N 1 0 2021
999103 P 0 0 2020
999104 C -98 0 2020
999104 N -64 0 2020
999105 P 4 0 2021
999106 P 0 0 2021
999107 N -84 0 2022
999108 N -95 0 2020
999108 P -45 0 2020
999109 N -2 0 2022
My objective is to create a variable covid_status_death (covid-19 status at death) with three categories: positive, negative, other. Each person could have more than one covid results, hence, >1 row.
(1) A person will be labelled as covid-positive if they had a positive covid result (covid_result = P) at anytime between -30 days before to 7 days after death. (2) A person will be labelled as covid-negative if they had negative covid results in the absence of a positive test within the same time window. (3) rest will be categorized as other.
What is the best way to approach this problem? I have tried the case_when() approach, but I cannot figure out a way to introduce hierarchy, as described above. Please see the attached code below:
coviddata %>% mutate(covid_status_death = case_when(
covid_result == "P" & between(days_from_death,-30,7)~"Positive",
covid_result == "N" & between(days_from_death,-30,7)~"Negative",
TRUE ~"other"))
I am new to R programming and any help will be much appreciated.
Tony

Multiple columns in one random effect GLMER

I'm trying to find variance in infectivity trait of animals in different herds. Each herds contains a fixed number of offspring from 5 different sires.
Example of data:
Herd
S
C
DeltaT
I
sire1
I1
sire2
I2
sire3
I3
sire4
I4
sire5
I5
1
20
0
14
1
13
0
26
0
46
0
71
0
91
1
1
1
0
14
5
13
1
26
0
46
2
71
1
91
1
18
4
0
14
13
2
5
52
4
84
2
87
2
98
0
19
11
3
14
27
2
6
13
7
18
3
46
5
85
6
Herd is the herdname. S is the number of susceptible animals in the herd, C is the number of cases in the time interval. DeltaT is the time interval length. Sire# is the ID of the sire in the Herd. I# is the number of infected Ofspring of the corresponding Sire#. This means that a sireID "13" in the first two rows in the column sire1. Refers to the same sire as the "13" in sire2 of the last row. To include these 5 sires into one random effect in a glmer of lme4 is getting me in trouble.
I tried:
glmer(data = GLMM_Data,
cbind(C, S-C) ~ (1 | Herd) + (1| (I1 | sire1) + (I2 | sire2) + (I3 | sire3) + (I4 | sire4) + (I5 | sire5)),
offset = log(GLMM_Data$I/nherds * GLMM_Data$DeltaT),
family = binomial(link="cloglog"))
This gave errors. So any help on combining these 10 columns in a single random factor would be more than welcome. Thanks in advance.
p.s. I know my offset, family and the left side of the formula are working since the analysis of susceptibility is working

Mutation of non-conformable arrays

library(boot)
install.packages("AMORE")
library(AMORE)
l.data=nrow(melanoma)
set.seed(5)
idxTrain<-sample(1:l.data,100)
idxTest<-setdiff(1:l.data,idxTrain)
set.seed(3)
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
result<-train(net,
melanoma[idxTrain,-2],
melanoma$status,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)
The problem I have is I have an error in result: "target - non-conformable arrays".
I know that it is the problem with melanoma$status, but have no idea how to alter the data accordingly. Any ideas? Couple of samples of data (if you don't use boot package from Rstudio).
melanoma:
time status sex age year thickness ulcer
1 10 3 1 76 1972 6.76 1
2 30 3 1 56 1968 0.65 0
3 35 2 1 41 1977 1.34 0
4 99 3 0 71 1968 2.90 0
5 185 1 1 52 1965 12.08 1
Your target variable should first take only the training indices. Moreover, the target should have a number of columns equal to the number of classes - with one-hot encoding. Something like this:
net<-newff(n.neurons=c(6,6,3),
learning.rate.global=0.02,
momentum.global=0.5,
hidden.layer="sigmoid",
output.layer="purelin",
method="ADAPTgdwm",
error.criterium="LMS")
Target = matrix(data=0, nrow=length(idxTrain), ncol=3)
status_mat=matrix(nrow=length(idxTrain), ncol=2)
status_mat[,1] = c(1:length(idxTrain))
status_mat[,2] = melanoma$status[idxTrain]
Target[(status_mat[,2]-1)*length(idxTrain)+status_mat[,1]]=1
result<-train(net,
melanoma[idxTrain,-2],
Target,
error.criterium="LMS",
report=TRUE,
show.step=10,
n.shows=800)

providing date in x-axis in R

Can anyone please tell me how to give the start and end as dates in time-series in R. I know how to give a sequence, say:
ts <- ts(temp, start=1, end=10)
But I want to show starting at Jan 01 and ending in Jan 10 instead of just 1 to 10. Thanks in advance.
The basic time-series functionality in ts is probably not going to be enough for you. There are a lot of available tools for working with time-series in R, but the ts class is geared towards representing
"regularly spaced time series (using numeric time stamps). Hence, it is
particularly well-suited for annual, monthly, quarterly data, etc"
If you describe your data correctly, the print command will format it nicely. If you wanted your data divided into months, you could do something like this (note the frequency of 12):
> print(ts(round(rnorm(44)), start = c(2012,3), frequency = 12), calendar = TRUE)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012 2 1 -2 -1 1 1 0 -1 1 -1
2013 1 0 2 1 -1 1 -1 -1 0 2 1 2
2014 2 0 -1 0 1 1 0 0 2 -1 0 0
2015 2 0 0 -1 0 0 0 1 2 0
Since you want daily intervals, you're going to want to set frequency to 365:
> print(ts(letters, start = c(2013, 1), frequency = 365), calendar = TRUE)
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21
2013 a b c d e f g h i j k l m n o p q r s t u
p22 p23 p24 p25 p26
2013 v w x y z
Which is going to look rather awkward, but will solve your problem in that it won't just give you a number for each day. However, as stated in the docs, the ts class only supports "numeric time stamps" so this is probably the best you're going to get with built in ts features.
If you want more advanced features I would have a look at some of the tools in this documentation.

Ifelse statements for a dataframe in R

I am hoping that someone can help me figure out how to write an if-else statement to work on my dataset. I have data on tree growth rates by year. I need to calculate whether growth rates decreased by >50% from one year to the next. I am having trouble applying an ifelse statement to calculate my final field. I am relatively new to R, so my code is probably not very efficient, but here is an example of what I have so far:
For an example dataset,
test<-data.frame(year=c("1990","1991","1992","1993"),value=c(50,25,20,5))
year value
1 1990 50
2 1991 25
3 1992 20
4 1993 5
I then calculate the difference between the current year and previous year's growth ("value"):
test[-1,"diff"]<-test[-1,"value"]-test[-nrow(test),"value"]
year value diff
1 1990 50 NA
2 1991 25 -25
3 1992 20 -5
4 1993 5 -15
and then calculate what 50% of each years' growth would be:
test$chg<-test$value * 0.5
year value diff chg
1 1990 50 NA 25.0
2 1991 25 -25 12.5
3 1992 20 -5 10.0
4 1993 5 -15 2.5
I am then trying to use an ifelse statement to calculate a field "abrupt" that would be "1" when the decline from one year to the next is greater than 50%. This is the code I am trying to use, but I'm not sure how to properly reference the "chg" field from the previous year, because I am getting an error (copied below):
test$abrupt<-ifelse(test$diff<0 && abs(test$diff)>=test[-nrow(test),"chg"],1,0)
Warning message:
In abs(test$diff) >= test[-nrow(test), "chg"] :
longer object length is not a multiple of shorter object length
> test
year value diff chg abrupt
1 1990 50 NA 25.0 NA
2 1991 25 -25 12.5 NA
3 1992 20 -5 10.0 NA
4 1993 5 -15 2.5 NA
A test of a similar ifelse statement worked when I just assigned a few numbers, but I'm not sure how to get this to work in the context of a datframe. Here is an example of it working on just a few values:
prevyear<-50
curryear<-25
chg<-prevyear*0.5
> chg
[1] 25
> diff<-curryear-prevyear
> diff
[1] -25
> abrupt<-ifelse(diff<0 && abs(diff)>= chg,1,0)
> abrupt
[1] 1
If anyone could help me figure out how to apply a similar ifelse statement to my dataframe I would greatly appreciate it! Thank you for any help you can provide.
thank you,
Katie
It's throwing a warning because the two vectors compared abs(test$diff) >= test[-nrow(test),"chg"] have different lengths. Also, for logical and, you are using && (which gives only one TRUE or FALSE) when you should be using & (which is vectorized: it operates elementwise over two vectors and returns a vector of the same length). Try this:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)
I would change where you're putting chg so that it lines up with the diff you want to compare it to:
test$chg[2:nrow(test)] <- test$value[1:(nrow(test)-1)] * 0.5
Then, correct your logical operator like Blue Magister said:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)
and you have your results:
year value diff chg abrupt
1 1990 50 NA NA NA
2 1991 25 -25 25.0 1
3 1992 20 -5 12.5 0
4 1993 5 -15 10.0 1
Also, you may find the function diff helpful: rather than doing this:
test[-1,"value"]-test[-nrow(test),"value"]
you can just do
diff(test$value)

Resources