I want to compute a variable with a hierarchical order in its values? Here is a slice of a fake dataset for this purpose.
study_id covid_result days_from_death time0_death indexyear
999100 N -7 0 2022
999101 C 9 0 2022
999101 N -3 0 2021
999102 N -87 0 2020
999103 N -89 0 2022
999103 N 1 0 2021
999103 P 0 0 2020
999104 C -98 0 2020
999104 N -64 0 2020
999105 P 4 0 2021
999106 P 0 0 2021
999107 N -84 0 2022
999108 N -95 0 2020
999108 P -45 0 2020
999109 N -2 0 2022
My objective is to create a variable covid_status_death (covid-19 status at death) with three categories: positive, negative, other. Each person could have more than one covid results, hence, >1 row.
(1) A person will be labelled as covid-positive if they had a positive covid result (covid_result = P) at anytime between -30 days before to 7 days after death. (2) A person will be labelled as covid-negative if they had negative covid results in the absence of a positive test within the same time window. (3) rest will be categorized as other.
What is the best way to approach this problem? I have tried the case_when() approach, but I cannot figure out a way to introduce hierarchy, as described above. Please see the attached code below:
coviddata %>% mutate(covid_status_death = case_when(
covid_result == "P" & between(days_from_death,-30,7)~"Positive",
covid_result == "N" & between(days_from_death,-30,7)~"Negative",
TRUE ~"other"))
I am new to R programming and any help will be much appreciated.
Tony
I'm working on a dataset from 2009. I have a variable with the birth years of the respondents, which I would like to recode into an age variable. I guess the easiest way to do so is to subtract from 2009 the dates of birth of all observations of the column of the dataframe, but I don't know how, because I am still new to R (I have been working with Stata so far). What is the easiest way to do the recoding?
Here is the head of my data frame. The column with the year of birth is e2:
d6c d7a d7b d7c d7d e1 e2 e3 e3a
1 neither no answer Q not asked no answer no answer 1 1961 in [country] -1
2 agree agree Q not asked agree strongly agree 1 1945 in [country] -1
3 agree agree Q not asked neither agree 1 1945 in [country] -1
4 strongly disagree strongly agree Q not asked disagree disagree 0 1961 in [country] -1
e4 e4a e4b e6a e6b e7 e7a e8
1 large town or city -1 48 -1 Q not asked other, not in labour force -1 -1
2 small or middle-sized town -1 63 -1 Q not asked employed full-time [32 hrs or more per week] -1 -1
3 rural area of village -1 64 -1 Q not asked retired -1 -1
4 rural area of village -1 48 -1 Q not asked employed part-time [15-32 hrs] -1 -1
e9 e10 e11 e12a
1 never None married or living as married Q not asked
2 never None married or living as married Q not asked
3 2 or 3 times a month Protestant, no denomination given married or living as married Q not asked
4 a number of times a year Roman Catholic married or living as married Q not asked
e12b e12c e13 e14 e15a e15b
1 Q not asked Q not asked -1 -1 -1 -1
2 Q not asked Q not asked -1 -1 -1 -1
3 Q not asked Q not asked -1 -1 -1 -1
4 Q not asked Q not asked -1 -1 -1 -1
Sample data:
df <- data.frame(
e2 = c(1967, 1977, 1988, 1955, 2000, 1969)
)
To get age, simply subtract df$e2 from 2009:
df$age <- 2009 - df$e2
df
e2 age
1 1967 42
2 1977 32
3 1988 21
4 1955 54
5 2000 9
6 1969 40
If e2 is of type character, convert to type numeric:
df$age <- 2009 - as.numeric(df$e2)
I have a df that looks like this:
Year Subscribers Forecast AbsError
1 2006 23188171 0 0
2 2007 28745769 0 0
3 2008 34880964 0 0
4 2009 46373266 0 0
I have a lop that fills in the forecast column and then it should subtract the subscriber value from the forecast value and put that number into the AbsError column, like so:
Year Subscribers Forecast AbsError
1 2006 23188171 9680000 13508171
2 2007 28745769 27960000 46240000
3 2008 3488096 46240000 11359036
My Loop looks like this:
for (i in 1:nrow(new.phone)) {
new.phone$Forecast[i] <- ((1.828e+07 )*new.phone$Year[i]) + (-3.666e+10)
new.phone$AbsError <- abs((new.phone$Subscribers[i] - new.phone$Forecast[i]))
}
Although this loop is giving the correct forecasted values, its giving the incorrect AbsError values, but i cannot figure out why. All the AbsError values are 10464033, but that is wrong. Any ideas why this is?
Thanks for the help!
You don't need a for loop to do that. This does what you need:
new.phone$Forecast <- ((1.828e+07) * new.phone$Year) + (-3.666e+10)
new.phone$AbsError <- abs(new.phone$Subscribers - new.phone$Forecast)
You were just missing the index in the second line of the loop. Should be: new.phone$AbsError[i] <- [...] not new.phone$AbsError <- [...].
Anyway, you could skip the loop of you want:
new.phone$Forecast <- (1.828e+07) * new.phone$Year + (-3.666e+10)
new.phone$AbsError <- abs(new.phone$Subscribers - new.phone$Forecast)
new.phone
Year Subscribers Forecast AbsError
1 2006 23188171 9680000 13508171
2 2007 28745769 27960000 785769
3 2008 34880964 46240000 11359036
4 2009 46373266 64520000 18146734
Try this in dplyr:
require(dplyr)
k <- read.table(text = "Year Subscribers Forecast AbsError
1 2006 23188171 0 0
2 2007 28745769 0 0
3 2008 34880964 0 0
4 2009 46373266 0 0")
k%>%mutate(Forecast = ((1.828e+07 )*Year) + (-3.666e+10) )%>%
mutate(AbsError = abs(Subscribers-Forecast))
Results:
Year Subscribers Forecast AbsError
1 2006 23188171 9680000 13508171
2 2007 28745769 27960000 785769
3 2008 34880964 46240000 11359036
4 2009 46373266 64520000 18146734
I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.
I have this code in R :
corr = function(x, y) {
sx = sign(x)
sy = sign(y)
cond_a = sx == sy && sx > 0 && sy >0
cond_b = sx < sy && sx < 0 && sy >0
cond_c = sx > sy && sx > 0 && sy <0
cond_d = sx == sy && sx < 0 && sy < 0
cond_e = sx == 0 || sy == 0
if(cond_a) return('a')
else if(cond_b) return('b')
else if(cond_c) return('c')
else if(cond_d) return('d')
else if(cond_e) return('e')
}
Its role is to be used in conjunction with the mapply function in R in order to count all the possible sign patterns present in a time series. In this case the pattern has a length of 2 and all the possible tuples are : (+,+)(+,-)(-,+)(-,-)
I use the corr function this way :
> with(dt['AAPL'], table(mapply(corr, Return[-1], Return[-length(Return)])) /length(Return)*100)
a b c d e
24.6129416 25.4466058 25.4863041 24.0174672 0.3969829
> dt["AAPL",list(date, Return)]
symbol date Return
1: AAPL 2014-08-29 -0.3499903
2: AAPL 2014-08-28 0.6496702
3: AAPL 2014-08-27 1.0987923
4: AAPL 2014-08-26 -0.5235654
5: AAPL 2014-08-25 -0.2456037
I would like to generalize the corr function to n arguments. This mean that for every nI would have to write down all the conditions corresponding to all the possible n-tuples. Currently the best thing I can think of for doing that is to make a python script to write the code string using loops, but there must be a way to do this properly. Do you have an idea about how I could generalize the fastidious condition writing, maybe I could try to use expand.grid but how do the matching then ?
I think you're better off using rollapply(...) in the zoo package for this. Since you seem to be using quantmod anyway (which loads xts and zoo), here is a solution that does not use all those nested if(...) statements.
library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
AAPL <- AAPL["2007-08::2009-03"] # AAPL during the crash...
Returns <- dailyReturn(AAPL)
get.patterns <- function(ret,n) {
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(patterns,1,function(row)all(row==sign(x))))
}
returns <- na.omit(ret)
patterns <- expand.grid(rep(list(c(-1,1)),n))
labels <- apply(patterns,1,function(row) paste0("(",paste(row,collapse=","),")"))
result <- rollapply(returns,width=n,f,align="left")
data.frame(100*table(labels[result])/(length(returns)-(n-1)))
}
get.patterns(Returns,n=2)
# Var1 Freq
# 1 (-1,-1) 22.67303
# 2 (-1,1) 26.49165
# 3 (1,-1) 26.73031
# 4 (1,1) 23.15036
get.patterns(Returns,n=3)
# Var1 Freq
# 1 (-1,-1,-1) 9.090909
# 2 (-1,-1,1) 13.397129
# 3 (-1,1,-1) 14.593301
# 4 (-1,1,1) 11.722488
# 5 (1,-1,-1) 13.636364
# 6 (1,-1,1) 13.157895
# 7 (1,1,-1) 12.200957
# 8 (1,1,1) 10.765550
The basic idea is to create a patterns matrix with 2^n rows and n columns, where each row represents one of the possible patterns (e,g, (1,1), (-1,1), etc.). Then pass the daily returns to this function n-wise using rollapply(...) and identify which row in patterns matches sign(x) exactly. Then use this vector of row numbers an an index into labels, which contains a character representation of the patterns, then use table(...) as you did.
This is general for an n-day pattern, but it ignores situations where any return is exactly zero, so the $Freq columns do not add up to 100. As you can see, this doesn't happen very often.
It's interesting that even during the crash it was (very slightly) more likely to have two up days in succession, than two down days. If you look at plot(Cl(AAPL)) during this period, you can see that it was a pretty wild ride.
This is a little different approach but it may give you what you're looking for and allows you to use any size of n-tuple. The basic approach is to find the signs of the adjacent changes for each sequential set of n returns, convert the n-length sign changes into n-tuples of 1's and 0's where 0 = negative return and 1 = positive return. Then calculate the decimal value of each n-tuple taken as binary number. These numbers will clearly be different for each distinct n-tuple. Using a zoo time series for these calculations provides several useful functions including get.hist.quote() to retrieve stock prices, diff() to calculate returns, and the rollapply() function to use in calculating the n-tuples and their sums.The code below does these calculations, converts the sum of the sign changes back to n-tuples of binary digits and collects the results in a data frame.
library(zoo)
library(tseries)
n <- 3 # set size of n-tuple
#
# get stock prices and compute % returns
#
dtz <- get.hist.quote("AAPL","2014-01-01","2014-10-01", quote="Close")
dtz <- merge(dtz, (diff(dtz, arithmetic=FALSE ) - 1)*100)
names(dtz) <- c("prices","returns")
#
# calculate the sum of the sign changes
#
dtz <- merge(dtz, rollapply( data=(sign(dtz$returns)+1)/2, width=n,
FUN=function(x, y) sum(x*y), y = 2^(0:(n-1)), align="right" ))
dtz <- fortify.zoo(dtz)
names(dtz) <- c("date","prices","returns", "sum_sgn_chg")
#
# convert the sum of the sign changes back to an n-tuple of binary digits
#
for( i in 1:nrow(dtz) )
dtz$sign_chg[i] <- paste(((as.numeric(dtz$sum_sgn_chg[i]) %/%(2^(0:2))) %%2), collapse="")
#
# report first part of result
#
head(dtz, 10)
#
# report count of changes by month and type
#
table(format(dtz$date,"%Y %m"), dtz$sign_chg)
An example of possible output is a table showing the count of changes by type for each month.
000 001 010 011 100 101 110 111 NANANA
2014 01 1 3 3 2 3 2 2 2 3
2014 02 1 2 4 2 2 3 2 3 0
2014 03 2 3 0 4 4 1 4 3 0
2014 04 2 3 2 3 3 2 3 3 0
2014 05 2 2 1 3 1 2 3 7 0
2014 06 3 4 3 2 4 1 1 3 0
2014 07 2 1 2 4 2 5 5 1 0
2014 08 2 2 1 3 1 2 2 8 0
2014 09 0 4 2 3 4 2 4 2 0
2014 10 0 0 1 0 0 0 0 0 0
so this would show that in month 1, January of 2014, there was one set of three days with 000 indicating 3 down returns , 3 days with the 001 change indicating two down return and followed by one positive return and so forth. Most months seem to have a fairly random distribution but May and August show 7 and 8 sets of 3 days of positive returns reflecting the fact that these were strong months for AAPL.