Age calculation for observation data in R [duplicate] - r

This question already has answers here:
Return date range by group
(3 answers)
Closed 3 years ago.
I have very simple big observation data hypothetically structured as below:
> df = data.frame(ID = c("oak", "birch", rep("oak",2), "pine", "birch", "oak", rep("pine",2), "birch", "oak"),
+ yearobs = c(rep(1998,3), rep(1999,2), rep(2000,3),rep(2001,2), 2002))
> df
ID yearobs
1 oak 1998
2 birch 1998
3 oak 1998
4 oak 1999
5 pine 1999
6 birch 2000
7 oak 2000
8 pine 2000
9 pine 2001
10 birch 2001
11 oak 2002
What I want to do is to calculate the age by taking the difference between the years ( max(yearobs)-min(yearobs) ) for each unique ID (tree species in this example). I have tried to work with lubridate + dplyr packages, however, number of observations for each unique ID varies in my data and I want to create an age column in a fastest way without storing minimum and maximum values separately (avoiding for loops here since my data is huge).
Desired output:
ID age
1 oak 4
2 birch 3
3 pine 3
Any suggestion would be appreciated.

In base R you can do:
aggregate(yearobs ~ ID, data = df, FUN = function(x) max(x) - min(x))
# ID yearobs
# 1 birch 3
# 2 oak 4
# 3 pine 2

An option is to group by 'ID' and get the difference between the min and max of 'yearobs' column
library(dplyr)
df %>%
group_by(ID) %>%
summarise(age = max(yearobs) - min(yearobs))
Also, if we need to do this fast, then data.table would be another option
library(data.table)
setDT(df)[, .(age = max(yearobs) - min(yearobs)), by = ID]
Or using base R
by(df['yearobs'], df$ID, FUN = function(x) max(x)- min(x))

Related

Cumsum function step wise in R

I am facing one problem, I calculated a monthly interest rate for a mortgage, however, I would need to sum the results in order to have it yearly (always 12 months).
H <- 2000000 # mortgage
i.m <- 0.03/12 # rate per month
year <- 15 # years
a <- (H*i.m*(1+i.m)^(12*year))/
((1+i.m)^(12*year)-1)
a # monthly payment
interest <- a*(1-(1/(1+i.m)^(0:(year*12))))
interest
cumsum(a*(1-(1/(1+i.m)^(0:(year*12))))) # first 12 values together and then next 12 values + first values and ... (I want to have for every year a value)
You may do this with tapply in base R.
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
yearly <- tapply(monthly, ceiling(seq_along(monthly)/12), sum)
I think you can use the following solution:
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
sapply(split(monthly, ceiling(seq_along(monthly) / 12)), function(x) x[length(x)])
1 2 3 4 5 6 7 8
2254.446 9334.668 21098.218 37406.855 58126.414 83126.695 112281.337 145467.712
9 10 11 12 13 14 15 16
182566.812 223463.138 268044.605 316202.434 367831.057 422828.023 481093.905 486093.905

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

dplyr, filter if both values are above a number [duplicate]

This question already has answers here:
dplyr filter with condition on multiple columns
(6 answers)
Closed 2 years ago.
I have a data set like such.
df = data.frame(Business = c('HR','HR','Finance','Finance','Legal','Legal','Research'), Country = c('Iceland','Iceland','Norway','Norway','US','US','France'), Gender=c('Female','Male','Female','Male','Female','Male','Male'), Value =c(10,5,20,40,10,20,50))
I need to be filter out all rows where both male value and female value are >= 10. For example, Iceland HR should be removed as well as Research France.
I've tried df %>% group_by(Business,Country) %>% filter((Value>=10)) but this filters out any value less than 10. any ideas?
Maybe this can help:
library(reshape2)
df2 <- reshape(df,idvar = c('Business','Country'),timevar = 'Gender',direction = 'wide')
df2 %>% mutate(Index=ifelse(Value.Female>=10 & Value.Male>=10,1,0)) %>%
filter(Index==1) -> df3
df4 <- reshape2::melt(df3[,-5],idvar=c('Business','Country'))
Business Country variable value
1 Finance Norway Value.Female 20
2 Legal US Value.Female 10
3 Finance Norway Value.Male 40
4 Legal US Value.Male 20
You could just use two ave steps, one with length, one with min.
df <- df[with(df, ave(Value, Country, FUN=length)) == 2, ]
df[with(df, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40
# 5 Legal US Female 10
# 6 Legal US Male 20
Notice that this also works if we disturb the data frame.
set.seed(42)
df2 <- df[sample(1:nrow(df)), ]
df2 <- df2[with(df2, ave(Value, Country, FUN=length)) == 2, ]
df2[with(df2, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 5 Legal US Female 10
# 6 Legal US Male 20
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40

How to compute the daily average from hourly values?

I have a text file consisting of 6 columns as shown below. the measurements are taken each 30 mint for several years (2001-2013). I want to compute the daily average so for example: for 2001 take all values correspond to the first day (1) and compute the average and do this for all days in that year and also for all years available in the text file.
to read the file:
LR=read.table("C:\\Users\\dat.txt", sep ='', header =TRUE)
header:
head(LR)
Year day hour mint valu1 valu2
1 2001 1 5 30 0 0
2 2001 1 6 0 1 0
3 2001 1 6 30 2 0
4 2001 1 7 0 0 7
5 2001 1 7 30 5 8
6 2001 1 8 0 0 0
Try:
library(plyr)
ddply(LR, .(Year, day), summarize, val = mean(valu1))
And another less elegant option:
LR$n <- paste(LR$Year, LR$day, sep="-")
tapply(LR$valu1, LR$n, FUN=mean)
If you want to select a certain range of years use subset:
dat < ddply(LR, .(Year, day), summarize, val = mean(valu1))
subset(dat, Year > 2003 & Year < 2005)
You can try aggregate:
res <- aggregate(LR, by = list(paste0(dat$Year, dat$day)), FUN = mean)
## You can remove the extra columns if you want
res[, -c(1,4,5)]
Or as Michael Lawrence suggests, using the formula interface:
aggregate(cbind(valu1, valu2) ~ Year + day, LR, mean)

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Resources