Different age calculation for different rows - r

I'm an absolute R beginner here working on a Master's project.
I have a data.frame that contains information on trotting horses (their wins, earnings, time records and such). The data is organised in a way that every row contains information for a specific year the horse competed and including a first row for each horse of "Total", so there's a summary for every variable for it's total competing life. It looks like this:
I created a new variable with their age using the age_calc function in the eeptools package:
travdata$Age<-age_calc(as.Date(travdata$Birth.date), enddate=as.Date("2016-12-31"),
units="years")
With no problems. What I'm trying to figure out is if there is any way I can calculate the age of the horses for each specific year I have info on them-that is, the "Total" row would have their age up until 2016-12-31, for the year 2015 it would have their age at that time and so on. I've been trying to include if statements in age_calc but it won't work and I'm really at a loss on how best to do this.
Any literature or help you could point me to would be much, much appreciated.
MWE
travdata <- data.frame(
"Id.Number"=c(rep("1938-98",3),rep("1803-97",7),rep("1221-03",4)),
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Sex"=c(rep("Mare",3),rep("Gelding",7),rep("Gelding",4)),
"Birth.year"=c(rep(1998,3),rep(1997,7),rep(2003,4)),
"Birth.date"=c(rep("1998-07-01",3),rep("1997-07-14",7),rep("2003-05-07",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"starts"=c(20,11,9,44,21,6,7,5,3,2,1,1,4,2),
"X1st.placements"=c(0,0,0,3,3,0,0,0,0,0,0,0,0,0),
"X2nd.placements"=c(2,2,0,1,0,1,0,0,0,0,0,0,0,0),
"X3rd.placements"=c(2,2,0,1,1,0,0,0,0,0,0,0,0,0),
"Earnings.euro"=c(1525,1425,100,2078,1498,580,0,0,0,0,0,0,10,10)
)

The trick is to filter out the "Total" rows and specify a format for the as.Date() function
library(eeptools)
travdata <- data.frame(
"Id.Number"=c(rep("1938-98",3),rep("1803-97",7),rep("1221-03",4)),
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Sex"=c(rep("Mare",3),rep("Gelding",7),rep("Gelding",4)),
"Birth.year"=c(rep(1998,3),rep(1997,7),rep(2003,4)),
"Birth.date"=c(rep("1998-07-01",3),rep("1997-07-14",7),rep("2003-05-07",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"starts"=c(20,11,9,44,21,6,7,5,3,2,1,1,4,2),
"X1st.placements"=c(0,0,0,3,3,0,0,0,0,0,0,0,0,0),
"X2nd.placements"=c(2,2,0,1,0,1,0,0,0,0,0,0,0,0),
"X3rd.placements"=c(2,2,0,1,1,0,0,0,0,0,0,0,0,0),
"Earnings.euro"=c(1525,1425,100,2078,1498,580,0,0,0,0,0,0,10,10)
)
travdata$Age<-age_calc(as.Date(travdata$Birth.date),
enddate=as.Date("2016-12-31"), units="years")
competitions <- travdata[travdata$Competition.year!="Total",]
competitions$Competition.age<-age_calc(
as.Date(competitions$Birth.date),
enddate=as.Date(competitions$Competition.year, format="%Y"),
units="years",F)

Related

Average after 2 group_by's in R

I am new to R can't find the right syntax for a specific average I need. I have a large fitbit dataset of heartrate per second for 30 people, for a month each. I want an average of heartrate per day per person to make the data easier to manage and join with other fitbit data.
First few lines of Data
The columns I have are Id (person Id#), Time (Date-Time), and Value (Heartrate). I already separated Time into two columns, one for date and one for time only. My idea is to group the information by person, then by date and get one average number per person per day. But, my code is not doing that.
hr_avg <- hr_per_second %>% group_by(Id) %>% group_by(Date) %>% summarize(mean(Value))
As a result I get an average by date only. I can't do this manually because the dataset is so big, Excel can't open it. And I can't upload it to BigQuery either, the database I learned to use during my data analysis course. Thanks.

How to create a ''for loop'' to download 5 consecutive months of data?

For an assignment we are supposed to use a for-loop to obtain a dataframe of 5 consecutive months.
The data regards crimes and their accompanying type of crime, location, month, street name etc.
How do we go about this issue?
We use the package 'ukpolice' and use this code to obtain data for a specific month and location of choice;
ukpolice syntax is as follows:
data <- as.data.frame(ukc_crime_location(lat = , lng = , date = ""))
Thank you in advance!

Column operators regarding only specific columns (specific dates and code i.e.) in R

i am trying to calculate the average_relative_humidity of the city Seoul for the dates 2020-01-01 tll 2020-31-01.
I have this data:
and I´ve tried this already but don´t really know what´missing.
Seoul_weather_dt <- Corona_relevant_weather_dt[, avg_relative_humidity_seoul := mean(avg_relative_humidity[code =="2020-01-01":"2020-01-01"]), by = c("province", "date", "avg_temp", "avg_relative_humidity"]
Can someone help me?
Something like this?
#select only Seoul and relevant dates
Seoul_weather_dt <- Corona_relevant_weather_dt[province == "Seoul" & date > as.date("2020-01-01") & date <= as.date("2020-31-01")]
#calculate average humidity for each unique date
aggregate(Seoul_weather_dt$avg_relative_humidity, by = list(Seoul_weather_dt$date), FUN = mean)
The line of code you provide is pretty long. I would suggest creating multiple lines with less functions per line to maintain an overview (also easier when getting an error). Also
is datein class "Date"? You can see that using str(Seoul_weather_dt)
code =="2020-01-01":"2020-01-01" only selects one day
Using by = c("province", "date", "avg_temp", "avg_relative_humidity") is strange. Then you would calculate a mean value for each observation of avg_relative_humidity as well, which is not what you want
Why create average values for each province when you are only interested in Seoul?

How do I stop the number of observations coming up when trying to tabulate a variable?

Very new to using R but encountering a problem when trying to work on the code for a stats project. I have attached the .csv file below for reference but essentially I would like to plot the years 2018,2019 and 2020 against the sum of international arrivals ("Int_Pax_In" in the excel file) from the first 6 months of each year from the "All Australian Airports" variable . So I will have 3 bars in my plot, with each being 2018,2019,2020 respectively with the y-axis labelled "All Australian Arrivals". The problem is, I just wanted to start off with a simple line of code to tabulate the "Year" variable without even trying to achieve the final result and simply putting in:
info=read.csv("mon_pax_web.csv")
table(info$Year)
doesn't give me any information. It simply gives me the number of observations for each year instead of anything else. Below is a screenshot of what I get:
Screenshot 1
info=read.csv("mon_pax_web.csv")
str(info)
table(info$Year)
I also tried changing my variables apart from "Year" into as.character and Month into factor but that had no effect as shown below:
Screenshot 2
info=read.csv("mon_pax_web.csv")
info$AIRPORT=as.character(info$AIRPORT)
info$Month=as.factor(info$Month)
info$Dom_Pax_In=as.character(info$Dom_Pax_In)
info$Dom_Pax_Out=as.character(info$Dom_Pax_Out)
info$Dom_Pax_Total=as.character(info$Dom_Pax_Total)
info$Int_Pax_Out=as.character(info$Int_Pax_Out)
info$Int_Pax_Total=as.character(info$Int_Pax_Total)
info$Pax_In=as.character(info$Pax_In)
info$Pax_Out=as.character(info$Pax_Out)
info$Pax_Total=as.character(info$Pax_Total)
info$Int_Pax_In=as.character(info$Int_Pax_In)
str(info)
table(info$Year)
I'm only allowed to use Base R for this project so would appreciate it a lot if people could help me out and if you do, provide coding using Base R so I could follow along. Just require some pointers so I could get started.
CSV File for reference
Thank you.
The column info$Year is just a vector of years, so when you do table(info$Year) it only shows the number of entries for that year because that's what you have asked for. If I gave you the following years: 2011, 2011, 2012 and 2013, and asked you to tabulate the years, without giving you any other information, all you could do is count the number of instances of each year. Presumably, this is not what meant.
I'm guessing what you're trying to do is to get the sum of Int_Pax_In per year. First you should filter so that your only include the years of interest, the months of interest, and the rows that represent all Australian airports. You can do this using subset:
df <- subset(info, Year > 2017 & Month < 7 & AIRPORT == "All Australian Airports")
Now we can use tapply to find the sum for each year:
plot_table <- tapply(df$Int_Pax_In, df$Year, sum)
Finally, we use barplot to create the bar graph you wanted:
barplot(plot_table, main = "Arrivals at all Australian airports January - June")

R - how to create lagged variables by id, day, assessment nr and specifying the interval

Maybe someone here can help me out!
What I need to do in R is:
create lags for multiple variables considering id, day and day_nr (as I have multiple assessments for each participants on each day, and no lags should be created overnight, meaning no lag for the first assessment in the morning by the last observation on the former day)
I tried several options, for example this, but didnt manage to put in more than id:
library(data.table)
data[, lag.value:=c(NA, value[-.N]), by=id]
Furthermore, I now included the specific day time of the assessment and lags should only be created for obsersavtions with an interval <3hours between them, as number of assessment per day are irregular. Any idea how i could do this in R?
Thanks a lot!!
Tine

Resources