How to set up a time series for this project (r)? - r

I am a cross country runner on a high school team, and I am using my limited knowledge of R and linear algebra to create a ranking index for xc teams.
I get my data from milesplit.com, but I am unsure if I am formatting this data properly. So far I created matrices for each race, with odd columns including runner score and even columns including time, where each team has a team_score and team_time column. I want to analyze growth of teams in a time series, but I have two questions about this:
(1): can I combine all of these "race matrices" into a time series? Can I assign all the data in a race matrix a certain date, then make one big time series including all 25 race matrices I made?
(2): Am I closing myself off to insights by not including name and grade for each runner (as I only record time and score)? If so, how can I write a matrix that contains all this information?

Related

Diff-in-Diff estimation in R where all IDs are treated but at different points in time

I am trying to run a diff-in-diff on a dataset at the person-day level, where all individuals in the dataset are treated, albeit at different points in time. There are 5 treatment dates, so, for instance, person X receives the treatment on day 1, person Y receives the treatment on day 10, person Z on day 5, and so forth. What's important here is that every person is treated eventually. Here's a stylized visual representation of the data (where LHS is the dependent variable):
Now, what I am trying to do is run a diff-in-diff where I compare person Z that was treated on day 5 with person Y that was not yet treated on day 5 (so, in this setup, person Y would serve as the control group). This criteria would have to be extended to all the individuals in the sample so as to run a diff-in-diff simultaneously for all people.
I am not sure how to code this up in R. I am pretty familiar with the feols function in R as I have used it several times in the past to run conventional diff-in-diffs such as the one illustrated here: https://lost-stats.github.io/Model_Estimation/Research_Design/event_study.html. However, in this particular case, I am not sure what I should be interacting Days_To_Treatment with since if I interact with Treatment every observation prior to Days_To_Treatment = 0 will be dropped.
I am honestly pretty clueless as to how to approach this at the moment. Any help, advice, or tip would be greatly appreciated.
Thanks!

How to normalize data with semesters vs trimesters when all mixed in dataset in R

I am new to R but finding it a powerful solution when working with education data across a state. I have grades for about 11,000 students over the span of two years. Most students 12 rows in my dataset, as most schools work on a semester system. Many schools, however, work on a trimester or quarter system, meaning there are more or less rows and, therefore, more or less grades. The grades are relatively close throughout each semester/trimester/whatever and I have already converted the letter grade into a numeric value. A column titled 'TERM' identifies which system the school is under (SEM1/2, TRI1/2/3, QTR1/2/3/4). I am wondering if anyone has an idea as to how best organize this data by TERM so I have something normalized.
df<- cbind(c('stu1', 'stu1', 'stu2','stu2','stu2'), c('sem1','sem2', 'tri1','tri2','tri3'), c('a','c','a','b','a'), c(4,2,4,3,4))

Complex dataframe selecting and sorting by quintile

I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want

How to plot k-medoids results of time series in R?

I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg . There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. I used k-medoids algorithm (after transforming data with FFT/DWT) and I do not know how to present each cluster = grouped time series on different plots.
Can somebody tell me how should I do that?
Here is the code of clustering:
clustersalesGain = pam(t(salesGain), 8)
nazwy = as.character(nazwy)
cbind(nazwy,clustersalesGain$clustering)
I would like to present the output on different plots.
k-medoids returns actual data points as cluster centers.
Just visualize them the same way you visualize your data!
(And if you havn't been visualizing your data, you better work on that now.)

Independent binary variable (frequency) and continuous response variable - lmm

I've spent a lot of time searching for a solution but not successfully. For that reason I decided to post my problem or question here hoping somebody of you can help me.
I want to find out which variables are influencing the travel distance of two animals (same species).
The response variable is distance moved (in meters). In total I have 66 tracking sessions for both animals.
The independent variables are: temperature, rainfall, offspring (yes = 1, no = 0), observation period (in minutes) and activity.
I looked at the animals (one day - one animal) every 15 minutes and noted the state of activity (active = 1 or inactive = 0). For that reason my data table consists around 1800 points and the same amount of activity records.
Then I created a table with following columns:
Animal, Tracking-Session, rainfall, offspring, observation period, active, inactive, distance
The two columns active and inactive contain the sum of active (inactive) records per tracking session.
For example in tracking-session 1 the animal A was 30 times active and 11 inactive and moved 6000 meters during that tracking session.
I thought I could do my analysis with this table using the command cbind() to make one column for activity out of the two columns with "inactive" and "active". But this does not work, I get:
Error in lme4::lFormula(formula = distance~ (1 | animal) + activity + offspring + ...
rank of X = 12 < ncol(X) = 13
I want to include the second animal as a random factor to get an output valid for the whole "population" (which only consits of two animals in that case).
How can I fit a linear mixed model to this data or the first question is: how my data table has to look like to do such analysis?
I started running a linear mixed model with my original data table consisting of 1800 rows but the outcome was not convincing. And I don't know if this table was built up correctly for this task. Because I have only 60 tracking sessions and for that reason only 60 resulting travel distances, but 1800 records of activity (each 15 minutes - active or inactive). I don't know how to handle this situation the only possibility for me to overcome this problem was to copy the travel distace (which is the result of all points watched per day) and assign it to each single point of that tracking session.
The same is for rainfall and temperature because these conditions were only measured once a day I had to copy the value for each single point taken on the same day.
Is this correct or better can R handle such tables (like in the picture)? Or is it better to create a table with one row for each day (as I describe above)?
If the the second table (the one with one row per tracking session) is the better choice, how has it be transformed that R can use it?
Hopefully you can follow my explanations (I tried to explain it as detailed as possible) and anyone can help me!
Thanks in advance!
Iris

Resources