I've spent a lot of time searching for a solution but not successfully. For that reason I decided to post my problem or question here hoping somebody of you can help me.
I want to find out which variables are influencing the travel distance of two animals (same species).
The response variable is distance moved (in meters). In total I have 66 tracking sessions for both animals.
The independent variables are: temperature, rainfall, offspring (yes = 1, no = 0), observation period (in minutes) and activity.
I looked at the animals (one day - one animal) every 15 minutes and noted the state of activity (active = 1 or inactive = 0). For that reason my data table consists around 1800 points and the same amount of activity records.
Then I created a table with following columns:
Animal, Tracking-Session, rainfall, offspring, observation period, active, inactive, distance
The two columns active and inactive contain the sum of active (inactive) records per tracking session.
For example in tracking-session 1 the animal A was 30 times active and 11 inactive and moved 6000 meters during that tracking session.
I thought I could do my analysis with this table using the command cbind() to make one column for activity out of the two columns with "inactive" and "active". But this does not work, I get:
Error in lme4::lFormula(formula = distance~ (1 | animal) + activity + offspring + ...
rank of X = 12 < ncol(X) = 13
I want to include the second animal as a random factor to get an output valid for the whole "population" (which only consits of two animals in that case).
How can I fit a linear mixed model to this data or the first question is: how my data table has to look like to do such analysis?
I started running a linear mixed model with my original data table consisting of 1800 rows but the outcome was not convincing. And I don't know if this table was built up correctly for this task. Because I have only 60 tracking sessions and for that reason only 60 resulting travel distances, but 1800 records of activity (each 15 minutes - active or inactive). I don't know how to handle this situation the only possibility for me to overcome this problem was to copy the travel distace (which is the result of all points watched per day) and assign it to each single point of that tracking session.
The same is for rainfall and temperature because these conditions were only measured once a day I had to copy the value for each single point taken on the same day.
Is this correct or better can R handle such tables (like in the picture)? Or is it better to create a table with one row for each day (as I describe above)?
If the the second table (the one with one row per tracking session) is the better choice, how has it be transformed that R can use it?
Hopefully you can follow my explanations (I tried to explain it as detailed as possible) and anyone can help me!
Thanks in advance!
Iris
Related
I am a cross country runner on a high school team, and I am using my limited knowledge of R and linear algebra to create a ranking index for xc teams.
I get my data from milesplit.com, but I am unsure if I am formatting this data properly. So far I created matrices for each race, with odd columns including runner score and even columns including time, where each team has a team_score and team_time column. I want to analyze growth of teams in a time series, but I have two questions about this:
(1): can I combine all of these "race matrices" into a time series? Can I assign all the data in a race matrix a certain date, then make one big time series including all 25 race matrices I made?
(2): Am I closing myself off to insights by not including name and grade for each runner (as I only record time and score)? If so, how can I write a matrix that contains all this information?
I am trying to run a diff-in-diff on a dataset at the person-day level, where all individuals in the dataset are treated, albeit at different points in time. There are 5 treatment dates, so, for instance, person X receives the treatment on day 1, person Y receives the treatment on day 10, person Z on day 5, and so forth. What's important here is that every person is treated eventually. Here's a stylized visual representation of the data (where LHS is the dependent variable):
Now, what I am trying to do is run a diff-in-diff where I compare person Z that was treated on day 5 with person Y that was not yet treated on day 5 (so, in this setup, person Y would serve as the control group). This criteria would have to be extended to all the individuals in the sample so as to run a diff-in-diff simultaneously for all people.
I am not sure how to code this up in R. I am pretty familiar with the feols function in R as I have used it several times in the past to run conventional diff-in-diffs such as the one illustrated here: https://lost-stats.github.io/Model_Estimation/Research_Design/event_study.html. However, in this particular case, I am not sure what I should be interacting Days_To_Treatment with since if I interact with Treatment every observation prior to Days_To_Treatment = 0 will be dropped.
I am honestly pretty clueless as to how to approach this at the moment. Any help, advice, or tip would be greatly appreciated.
Thanks!
I am writing my thesis and I am struggling with some data preparation.
I have a dataset with prices, distance, and many other variables for several us airline routes. I need to identify the threat of entry on each route for a specific carrier (southwest) and to do that I need to create, for each row of the dataset, a dummy that assumes the value of 1 if southwest is flying from the takeoff airport of the row at that point in time.
How I thought of approaching this was to have an algorithm that checks the year and the takeoff airport_ID (all variables in the dataset) and then based on that values filter through all the dataset by year =< year row, origin_airport= origin_airport row, carrier = southwest. If the filter produces an output, it means that southwest is by that time already flying from that airport. Hence, if the filtering produces an output, the dummy should assume a value of 1, otherwise 0. This should be automated for each row in the dataset.
Any idea how to put this into Rstudio code? Or is there an easier way to address this issue?
This is the link to the dataset on dropbox:
https://www.dropbox.com/s/n09rp2vcyqfx02r/DB1B_completeDB1B_complete.csv?dl=0
The short answer is to use a self join.
Looking at your data set, I don't see IATA airport codes, but rather 6-digit origin and destination id's (which do not seem to conform to anything in DB1A/DB1B??). Also, it's not clear (to me) what exactly is the granularity of your data, so making some assumptions.
library(data.table)
setwd('<directory with your csv file>')
data <- fread('DB1B_completeDB1B_complete.csv')
wn <- data[carrier=='WN']
data[, flag:=0]
data[wn, flag:=1, on=.(ap_id, year, quarter, date)]
So, this just extracts the WN records and then joins that back to the original table, on ap_id (defines route??), year, quarter, and date. This assumes granularity is at the carrier/route/year/quarter/date level (e.g. one row per).
Before you do that, though, you need to do some serious data cleaning. For instance, while it looks like ORIGIN_AIRPORT_CD and DEST_AIRPORT_CD are parsed out of ap_id, there are about 1200 records where these are NA.
##
# missingness
#
data[, .(col = names(data), na.count=sapply(.SD, \(x) sum(is.na(x))))]
Also, my assumption that there is one row per carrier/route/year/quarter/date does not seem to hold always. This is an especially serious problem wit the WN rows.
##
# duplicates??
#
data[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
wn[, .N, keyby=.(carrier, ap_id, year, quarter, date)][order(-N)]
Finally, in attempting to quantify the impact of WN entry to a market, you probably should at least consider grouping nearby airports. For instance JFK/LGA/EWR are frequently considered "NYC", and SFO/OAK/SJC are frequently considered "Bay Area" (these are just examples). This means, for instance, that if WN started flying from LGA to a destination of interest it might also influence OA prices from JFK and EWR to that same destination.
I have a panel data set consisting of bonds with daily prices observed over a period of time. Thus each bond is repeated downwards with the corresponding daily price observations and dates (ref picture below). Half of the bonds are green (identified by a dummy variable) and each green bond is matched with a non-green bond, each pair is identified with a pair-id. So a green bond and its matched non-green bond have the same pair-id, and are observed over the same time span (say 100 days each), but the individual bond-id is unique.
I want to measure the fixed effect within each pair of bonds to figure out if there is a significant difference in yield to maturity (variable used = ask.yield) between the green bond and its matching non-green bond. Thus, I believe when identifying the paneldata in R, that the individual should be pair.id and the time index should be date. I use the following regression:
fixed <- plm(ask.yield ~ liquidity + green, data = paneldata, index = c(“pair.id”, “dates”), model = “within”)
Desired output (do not mind the numbers):
I get an error message saying:
Error in pdim.default(index[1], index[2]) :
duplicate couples (id-time)
I understand the error message – each pair.id in the panel data is recorded over the same dates twice (one time for the green bond, and one for the matching non-green bond).
Does anyone know how to get around this problem and still be able to measure the fixed effect within each pair of bonds?
From the error, there are duplications in the paired id, aka, the combination of pair.id and dates are not unique. Can you check whether the values of date unique for each pair.id?
If they are, you might need to convert the date to str, depending on the data type, the date might be converted to some value that might introduce the duplication values.
Hope this helps, since I don't have the data, I have no way to reproduce.
I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg . There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. I used k-medoids algorithm (after transforming data with FFT/DWT) and I do not know how to present each cluster = grouped time series on different plots.
Can somebody tell me how should I do that?
Here is the code of clustering:
clustersalesGain = pam(t(salesGain), 8)
nazwy = as.character(nazwy)
cbind(nazwy,clustersalesGain$clustering)
I would like to present the output on different plots.
k-medoids returns actual data points as cluster centers.
Just visualize them the same way you visualize your data!
(And if you havn't been visualizing your data, you better work on that now.)