Flowstrates and R: extracting and reshaping data in required format - r

I'm working on trying to transform a large dataset into the required formats for analyzing within the flowstrates package.
What I currently have is a large file (600k trips) with origin and destination points.
Format is sort of like this:
tripID Month start_pt end_pt
1 June 1 3
2 June 1 3
3 July 1 5
4 July 1 7
5 July 1 7
What I need to be able to generate is a file that has trip counts by unit time (let's say months) in a format like this:
start_pt end_pt June July August ... December
1 3 2 0 5 9
1 5 0 1 4 4
1 7 0 2 0 0
It's easy enough to pre-segment the data by month and then generate counts for each origin-destination pair, but then putting it all back together causes all sorts of problems since each of the pre-segmented chunks of data have very different sizes. So it seems that I'd need to do this for the entire dataset at once.
Are there any packages for doing this type of processing? Would it be easier to do this in something like SQL or SQLite?
Thanks in advance for any help.

You can use the reshape2 package to do this fairly easily.
If your data is dat,
library("reshape2")
dcast(dat, start_pt+end_pt~Month, value.var="tripID", fun.aggregate=length)
This gives a single entry for each start_pt/end_pt/Month combination, the value of which is how many cases had that combination (the length of tripID for that set).

Related

How do I merge data sets with some of the same columns without matching the elements but rather adding them to the vector?

I have been attempting to merge cross sectional data sets which were acquired at different years from different people.
For data collection, most of the same questions were asked per year but some questions where added or deleted. Hence, there is some variables that match across datasets and some others that are not matching but are still important.
Something that might be important for you all to keep in mind is that there are different number of respondents per year. Hence not all variables have the same number of elements per matching variable.
For context I am attempting to merge three data sets. But I will illustrate my below examples with 2 of the 3 for simplicities sake.
I have tried the match() function with all = TRUE but the data set I created by using this function made 3 vectors off of the vector I wanted stacked. e.g.
internet.x internet.y internet.z
3 3 7
6 4 5
I have also tried the rbind() function from the plyr package but this mode of merging deletes the columns that do not have matching elements.
So for example, since data: year2017 and data:year2018 both have a variable titles YEAR e.g.
data:year2017 data:year2018
YEAR YEAR
2017 2018
2017 2018
2017 2018
2017 2018
2017 2018
2017 2018
2017 2018
2017 2018
The YEAR variable gets deleted in the merging product because the same variable has different values or elements within different datasets.
So... what I want to keep in the finalized product is a merged result of
data:MERGED
YEAR
2017
2017
2017
2017
2017
2017
2017
2017
2018
2018
2018
2018
2018
2018
2018
2018
Another example is the following variable = var1 which is named the same across data sets
data:year2016 data:year2017 data:year2018
var1 var1 var1
3 5 2
2 3 1
4 7 7
5 8 3
6 3 4
The resulting product ideally would be
data:MERGEDFINAL
var1
3
2
4
5
6
5
3
7
8
3
2
1
7
3
4
What I want there to happen is that for all variables that are the same across data sets there should be a stacking action conducted. For the variables that are not the same, then the stacking should still occur but be packed with NA's for the respondents who took the survey in the year wherein there was no data collected for that variable.
If you all could put your brain power and experience together and help me out with this one that would be great :):):)
The bind_rows() function from the dplyr library is what you need! To 'merge' three datasets into one, while respecting column names, use the command like this:
library(dplyr)
dfAll<-bind_rows(dfA, dfB, dfC)
Edit: Update, directly call all three datasets. Removed intermediate step as first posted.

How to sum a variable by group but do not aggregate the data frame in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
although I have found a lot of ways to calculate the sum of a variable by group, all the approaches end up creating a new data set which aggregates the double cases.
To be more precise, if I have a data frame:
id year
1 2010
1 2015
1 2017
2 2011
2 2017
3 2015
and I want to count the number of times I have the same ID by the different years, there are a lot of ways (using aggregate, tapply, dplyr, sqldf etc) which use a "group by" kind of functionality that in the end will give something like:
id count
1 3
2 2
3 1
I haven't managed to find a way to calculate the same thing but keep my original data frame, in order to obtain:
id year count
1 2010 3
1 2015 3
1 2017 3
2 2011 2
2 2017 2
3 2015 1
and therefore do not aggregate my double cases.
Has somebody already figured out?
Thank you in advance

finding month given week number of the year in R

I have week_of_the_year column (0-52) in my data. I want to find out which month or (some cases 2 months) that it belongs to. I finally need a data frame like
week_of_year, jan, feb,.....dec
5 1 1 0 0 0
if fifth week belongs to both jan and feb those columns to be filled with 1 and others with 0s. I finally want to merge the above dataframe with my own dataframe.
Help me with this please.

R: How to plot multiple series when the series is included as a variable?

I want to plot multiple lines to one graph of five different time series. The problem is that my data frame is arranged like so:
Series Time Price ...
1 Dec 2003 5
2 Dec 2003 10
3 Dec 2003 2
1 Jan 2004 10
2 Jan 2004 10
3 Jan 2004 5
This is a simplified version, and there are many other variables for each observation. I'd like to be able to plot time vs price and use the first variable as the indicator for which series.
The time period is 77 months long, so I'm not sure if there's an easy way to reshape the data to look like:
Series Dec.2003.Price Jan.2004.Price ...
1 5 10
2 10 10
3 2 5
or a way to graph these like I said without reshaping.
You can try
xyplot(Price ~ Time, groups=Series, data=df, type="l")

How to properly use time-series objects in R?

I have some daily data which looks like this:
> head(daily_leads)
date gross.leads day_of_week month
1 2007-01-01 6427 2 1
2 2007-01-02 7111 3 1
3 2007-01-03 7367 4 1
4 2007-01-04 7431 5 1
5 2007-01-05 7257 6 1
6 2007-01-06 7231 7 1
There are some clear intra-week trends and the same day of the week usually corresponds to similar levels (Friday's are similar counts to Fridays). There are also some trends going on here. Also there are trends over the course of the year.
What is the correct way to set up a time-series object in R in order to make forecasts in my case? When I try this:
> leads <- ts(daily_leads$gross.leads, frequency = 1/7)
> q <- decompose(leads)
Error in decompose(leads) : time series has no or less than 2 periods
I'm not sure if I am setting it up correctly but I want the time-series object to reflect the seasonality within the week and year. Any help would be much appreciated. I have 2543 observations of the data.
You need frequency=7. In time series, the frequency is the number of observations per seasonal period.

Resources