Here we have a panel data of "pessoas contadas" (counted people) in the city of Rio de Janeiro, by "dia" (day) and "bairro" (neighborhood). I want to calculate the average of counted people, by "bairro", in the last n days, according to the day 24/03/2020. The result must be repeated along the role panel, by "bairro". How can I do it?
You can download the dataset here: https://www.dropbox.com/s/yz5fkgs0vuyoi8s/teste.xlsx?dl=0
Related
I am trying to calculate the maximum number of aircraft on the ground simultaneously throughout the year per station, knowing that I have more than 300 stations and that the data is per (day and hours) for 1 year.
So I thought of this solution: find the maximum per day and per station then extract the maximum per station.
my data are in this format: station, aircraft ,time , type ( arrive to the station or depart from the station) and value is 1 if is arrival and -1 if is depart, I create this column to facilitate the count, the idea is apply cumsum once the data are sorted by time for each station.
I need to create a function which group the data by day and by station and count the cumulative sum, but I have planes that have been sleeping in the station, so I need to delete them( the yellows lines in the screenshot).
The trick to detect these planes:
Aircraft allows us to track the plane:
generally it appears twice a day when it arrives and when it leaves.
to detect these planes that I have to look
the variables: Aircraft and Type:
if the type is departure and the aircraft variable of this line appears only once in this day, ( it means there is no arrival for this flight) then I should not count them.
I was thinking to create a function: to group by (station and time) then apply cumsum but skipping the lines with the conditions that I explained before.(if the type is departure and the aircraft variable of this line appears only once in this day, then I should not count them)
Any Help??
I solved this by creating new table Tab1 with Table function to count the frequency of the aircrafts by day, then I joined this table to my original data base to then delete the ones with freq=1 and type=departure.
tab<- as.data.frame(table(df$day,df$AIRCRAFT, useNA = "always"))
df_new<-df%>% left_join(tab,by=c("day"="Var1","AIRCRAFT" = "Var2"))
df_final<-df_new%>% group_by(day)%>% filter(TYPE=="Departure",Freq==1)%>% arrange(TIMESTAMP)%>% group_by(day) %>%
mutate(ground= cumsum(Freq))
I am new to R can't find the right syntax for a specific average I need. I have a large fitbit dataset of heartrate per second for 30 people, for a month each. I want an average of heartrate per day per person to make the data easier to manage and join with other fitbit data.
First few lines of Data
The columns I have are Id (person Id#), Time (Date-Time), and Value (Heartrate). I already separated Time into two columns, one for date and one for time only. My idea is to group the information by person, then by date and get one average number per person per day. But, my code is not doing that.
hr_avg <- hr_per_second %>% group_by(Id) %>% group_by(Date) %>% summarize(mean(Value))
As a result I get an average by date only. I can't do this manually because the dataset is so big, Excel can't open it. And I can't upload it to BigQuery either, the database I learned to use during my data analysis course. Thanks.
I have a relative large xts object. It contains the daily adjusted closing prices from 2012-2021 for each company in the STOXX 600 Europe. I want to calculate the yearly volatility of the stocks for each year for each company.
Here is a link of how my dataset look like:
https://imgur.com/a/oS7ROCL
So I started by calculate the log differences by:
XTS.LOGDIFFS <- diff(log(XTS.ADJCLOSE))
The next step would be to calculate the volatility of the stocks for a specific period of time, for example from 2012-05-01 till 2012-12-31, by using the standard deviation and multiply it with the square root of 252 ( 252 are the average trading days).
So my idea is this: First I want to extract the data from my dataset for a specific period of time, in this case from 2012-05-01 till 2012-12-31.
I tried this XT1<-xts(XTS.LOGDIFFS [1:174]).
As an alternativ I have thought about this:
start_date<- as.Date("2012-05-01")
end_date<-as.Date("2012-12-31")
The next step would be to calculate the volatility for the extracted xts object.
So I tried this: vol<-sd(XTS.1, na.rm = TRUE) *sqrt (252).
But this only give me the stock volatility for all combined and not for every single one.
So I think I need a function to get the stock volatility for every company for the extracted period of time in the xts object. But I have no clue how this should look like.
I have a series of random data points with order date, expiry date and product category.
I am trying to compute 2 age variables, wherein 1 would compute the rolling age using just the order date (hence no expiry date) between order dates.
In other words, I don't need age determined based on current date but ones' relative to each data point in the data. As such, the average would keep increasing as more products are purchased and first ones get old. The second age variable would essentially cap each order by expiry date, which would mean the rolling average would not necessarily be increasing.
I am trying to do this using R or Tableau.
I am relatively new to R and I am facing an issue in extracting summarized data from one dataset and then storing this data in another dataset. I am presently trying to analyze flight dataset along with the weather information. I am a having a query in this regard which is as follows:
I am having one dataset for the flight data.This dataset comprises of columns such as flight number, date,origin city name, origin state name, departure time, departure time block(each block of 1 hour duration spanning the entire 24 hours), arrival time, arrival time block, flight distance, total flight time and a binary response variable indicating whether the flight was delayed or not.
I have another dataset which comprises of weather data. This dataset includes columns such as date, time of recording, hour block(each block of 1 hour duration spanning the entire 24 hours), state, city and the respective weather related data such as temperature,humidity levels, wind speed etc. For each city,state,date and a single hour block, there can be several entries in the weather dataset indicating varying temperature levels at various times in the hour block.
I am trying to extract the temperature and humidity levels from the weather dataset and place this information in the flight dataset. I have created four new columns in the flight dataset for origin temperature,origin humidity, destination temperature and destination humidity for this purpose.
My objective is to store the average temperature and humidity levels for the origin city during the departure time hour block as well as for the destination city during the arrival time hour block in the flight dataset. The reason I am choosing average temperature and humidity is because there can be multiple entries in weather dataset for the same hour block,city and state combination.
I have tried to extract the origin temperature information from weather dataset using the following loop:
for(i in 1:nrow(flight))
{
flight[[i,13]]=mean(weather$temperature[weather$date == flight[[i,1]] & weather$time_blk == train[[i,5]]
& weather$state == train[[i,3]] & weather$city == train[[i,2]]])
}
However, the weather data set contains around 2 Million rows and the flight dataset also comprises of around 20000 observations. As a result, the above calculation takes a lot of time to execute and even then the origin temperature data in flight data set is not calculated properly.
I am also attaching the images of sample flight and weather datasets for reference.
Can you please suggest another approach for obtaining the above information considering the constraints of the large datasets and the time required for processing.