finding month given week number of the year in R - r

I have week_of_the_year column (0-52) in my data. I want to find out which month or (some cases 2 months) that it belongs to. I finally need a data frame like
week_of_year, jan, feb,.....dec
5 1 1 0 0 0
if fifth week belongs to both jan and feb those columns to be filled with 1 and others with 0s. I finally want to merge the above dataframe with my own dataframe.
Help me with this please.

Related

How to calculate a ratio for each month in a year and attach 12 ratio columns (will be 12 columns) to the origianl dataset in R?

My dataset has 3000 observations (rows) and 24 variables (columns). First 12 (column 1 to 12) variables are price value from January to December in a year. The remaining 12 variables (column 13 to 24) are rent value from January to December in the same year. I would like to caculate the price/rent for each month and add the 12 new ratios columns to the end of my origianl dataset. How can I do this in R?
Thank you in advance!
You could do :
df[paste0('ratio', 1:12)] <- df[1:12]/df[13:24]
This will create 12 new columns called ratio1, ratio2 ... ratio12. where ratio1 is column1/column13 ratio2 is column2/column14 and so on.

How to sum a variable by group but do not aggregate the data frame in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
although I have found a lot of ways to calculate the sum of a variable by group, all the approaches end up creating a new data set which aggregates the double cases.
To be more precise, if I have a data frame:
id year
1 2010
1 2015
1 2017
2 2011
2 2017
3 2015
and I want to count the number of times I have the same ID by the different years, there are a lot of ways (using aggregate, tapply, dplyr, sqldf etc) which use a "group by" kind of functionality that in the end will give something like:
id count
1 3
2 2
3 1
I haven't managed to find a way to calculate the same thing but keep my original data frame, in order to obtain:
id year count
1 2010 3
1 2015 3
1 2017 3
2 2011 2
2 2017 2
3 2015 1
and therefore do not aggregate my double cases.
Has somebody already figured out?
Thank you in advance

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Count the number of previous occurrences using a time window, not a fixed window size

I have a dataset like the following, the last column is desired output.
DX_CD AID date2 <count.occurences.1000.days>
1 272.4 1649 2007-02-10 0 or N/A
2 V58.67 1649 2007-02-10 0<- (excluding the same day). OR 1
3 787.91 1649 2010-04-14 0
4 788.63 1649 2011-03-10 1
5 493.90 4193 2007-09-13 0 or N/A #new AID
6 787.20 6954 2010-02-25 0 or N/A #new AID
.....
I want to compute the column (count.occurences.1000.days) that counts the number of previous occurrences within X days (e.g. X=1000) by AID.
The first value is 0 or N/A because there is no previous record before record #1 for AID=1649. The second value is 0 because this event occurs on the same day as record #1. Third value is 0 because there are records older than 2010-04-14, but they are beyond 1000days. Fourth value is 1 because the record #3 happened within 1000 days. Same logic goes for AID=4193 and AID=6954
Can someone provide an idea, preferably vectorized?
If I understood correctly the question, this should do
First, a sample of the data
df <- data.frame(date2=days <-
seq(as.Date("2008-12-30"), as.Date("2015-01-03"), by="days"),
AID=sample(c(1649, 4193, 6954, 3466), 2196, replace=T),
count=(rep.int(1,2196)))
Now we group by the 1000 days from max to min
df$date.bin <- Hmisc::cut2(df$date2,
cuts=sort(seq(max(df$date2), length=10,by="-1000 days")))
Now we use cumsum on the grouped variables
res <-df %>% dplyr::arrange(date.bin, AID) %>% group_by(date.bin, AID) %>%
mutate(cumsum=cumsum(count))

Flowstrates and R: extracting and reshaping data in required format

I'm working on trying to transform a large dataset into the required formats for analyzing within the flowstrates package.
What I currently have is a large file (600k trips) with origin and destination points.
Format is sort of like this:
tripID Month start_pt end_pt
1 June 1 3
2 June 1 3
3 July 1 5
4 July 1 7
5 July 1 7
What I need to be able to generate is a file that has trip counts by unit time (let's say months) in a format like this:
start_pt end_pt June July August ... December
1 3 2 0 5 9
1 5 0 1 4 4
1 7 0 2 0 0
It's easy enough to pre-segment the data by month and then generate counts for each origin-destination pair, but then putting it all back together causes all sorts of problems since each of the pre-segmented chunks of data have very different sizes. So it seems that I'd need to do this for the entire dataset at once.
Are there any packages for doing this type of processing? Would it be easier to do this in something like SQL or SQLite?
Thanks in advance for any help.
You can use the reshape2 package to do this fairly easily.
If your data is dat,
library("reshape2")
dcast(dat, start_pt+end_pt~Month, value.var="tripID", fun.aggregate=length)
This gives a single entry for each start_pt/end_pt/Month combination, the value of which is how many cases had that combination (the length of tripID for that set).

Resources