Compute factor data between two data frames in R - r

I have not found a solution for this, and I think it should be very simple but now I can't think right.
I have two data frames, monthly traffic volume averages, and yearly traffic volume averages. I need to divide yearly averages by monthly averages.
ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
So the first 12 rows of the TPDM table should divide the first row of the TPDA table and create a new data frame which should contain monthly factors.
Something like:
ano mes dias FA_Au
2012 Ene 31 4271.096/4288.323
2012 Feb 29 4271.096/3268.862
(Don't need to show the computation, just the result)
I am sure that selecting the data by year would do that but haven't found the right way to do it.

Merge by year and find columns to divide by position
As already mentioned by zx8754 this can be done by merging on year and dividing the corresponding columns in base R:
merged <- merge(TPDM, TPDA, by.x = "ano", by.y = "anosDB")
FA <- cbind(merged[, 1:3], merged[, 10:15]/merged[, 4:9])
# rename columns
names(FA) <- sub("TPDA_", "FA_", names(FA))
FA
ano mes dias FA_Au FA_Bu FA_CU FA_CAI FA_CAII FA_TOTAL
1 2012 Ene 31 0.9959828 0.9951086 1.0779532 1.1044977 1.1757530 1.705872
2 2012 Feb 29 1.3066003 1.0412831 1.0360781 0.9862042 1.1050245 1.663675
3 2012 Mar 31 1.1644517 0.9884285 0.8829349 0.9583809 1.0566337 1.546941
4 2012 Abr 30 0.9148231 0.9540314 0.9073122 1.0383376 1.0328838 1.440892
5 2012 May 31 1.3353096 1.0309085 0.9963600 0.9507334 0.8084003 1.576802
6 2012 Jun 30 1.1834349 1.0191696 0.9103332 0.9642064 0.9149720 1.534471
Caveat:
This approach works as long as the positions, i.e., column numbers, of the corresponding columns are known. With the given datasets, the columns are ordered in the same way. Therefore, only an offset has to be considered to match corresponding columns.
Merge by year and find columns to divide by name
If, for some reason, the positions are not known in advance we can find corresponding columns by matching the column names.
For this, both datasets are reshaped from wide to long format. In long format, the column names (now called variable) are treated as data. Now, we can join monthly and annual values on year and column name, divide annual values by the corresponding monthly values, and reshape back to wide format, finally:
library(data.table)
# reshape and prepare monthly data
longM <- melt(setDT(TPDM), id.vars = 1:3)
longM[, variable := stringr::str_replace(variable, "_TPDM", "")]
longM[, mes := forcats::fct_inorder(mes)]
# reshape and prepare annual data
longA <- melt(setDT(TPDA), id.vars = 1)
longA[, variable := stringr::str_replace(variable, "TPDA_", "")]
setnames(longA, "anosDB", "ano")
# join
long_FA <- longA[longM, on = .(ano, variable),
.(ano, mes, dias, variable, FA = value/i.value)]
# reshape back to wide format
dcast(long_FA, ano + mes +dias ~ paste0("FA_", variable), value.var = "FA")
ano mes dias FA_Au FA_Bu FA_CAI FA_CAII FA_CU FA_TOTAL
1: 2012 Ene 31 0.9959828 0.9951086 1.1044977 1.1757530 1.0779532 1.705872
2: 2012 Feb 29 1.3066003 1.0412831 0.9862042 1.1050245 1.0360781 1.663675
3: 2012 Mar 31 1.1644517 0.9884285 0.9583809 1.0566337 0.8829349 1.546941
4: 2012 Abr 30 0.9148231 0.9540314 1.0383376 1.0328838 0.9073122 1.440892
5: 2012 May 31 1.3353096 1.0309085 0.9507334 0.8084003 0.9963600 1.576802
6: 2012 Jun 30 1.1834349 1.0191696 0.9642064 0.9149720 0.9103332 1.534471
Data
TPDM <- read.table(text = "
i ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
", header = TRUE)[, -1L]
TPDA <- read.table(text = "
i anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
", header = TRUE)[, -1L]

Related

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

How to query NOAA for historical daily temperature averages using rnoaa?

I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).

R Studio: look up a value in table(both direction V&H), then use as a variable in loop

I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Resources