R: precipitation data seasonal (DJFM) sums for each station - r

I have a data.frame "n_com", which includes columns for "year" (1951-2010), "month" (1,2,3,12) and 81 further value-columns for monthly precipitation sums of 81 weather-stations.
Jahr Monat 12_NS_Monat 13 NS Monat 14 NS Monat 15 NS Monat 16 NS Monat
1 1951 1 397 2045 1447 2666 236
2 1951 2 528 1043 464 1397 202
3 1951 3 819 480 953 1634 665
4 1951 12 363 252 881 610 350
5 1952 1 391 530 557 1321 339
6 1952 2 683 684 920 1125 805
Now, I need the seasonal sums for each year for the months december, january, february and march (DJFM) for each station. But the seasonal sums should include the information of the december-month of the previous year, while the informations about the other months should come from the current year.
(e.g.: seasonal sum of 1956 which includes december-data of 1955, while the other months are from 1956)
Finally, I want a data.frame with the following columns: "year", "station 1", "station 2" and so on..
It seems, that the function "dm2seasonal" of the package "hydroTSM" is the right for me to create seasonal sums. My problem is, that "hydroTSM" needs special formation of the data.frame (data in long format), but my data.frame is in wide-format. Can anayone help me formatting my data for the package "hydroTSM" or has got another solution to create seasonal sums?
greetz from Germany

More a hack than a solution, but you could probably just add 1 to the 'year' column for all rows with month = 12 :
n_com$yeartemp = n_com$year
n_com$yeartemp[n_com$month == 12] = n_com$year[n_com$month == 12] + 1
To change column names, see 'names`.
Then, to change to long format you can use melt in reshape2 package, using yeartemp as id variable.
Hth.
Hth.

Related

Alter variable to lag by year

I have a data set I need to test for autocorrelation in a variable.
To do this, I want to first lag it by one period, to test that autocorrelation.
However, as the data is on US elections, the data is only available in two-year intervals, i.e. 1968, 1970, 1970, 1972, etc.
As far as I know, I'll need to somehow alter the year variable so that it can run annually in some way so that I can lag the variable of interest by one period/year.
I assume that dplyr() is helpful in some way, but I am not sure how.
Yes, dplyr has a helpful lag function that works well in these cases. Since you didn't provide sample data or the specific test that you want to perform, here is a simple example showing an approach you might take:
> df <- data.frame(year = seq(1968, 1978, 2), votes = sample(1000, 6))
> df
year votes
1 1968 565
2 1970 703
3 1972 761
4 1974 108
5 1976 107
6 1978 449
> dplyr::mutate(df, vote_diff = votes - dplyr::lag(votes))
year votes vote_diff
1 1968 565 NA
2 1970 703 138
3 1972 761 58
4 1974 108 -653
5 1976 107 -1
6 1978 449 342

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

How can I use if-else statements (or a better way) to assign absolute values to days in a year (using R)?

I am working with daily temperature data that I have already run through R to pull out the first and last days of each year that are above a calculated threshold unique to each city dataset.
Data is brought into R in a .csv file with columns "YR", "number_of_days", "start_date", and "end_date". I only care about the "start_date" and "end_date" columns for this calculation.
For example, if I am looking at heat extremes, the first day of the year to have a temperature above 33 degrees C is May 1st and the last day of the year to have a temperature above 33 degrees C is October 20th. I do not care what the temperatures of the days in between are, just the start and end dates.
I want to convert the "May 1st" to an absolute number to be compared to other years. Below is sample data from BakersfieldTMAXextremes data.frame:
YR number_of_days start_date end_date
1900 27 5/22/00 10/18/00
1901 42 6/29/01 10/22/01
1902 76 6/7/02 9/23/02
1903 97 5/6/03 10/18/03
1904 98 4/8/04 9/15/04
1905 115 5/11/05 10/10/05
1906 90 4/20/06 10/27/06
1907 97 5/27/07 10/10/07
1908 107 4/11/08 9/16/08
1909 106 5/2/09 9/23/09
1910 89 4/18/10 10/15/10
1911 54 5/5/11 9/4/11
1912 51 5/31/12 10/18/12
1913 100 4/25/13 10/18/13
1914 78 4/19/14 10/14/14
1915 84 5/27/15 10/8/15
1916 73 5/5/16 9/28/16
1917 99 6/2/17 10/8/17
1918 81 6/2/18 10/13/18
1919 85 5/28/19 9/26/19
1920 61 5/17/20 9/30/20
1921 85 6/5/21 11/3/21
1922 91 5/14/22 9/25/22
1923 67 5/9/23 9/17/23
1924 91 5/8/24 9/29/24
1925 70 5/3/25 9/24/25
1926 84 4/25/26 9/9/26
1927 77 4/25/27 10/20/27
1928 88 5/5/28 10/9/28
1929 91 5/22/29 10/23/29
1930 86 5/23/30 10/7/30
1931 91 4/20/31 9/26/31
1932 82 5/11/32 10/5/32
1933 93 5/27/33 10/7/33
1934 101 4/20/34 10/12/34
1935 93 5/21/35 10/11/35
1936 85 5/10/36 9/26/36
For example, I would like to see the first start date as 141 (because it is the 141st day out of the 365 days in a year). At this point I couldn't care less about leap years, so we'll pretend they don't exist. I want the output in a table with the "YR", "start_date", and "end_date" (except with absolute values). For the first one, I would want "1900", "141" and "291" as the output.
I've tried to do this with an if-else statement, but it seems cumbersome to do for 365 days of the year (also I am fairly new to R and only have experience doing this in MATLAB). Any help is greatly appreciated!
Based on this answer, you can modify your data frame as follows:
library(lubridate)
df$start_date <- yday(df$start_date)
df$end_date <- yday(df$end_date)

Coding for the onset of an event in panel data in R

I was wondering if you could help me devise an effortless way to code this country-year event data that I'm using.
In the example below, each row corresponds with an ongoing event (that I will eventually fold into a broader panel data set, which is why it looks bare now). So, for example, country 29 had the onset of an event in 1920, which continued (and ended) in 1921. Country 23 had the onset of the event in 1921, which lasted until 1923. Country 35 had the onset of an event that occurred in 1921 and only in 1921, et cetera.
country year
29 1920
29 1921
23 1921
23 1922
23 1923
35 1921
64 1926
135 1928
135 1929
135 1930
135 1931
135 1932
135 1933
135 1934
120 1930
70 1932
What I want to do is create "onset" and "ongoing" variables. The "ongoing" variable in this sample data frame would be easy. Basically: Data$ongoing <- 1
I'm more interested in creating the "onset" variable. It would be coded as 1 if it marks the onset of the event for the given country. Basically, I want to create a variable that looks like this, given this example data.
country year onset
29 1920 1
29 1921 0
23 1921 1
23 1922 0
23 1923 0
35 1921 1
64 1926 1
135 1928 1
135 1929 0
135 1930 0
135 1931 0
135 1932 0
135 1933 0
135 1934 0
120 1930 1
70 1932 1
If you can think of effortless ways to do this in R (that minimizes the chances of human error when working with it in a spreadsheet program like Excel), I'd appreciate it. I did see this related question, but this person's data set doesn't look like mine and it may require a different approach.
Thanks. Reproducible code for this example data is below.
country <- c(29,29,23,23,23,36,64,135,135,135,135,135,135,135,120,70)
year <- c(1920,1921,1921,1922,1923,1921,1926,1928,1929,1930,1931,1932,1933,1934,1930,1932)
Data=data.frame(country=country,year=year)
summary(Data)
Data
This should work, even with multiple onsets per country:
Data$onset <- with(Data, ave(year, country, FUN = function(x)
as.integer(c(TRUE, tail(x, -1L) != head(x, -1L) + 1L))))
You could also do this:
library(data.table)
setDT(Data)[, onset := (min(country*year)/country == year) + 0L, country]
This could be very fast when you have a larger dataset.

Resources