Alter variable to lag by year - r

I have a data set I need to test for autocorrelation in a variable.
To do this, I want to first lag it by one period, to test that autocorrelation.
However, as the data is on US elections, the data is only available in two-year intervals, i.e. 1968, 1970, 1970, 1972, etc.
As far as I know, I'll need to somehow alter the year variable so that it can run annually in some way so that I can lag the variable of interest by one period/year.
I assume that dplyr() is helpful in some way, but I am not sure how.

Yes, dplyr has a helpful lag function that works well in these cases. Since you didn't provide sample data or the specific test that you want to perform, here is a simple example showing an approach you might take:
> df <- data.frame(year = seq(1968, 1978, 2), votes = sample(1000, 6))
> df
year votes
1 1968 565
2 1970 703
3 1972 761
4 1974 108
5 1976 107
6 1978 449
> dplyr::mutate(df, vote_diff = votes - dplyr::lag(votes))
year votes vote_diff
1 1968 565 NA
2 1970 703 138
3 1972 761 58
4 1974 108 -653
5 1976 107 -1
6 1978 449 342

Related

How do I calculate days since value exceeded in R?

I'm working with daily discharge data over 30 years. Discharge is measured in cfs, and my dataset looks like this:
date ddmm year cfs
1/04/1986 1-Apr 1986 2560
2/04/1986 2-Apr 1986 3100
3/04/1986 3-Apr 1986 2780
4/04/1986 4-Apr 1986 2640
...
17/01/1987 17-Jan 1987 1130
18/01/1987 18-Jan 1987 1190
19/01/1987 19-Jan 1987 1100
20/01/1987 20-Jan 1987 864
21/01/1987 21-Jan 1987 895
22/01/1987 22-Jan 1987 962
23/01/1987 23-Jan 1987 998
24/01/1987 24-Jan 1987 1140
I'm trying to calculate the number of days preceding each date that the discharge exceeds 1000 cfs and put it in a new column ("DaysGreater1000") that will be used in a subsequent analysis.
In this example, DaysGreater1000 would be 0 for all of the dates in April 1986. DaysGreater1000 would be 1 on 20 Jan, 2 on 21 Jan, 3 on 22 Jan, etc.
Do I first need to create a column (event) of binary data for when the threshold is exceeded? I have been reading several old questions and it looks like I need to use ifelse but I can't figure out how to make a new column of data and then how to make the next step to calculate the number of preceding days.
Here are the questions that I have been examining:
Calculate days since last event in R
Calculate elapsed time since last event
... And this is the code that looks promising, but I can't quite put it all together!
df %>%
mutate(event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
summary(df)
I'm sorry if I'm not being very eloquent! I'm feeling a bit rusty as I haven't used R in a while.

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

R: precipitation data seasonal (DJFM) sums for each station

I have a data.frame "n_com", which includes columns for "year" (1951-2010), "month" (1,2,3,12) and 81 further value-columns for monthly precipitation sums of 81 weather-stations.
Jahr Monat 12_NS_Monat 13 NS Monat 14 NS Monat 15 NS Monat 16 NS Monat
1 1951 1 397 2045 1447 2666 236
2 1951 2 528 1043 464 1397 202
3 1951 3 819 480 953 1634 665
4 1951 12 363 252 881 610 350
5 1952 1 391 530 557 1321 339
6 1952 2 683 684 920 1125 805
Now, I need the seasonal sums for each year for the months december, january, february and march (DJFM) for each station. But the seasonal sums should include the information of the december-month of the previous year, while the informations about the other months should come from the current year.
(e.g.: seasonal sum of 1956 which includes december-data of 1955, while the other months are from 1956)
Finally, I want a data.frame with the following columns: "year", "station 1", "station 2" and so on..
It seems, that the function "dm2seasonal" of the package "hydroTSM" is the right for me to create seasonal sums. My problem is, that "hydroTSM" needs special formation of the data.frame (data in long format), but my data.frame is in wide-format. Can anayone help me formatting my data for the package "hydroTSM" or has got another solution to create seasonal sums?
greetz from Germany
More a hack than a solution, but you could probably just add 1 to the 'year' column for all rows with month = 12 :
n_com$yeartemp = n_com$year
n_com$yeartemp[n_com$month == 12] = n_com$year[n_com$month == 12] + 1
To change column names, see 'names`.
Then, to change to long format you can use melt in reshape2 package, using yeartemp as id variable.
Hth.
Hth.

Remove rows conditionally in a dataframe [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have got a dataframe and I would like to remove some duplicate rows by taking the ones with max values.
Here an example simplified of my dataframe:
Code Weight Year
1 27009 289 1975
2 27009 300 1975
3 27009 376 1977
4 30010 259 1975
5 30010 501 1979
6 30010 398 1979
[....]
My output should be:
Code Weight Year
1 27009 300 1975
2 27009 376 1977
3 30010 259 1975
4 30010 501 1979
[....]
Between Code and Weight I have got 5 more columns with different values and between Weight and Year one more column with still different values.
Should I use an if statement?
You could use the dplyr package:
df <- read.table(text = "Code Weight Year
27009 289 1975
27009 300 1975
27009 376 1977
30010 259 1975
30010 501 1979
30010 398 1979", header = TRUE)
library(dplyr)
df$x <- rnorm(6)
df %>%
group_by(Year, Code) %>%
slice(which.max(Weight))
# Code Weight Year x
# (int) (int) (int) (dbl)
# 1 27009 300 1975 1.3696332
# 2 30010 259 1975 1.1095553
# 3 27009 376 1977 -1.0672932
# 4 30010 501 1979 0.1152063
As a second solution you coud use the data.table package.
setDT(df)
df[order(-Weight) ,head(.SD,1), keyby = .(Year, Code)]
The results are the same.
Simply run aggregate in base R using Code and Year as the grouping. This will take max values of all other numeric columns:
finaldf <- aggregate(. ~ Code + Year, df, FUN = max)

Resources