I'm working with daily discharge data over 30 years. Discharge is measured in cfs, and my dataset looks like this:
date ddmm year cfs
1/04/1986 1-Apr 1986 2560
2/04/1986 2-Apr 1986 3100
3/04/1986 3-Apr 1986 2780
4/04/1986 4-Apr 1986 2640
...
17/01/1987 17-Jan 1987 1130
18/01/1987 18-Jan 1987 1190
19/01/1987 19-Jan 1987 1100
20/01/1987 20-Jan 1987 864
21/01/1987 21-Jan 1987 895
22/01/1987 22-Jan 1987 962
23/01/1987 23-Jan 1987 998
24/01/1987 24-Jan 1987 1140
I'm trying to calculate the number of days preceding each date that the discharge exceeds 1000 cfs and put it in a new column ("DaysGreater1000") that will be used in a subsequent analysis.
In this example, DaysGreater1000 would be 0 for all of the dates in April 1986. DaysGreater1000 would be 1 on 20 Jan, 2 on 21 Jan, 3 on 22 Jan, etc.
Do I first need to create a column (event) of binary data for when the threshold is exceeded? I have been reading several old questions and it looks like I need to use ifelse but I can't figure out how to make a new column of data and then how to make the next step to calculate the number of preceding days.
Here are the questions that I have been examining:
Calculate days since last event in R
Calculate elapsed time since last event
... And this is the code that looks promising, but I can't quite put it all together!
df %>%
mutate(event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
summary(df)
I'm sorry if I'm not being very eloquent! I'm feeling a bit rusty as I haven't used R in a while.
Related
This question already has an answer here:
Add rep vector to dataframe with uneven total rows
(1 answer)
Closed 2 years ago.
My data frame looks like this:
Year sales
1976 January 250
1976 February 350
1976 March 230
1976 April 255
.
.
This goes up-to 2003 December
I want to add a new column "Month" with a number from 1 to 12 for every year and repeating thereafter.
So that it would look like this:
Year Month sales
1976 January 1 250
1976 February 2 350
1976 March 3 230
1976 April 4 255
.
.
1976 December 12 320
1977 January 1 233
1977 February 2 333
.
.
Can you help me with the codes and if possible without use of any packages.
Thank you
Probably a safer way than Konrad's answer:
library(tidyr)
library(dplyr)
mydat %>%
# Split the year from the month into a separate variable
separate(Year, c("Year", "month"), sep = " ") %>%
# Add the month number based on the name of the month
mutate(Month_num = match(month, month.name))
This will return the correct month number even if your rows are not properly ordered.
If the first row of the table is always January, and if no months are missing, you can do
table$Month = rep(1 : 12, length.out = nrow(table))
I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607
I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).
I am attempting to multiply a column of numbers representing daily precipitation amounts by the corresponding monthly precipitation amount of the same year. From the example below, this means multiplying every PPT value in January 1890 by the monthly PPT value for January 1890, i.e. multiplying 31 numbers from D.SIM by the same number from M.SIM, and then doing the same for all the remaining months and years in the record. Is there an easy way?
Many thanks.
Dataset: D.SIM
Day Month Year PPT
1 1 1890 2.4
2 1 1890 0.0
3 1 1890 3.6
Dataset: M.SIM
Year Jan Feb Mar ...
1890 78.5 69.6 62.1 ...
Create loop to repeat daily values to align with monthly values
for (i in df){
JAN <- data.frame(rep(df$Jan, each=31))
}
and then repeated for the other 11 months.
I need to create datasets of weather data to use for modeling over the next 50 years. I am planning to do this by using historical weather data (daily, 1980-2012), but mixing up the years in a random order and then relabeling them with 2014-2054. However, I cannot be completely random, because it is important to maintain leap years. I want to have as many datasets as possible so I can get an average response of the model to different weather patterns.
Here is an example of what the historical data looks like (except there is data for every day). How could I reassemble it so the years are in a different order, but make sure years with 366 days (1980, 1984, 1988) end up in future leap years (2016, 2020, 2024, 2028, 2052)? And then do that at least 50 more times?
year day radn maxt
1980 1 5.827989 -1.59375
1980 2 5.655813 -1.828125
1980 3 6.159346 -0.96875
1981 4 6.065136 -1.84375
1981 5 5.961181 -2.34375
1981 6 5.758733 -2.0625
1981 7 6.458055 -2.90625
1982 8 6.73056 -2.890625
1982 9 6.89472 -1.796875
1983 10 6.687879 -2.140625
1984 11 6.585833 -1.609375
1984 12 6.466392 -0.71875
1984 13 7.100092 -0.515625
1985 14 7.176402 -1.734375
1985 15 7.236122 -2.5
1985 16 7.455515 -2.375
1986 17 7.395174 -1.390625
1986 18 7.341537 -2.21875
1987 19 7.678102 -2.828125
1987 20 7.539239 -2.875
1987 21 7.231031 -2.390625
1988 22 7.397067 -0.21875
1988 23 7.947912 -0.5
1989 24 8.355059 -1.03125
1990 25 8.145792 -1.5
1990 26 8.591616 -2.078125
Here is a function that scrambles the years of a passed data frame df, returning a new data frame:
scramble.years = function(df) {
# Build convenience vectors of years
early.leap = seq(1980, 2012, 4)
late.leap = seq(2016, 2052, 4)
early.nonleap = seq(1980, 2012)[!seq(1980, 2012) %in% early.leap]
late.nonleap = seq(2014, 2054)[!seq(2014, 2054) %in% late.leap]
# Build map from late years to early years
map = data.frame(from=c(sample(early.leap, length(late.leap), replace=T),
sample(early.nonleap, length(late.nonleap), replace=T)),
to=c(late.leap, late.nonleap))
# Build a new data frame with the correct years/days for later period
return.list = lapply(2014:2054, function(x) {
get.df = subset(df, year == map$from[map$to == x])
get.df$year = x
return(get.df)
})
return(do.call(rbind, return.list))
}
You can call scramble.years any number of times to get new scrambled data frames.