I have daily rainfall data in Excel (which I can save as a CSV or txt file) that I would like to manipulate and load into R. I'm very new to R.
The format of the data is such that I have I have the following columns
Year; Month; Rain on day 1 of Month, Rain on Day 2, ... , Rain on day 31;
This means that I have a large array/table. Some data is missing because it wasn't recorded, and some because February 31st, June 31st, etc do not exist.
I would like to analyse things like monthly totals, and their distributions.
What is the best way to input data so it can be easily manipulated, and that I can distinguish between missing data and NULL data (31st Feb)?
Thanks a lot in advance
Several things for you to have a look at. E.g. readxl::read_excel() for reading excel files or Hmisc::monthDays(dates) for determining the number of days for each month in a dates vector.
Anyway, here's one idea as a starter:
# create sample data
set.seed(1)
mat <- matrix(rbinom(5*31, 31, .5), nrow=5)
mat[sample(1:length(mat), 10)] <- NA
df <- data.frame(year=2016, month=1:5, mat)
# reshape data from wide to long format
library(reshape2)
dflong <- melt(df, id.vars = 1:2, variable.name = "day")
# add date column (will be NA if conversion is not possible, i.e. if date does not exists)
dflong$date <- as.Date(with(dflong, paste(year, month, day, sep="-")), format = "%Y-%m-X%e")
# Select only existing dates
dflong <- subset(dflong[order(dflong$month), ], !is.na(date))
# Aggregate: means per month and year (missing values removed)
aggregate(value~year+month, dflong, mean, na.rm=TRUE)
# year month value
# 1 2016 1 15.93548
# 2 2016 2 15.26923
# 3 2016 3 15.10345
# 4 2016 4 15.74074
# 5 2016 5 16.16667
Related
I'd like to generate a list of random dates between a defined interval using R such that there is only one date for each month present in the interval.
I've tried using a variation of the code from another solution, but I can't seem to limit it to one date per month. I get multiple dates for a given month.
Here's my attempt
df = data.frame(Date=c(sample(seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day"), 9)))
But I seem to get more than one date for a given month. Any inputs would be highly appreciated.
First I create a table, containing all the possible dates that you want to sample. And I store in a column of this table, the index, or the number of the month of each date, using the month() function from lubridate package.
library(lubridate)
dates <- data.frame(
days = seq(as.Date('2020/01/01'), as.Date('2020/09/01'), by="day")
)
dates$month <- month(dates$day)
Then, the idea is to create a loop with lapply() function. In each loop, I select in the table dates, only the dates of that month, and I paste these months in to the sample() function.
results <- lapply(1:9, function(x){
sample_dates <- dates$days[dates$month == x]
return(sample(sample_dates, size = 1))
})
df <- data.frame(
dates = as.Date(unlist(results), origin = "1970-01-01")
)
Resulting this:
dates
1 2020-01-19
2 2020-02-06
3 2020-03-26
4 2020-04-13
5 2020-05-16
6 2020-06-29
7 2020-07-06
8 2020-08-21
9 2020-09-01
In other words, the ideia of this approach is to provide selected dates to sample() function on each loop. So it will sample, or choose a date, only for that specific month, on each loop.
How about this:
First you create a function that returns a random day from month 'month'
Then you lapply for all months you need, 1 to 9
x <- function(month){
(Date=c(sample(seq(as.Date(paste0('2020/',month,'/01')), as.Date(paste0('2020/',month+1,'/01')), by="day"), 1)))
}
df <- data.frame(
dates = as.Date(unlist(lapply(1:9,x)), origin = "1970-01-01")
)
If you also want the results to be random (not January, February, March...) you only need to add a sample:
df <- data.frame(
dates = as.Date(unlist(sample(lapply(1:9,x))), origin = "1970-01-01")
)
This question is about how to replace missing days and months in a data frame using R. Considering the data frame below, 99 denotes missing day or month and NA represents dates that are completely unknown.
df<-data.frame("id"=c(1,2,3,4,5),
"date" = c("99/10/2014","99/99/2011","23/02/2016","NA",
"99/04/2009"))
I am trying to replace the missing days and months based on the following criteria:
For dates with missing day but known month and year, the replacement date would be a random selection from the middle of the interval (first day to the last day of that month). Example, for id 1, the replacement date would be sampled from the middle of 01/10/2014 to 31/10/2014. For id 5, this would be the middle of 01/04/2009 to 30/04/2009. Of note is the varying number of days for different months, e.g. 31 days for October and 30 days for April.
As in the case of id 2, where both day and month are missing, the replacement date is a random selection from the middle of the interval (first day to last day of the year), e.g 01/01/2011 to 31/12/2011.
Please note: complete dates (e.g. the case of id 3) and NAs are not to be replaced.
I have tried by making use of the seq function together with the as.POSIXct and as.Date functions to obtain the sequence of dates from which the replacement dates are to be sampled. The difficulty I am experiencing is how to automate the R code to obtain the date intervals (it varies across distinct id) and how to make a random draw from the middle of the intervals.
The expected output would have the date of id 1, 2 and 5 replaced but those of id 3 and 4 remain unchanged. Any help on this is greatly appreciated.
This isn't the prettiest, but it seems to work and adapts to differing month and year lengths:
set.seed(999)
df$dateorig <- df$date
seld <- grepl("^99/", df$date)
selm <- grepl("^../99", df$date)
md <- seld & (!selm)
mm <- seld & selm
df$date <- as.Date(gsub("99","01",as.character(df$date)), format="%d/%m/%Y")
monrng <- sapply(df$date[md], function(x) seq(x, length.out=2, by="month")[2]) - as.numeric(df$date[md])
df$date[md] <- df$date[md] + sapply(monrng, sample, 1)
yrrng <- sapply(df$date[mm], function(x) seq(x, length.out=2, by="12 months")[2]) - as.numeric(df$date[mm])
df$date[mm] <- df$date[mm] + sapply(yrrng, sample, 1)
#df
# id date dateorig
#1 1 2014-10-14 99/10/2014
#2 2 2011-02-05 99/99/2011
#3 3 2016-02-23 23/02/2016
#4 4 <NA> NA
#5 5 2009-04-19 99/04/2009
I have a dataset with Month and year in one column and closing rate of "AXA" in another column in a csv filesnippet of csv file. For example, I need to find the average closing rate for year 2017 and write the average closing rate into new csv file with company name("AXA") and year. Like above example, i need to do for years available it fom 2009 to 2017
From what I understood of the question, I created a reproducible example with only 2 years, I'm using the lubridate package to deal with date formats
dates <- c('1-1-2009', '1-2-2009', '1-3-2009', '1-4-2009', '1-5-2009', '1-6-2009',
'1-7-2009', '1-8-2009', '1-9-2009', '1-10-2009', '1-11-2009', '1-12-2009',
'1-1-2010', '1-2-2010', '1-3-2010', '1-4-2010', '1-5-2010', '1-6-2010', '1-7-2010',
'1-8-2010', '1-9-2010', '1-10-2010', '1-11-2010', '1-12-2010')
set.seed(42)
prices <- rnorm(24, mean = 300, sd = 15)
library(lubridate)
Axa <- data.frame(dates = dmy(dates), prices = prices)
Axa <- Axa %>% mutate(obs_year = year(dates)) %>% group_by(obs_year) %>%
summarise(prices = mean(prices))
Wich gives the following result
# A tibble: 2 x 2
obs_year prices
<dbl> <dbl>
1 2009 311.
2 2010 292.
This should work for all the different years you have on your file
For reading and writing the files I assume you know how to use read.csv and write.csv
Say, if I have a data frame as follows:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
x1 = rnorm(length(Date1)))
where the spacing between each measurement is 1 hour. How would it be possible to pad this data frame with NAs for the rest of the year, where the final solution should have a length of 8760 i.e. hourly measurements for the entire year. I would like to have the DateTime column to extent from 2010-01-01 00:00 to 2010-12-31 23:00, for example, but have the x1 column to be NA for the days that have been added to the original data frame (if that makes sense). I would like to come up with a solution where there can be any number of years i.e. if the data extends from May 2009 to September 2012 then the final solution should have this data set but with the missing times i.e. from January 2009 to December 2012 to be padded with NA's. How can I go about solving this issue?
Create new data frame that contains all hours and then merge both data frames.
df2<-data.frame(DateTime=seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-31 23:00"), by = "hour"))
merge(df2,Dat,all=TRUE)
I have a data frame with 275 different stations and 43 years seasonal data (October to next April, no need for May to Sept data)and 6 variables, here is a small example of the data frame with only one variable called value:
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
What I need is to fill the gap of day with daily date(eg:1:8) and the value of each row the average of the 8 days, it would be look like:
data1 <- data.frame(station=rep(1,40), year=rep(1969,40), month=c(rep(10,31),rep(11,9)),day=c(1:31,1:9),value=rep(c(1/7,2/8,3/8,4/8,5/8,6/8),c(7,8,8,8,8,1)))
I wrote some poor code and searched around the site, but unfortunately didn't work out, please help or better ideas would be appreciated.
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
for (i in 1:length(station.date)){
days <- as.numeric(station.date[i+1]-station.date[i]) #not working
data <- within(data, days <- c(days,1))
}
rows <- rep(1:nrow(data), times=data[ ,data$days])
rows <- ifelse(rows > 10, 0, rows) #get rid of month May to Sept
data <- data[rows, ]
data <- within(data, value1 <- value/days)
data <- within(data, dd <- ?) #don't know to change the repeated days to real days
I wrote some code that does the same things as your example, but probably You have to modyfi it in order to handle whole data set. I wasn't sure what to do with the last observation. Eventually I made a special case for it. If it should be divided by different number, You need just to replace 8 inside values <- c(values, tail(data$value, 1) / 8)
with that number. Moreover if you have all 275 stations in one data.frame, I think the best idea would be to split it, transform it separately and than cbind it.
data <- data.frame(station=rep(1,6), year=rep(1969,6), month=c(10,10,10,10,11,11),day=c(1,8,16,24,1,9),value=c(1:6))
station.date <- as.Date(with(data, paste(year, month, day, sep="-")))
d <- as.numeric(diff(station.date))
range <- sum(d) + 1
# create dates
dates <- seq(station.date[1], by = "day", length = range)
# create values
values <- unlist(sapply(1:length(d), function(i){
rep(data$value[i] / d[i] , d[i])
}))
# adding last observation
values <- c(values, tail(data$value, 1) / 8)
# create new data frame
data2 <- data.frame(station = rep(1, range),
year = as.numeric(format(dates, "%Y")),
month = as.numeric(format(dates, "%m")),
day = as.numeric(format(dates, "%d")),
value = values)
It could probably be optimised in some way, however I hope it helps too. Note how I extract year, month and day from dates.