Weird date-time format on x-axis plot - r

I'm currently plotting several datasets of one day in R. The format of the dates in the datasets is yyyymmmdddhh. When I plot this, the formatting fails gloriously: on the x-axis, I now have 2016060125, 2016060150, etc. and a very weirdly shaped plot. What do I have to do to create a plot with a more "normal" date notation (e.g. June 1, 12:00 or just 12:00)??
Edit: the dates of these datasets are integers
The dataset looks like this:
> event_1
date P ETpot Q T fXS GRM_SintJorisweg
1 2016060112 0.0 0.151 0.00652 19.6 0.00477 0.39250
2 2016060113 0.0 0.134 0.00673 20.8 0.00492 0.38175
3 2016060114 0.0 0.199 0.00709 22.6 0.00492 0.36375
4 2016060115 0.0 0.201 0.00765 21.2 0.00492 0.36850
5 2016060116 19.4 0.005 0.00786 19.5 0.00492 0.36900
6 2016060117 2.8 0.005 0.00824 18.1 0.00492 0.36625
7 2016060118 2.6 0.017 0.00984 18.0 0.00508 0.35975
8 2016060119 9.7 0.000 0.01333 16.7 0.00555 0.34750
9 2016060120 7.0 0.000 0.01564 16.8 0.00524 0.33550
10 2016060121 4.1 0.000 0.01859 17.1 0.00524 0.32000
11 2016060122 9.5 0.000 0.02239 17.2 0.00539 0.30250
12 2016060123 2.6 0.000 0.03330 17.5 0.00555 0.27050
13 2016060200 11.6 0.000 0.03997 17.4 0.00555 0.23800
14 2016060201 0.9 0.000 0.04928 17.3 0.00555 0.21725
15 2016060202 0.0 0.000 0.05822 17.2 0.00555 0.20350
16 2016060203 2.3 0.002 0.06547 16.4 0.00555 0.18575
17 2016060204 0.0 0.016 0.07047 16.5 0.00555 0.16950
18 2016060205 0.0 0.027 0.07506 16.7 0.00555 0.16475
19 2016060206 0.0 0.070 0.07762 18.0 0.00555 0.16525
20 2016060207 0.0 0.285 0.08006 19.5 0.00555 0.14500
21 2016060208 0.0 0.224 0.08109 20.3 0.00555 0.15875
22 2016060209 0.0 0.362 0.07850 21.3 0.00555 0.17825
23 2016060210 0.0 0.433 0.07441 22.0 0.00524 0.19175
24 2016060211 0.0 0.417 0.07380 23.9 0.00492 0.19050
I want to plot the date on the x-axis and the Q on the y-axis

Create a minimal verifiable example with your data:
date_int <- c(2016060112,2016060113,2016060114,2016060115,2016060116,2016060117,2016060118,2016060119,2016060120,2016060121,2016060122,2016060123,2016060200,2016060201,2016060202,2016060203,2016060204,2016060205,2016060206,2016060207,2016060208,2016060209,2016060210,2016060211)
Q <- c(0.00652,0.00673,0.00709,0.00765,0.00786,0.00824,0.00984,0.01333,0.01564,0.01859,0.02239,0.0333,0.03997,0.04928,0.05822,0.06547,0.07047,0.07506,0.07762,0.08006,0.08109,0.0785,0.07441,0.0738)
df <- data.frame( date_int, Q)
So, now we have a dataframe 'df'
With the dataframe 'df' you can convert your date_int column to a date format with hours and update the dataframe:
date_time <- strptime(df$date_int, format = '%Y%m%d%H', tz= "UTC")
df$date_int <- date_time
Finally,
plot(df)
You will see a nice plot! Like the following:
Ps.: Please note that you need to use abbreviations specified on "Date and Times in R" (e.g. "%Y%m%d%H" in this case)
Ref.: https://www.stat.berkeley.edu/~s133/dates.html

Here a lubridate answer:
library(lubridate)
event_1$date <- ymd_h(event_1$date)
or base R:
event_1$date <- as.POSIXct(event_1$date, format = "%Y%d%d%H")

What is happening is the dates are getting interpreted as numeric classes. As indicated, you need to convert. To get the formatting correct, you need to do a little more:
set.seed(123)
library(lubridate)
## date
x <- ymd_h(2016060112)
y <- ymd_h(2016060223)
dfx <- data.frame(
date = as.numeric(format(seq(x, y, 3600), "%Y%m%d%H")),
yvar = rnorm(36))
dfx$date_x <- ymd_h(dfx$date)
# plot 1
plot(dfx$date, dfx$yvar)
Now using date_x which is POSIXct:
#plot 2
# converted to POSIXct
class(dfx$date_x)
## [1] "POSIXct" "POSIXt"
plot(dfx$date_x, dfx$yvar)
You will need to fix your date axis to get the format you desire:
#plot 3
# using axis.POSIXct to help things
with(dfx, plot(date_x, yvar, xaxt="n"))
r <- round(range(dfx$date_x), "hours")
axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%b-%d %H:%M")

Related

Storing output from R multiple loops into a list

I'm trying to carry out the following action on the columns of a dataframe (df1):
term1+term2+term3*req_no
req_no is a range of numbers: 20:24
df1:
ID term1 term2 term3
X299 1.2 2.3 0.12
X300 1.4 0.6 2.4
X301 0.3 1.6 1.2
X302 0.9 0.6 0.4
X303 0.3 1.8 0.3
X304 1.3 0.3 2.1
I need help t get this output and here's my attempt:
Required output:
ID 20 21 22 23 24
X299 5.9 6.02 6.14 6.26 6.38
X300 50 52.4 54.8 57.2 59.6
X301 25.9 27.1 28.3 29.5 30.7
X302 9.5 9.9 10.3 10.7 11.1
X303 8.1 8.4 8.7 9 9.3
X304 43.6 45.7 47.8 49.9 52
Here's:
results <- list()
req_no <- 20:25
for(i in 1:nrow(df1){
for(j in rq_no){
res <- term1+term2+term3*j
results[j] <- res
}
results[[i]]
}
results2 <- do.call("rbind",result)
Help will be appreciated.
Here are a couple different approaches, though neither as succinct as Parfait's. Sample data:
df <- data.frame(ID=c("X299", "X300"),
term1=c(1.2, 1.4),
term2=c(2.3, 0.6),
term3=c(0.12, 2.4))
req_no <- 20:25
Loop approach
Your initial approach is headed in the right direction, but in the future, it would help to specify exactly what your error or problem is. For an iterated and perhaps easier-to-read approach, here's one answer:
results <- matrix(data=NA, nrow=nrow(df), ncol=length(req_no)) # Empty matrix to store our results
colnames(results) <- req_no # Optional; name columns based off of req_no values
for(i in 1:nrow(df)) {
# Do the calculation we want; returns a vector length 6
res <- df[i,]$term1 + df[i,]$term2 + (df[i,]$term3 * req_no)
# Save results for row i of df into row i of results matrix
results[i,] <- res
}
# Now bind the columns (named 20 through 25) to the respective rows of df
output <- cbind(df, results)
output
From your initial attempt, note:
We only do one loop, since it is easy to multiply by a vector in R
There are a few ways to subset data from a data frame in R. In this case, df[i,] gets everything in the i-th row, while $termX gets value in the column named termX
Using a results matrix instead of a list makes it very easy to copy the temporary computations (for each row) into rows of the matrix
Rather than rbind() (row bind), we want cbind() (column bind) to bind those results to new columns of the original rows.
Output:
ID term1 term2 term3 20 21 22 23 24 25
1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60 62.0
Dplyr/purrr functions
This could also be solved using tidy functions. In essence it's a pretty similar approach to Parfait's answer, but I've made the steps a bit more verbose to see what's going on.
# Use purrr's map functions to do the computation we want
nested_df <- df %>%
# Make new column holding term3 * req_no (stores a vector in each new cell)
mutate(term3r = map(term3, ~ .x * req_no)) %>%
# Make new column which sums the three columns of interest (stores a vector in each new cell)
mutate(sum = pmap(list(term1, term2, term3r), ~ ..1 + ..2 + ..3))
# "Unnest" those vectors which store our sums, and keep only those and ID
output <- nested_df %>%
# Creates six new columns (named ...1 to ...6) with the elements of each sum
unnest_wider(sum) %>%
# Keeps only the output data and IDs
select(ID, ...1:...6)
output
Output:
# A tibble: 2 x 7
ID ...1 ...2 ...3 ...4 ...5 ...6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X299 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 50 52.4 54.8 57.2 59.6 62
Consider directly assigning new columns with sapply using your formula:
df[paste0(req_no)] <- sapply(req_no, function(r) with(df, term1 + term2 + term3 * r))
df
# ID term1 term2 term3 20 21 22 23 24
# 1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38
# 2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60
# 3 X301 0.3 1.6 1.20 25.9 27.10 28.30 29.50 30.70
# 4 X302 0.9 0.6 0.40 9.5 9.90 10.30 10.70 11.10
# 5 X303 0.3 1.8 0.30 8.1 8.40 8.70 9.00 9.30
# 6 X304 1.3 0.3 2.10 43.6 45.70 47.80 49.90 52.00

Time-series average of cross-sectional correlations

I have a panel dataset looking like this:
head(panel_data)
date symbol close rv rv_plus rv_minus rskew rkurt Mkt.RF SMB HML
1 1999-11-19 a 25.4 19.3 6.76 12.6 -0.791 4.36 -0.11 0.35 -0.5
2 1999-11-22 a 26.8 10.1 6.44 3.69 0.675 5.38 0.02 0.22 -0.92
3 1999-11-23 a 25.2 8.97 2.56 6.41 -1.04 4.00 -1.29 0.08 0.3
4 1999-11-24 a 25.6 5.81 2.86 2.96 -0.505 5.45 0.87 0.08 -0.89
5 1999-11-26 a 25.6 2.78 1.53 1.25 0.617 5.60 0.23 0.92 -0.2
6 1999-11-29 a 26.1 5.07 2.76 2.30 -0.236 7.27 -0.6 0.570 -0.14
where the variable symbol depicts different stocks. I want to calculate the time-series average of the cross-sectional correlation between the variables rskew and rkurt. This means I need to compute the correlation between rskew and rkurt over all different stocks at each point in time and then calculate the time-series average afterwards.
I tried to do it with the rollapply function from the zoo package, but since the number of different stocks is not the same for all dates, I cannot simply define width as an integer. Here is what i tried for a sample width of 20:
panel_data <- panel_data %>%
group_by(date) %>%
mutate(cor_skew_kurt = rollapply(data = panel_data[7:8],
width=20,
FUN=cor,
align="right",
na.rm=TRUE,
fill=NA)) %>%
ungroup
Is there a way to do this without having to define a fixed width for each date group?
Or should I maybe use a different approach to do this?
[Edited] Can you try running the below code? I have recreated an example emulating your issue. if I understood your problem correctly this code should at least put you on the path to the right solution as it solves the issue of unequal time window length.
###################
#Recreating an example dataset with unequal dates across stocks
seed(1)
date6 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26','1999-11-29')
date5 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24','1999-11-26')
date4 <- c('1999-11-19','1999-11-22','1999-11-23','1999-11-24')
cor_skew_kurt <- c(rep(NaN,21))
symbol <- c(rep('a',6),rep('b',5),rep('c',4),rep('d',6))
rskew <- rnorm(21,mean=1, sd =1)
rkurt <- rnorm(21, mean=5, sd = 1)
panel_data <- cbind.data.frame(date = c(date6,date5,date4,date6), symbol = symbol, rskew = rskew, rkurt = rkurt, cor_skew_kurt = cor_skew_kurt )
panel_data$date <- as.Date(panel_data$date, '%Y-%m-%d')
# Computing the cor_skew_kurt and filling the table <- ANSWER TO YOUR QUESTION
for (date in unique(panel_data$date))
{
panel_data[panel_data$date == date,"cor_skew_kurt"] <- as.double(cor(panel_data[panel_data$date == date,'rskew'],panel_data[panel_data$date == date,'rkurt']))
}

na.omit seems to be removing negative values in my data frame?

I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

How to reshape a matrix

I have a matrix (d) that looks like:
d <-
as.matrix(read.table(text = "
month Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
X10 10 7.04 8.07 9.4 8.17 9.39 8.13 9.43 9.06 8.59 9.37 9.79 8.47 8.86
X11 11 12.10 11.50 12.6 13.70 11.90 11.50 13.10 17.20 19.00 14.60 13.70 13.20 16.10
X12 12 24.00 22.00 22.2 20.50 21.60 22.50 23.10 23.30 30.50 34.10 36.10 37.40 28.90
X1 1 18.30 16.30 16.2 14.80 16.60 15.40 15.20 14.80 16.70 14.90 15.00 13.80 15.90
X2 2 16.70 14.40 15.3 14.10 15.50 16.70 15.20 16.10 18.00 26.30 28.00 31.10 34.20",
header=TRUE))
going from Q1 to Q31 (its the days in each month). what I would like to get is:
month day Q
10 1 7.04
10 2 8.07
and so on for the 31 days and the 12 months.
I have tried using the following code:
reshape(d, direction="long", varying = list(colnames(d)[2:32]), v.names="Q", idvar="month", timevar="day")
but I get the error :
Error in d[, timevar] <- times[1L] : subscript out of bounds
Can anyone tell me what is wrong with the code? I don't really understand the help file on "reshape", it's a bit confusing... Thanks!
Almost there - you're just missing as.data.frame(d) to make your matrix into a data frame. Also you don't need the list in varying - just a vector, so
reshape(as.data.frame(d), varying=colnames(d)[2:32], v.names="Q",
direction="long", idvar="month", timevar="day")
The help file is confusing as heck, not least because (as I've learned) the necessary information almost always actually is in there --- somewhere.
As a prime example, midway through the help file, there is this bit:
The function will
attempt to guess the ‘v.names’ and ‘times’ from these names [i.e. the ones in the 'varying' argument]. The
default is variable names like ‘x.1’, ‘x.2’, where ‘sep = "."’
specifies to split at the dot and drop it from the name. To have
alphabetic followed by numeric times use ‘sep = ""’.
That last sentence is the one you need here: "Q1", "Q2", etc. are indeed "alphabetic followed by numeric", so you need to set sep = "" argument if reshape() is to know how to split apart those column names.
Try this:
res <- reshape(as.data.frame(d), idvar="month", timevar="day",
varying = -1, direction = "long", sep = "")
head(res[with(res, order(month,day)),])
# month day Q
# 1.1 1 1 18.3
# 1.2 1 2 16.3
# 1.3 1 3 16.2
# 1.4 1 4 14.8
# 1.5 1 5 16.6
# 1.6 1 6 15.4
The help file on reshape is not a bit confusing. It's a LOT confusing. Assuming your matrix has 12 rows(1 for each month) and 31 columns (I'm guessing you have NA values months with fewer than 31), you could easily construct this by hand.
d <- data.frame(month = rep(d[,1], 31), day = rep(1:31, each = 12), Q = as.vector(d[,2:32])
Now, back to your reshape... I'm guessing it's not parsing the names of your columns correctly. It might work better with Q.1, Q.2, etc. BTW, my reshaping above really depends on what you presented actually being a matrix and not a data.frame.

Resources