I want to combine 6-hour timesteps that are immediately following one another in order to see maximum Total_IVT during a single storm event. For example, 2019-5-15 has several observations at Hours 12 and 18, and the next day has an observation at Hour 0. How can I combine by nearby timesteps?
Original data is here: https://ucla.box.com/ARcatalog. A shortened sample is below.
> dput(tail(df))
> structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L
), Month = c(3L, 5L, 5L, 5L, 5L, 5L), Day = c(27L, 15L, 15L,
15L, 16L, 21L), Hour = c(12L, 0L, 12L, 18L, 0L, 6L), Total_IVT = c(111.5, 206, 503.3, 287, 261.2, 294.8), Date = c("2019-03-27", "2019-05-15",
"2019-05-15", "2019-05-15", "2019-05-16", "2019-05-21")), row.names = 1719:1724, class = "data.frame")
I tried this code, and I got the daily maximum, but what I want is to include previous or following days if the storm spans across days.
df1 <- df %>% #subset of storms by max IVT
mutate(Date = as.Date(Date)) %>%
group_by(Date) %>%
filter(Total_IVT == max(Total_IVT))
Here is an example of what I get from the full dataset when I plot the daily max IVT. What I want will be a plot of fewer points because some of the storms overlap days.
ggplot(df1) + geom_point(aes(Date, Total_IVT))
I am new to R, so I apologize if this does not make sense. I appreciate your help in advance.
Related
I'm trying to create a new variable confirmed_delta_perc in a list of commands (piping) but am having an issue with the variable active_delta showing it is not found. I have confirmed it is in the data frame but is not being read. It also doesn't add the new variable.
COVID %>%
select(county, confirmed, confirmed_delta) %>%
mutate(confirmed_delta_perc = active_delta/active * 100) %>%
filter(confirmed_delta_perc == 32)
Error:
Error in `mutate()`:
! Problem while computing `confirmed_delta_perc =
active_delta/active`.
Caused by error:
! object 'active_delta' not found
This is the full list of directions to including in the pipe:
Using piping, create a link of commands that selects the county, confirmed, and confirmed_delta variables. Create a new variable called confirmed_delta_perc using the mutate() function. The values in this column should be the percentage of active delta cases of all active cases. Filter for all observation(s) that have a confirmed_delta_perc value of 32. Print out all observation(s).
I've tried modifing the mutate() by renaming the dataframe so it "redoes" it and adds the new variable but it doesn't work either.
There's not any observations that actually equal 32 but it still should add the variable but is not.
Does anyone have any ideas?
dput(head(COVID))
structure(list(county = c("Washington", "Fountain", "Jay", "Wabash",
"Fayette", "Washington"), confirmed = c(620L, 737L, 930L, 1530L,
1336L, 675L), confirmed_delta = c(18L, 12L, 11L, 49L, 19L, 29L
), deaths = c(5L, 8L, 14L, 25L, 33L, 6L), deaths_delta = c(0L,
1L, 0L, 1L, 0L, 1L), recovered = c(0L, 0L, 0L, 0L, 0L, 0L), recovered_delta = c(0L,
0L, 0L, 0L, 0L, 0L), active = c(615L, 729L, 918L, 1512L, 1305L,
669L), active_delta = c(18L, 11L, 11L, 49L, 19L, 28L), active_delta_perc = c(0.0292682926829268,
0.0150891632373114, 0.0119825708061002, 0.0324074074074074, 0.0145593869731801,
0.0418535127055306)), row.names = c(NA, 6L), class = "data.frame")```
For most numbers of cases, it is impossible for any portion of them to be exactly 32%. For instance what we would report 29 of 90 cases as "32%" but that's really 32.222222 which is not strictly equal to 32. So you will need to specify what range around 32 counts as a match. Here, I say anything within 0.5 of 32 on either side, from 31.5 to 32.5, is close enough.
COVID <- COVID %>%
mutate(confirmed_delta_perc = active_delta/active * 100) %>%
filter(abs(confirmed_delta_perc - 32) <= 0.5)
try this:
COVID <- COVID %>%
mutate(confirmed_delta_perc = active_delta/active * 100) %>%
filter( round(confirmed_delta_perc, 0) == 32)
filtering by abs function as suggested by #JonSpring in the comments is better though
I am trying to create a bar graph of mortality frequencies each year from 2000-2013. I have two events which occurred in 2003 and 2007, so I would like these years bars to be highlighted a different color. I am currently using ggchart with the following code:
spec <- highlight_spec(
what = c("2003", "2007"),
highlight_color = "darkorange3"
`enter code here`)
bar_chart(
YearMortality,
Year,
freq,
sort = FALSE,
horizontal = FALSE,
highlight = spec
)
This gives me the chart below.
Frequency of mortality each year from 2000-2013
The years are numerical values, which I think is why they aren't each appearing as x-axis values. I am not sure how to fix the code to display the x-axis labels, or how I should change the numerical values to characters. Any suggestions?
Edit: data frame information
Here is the information recommended by Peter,
structure(list(Year = 2000:2013, freq = c(9L, 10L, 10L, 7L, 3L,
9L, 6L, 4L, 2L, 6L, 5L, 4L, 2L, 6L)), row.names = c(NA, 14L), class = "data.frame")
I have a df as follow:
Variable Value
G1_temp_0 37.9
G1_temp_5 37.95333333
G1_temp_10 37.98333333
G1_temp_15 38.18666667
G1_temp_20 38.30526316
G1_temp_25 38.33529412
G1_mean_Q1 38.03666667
G1_mean_Q2 38.08666667
G1_mean_Q3 38.01
G1_mean_Q4 38.2
G2_temp_0 37.9
G2_temp_5 37.95333333
G2_temp_10 37.98333333
G2_temp_15 38.18666667
G2_temp_20 38.30526316
G2_temp_25 38.33529412
G2_mean_Q1 38.53666667
G2_mean_Q2 38.68666667
G2_mean_Q3 38.61
G2_mean_Q4 38.71
I like to make a lineplot with two lines which reflects the values "G1_mean_Q1 - G1_mean_Q4" and "G2_mean_Q1 - G2_mean_Q4"
In the end it should more or less look like this, the x axis should represent the different variables:
The main problem I have is, how to get a basic line plot with this df.
I've tried something like this,
ggplot(df, aes(x = c(1:4), y = Value) + geom_line()
but I have always some errors. It would be great if someone could help me. Thanks
Please post your data with dput(data) next time. it makes it easier to read your data into R.
You need to tell ggplot which are the groups. You can do this with aes(group = Sample). For this purpose, you need to restructure your dataframe a bit and separate the Variable into different columns.
library(tidyverse)
dat <- structure(list(Variable = structure(c(5L, 10L, 6L, 7L, 8L, 9L,
1L, 2L, 3L, 4L, 15L, 20L, 16L, 17L, 18L, 19L, 11L, 12L, 13L,
14L), .Label = c("G1_mean_Q1", "G1_mean_Q2", "G1_mean_Q3", "G1_mean_Q4",
"G1_temp_0", "G1_temp_10", "G1_temp_15", "G1_temp_20", "G1_temp_25",
"G1_temp_5", "G2_mean_Q1", "G2_mean_Q2", "G2_mean_Q3", "G2_mean_Q4",
"G2_temp_0", "G2_temp_10", "G2_temp_15", "G2_temp_20", "G2_temp_25",
"G2_temp_5"), class = "factor"), Value = c(37.9, 37.95333333,
37.98333333, 38.18666667, 38.30526316, 38.33529412, 38.03666667,
38.08666667, 38.01, 38.2, 37.9, 37.95333333, 37.98333333, 38.18666667,
38.30526316, 38.33529412, 38.53666667, 38.68666667, 38.61, 38.71
)), class = "data.frame", row.names = c(NA, -20L))
dat <- dat %>%
filter(str_detect(Variable, "mean")) %>%
separate(Variable, into = c("Sample", "mean", "time"), sep = "_")
g <- ggplot(data=dat, aes(x=time, y=Value, group=Sample)) +
geom_line(aes(colour=Sample))
g
Created on 2020-07-20 by the reprex package (v0.3.0)
I have a time series like this:
created_time,reaction_counts
2016-01-18T08:05:44+0000,65
2016-01-18T08:05:44+0000,65
2016-01-18T08:05:44+0000,65
2016-02-23T01:42:48+0000,468
2016-02-23T03:51:37+0000,125
2016-02-23T09:49:01+0000,433
2016-02-23T10:09:32+0000,72
2016-02-26T07:45:10+0000,137
2016-02-26T11:48:09+0000,120
2016-02-27T03:27:39+0000,70
2016-02-28T09:28:16+0000,145
2016-03-02T00:17:14+0000,122
2016-03-02T05:34:41+0000,108
2016-03-02T09:04:45+0000,296
And I want to aggregate it by month (and also by year) and plot a histogram.
How do I do it?
Thanks!
You can use the following code for converting hourly data to monthly or yearly data
library(lubridate)
library(dplyr)
library(hydroTSM)
try <- structure(list(created_time = structure(c(1L, 1L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("2016-01-18T08:05:44+0000",
"2016-02-23T01:42:48+0000", "2016-02-23T03:51:37+0000", "2016-02-23T09:49:01+0000",
"2016-02-23T10:09:32+0000", "2016-02-26T07:45:10+0000", "2016-02-26T11:48:09+0000",
"2016-02-27T03:27:39+0000", "2016-02-28T09:28:16+0000", "2016-03-02T00:17:14+0000",
"2016-03-02T05:34:41+0000", "2016-03-02T09:04:45+0000"), class = "factor"),
reaction_counts = c(65L, 65L, 65L, 468L, 125L, 433L, 72L,
137L, 120L, 70L, 145L, 122L, 108L, 296L)), class = "data.frame", row.names = c(NA,
-14L))
df <- mutate_at(try, "created_time", ymd_hms)
Monthly conversion
monthly = df %>%
mutate(month = format(created_time, "%m"), year = format(created_time, "%Y")) %>%
group_by(month, year) %>%
summarise(total = sum(reaction_counts))
For histogram plotting of monthly data
hist(monthly$total)
Yearly conversion
yearly = df %>%
mutate(month = format(created_time, "%m"), year = format(created_time, "%Y")) %>%
group_by(year) %>%
summarise(total = sum(reaction_counts))
For histogram plotting of yearly data
hist(yearly$total)
I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.