new R student here. I am trying to sort data by month. Here is a sample of the data I need to use, followed by the code, then my results. Any tips for how to accomplish this?! I'm super stuck...
This is the latest code I have been trying:
library(readr)
weather <- read_csv("R/weather.csv", col_types = cols(High = col_number(),
Low = col_number(), Precip = col_number(),
Snow = col_number(), Snowd = col_integer()))
View(weather)
library(ggplot2)
library(ggridges)
library(dplyr)
library(lubridate)
class(weather) #what class is dataset = dataframe
head(weather) #structure of the dataset
weather.month <- weather %>% # Group data by month
mutate(weather, 'month') %>%
group_by(month = lubridate::floor_date(weather$Day, 'month')) %>%
summarise(weather.month$High)
Then this is the errors I get:
Any help getting through this would be greatly appreciated!!!
The code can be modified by converting the Day to Date class (with mdy or dmy - as it is not clear whether it is month-day-year or day-month-year format), then apply the floor_date by 'month' and apply the function on High column
library(dplyr)
library(lubridate)
weather %>% #
group_by(month = lubridate::floor_date(mdy(Day), 'month')) %>%
summarise(High = sum(High, na.rm = TRUE))
Related
I am inquiring to learn how I can convert a character to hms so that I can accurately plot the mean of a series of album lengths that are in hh:mm:ss format. Here's my code. It doesn't work.
#Create labels for the mean
meanHours <- music %>%
select(AlbumHours) %>%
summarise(hrsMean = albumHRSmean, na.rm=TRUE) %>%
strptime(hrsMean, "%H:%M:%S") %>%
mutate(label_hrsMean = paste0("Mean of Hours: ",(hrsMean)))
We could use hms() function from lubridate package:
library(lubridate)
library(dplyr)
meanHours <- music %>%
select(AlbumHours) %>%
summarise(hrsMean = albumHRSmean, na.rm=TRUE) %>%
mutate(hrsMean = hms(hrsMean),
label_hrsMean = paste0("Mean of Hours: ",(hrsMean)))
I have the following data frame:
DF<-data.frame(stringsAsFactors = TRUE,
Sample = c(rep("s1",4),rep("s2",4)),
date = c("21/07/2020","24/07/2020","25/07/2020","27/07/2020",
"03/08/2020","06/08/2020","09/08/2020","10/08/2020"))
First I want to obtain the number of days between consecutive dates by the factor "Sample". so the output would be like this:
DF_2<-data.frame(stringsAsFactors = TRUE,
Sample = c(rep("s1",4),rep("s2",4)),
date = c("21/07/2020","24/07/2020","25/07/2020","27/07/2020",
"03/08/2020","06/08/2020","09/08/2020","10/08/2020"),
days = c(NA,3,1,2,NA,3,3,1))
Where variable "days" is my outcome variable.
Afterwards I want to add all those "days" by factor. But that is easy, will do it like this:
df_3<-aggregate(days~Sample,DF_2,sum)
I would much appreciate it if someone helps me to get right first step, to get DF_2.
We can use diff to get the difference between Date class converted 'date' column
library(dplyr)
library(lubridate)
DF1 <- DF %>%
mutate(date = dmy(date)) %>%
group_by(Sample) %>%
mutate(days = c(NA, diff(date))) %>%
ungroup
Im working through this example. However, when I begin investigating the tk_ts output I don't think it is taking the start/end data Im entering correctly, but am unsure as to what the proper input is if I want it to start at 12-31-2019 and end at 7-17-2020:
daily_cases2 <- as_tibble(countrydatescases) %>%
mutate(Date = as_date(date)) %>%
group_by(country, Date) %>%
summarise(total_cases = sum(total_cases))
daily_cases2$total_cases <- as.double(daily_cases2$total_cases)
# Nest
daily_cases2_nest <- daily_cases2 %>%
group_by(country) %>%
tidyr::nest()
# TS
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = 2019-12-31,
freq = 1))
Here is what I get when I examine it closely:
When I go through the example steps with these parameters the issue is also then seen in the subsequent graph:
I've tried varying the frequency and start parameters and its just not making sense. Any suggestions?
You've given the start and end dates, but you haven't said what frequency you want. Given that you want the series to start at the end of 2019 and end in the middle of July, 2020, I'm guessing you want a daily time series. In that case, the code should be:
daily_cases2_ts <- daily_cases2_nest %>%
mutate(data.ts = purrr::map(.x = data,
.f = tk_ts,
select = -Date,
start = c(2019, 365), # day 365 of year 2019
freq = 365)) # daily series
Lately for everything I manage on my own, I feel like I have a point I need to clarify on this board. SO thanks in advance for your help.
The issue is as follows:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90,30,70,90,220,35,70,150,250,10,50,25)
)
Data<-Data[, totalcont := sum(Contract_Size), by=c("Code_ID_Buy","Month")]
View(Data)
I would like to calculate the difference in total contract size from one period to the next, does anybody know whether it is possible ? I have been working on that for weeks not finding the solution...
Kind regards,
This should solve it
library(tidyverse)
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90)
)
Data %>%
group_by(Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(dif_contract = total_contract -lag(total_contract))
If you need to respect another variable do
Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
If you need to maintain the current structure left_join may be right
result <- Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
Data %>%
left_join(result)
I am working with JHU data on coronavirus infections, and I'm trying to compute new cases (and deaths) by group. Here's the code:
base <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-"
world.confirmed <- read.csv(paste0(base,"Confirmed.csv"), sep=',', head=T)
world.confirmed <- gather( world.confirmed, Date, Cases, X1.22.20:X3.21.20)
world.deaths <- read.csv(paste0(base,"Deaths.csv"), sep=',', head=T)
world.deaths <- gather(world.deaths, Date, Deaths, X1.22.20:X3.21.20)
world.data <- merge(world.confirmed, world.deaths,
by=c("Province.State","Country.Region","Lat", "Long", "Date"))
world.data$Date <- as.Date(world.data$Date, "X%m.%d.%y")
world.data <- world.data %>%
group_by(Province.State,Country.Region,Date) %>%
arrange(Province.State, Country.Region, as.Date(Date))
Following solutions to this question in SO I have tried to compute differences by group using something like this:
world.data <- world.data %>%
group_by(Lat,Long) %>%
mutate(New.Cases = Cases - lag(Cases))
That does not work, however; any other grouping does not either. Here're results on boundary between two first countries:
I have tried also inserting an arrange phase, and even trying to zero the first element of the group. Same problem. Any idea?
Update I'm using R 3.4.4 and dplyr_0.8.5
Probably, this might help :
library(dplyr)
world.data %>%
mutate(Date = as.Date(Date, "X%m.%d.%y")) %>%
arrange(Country.Region, Lat, Long, Date) %>%
group_by(Country.Region, Lat, Long) %>%
mutate(New_Cases = Cases - lag(Cases),
New_deaths = Deaths - lag(Deaths))
We arrange the data according to Date, and find New_Cases by subtracting today's case with yesterday's case for each Country and the same for deaths.