I have cleaned and ordered my data by date, which looks like below:
df1 <- data.frame(matrix(vector(),ncol=4, nrow = 3))
colnames(df1) <- c("Date","A","B","C")
df1[1,] <- c("2000-01-30","0","1","0")
df1[2,] <- c("2000-01-31","2","0","3")
df1[3,] <- c("2000-02-29","1","2","1")
df1[4,] <- c("2000-03-31","2","1","3")
df1
Date A B C
1 2000-01-30 0 1 0
2 2000-01-31 2 0 3
3 2000-02-29 1 2 1
4 2000-03-31 2 1 3
However, I want to drop the day and order the data by month and year so the data will look like:
Date A B C
1 2000-01 2 1 3
3 2000-02 1 2 1
4 2000-03 2 1 3
I tried to use as.yearmon from zoo df2 <- as.yearmon(df1$Date, "%b-%y") and it returns NA. Thank you in advance for your generous help!
Here's a way to get the sum of the values for each column within each combination of Year-Month:
library(zoo)
library(dplyr)
# Convert non-date columns to numeric
df1[,-1] = lapply(df1[,-1], as.numeric)
df1 %>% mutate(Date = as.yearmon(Date)) %>%
group_by(Date) %>%
summarise_each(funs(sum))
Or, even shorter:
df1 %>%
group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum))
Date A B C
1 Jan 2000 2 1 3
2 Feb 2000 1 2 1
3 Mar 2000 2 1 3
A couple of additional enhancements:
Add the number of rows for each group:
df1 %>% group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum)) %>%
bind_cols(df1 %>% count(d=as.yearmon(Date)) %>% select(-d))
Multiple summary functions:
df1 %>% group_by(Date=as.yearmon(Date)) %>%
summarise_each(funs(sum(.), mean(.))) %>%
bind_cols(df1 %>% count(d=as.yearmon(Date)) %>% select(-d))
Date A_sum B_sum C_sum A_mean B_mean C_mean n
1 Jan 2000 2 1 3 1 0.5 1.5 2
2 Feb 2000 1 2 1 1 2.0 1.0 1
3 Mar 2000 2 1 3 2 1.0 3.0 1
Your Date column is a character vector, when it needs to be a Date type vector. So:
df1$Date <- as.Date(df1$Date)
df1$Date <- as.yearmon(df1$Date)
Date A B C
1 Jan 2000 0 1 0
2 Jan 2000 2 0 3
3 Feb 2000 1 2 1
4 Mar 2000 2 1 3
Related
Suppose I have the following dataset:
id1 <- c(1,1,1,1,2,2,2,2,1,1,1,1)
dates <- c("a","a","a","a","b","b","b","b","c","c","c","c")
x <- c(NA,0,NA,NA,NA,NA,0,NA,NA,NA,NA,0)
df <- data.frame(id1,dates,x)
My objective is to have a new column that explicitly tells counts the sequence of observations around 0 for every combination of id1 and dates. This would yield the following outcome:
desired_result <- c(-1,0,1,2,-2,-1,0,1,-3,-2,-1,0)
Any help is appreciated.
library(dplyr)
df %>%
group_by(id1, dates) %>%
mutate(x = row_number() - which(x == 0))
id1 dates x
1 1 a -1
2 1 a 0
3 1 a 1
4 1 a 2
5 2 b -2
6 2 b -1
7 2 b 0
8 2 b 1
9 1 c -3
10 1 c -2
11 1 c -1
12 1 c 0
With dplyr 1.1.0:
df %>%
mutate(x = row_number() - which(x == 0), .by = dates)
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
I'm a complete beginner to R and I just need to do some quick cleaning of my data. But I ran into a problem I can't wrap my head around.
So I have a Postgres db with timeseries, Columns are ID, DATE and VALUE (temperature). Each ID is a new measuring station, so I have a time serie for each id (around 2000 unique ids, 4m rows). The dates span from 1915-2016, some series are overlapping some are not. If there is missing measurement from a week I want to fill those weeks with an NA value (which i interpolate after).
The problem i run into is that complete(Date.seq) creates NA values for all weeks between 1915 and 2016, I clearly understand why it happens. How can I make so it only fills values between the actual start and end date of the specific timeserie? I want a moving min and max which is dependent on the start date and end date of each specific ID and than fill missing dates between the start and end date of each ID.
library("RpostgreSQL")
library("tidyverse")
library("lubridate")
con <- dbConnect(PostgreSQL(), user = "postgres",
dbname="", password = "", host = "localhost", port= "5432")
out <- dbGetQuery(con, "SELECT * FROM *******.Weekly_series")
out %>%
group_by(ID)%>%
mutate(DATE = as.Date(DATE)) %>%
complete(DATE = seq(ymd("1915-04-14"), ymd("2016-03-30"), by= "week"))
Ignore errors in the connect line.
Thanks in advance.
Edit1
Sample data
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Excpected output
ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-22 NA
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-08 NA
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-08 NA
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1
Using the data you provided, this works. I don't know why this works and your whole code does not, but possibly in your code, the data structure is not what is needed. If so, something like out <- tibble::as_tibble(out) might work. My other guess is that complete isn't drawing from the package you need. Using tidyr::complete works on the sample.
library(lubridate)
library(dplyr)
library(tidyr)
a <- "ID DATE VALUE
1 2015-10-01 1
1 2015-10-08 1
1 2015-10-15 1
1 2015-10-29 1
2 1956-01-01 1
2 1956-01-15 1
2 1956-01-22 1
3 1982-01-01 1
3 1982-01-15 1
3 1982-01-22 1
3 1982-01-29 1"
df <- read.table(text = a, header = TRUE)
big_df1 <- df %>%
filter(ID == 1)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df2 <- df %>%
filter(ID == 2)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df3 <- df %>%
filter(ID == 3)%>%
mutate(DATE = as.Date(DATE)) %>%
tidyr::complete(DATE = seq(ymd(min(DATE)), ymd(max(DATE)), by= "week"))
big_df <- rbind(big_df1, big_df2, big_df3)
big_df
DATE ID VALUE
<date> <int> <int>
1 2015-10-01 1 1
2 2015-10-08 1 1
3 2015-10-15 1 1
4 2015-10-22 NA NA
5 2015-10-29 1 1
6 1956-01-01 2 1
7 1956-01-08 NA NA
8 1956-01-15 2 1
9 1956-01-22 2 1
10 1982-01-01 3 1
11 1982-01-08 NA NA
12 1982-01-15 3 1
13 1982-01-22 3 1
14 1982-01-29 3 1
I have following panel data:
firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2
I want to transform this long format to wide but only for date 1 to look like this
firmid return in date=1
1 1
3 2
I appreciate any advice!
df <- read.table(header = T, text = "firmid date return
1 1 1
1 2 1
1 3 1
2 2 2
2 3 2
3 1 2
3 3 2")
Base R solution:
df <- df[df$date == 1, ]
df$date <- NULL
df
firmid return
1 1 1
6 3 2
data.table solution:
library(data.table)
setDT(df)
df <- df[date == 1, ]
df[, date := NULL]
firmid return
1: 1 1
2: 3 2
You can use dplyr to achieve it too:
library(dplyr)
df2 <- df %>%
filter(date == 1) %>%
select(-date)
# firmid return
#1 1 1
#2 3 2
A different dplyr solution that allows you to have multiple values of return within firmid:
df %>%
filter(date == 1) %>%
group_by(firmid, return) %>%
summarise()
I have a large dataset that looks like below:
df1 <- data.frame(matrix(vector(),ncol=4, nrow = 3))
colnames(df1) <- c("Date","A","B","C")
df1[1,] <- c("2000-01-30","0","1","0")
df1[2,] <- c("2000-01-31","2","0","3")
df1[3,] <- c("2000-02-29","1","2","1")
df1[4,] <- c("2000-03-31","2","1","3")
df1
Date A A C
1 2000-01-30 0 1 0
2 2000-01-31 2 0 3
3 2000-02-29 1 2 1
4 2000-03-31 2 1 3
I'm trying to get:
Date A A C
1 2000-01 2 1 3
2 2000-02 1 2 1
3 2000-03 2 1 3
Here is what I did:
library(zoo)
library(dplyr)
df1[,-1] = lapply(df1[,-1], as.numeric)
df1 %>% mutate(Date = as.yearmon(Date)) %>%
group_by(Date) %>%
summarise_each(funs(sum))
This works great when there are no duplicate column names. However, given the size of the data, some of the columns may have the same name, which causes the error found duplicated column name: A. I do not want to combine the columns, and I want to get the result as mentioned above. Please advise.