Having aggregated data - wanna have data for each element [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Hei,
My aim is to do a histogramm.
Therefor I need unaggregated data - but unfortunately I only have it in aggregated form.
My data:
tribble(~date,~groupsize,
"2020-09-01",3,
"2020-09-02",2,
"2020-09-03",1,
"2020-09-04",2)
I want to have:
tribble(~date,~n,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-02",1,
"2020-09-02",1,
"2020-09-01",1,
"2020-09-04",1,
"2020-09-04",1)
I think this is really simple, but I am at a loss. Sorry for that!
What can I do? I really like dplyr solutions :-)
Thank you!

repeat the date according to groupsize.
res <- data.frame(date=rep(dat$date, dat$groupsize), n=1)
res
# date n
# 1 2020-09-01 1
# 2 2020-09-01 1
# 3 2020-09-01 1
# 4 2020-09-02 1
# 5 2020-09-02 1
# 6 2020-09-03 1
# 7 2020-09-04 1
# 8 2020-09-04 1

Related

How to use Sys.Date() To Extract Current Year? [duplicate]

This question already has answers here:
How can I get the extract the previous year (2020) using Sys.Date()?
(2 answers)
Closed 1 year ago.
I have manually separated my dataset (discrete_8) into 2 separate datasets (data & data2). 'Data' contains the data from this current year (2021), whereas 'Data2' contains data from previous years. Of course, this is based on the current year (2021), but I want to automate the line of code so that when the year 2022 comes, I will not have to edit the script to change 2021 to 2022. Should I use Sys.Date() for calling the most recent year? How would I go about incorporating sys.date() to partition the dataset?
Here is my code so far, where I partition the dataset:
data <- discrete_8 %>% filter(PS_DATE >= as.POSIXct("2021-01-01"))#current year
data2 <- discrete_8 %>% filter(PS_DATE < as.POSIXct("2021-01-01"))#past years
Here is what discrete_8 looks like:
X PS_DATE PS_NAME Control.Parameters.Cell.Return.Flow.Rate Control.Parameters.Harvest.Flow.Rate Control.Parameters.Microsparger.Total.Gas.Flow.Rate
1 0 2014-02-06 123 NA NA 1
2 1 2014-02-07 124 NA NA 1
3 2 2014-02-08 125 NA NA 1
4 3 2014-02-09 126 1.5 NA 1
5 4 2014-02-10 127 1.5 NA 1
6 5 2014-02-11 128 1.5 NA 1
There is somewhat tedious bug still present in that trunc(Sys.Date(), "year") does not give you Jan 01 of the current year -- it does in R-devel.
But you can build yourself a helper such as this:
> firstDay <- function() { d <- Sys.Date(); d - as.POSIXlt(d)$yday }
> firstDay()
[1] "2021-01-01"
and you can use that to compare. (Also, in the code you posted, as.Date() is simpler as you ignore hours/minutes/seconds here.)
one option can be the lubridate::floor_date() function:
lubridate::floor_date(Sys.Date(), unit = "years")
[1] "2021-01-01"
I use substr(Sys.Date(),1,4) to get the current year. In your code you can replace as.POSIXct("2021-01-01") with
as.POSIXct(paste0(substr(Sys.Date(),1,4),"-01-01"))
This will give the 1st of the current year in your datetime format.

Get first n rows for each date in a dataframe [duplicate]

This question already has answers here:
Selecting top N rows for each group based on value in column
(4 answers)
Closed 3 years ago.
I am currently trying to subset the first n-observations for each date in my dataset. Let's say n=2 for example purposes. This is what the data set looks like:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-01 3
2019-02-01 6
… …
2019-02-02 5
2019-02-02 5
2019-02-02 2
… …
I would like to see this output:
Date Measure
2019-02-01 5
2019-02-01 4
2019-02-02 5
2019-02-02 5
… …
Unfortunately, this is not something I am able to do with definitions. I am dealing with over 10 million rows of data, so the solution needs to be dynamic to make the selection of n for each unique date.
An option is to group by 'Date' and slice the sequence of 'n' rows
library(dplyr)
n <- 2
df1 %>%
group_by(Date) %>%
slice(seq_len(n))

how to extract the value from multiple columns in a specific order [duplicate]

This question already has answers here:
Get Value of last non-empty column for each row [duplicate]
(3 answers)
Closed 4 years ago.
I have this dataset that contains variables from three previous years.
data <- read.table(text="
a 2015 2016 2017
1 100 100 100
2 1000 5 NA
3 10000 NA NA", header=TRUE)
I would like to create a new column in my data which contains the value from the most recent year. The order is 2017 ->2016 ->2015.
output <- read.table(text="
a 2015 2016 2017 recent
1 100 100 100 100
2 1000 5 NA 5
3 10000 NA NA 10000", header=TRUE)
I know that I can use "if" command to achieve it, but I am wondering if there is a quick and simple way to do it.
Thanks!
Here's a simple base R solution. This assumes that the years are sorted from left-right.
data$recent <- apply(data, 1, function(x) tail(na.omit(x), 1))
a X2015 X2016 X2017 recent
1 1 100 100 100 100
2 2 1000 5 NA 5
3 3 10000 NA NA 10000

How to create a step-by-step cumulation of data? [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Closed 7 years ago.
Probably my question is really dull but I couldn't find an easy solution for that. So we have a data.frame without (overall) column. Overall column must present a cumulative number of pies (in my case) eaten up to a certain time period. What is the easiest way to create it in R for an infinite number of rows? Thanks!
Year Pies eaten Pies eaten(overall)
1 1960 3 3
2 1961 2 5
3 1962 5 10
4 1963 1 11
5 1964 7 18
6 1965 4 22
We can use cumsum
df1$Pies_eaten_Overall <- cumsum(df1$Pies_eaten)

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeñorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Resources