Extracting the frequencies in percentage - r

Here are some dataframe with volume in numerical numbers
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volume=c(1,1,2,0))
How is it possible to convert the volume column to be in percentage for the same year(date) of different labels?
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volumepercentage=c("50.00%","50.00%","100.00%","9.00%"))

Here is a base R approach:
df1.spl <- split(df1, df1$date)
df1.lst <- lapply(df1.spl, function(x) data.frame(x, pct=prop.table(x$volume)*100))
df2 <- do.call(rbind, df1.lst)
df2
# class date volume pct
# 2009.1 a 2009 1 50
# 2009.2 b 2009 1 50
# 2010.3 a 2010 2 100
# 2010.4 b 2010 0 0
Note the change in the row names. The command rownames(df2) <- NULL will remove them.

Related

Count the number of rows within a certain time range based on each row in R (tidyverse)

I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".
head(data$cleaned_date)
[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"
The dates are in POSIXct format.
For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.
I used roughly the following code
data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())
But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!
A reproducible example can be the following:
The input is
cleaned_date id
1 2008-09-11 A
2 2008-09-10 B
3 2008-09-30 B
4 2011-10-25 A
5 2011-11-14 A
And I want the output to be
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.
Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.
Perhaps not the most elegant solution.
# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"),
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
Edit for large datasets:
# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 3.247204 mins
If you want to go faster, you can run the loop in parallel
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
Time difference of 37.37211 secs
Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.
It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:
df %>%
group_by(id) %>%
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>%
mutate(counts = n())
Will give us:
cleaned_date id counts
<date> <chr> <int>
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2

Create R dataframe with column A * column B [duplicate]

This question already has answers here:
Cartesian product data frame
(7 answers)
Cartesian product with dplyr
(7 answers)
Closed 4 years ago.
How can I create a dataframe with column A * column B.
For example, column 'year' (2018 - 2025) and for each year a column 'week' from 1:52.
Basically, I want a nicer way to get this result:
a =data.table( c(2018) , c(1:52))
x <- c("year", "week")
colnames(a) <- x
b =data.table(c(2019) , c(1:52))
x <- c("year", "week")
colnames(b) <- x
c =data.table(c(2020) , c(1:52))
x <- c("year", "week")
colnames(c) <- x
d = rbind(a, b, c)
EDIT: Thanks!!
d <- expand.grid(year = c(2018:2020), week = c(1:52))
Use crossing from the tidyr package. something like:
library(tidyr)
library(data.table)
crossing(
data.table(year=2018:2020),
data.table(week=1:52))
for more details, see https://stackoverflow.com/a/49630818/1358308
With base R
data.frame(year = rep(2018:2020, 52), week = rep(1:52, length(year)))
Since you seem to use data.table, here is one more option.
library(data.table)
CJ('year' = 2018:2020, 'week' = 1:52)
# year week
# 1: 2018 1
# 2: 2018 2
# 3: 2018 3
# 4: 2018 4
# 5: 2018 5
# ---
#152: 2020 48
#153: 2020 49
#154: 2020 50
#155: 2020 51
#156: 2020 52
Basically,
year = rep(c(2018:2025),each = 52)
week = rep(c(1:52), length(c(2018:2025)))
d = as.data.frame(cbind(year, week))
You just need to call data.frame
data.frame(year=rep(2018:2020,52),weak=rep(c(1:52),3))

Combining two time series with different ranges, when column headings are the dates

I am stuck trying to combine two time series datasets that have different ranges and both are stored with item# in column1 and date as column headings. For example:
df1
#ITEM 1/1/16 1/2/16 1/3/16 ... 3/24/17
#1 350 365 370 ... 400
#2 100 95 101 ... 95
#3 5 8 9 ... 15
The other dataset range is smaller, its in the same format, and both are daily frequency.
How can I append the rows of df2 to df1 despite having different ranges, but making sure the dates are aligned when merged? Happy with NA in the new dataframe where df#2 didn't have values for dates in df1
Should I create these at xts objects so that once they are merged I can easily pull data for item1 on X date? Or is there an easy way to do that with this format as well?
Thanks in advance for you help.
One option is to use data.table::rbindlist(df1, df2) with fill = TRUE
that fills missing columns with NAs.
Example:
library(data.table)
dt1 <- data.table(item=c(1,2,3),"d1/1/16" = c(350,100,5) ,"d1/2/16" = c(360,120,7))
dt2 <- data.table(item=c(3,4,5),"d1/2/16" = c(50,50,2) ,"d1/3/16" = c(460,150,9))
l = list(dt1,dt2)
data.table::rbindlist(l, use.names= TRUE, fill=TRUE, idcol=TRUE )
Normally in R time series are represented in columns, not rows. Assuming we have DF1 and DF2 shown reproducibly in the Note at the end here are some alternatives
1) zoo we can create zoo series from each by transposing. Then merge them:
library(zoo)
fmt <- "%m/%d/%y"
z1 <- setNames(zoo(t(DF1[-1]), as.Date(names(DF1[-1]), fmt)), DF1[[1]])
z2 <- setNames(zoo(t(DF2[-1]), as.Date(names(DF2[-1]), ftm)), DF2[[1]])
z <- merge(z1, z2)
It is probably best to leave this as the zoo series z but if you want to transform to a data frame then use: fortity.zoo(z)
2) base Alternately, without zoo using fmt from above:
d1 <- data.frame(as.Date(names(DF1[-1]), fmt), t(DF1[-1]))
names(d1) <- c("Index", DF1[[1]])
d2 <- data.frame(as.Date(names(DF2[-1]), fmt), t(DF2[-1]))
names(d2) <- c("Index", DF2[[1]])
merge(d1, d2, by = "Index", all = TRUE)
Note: The input in reproducible form is assumed to be:
Lines <- "ITEM 1/1/16 1/2/16 1/3/16 3/24/17
1 350 365 370 400
2 100 95 101 95
3 5 8 9 15"
DF <- read.table(text = Lines, header = TRUE, check.names = FALSE)
DF1 <- DF[1:2, 1:3]
DF2 <- DF[3, -3]

Using lapply to output values between date ranges within different factor levels

I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:
Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")
And for the dates of each audit across different stores:
Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")
Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:
library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))
##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))
But I don't know how I would get this for multiple sites.
Try this:
# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")
library(tidyr)
library(dplyr)
# Gather df1 so that df1 and df2 have the same format:
df1 = gather(df1, Store, Sales, -t)
head(df1)
t Store Sales
1 2015-12-30 Store.A 16
2 2015-12-31 Store.A 24
3 2016-01-01 Store.A 8
4 2016-01-02 Store.A 42
5 2016-01-03 Store.A 7
6 2016-01-04 Store.A 46
# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:
r <- lapply(1:nrow(df2), function(i) {
audit.t = df2[i, "t"] #time of audit
audit.s = df1[, "Store"] == df2[i, "Store"] #store audited
df = df1[audit.s, ] #data from audited store
df[, "audited"] = audit.t #add extra column with audit date
week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
week_audit = difftime(df[, "t"], audit.t) <= 0
df[week_before & week_audit, ]
})
Does this give you the proper subsets?
Also, to summarise your results:
r = do.call("rbind", r) %>%
group_by(audited, Store) %>%
summarise(sales = sum(Sales))
r
audited Store sales
<time> <chr> <int>
1 2016-01-04 Store.A 97
2 2016-02-01 Store.B 156
3 2016-02-01 Store.C 226
4 2016-03-01 Store.A 115
5 2016-03-01 Store.C 187

Imputing observations to make cross section a time series in R

I'm hoping to take a dataset with cross section salary data for employees and create a large uninterrupted time series, imputing values along the way. Suppose I have:
name <- c("carl","carl","bob","rick","rick","rick","rick")
sex <- c(rep("M",7))
salary <- c(18000, 14000, 34000, 11000, 23000, 23000, 25000)
date <- as.Date(c("2007-04-30","2007-07-30","2009-12-09","2006-01-01",
"2008-01-01","2009-12-09", "2010-01-01"))
salaries <- data.frame(name,sex,salary,date)
salaries
name sex salary date
carl M 18000 2007-04-30
carl M 14000 2007-07-30
bob M 34000 2009-12-09
rick M 11000 2006-01-01
rick M 23000 2008-01-01
rick M 23000 2009-12-09
rick M 25000 2010-01-01
As we can see, poor carl got his salary cut by 4k in July. Prior to that, he was earning 18k. This was the case for 3 months before he got the cut ,but my data doesn't reflect this. I'd like to make a nice picture showing this trend, but first I need to change the data to look like this (where * denotes imputed values):
head(salaries)
name sex salary date change
carl M 18000 2007-04-30 0
carl M 18000 2007-05-30* 0
carl M 18000 2007-06-30* 0
carl M 14000 2007-07-30 1
bob M 34000 2009-12-09 0
rick M 11000 2006-01-01 0
rick M 11000 2006-02-01* 0
... .. ....... ...... ....
rick M 11000 2007-12-01* 0
rick M 23000 2008-01-01 1
rick M 23000 2008-02-01* 1
.... ...... ...... ........
rick M 23000 2009-12-09 1
rick M 25000 2010-01-01 2
So i'd like to impute in-between values and also mark when a change occurs. A guy like bob, who never had a salary change, just stays at 0. But rick, who's had multiple salary changes get's marked each time so we know when the change occurred and which number it is. I'm only interested in the month as the unit of analysis but it would be useful to know how to impute daily as well.
If you have a single time series,
you can use na.locf to replace missing values with the last available value
or approx if you only want to interpolate between values.
To create those individual time series, you can convert the data between your "tall" (normalized) format and a "wide" format with dcast and melt.
To count the number of changes, you can use ddply and cumsum.
library(reshape2)
library(plyr)
library(zoo)
# Convert to wide format
d <- dcast( salaries, date ~ name, value.var = "salary" )
# Add all the dates you want
dates <- seq.Date( from = min(d$date), max(d$date), by="month" )
d <- merge( d, data.frame(date=dates), all=TRUE )
# Fill in the missing values
# If you want the last non-missing value:
#d <- as.data.frame(lapply(d, na.locf, na.rm=FALSE))
# If you only want to interpolate between values:
d <- as.data.frame(lapply(d,
function(x) approx( seq_along(x), x, seq_along(x), method="constant" )$y
))
# Convert back to the tall format
d <- melt(d, id.vars="date", value.name="salary", variable.name="name", na.rm=TRUE)
# Add the number of changes
d <- ddply(
d, "name", transform,
change = cumsum(c(0, diff(salary) != 0))
)
Elaborating on #Vincent's advice:
name <- c("carl","carl","bob","rick","rick","rick","rick")
sex <- c(rep("M",7))
salary <- c(18000, 14000, 34000, 11000, 23000, 23000, 25000)
office <- c('melbourne','sydney','adelaide','perth','perth','melbourne','melbourne')
date <- as.Date(c("2007-04-30","2007-07-30","2009-12-09","2006-01-01",
"2008-01-01","2009-12-09", "2010-01-01"))
salaries <- data.frame(name,sex,salary,date, office)
salaries
library(reshape2)
library(plyr)
library(zoo)
Dealing with numeric vector using approx
# Convert to wide format
d <- dcast( salaries, date ~ name, value.var = "salary" )
# Add all the dates you want
dates <- seq.Date( from = min(d$date), max(d$date), by="month" )
d <- merge( d, data.frame(date=dates), all=TRUE )
# Fill in the missing values
# If you want the last non-missing value:
#d <- as.data.frame(lapply(d, na.locf, na.rm=FALSE, fromLast = T))
#If you only want to interpolate between values:
d <- as.data.frame(lapply(d,
function(x) approx( seq_along(x), x, seq_along(x), method="constant" )$y
))
# Convert back to the tall format
d <- melt(d, id.vars="date", value.name="salary", variable.name="name", na.rm=TRUE)
# Add the number of changes
d <- ddply(
d, "name", transform,
change = cumsum(c(0, diff(salary) != 0))
)
Convert character vector with na.locf
# Convert to wide format
a <- dcast( salaries, date ~ name, value.var = "office" )
# Add all the dates you want
dates <- seq.Date( from = min(a$date), max(a$date), by="month" )
a <- merge( a, data.frame(date=dates), all=TRUE )
# Fill in the missing values using na.locf
a <- as.data.frame(lapply(a, na.locf, na.rm=FALSE, fromLast = T))
# Convert back to the tall format
a <- melt(a, id.vars="date", value.name="office", variable.name="name", na.rm=TRUE)
Merge results
d$date <- as.Date(d$date)
out = merge(a,d, by = c('name','date'))

Resources