How do i use hts() to aggregate a time series data? - r

I an new to R and have very basic doubts,
Company Customer Product Q1 Q2 Q3 Q4
xyz Customer1 ProductA 500 600 400 800
xyz Customer1 ProductB 100 255 520 642
xyz Customer1 ProductC 846 566 320 54
xyz Customer1 ProductD 510 53 100 210
xyz Customer2 ProductX 500 50 466 260
xyz Customer2 ProductY 100 120 150 620
xyz Customer2 ProductZ 500 460 240 543
The above mentioned is an example of my data set. I need to create a hierarchical time series using hts() with 3 levels. The bottom level (level 0) should contain the products(column - product) which will be aggregated to an upper level (level 1) which is based on customers (colunm - Customer) which inturn will have to be aggregated to the top level based on company.
My ques are,
how do i write a hts() code for this data set?
the data type of my data set is data frames, should i convert to
matrix before using?

Related

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

How to remove duplicate values in specific column without removing related row

Want to remove duplicate values in specific column without deleting the rows related with duplicate column values as below example:
Input
-----
Date Market Quantity
4/2/2018 Indonesia 1000
4/2/2018 Australia 500
4/2/2018 India 300
4/2/2018 USA 500
4/2/2018 Germany 200
5/2/2018 India 400
5/2/2018 Japan 400
5/2/2018 Russia 457
6/2/2018 Austria 260
6/2/2018 Swiss 700
6/2/2018 USA 1200
6/2/2018 Indonesia 400
output
------
Date Market Quantity
4/2/2018 Indonesia 1000
Australia 500
India 300
USA 500
Germany 200
5/2/2018 India 400
Japan 400
Russia 457
6/2/2018 Austria 260
Swiss 700
USA 1200
Indonesia 400
And if possible , how to plot a graph(bar/column) for same output(something like given)?
Sample Graph
I would add this to comments but I don't have rights yet...
I don't think you actually want to change the data, but as a few mentioned in the comments there are easy ways to do that.
If you're just trying to show the multi-dimensional data in plotly and you're just not familiar with the library syntax try the code below...
df <- data.frame(Date = c('2018/04/02','2018/04/02','2018/04/02','2018/04/02','2018/04/02','2018/05/02','2018/05/02','2018/05/02','2018/06/02','2018/06/02','2018/06/02','2018/06/02'),
Market = c('Indonesia','Australia','India','USA','Germany','India','Japan','Russia','Austria','Swiss','USA','Indonesia'),
Quantity = c(1000,500,300,500,200,400,400,457,260,700,1200,400),
stringsAsFactors = F)
plotly::ggplotly(
ggplot2::ggplot(df, ggplot2::aes(x=Market, y=Quantity)) +
ggplot2::geom_col(ggplot2::aes(fill=Market))+
ggplot2::facet_grid(~Date,scale='free_x') +
ggthemes::theme_tufte()
)

Distributed computing using r

I am trying to build a system where I have the monthly sales data of all employees in my department for the past 1 year.
sales<-read.table(text="MONTH Emp1 Emp2 Emp3
1 1000 1500 1100
2 1200 1400 1600
3 1500 1400 1600
4 1300 1500 1400
5 1500 1200 1200", header=T)
and so on till month 10
Through an algorithm, I have forecasted their future values and found the maximum percentage increase they can achieve.
threshold<-read.table(text="Employee 'Max Increment Possible'
Emp1 200
Emp2 220
Emp3 300",header=T)
Now I am setting a target of 400 increase in my department but want to distribute it among my employees in the best possible way. Wondering if there is an existing package that can do this in R.
The output should be:
Employee Incremental Value
Emp1 120
Emp1 140
Emp1 140

R - read multiple tables from text files of different format

I have several text files converted from images using OCR. Some of the text files contains multiple tables. These files differ in number of columns, separator and the line on which data starts. Below are the sample 2 files:
file1.txt: contains two tables in single text file
Receipt
Date: 12/05/2015 Page: 1
Status: Active
Location: Florida, USA
Prod ID Category ID Product Name Received Date Quantity Price
1 201 ABC 02/01/2015 5 200
2 02/01/2015 1 100
3 204 XYZ 05/02/2015 10 2000
Total 16 2300
Date: 01/02/2016 Page: 2
Status: Complete
Location: Florida, USA
Prod ID Category ID Product Name Received Date Quantity Price
1 202 ABC 02/01/2015 5 200
2 203 MNO 02/01/2015 1 100
3 204 XYZ 05/02/2015 10 2000
Total 16 2300
file2.txt: contains one table but in different format than above
Receipt Date: 12/05/2015 Page: 1 Location: California, USA Status: Complete
Prod ID Product Received Sent Quantity Price
Name Date Date
1 ABC 02/01/2015 03/01/2015 5 200
2 PQR 02/01/2015 03/01/2015 1 100
3 XYZ 05/02/2015 03/02/2015 10 2000
I am looking to read the files and create dataframe for each file/table. Is there any way to apply machine learning/NLP to convert these text files into dataframe in R.

conditional sorting of columns based on contents of one column in R

I have a R data.frame of daily stock volumes of 5 stocks for many days:
date stock1 stock2 stock3 stock4 stock5
1 350 600 1900 3000 250
2 800 800 1200 4200 400
3 500 600 1500 3500 550
4 600 900 1800 3200 1000
...
...
What I am looking for is a way to get a sorted list of stocks at end of each day, sorted on the volume numbers(descending ranking). I am thinking I can run a for loop for nrow(df) and at each iteration, sort the row contents on volume and save the sorted columns headers(stock names) as the expected list for that day. How can I manage to do this. Is it possible to do?
I am a novice in R and programming. I hope my question was clear. Grateful for any help.! thanks.

Resources