Have a data frame with dates of transaction and units of the product, e.g.:
date_tr units
1/1/2022 +50
1/2/2022 +30
1/3/2022 -10
1/4/2022 -45
1/5/2022 +25
where + denotes bought and - denotes sold.
I need to find the ageing of the sold products in R. There are 1000s of products and 1000s of entries. Ideally need to convert to a data frame with corresponding bought dates eg:
1/1/2022 10 1/3/2022 10
1/1/2022 40 1/4/2022 40
1/2/2022 5 1/4/2022 5
My understanding of flifo package is it will help in finding the products in the inventory with fifo or lifo. But I need to find the ageing of the sold products, in R. Highly appreciate any help!
Related
I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
STORE
month
Metric 1
22
Jan-18
10
23
Jan-18
20
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!
In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Store
Month
Total sales
Total sales (normalized)
1
Jan-18
50
0.64
2
Jan-18
40
0.45
3
Jan-18
70
0
4
Jan-18
15
1
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Store
Month
Total sales
1
Jan-18
20
1
Feb-18
20
1
Mar-18
20
2
Jan-18
5
2
Feb-18
15
2
Mar-18
40
3
Jan-18
10
3
Feb-18
30
3
Mar-18
78
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient
I have a set of 72 public companies of the chilean stock exchange. The selection of the data wasn't random, as the choice of public companies was adressed first by those ones that have a total 24h market volume of at least 400 USD (yes, the stock exchange is small), for which I got the sample of 72 public companies.
The ESG ratings, are ratings that agencies give in order to measure the sustainability of a company. I used 4 different rating agencies.
The data is Missing at Random (MAR), as as the data is randomly missing for companies that aren’t part of the main chilean market index ("ipsa" you can ignore this). The data looks like this (on the table): (very few (like 5%) of the cases of companies that are in the main market index have missing values. Meanwhile for the other companies, missing values are in 70%+ of the cases.
Companies
Agency 1 Rating
Agency 2 rating
is part of main index?
A
40
20
yes
B
20
30
yes
C
55
30
yes
D
55
missing
yes
E
70
30
yes
F
30
50
yes
G
30
60
yes
H
30
60
yes
I
missing
60
no
N
40
missing
no
K
20
missing
no
L
30
missing
no
M
40
missing
no
Z
20
20
no
Y
30
10
no
Is deletion of all those cases where the companies have missing values an accepted method to deal with these missing values to then run the correlation between ratings of different agencies, or am I doing a big mistake (big bias)? I end up deleting almost all entries of the group of companies that aren't in the main market index, only two are remaining, while of the ones that are part of the market index, 26 remain. The total sample is 29 remaining alive, can I run correlation this way?
I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))
I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf
readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2
heres a table, the time when the query runs i.e now is 2010-07-30 22:41:14
number | person | timestamp
45 mike 2008-02-15 15:31:14
56 mike 2008-02-15 15:30:56
67 mike 2008-02-17 13:31:14
34 mike 2010-07-30 22:31:14
56 bob 2009-07-30 22:37:14
67 bob 2009-07-30 22:37:14
22 tom 2010-07-30 22:37:14
78 fred 2010-07-30 22:37:14
Id like a query that can add up the number for each person. Then only display the name totals which have a recent entry say last 60 minutes. The difficult seems to be, that although its possible to use AND timestamp > now( ) - INTERVAL 600, this has the affect of stopping the full sum of the number.
the results I would from above are
Mike 202
tom 22
fred 78
bob is not included his latest entry is not recent enough its a year old! mike although he has several old entries is valid because he has one entry recently - but key, it still adds up his full 'number' and not just those with the time period.
go on get your head round that one in a single query ! and thanks
andy.
You want a HAVING clause:
select name, sum(number), max(timestamp_column)
from table
group by name
HAVING max( timestamp_column) > now( ) - INTERVAL 600;
andrew - in the spirit of education, i'm not going to show the query (actually, i'm being lazy but don't tell anyone) :).
basically tho', you'd have to do a subselect within your main criteria select. in psuedo code it would be:
select person, total as (select sum(number) from table1 t2 where t2.person=t1.person)
from table1 t1 where timestamp > now( ) - INTERVAL 600
that will blow up, but you get the gist...
jim