Quarter over Quarter in r - r

I have been trying to calculate the quarter over quarter change in shares with no luck. I have a data.table with approx 15millions rows.
What I need to calculate is the change in absolute values quarter by quarter according to the Holder and the stock they own.
My data table looks like this:
stock Holder Quarter Shares
1: GOOGLE Advance Capital Management, Inc. 2015 Q3 5800
2: GOOGLE Advance Capital Management, Inc. 2015 Q4 9000
3: GOOGLE Advance Capital Management, Inc. 2016 Q1 7000
4: GOOGLE Advance Capital Management, Inc. 2016 Q2 7560
5: GOOGLE Advest, Inc. 2015 Q3 12000
6: GOOGLE Advest, Inc. 2015 Q3 13450
I'm trying to use data.table functions, using
df[, qoq := c(NA, diff(Shares)), by = "Holder,stock,Quarter"]
However, I get only NA.
I was expecting something like this:
stock Holder Quarter Shares qoq
1: GOOGLE Advance Capital Management, Inc. 2015 Q3 5800 NA
2: GOOGLE Advance Capital Management, Inc. 2015 Q4 9000 4000
3: GOOGLE Advance Capital Management, Inc. 2016 Q1 7000 -2000
4: GOOGLE Advance Capital Management, Inc. 2016 Q2 7560 560
5: GOOGLE Advest, Inc. 2015 Q3 12000 NA
6: GOOGLE Advest, Inc. 2015 Q3 13450 1450
After that, I need to calculate the variance of this result, again, by Holder and stock. Is there any general function to calculate statistics by grouping several columns? I tried aggregate but is taking yearsssss...
aggregate(REPORTED_HOLDING~Quarter+FILER_NAME+STOCK_NAME, FUN=sum, data=df)

With dplyr, assuming df is you data.frame:
df %>%
group_by(stock, Holder) %>%
mutate(qoq = Shares - lag(Shares)) %>%
summarise(qvar = var(qoq, na.rm = T))

Related

R Dataframe Rolling Measures by Group

I have a dataframe looks like below (the true data has many more people):
Year Player Club
2005 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2010 Phelan Chicago Fire
2002 John New York Jet
2006 John New York Jet
2007 John Atlanta Elephant
2009 John Los Angeles Eagle
I want to calculate a player level measure (count) for each row (year) that captures the weighted number of club that a person experienced up to that point. The formula is (length of the experience 1/total years up to that point)^2+(length of the experience 2/total years up to that point)^2+......
Below is the ideal output for Phelan. For example, "count" for his first row is 1 as it is his first year in the data and (1/1)^2=1. For his second row, which includes three years (2005, 2006, 2007) up to this point, count=(1/3)^2+(2/3)=0.56 (assuming in 2006, which is missing data, Phelan also stayed in Chicago Fire). For his third row, count=(2/4)^2+(2/4)^2=0.5. For his fourth row, count=(3/6)^2+(3/6)^2=0.5 (assuming in 2009, which is missing data, Phelan also stayed in Boston Pant).
Year Player Club Count
2005 Phelan Chicago Fire 1
2007 Phelan Boston Pant 0.56
2008 Phelan Boston Pant 0.5
2010 Phelan Chicago Fire 0.5
This is a bit convoluted but I think it does what you want.
Using data.table:
library(data.table)
library(zoo) # for na.locf(...)
##
expand.df <- setDT(df)[, .(Year=min(Year):max(Year)), by=.(Player)]
expand.df[df, Club:=i.Club, on=.(Player, Year)]
expand.df[, Club:=na.locf(Club)]
expand.df[, cuml.exp:=1:.N, by=.(Player)]
expand.df <- expand.df[expand.df[, .(Player, cuml.exp)], on=.(Player, cuml.exp <= cuml.exp)]
expand.df <- expand.df[, .(Year=max(Year), club.exp=sum(sapply(unique(Club), \(x) sum(Club==x)^2))), by=.(Player, cuml.exp)]
expand.df[, score:=club.exp/cuml.exp^2]
result <- expand.df[df, on=.(Player, Year), nomatch=NULL]
result[, .(Player, Year, Club, cuml.exp, club.exp, score)]
## Player Year Club cuml.exp club.exp score
## 1: Phelan 2005 Chicago Fire 1 1 1.0000000
## 2: Phelan 2007 Boston Pant 3 5 0.5555556
## 3: Phelan 2008 Boston Pant 4 8 0.5000000
## 4: Phelan 2010 Chicago Fire 6 18 0.5000000
## 5: John 2002 New York Jet 1 1 1.0000000
## 6: John 2006 New York Jet 5 25 1.0000000
## 7: John 2007 Atlanta Elephant 6 26 0.7222222
## 8: John 2009 Los Angeles Eagle 8 30 0.4687500
So this expands your df to include one row per year per player, then joins back the clubs for the appropriate years, then fills the gaps per your description. Then we calculate cumulative years of experience for each player.
The next bit is the convoluted part: we need to expand further so that for each combination of player and cuml.exp we have all the rows up to that point. The join on=.(Player, cuml.exp <= cuml.exp) does that. Then we can count the number of instances of each club by player and cuml.exp to get the numerator of your score.
Then we calculate the scores, drop the extra years and the extra columns.
Note that this assumes you've got R 4.1+. If not, replcae \(x)... with function(x)....

How to combine row with column headings?

I have a large dataset which, simplified, looks something like this:
Year
Name
January
February
March
April
May
Street
2000
Bob
$100
$197
$124
$100
ABC
2000
Abe
$100
$100
$117
$123
$100
ABC
2001
Bob
$100
$100
$197
$103
$150
DEF
2001
Abe
$140
$100
$127
$526
$123
ABC
2002
Abe
$100
$100
$198
$102
$101
DEF
2002
Bob
$102
$110
ABC
2003
Carly
$100
$100
$197
ABC
I am trying to combine this data so that each person has one line, with the goal of counting and graphing how many months they paid in a row.
I was thinking of trying to recode the data so that each person gets their own row, with a timeline of how much they paid by year and season, with column names like this, but I am having trouble figuring out how to do that.
Name
2000 January
2000 February
2000 March
2000 April
2000 May
2001 January
2001 February
2001 March
2001 April
2001 May
2002 January
2002 February
2002 March
2002 April
2002 May
Street
Is there a way to condense variables in this way somehow?
Thank you so much!
Using pivot_wider from {tidyr} will achieve this. Calling your dataframe yeardata, you can do the following:
selectmonths <- c("January", "February", "March", "April", "May")
result <- yeardata %>%
pivot_wider(names_from = "Year", values_from = selectmonths)

How do I lag Quarters in r?

First and foremost - thank you for viewing my question - regardless of if you answer or not.
I am trying to add a column that contains the lagged values of the Quarter value to my DF, however, I get the below warning when I do so:
Warning messages:
1: In mutate_impl(.data, dots) :
Vectorizing 'yearqtr' elements may not preserve their attributes
Below is my sample data (my data starts on 1/3/2018)
Ticker Price Date Quarter
A 10 1/3/18 2018 Q1
A 13.5 2/15/18 2018 Q1
A 12.9 4/2/18 2018 Q2
A 11.2 5/3/18 2018 Q2
B 35.2 1/4/18 2018 Q1
B 33.1 3/2/18 2018 Q1
B 31 4/6/18 2018 Q2
... ... ... ...
XYZ 102 5/6/18 2018 Q2
I have a huge table with multiple stocks and multiple dates. The way I calculate the quarter column is :
df$quarter <- lag(as.yearqtr(df$Date))
But however - I can't get to add a column that would lag the values of the Quarter. Would anyone know a possible workaround?
I would like the below output:
Ticker Price Date Quarter Lag_Q
A 10 1/3/18 2018 Q1 NA
A 13.5 2/15/18 2018 Q1 NA
A 12.9 4/2/18 2018 Q2 2018 Q1
A 11.2 5/3/18 2018 Q2 2018 Q1
B 35.2 1/4/18 2018 Q1 NA
B 33.1 3/2/18 2018 Q1 NA
B 31 4/6/18 2018 Q2 2018 Q1
... ... ... ...
XYZ 102 5/6/18 2018 Q2 2018 Q1
Firstly, I'd suggest organizing your data so that each column represents prices of an individual security and each row is a specific date. From there, you can transform all securities easily, but I'm not sure what your end goal is. The xts package is excellent and has been optimized in c, and is kind of the securities industry standard. I highly suggest exploring it. But that's beyond the scope of your post!
For your data structure though, a single line should do:
df$lag_Q <- as.yearqtr( ifelse(test = (df$quarter=="2018 Q1"),
yes = NA,
no = df$quarter-0.25) )

Need to fetch only first two highest records by grouping two columns of dataframe in R

I have a data.frame which contains 4 columns with 13 rows. Below is the sample data. [Column name is in uppercase and data is in lower case]
Sample input data:
NAME. MARKS MONTH COUNTRY
ram 20. jan India
ranjith 40. jan India
naren. 80. jan. India
Amir. 90. feb. India
kumar. 60. feb India
azhar 80. feb India
mark 90. feb. US
Alex. 55 feb. US
chris 20 feb US
rakesh 60. jan US
Mona. 70. jan. US
mano. 90. mar. UK
Ron. 37. mar. UK
Expected Output:
NAME MARKS. MONTH COUNTRY
naren 80. jan. India
ranjith 40. jan. India
Amir. 90. feb. India
Azhar. 80. feb. India
mark. 90. feb. US
Alex 55. feb. US
Mona. 70. jan. US
Rakesh. 60. jan. US
mano. 90. mar. UK
Ron. 37. mar. UK
Question : From the input dataframe I want to select only the highest two mark values from each group called MONTH and COUNTRY. Sample output is given above.
Can anyone share the sample code to produce the correct output and assign it to new dataframe. Any method is preferable including sqldf.
You can do it as follows using data.table. Thanks to #Arun for his suggestions on improving the answer.
require(data.table)
dat <- fread(txt)
dat[order(MARKS), tail(.SD, 2L), by=c("MONTH", "COUNTRY")]
Note that this only computes the order vector, and does not rearrange the entire data.table first before to perform the grouping operations (hence more memory efficient). .SD contains the subset of data for each group and is itself a data.table.
With too many groups, tail(.SD, 2L) could be slightly slower, In that case, we can use .I which returns the indices, and then do the subset one last time finally as follows:
ix = dat[order(MARKS), .(I=tail(.I, 2L)), by=c("MONTH", "COUNTRY")][, I]
dat[ix]
This results in:
MONTH COUNTRY NAME MARKS
1: jan India ranjith 40
2: jan India naren 80
3: feb India kumar 60
4: feb India azhar 80
5: feb US Alex 55
6: feb US chris 20
7: feb India rakesh 60
8: feb India Mona 70
9: mar UK mano 90
10: mar UK Ron 37
Where txt is your data without the ending .
txt <- "NAME MARKS MONTH COUNTRY
ram 20 jan India
ranjith 40 jan India
naren 80 jan India
Amir 90 feb India
kumar 60 feb India
azhar 80 feb India
mark 90 feb US
Alex 55 feb US
chris 20 feb US
rakesh 60 jan US
Mona 70 jan US
mano 90 mar UK
Ron 37 mar UK"
In dplyr, you can group_by, arrange, and slice. With some cleaning:
library(dplyr)
# take out .s
df %>% mutate_all(sub, pattern = '.', replacement = '', fixed = TRUE) %>%
# convert to numbers, if necessary
mutate_all(type.convert, as.is = TRUE) %>%
# set grouping for following operations
group_by(MONTH, COUNTRY) %>%
# sort by MARKS, descending
arrange(desc(MARKS)) %>%
# subset to top two rows of each group
slice(1:2)
## Source: local data frame [10 x 4]
## Groups: MONTH, COUNTRY [5]
##
## NAME. MARKS MONTH COUNTRY
## <chr> <int> <chr> <chr>
## 1 Amir 90 feb India
## 2 azhar 80 feb India
## 3 mark 90 feb US
## 4 Alex 55 feb US
## 5 naren 80 jan India
## 6 ranjith 40 jan India
## 7 Mona 70 jan US
## 8 rakesh 60 jan US
## 9 mano 90 mar UK
## 10 Ron 37 mar UK
Here is an option with base R (no packages used). We extract the first 3 letters from 'MONTH' using substr (as there are some . in some cases). Using ave, we get the logical index based on the rank after grouping by 'COUNTRY' and 'MONTH', it can be used to subset the rows.
df1$MONTH <- substr(df1$MONTH, 1, 3)
df1[with(df1, as.logical(ave(MARKS, COUNTRY, MONTH,
FUN = function(x) rank(-x) %in% 1:2))),]

How to find unique field values from two columns in data frame

I have a data frame containing many columns, including Quarter and CustomerID. In this I want to identify the unique combinations of Quarter and CustomerID.
For eg:
masterdf <- read.csv(text = "
Quarter, CustomerID, ProductID
2009 Q1, 1234, 1
2009 Q1, 1234, 2
2009 Q2, 1324, 3
2009 Q3, 1234, 4
2009 Q3, 1234, 5
2009 Q3, 8764, 6
2009 Q4, 5432, 7")
What i want is:
FilterQuarter UniqueCustomerID
2009 Q1 1234
2009 Q2 1324
2009 Q3 8764
2009 Q3 1234
2009 Q4 5432
How to do this in R? I tried unique function but it is not working as i want.
The long comments under the OP are getting hard to follow. You are looking for duplicated as pointed out by #RomanLustrik. Use it to subset your original data.frame like this...
masterdf[ ! duplicated( masterdf[ c("Quarter" , "CustomerID") ] ) , ]
# Quarter CustomerID
#1 2009 Q1 1234
#3 2009 Q2 1324
#4 2009 Q3 1234
#6 2009 Q3 8764
#7 2009 Q4 5432
Another simple way is to use SQL queries from R, check the codes below.
This assumes masterdf is the name of the original file...
library(sqldf)
sqldf("select Quarter, CustomerID from masterdf group by 1,2")

Resources