How to find unique field values from two columns in data frame - r

I have a data frame containing many columns, including Quarter and CustomerID. In this I want to identify the unique combinations of Quarter and CustomerID.
For eg:
masterdf <- read.csv(text = "
Quarter, CustomerID, ProductID
2009 Q1, 1234, 1
2009 Q1, 1234, 2
2009 Q2, 1324, 3
2009 Q3, 1234, 4
2009 Q3, 1234, 5
2009 Q3, 8764, 6
2009 Q4, 5432, 7")
What i want is:
FilterQuarter UniqueCustomerID
2009 Q1 1234
2009 Q2 1324
2009 Q3 8764
2009 Q3 1234
2009 Q4 5432
How to do this in R? I tried unique function but it is not working as i want.

The long comments under the OP are getting hard to follow. You are looking for duplicated as pointed out by #RomanLustrik. Use it to subset your original data.frame like this...
masterdf[ ! duplicated( masterdf[ c("Quarter" , "CustomerID") ] ) , ]
# Quarter CustomerID
#1 2009 Q1 1234
#3 2009 Q2 1324
#4 2009 Q3 1234
#6 2009 Q3 8764
#7 2009 Q4 5432

Another simple way is to use SQL queries from R, check the codes below.
This assumes masterdf is the name of the original file...
library(sqldf)
sqldf("select Quarter, CustomerID from masterdf group by 1,2")

Related

rearrange data in a specific structure

I have data like this format:
state
year1
year 2
First
2000
2004-2005
Second
2007
2010-2011
Third
2008
2010
Third
2010
2012
I want to make this:
state
year
First
2000
First
2004-2005
Second
2007
Second
2010-2011
Third
2008
Third
2010
Third
2012
The code can be in R or Python. Thanks in advance
There is a function in the data.table package, called melt( ) which allows you to convert data from wide to long format. In this case I am keeping State as my ID variable and the variables I would like pulled into my value field are Year1 and Year2. There is a line that keeps unique observations to remove duplicates.
library(data.table)
data <- data.table(
State = c("First","Second","Third","Third"),
Year1 = c("2000","2007","2008","2010"),
Year2 = c("2004-2005","2010-2011","2010","2012"))
data
State Year1 Year2
1: First 2000 2004-2005
2: Second 2007 2010-2011
3: Third 2008 2010
4: Third 2010 2012
data2 <- melt(
data = data,
id.vars = c("State"),
measure.vars = c("Year1","Year2"),
variable.name = "Year",
value.name = "years")
data2 <- unique(data2)
data2[order(State),.(State,years)]
State years
1: First 2000
2: First 2004-2005
3: Second 2007
4: Second 2010-2011
5: Third 2008
6: Third 2010
7: Third 2010
8: Third 2012

R: Count number of new observations compared to a previous groups

I would like to know the number of new observations that occurred between groups.
If I have the following data:
Year
Observation
2009
A
2009
A
2009
B
2010
A
2010
B
2010
C
I wound like the output to be
Year
New_Obsevation_Count
2009
2
2010
1
I am new to R and don't really know how to move forward. I have tried using the count function in the tidyverse package but still can't figure out.
You can use union in Reduce:
y <- split(x$Observation, x$Year)
data.frame(Year = names(y), nNew =
diff(lengths(Reduce(union, y, NULL, accumulate = TRUE))))
# Year nNew
#1 2009 2
#2 2010 1
Data:
x <- read.table(header=TRUE, text="Year Observation
2009 A
2009 A
2009 B
2010 A
2010 B
2010 C")

Conditional statement within loop using multiple datasets R

I would like to figure out who was the most recent previous owner at a location within the last two years before the current owner. The locations are called reflo (reference location). Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).
The conditions:
the previous owner has to have lived at the same location (lifetime_census$reflo==owners$reflo.x[i]) within two years of the current owner's year (lifetime_census$census_year <= 2 years of owners$spr_census)
if none, then assign NA
Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:
id previous_id reflo census_year
16161 5587 -310 2001
17723 5587 -310 2002
19345 5879 -310 2003
16848 5101 Q1 2001
17836 6501 Q1 2002
19439 6501 Q1 2003
21815 6057 Q1 2004
I then have an owners dataset (here is a sample):
squirrel_id spr_census reflo.x
6391 2005 Q1
6130 2005 -310
6288 2005 A12
To illustrate what I am trying to achieve:
squirrel_id spr_census reflo.x previous_owner census_year
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 NA NA
What I have currently tried is this:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$previous_id!=owners$squirrel_id[i] & #previous owner != current owner
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year<=owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"spring_owner"] <- last_owner$previous_id[i]
}
This gives me a new column for the previous owner in any past year for reflo.x, adding NAs after all the conditions are not met. I cannot figure out how to restrict this search to the last two years.
Any ideas? (Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).)
To figure out who was the most recent previous owner at a location within the last two years before the current owner you can first arrange by date in descending order:
library(dplyr)
lifetime_census<-lifetime_census %>%
group_by(reflo) %>%
arrange(desc(census_date))
Which puts the most recent years first (similar to top_n):
id previous_owner reflo census_year
19345 5879 -310 2003
17723 5587 -310 2002
16161 5587 -310 2001
21815 6057 Q1 2004
19439 6501 Q1 2003
17836 6501 Q1 2002
16848 5101 Q1 2001
Then you can run the loop above:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$squirrel_id != owners$squirrel_id[i] &
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year <= owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"previous_owner"] <- last_owner$previous_owner[i]
owners[i,"prev_census"] <- last_owner$census_year[i]
}
This will give you:
> head(owners)
> squirrel_id spr_census reflo.x previous_owner prev_census
<chr> <chr> <chr> <chr> <dbl>
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 <NA> <NA>
If, for example, an individual above had a match for a year <= 2 years from the spr_census year, you can fix this on a case by case basis (not the most elegant solution, but it's workable) by using an if_else statement, like so:
owners<-owners%>% mutate(spring_owner=ifelse(prev_census < 2003, "<NA>", spring_owner))

How do I lag Quarters in r?

First and foremost - thank you for viewing my question - regardless of if you answer or not.
I am trying to add a column that contains the lagged values of the Quarter value to my DF, however, I get the below warning when I do so:
Warning messages:
1: In mutate_impl(.data, dots) :
Vectorizing 'yearqtr' elements may not preserve their attributes
Below is my sample data (my data starts on 1/3/2018)
Ticker Price Date Quarter
A 10 1/3/18 2018 Q1
A 13.5 2/15/18 2018 Q1
A 12.9 4/2/18 2018 Q2
A 11.2 5/3/18 2018 Q2
B 35.2 1/4/18 2018 Q1
B 33.1 3/2/18 2018 Q1
B 31 4/6/18 2018 Q2
... ... ... ...
XYZ 102 5/6/18 2018 Q2
I have a huge table with multiple stocks and multiple dates. The way I calculate the quarter column is :
df$quarter <- lag(as.yearqtr(df$Date))
But however - I can't get to add a column that would lag the values of the Quarter. Would anyone know a possible workaround?
I would like the below output:
Ticker Price Date Quarter Lag_Q
A 10 1/3/18 2018 Q1 NA
A 13.5 2/15/18 2018 Q1 NA
A 12.9 4/2/18 2018 Q2 2018 Q1
A 11.2 5/3/18 2018 Q2 2018 Q1
B 35.2 1/4/18 2018 Q1 NA
B 33.1 3/2/18 2018 Q1 NA
B 31 4/6/18 2018 Q2 2018 Q1
... ... ... ...
XYZ 102 5/6/18 2018 Q2 2018 Q1
Firstly, I'd suggest organizing your data so that each column represents prices of an individual security and each row is a specific date. From there, you can transform all securities easily, but I'm not sure what your end goal is. The xts package is excellent and has been optimized in c, and is kind of the securities industry standard. I highly suggest exploring it. But that's beyond the scope of your post!
For your data structure though, a single line should do:
df$lag_Q <- as.yearqtr( ifelse(test = (df$quarter=="2018 Q1"),
yes = NA,
no = df$quarter-0.25) )

All permutations of quartely data

I have a data set that contains quarterly data for 8 years. If I randomly select each quarter from one of the years I could in theory construct a "new" year. For example: new year = Q1(2009), Q2(2012), Q3(2010), Q4(2015).
The problem I have, is that I would like to construct a data set that contains all such permutations. With 8 years and 4 quarters that would give me 4^8= 65536 "new" years. Is this something best tackled with a nested loop, or are there functions out there that could work better?
We can use expand.grid to create a matrix of all possible combinations:
nrow(do.call('expand.grid', replicate(8, 1:4, simplify=FALSE)))
[1] 65536
I think you want combinations of the 8 years over 4 quarters so the number of combinations is 8^4 = 4096:
> x <- years <- 2008:2015
> length(x)
[1] 8
> comb <- expand.grid(x, x, x, x)
> head(comb)
Var1 Var2 Var3 Var4
1 2008 2008 2008 2008
2 2009 2008 2008 2008
3 2010 2008 2008 2008
4 2011 2008 2008 2008
5 2012 2008 2008 2008
6 2013 2008 2008 2008
> tail(comb)
Var1 Var2 Var3 Var4
4091 2010 2015 2015 2015
4092 2011 2015 2015 2015
4093 2012 2015 2015 2015
4094 2013 2015 2015 2015
4095 2014 2015 2015 2015
4096 2015 2015 2015 2015
> nrow(comb)
[1] 4096
Each row is a year and Var1, Var2, Var3, Var4 are the 4 quarters.
You may want to wait a bit to see if someone gives you a less 'janky' answer, but this example takes a time series, takes all permutations with no repeated quarters inside of each year, and returns those new years values with the old year and quarters info as columns.
set.seed(1234)
# Make some fake data
q_dat <- data.frame(year = c(rep(2011,4),
rep(2012,4),
rep(2013,4)),
quarters = rep(c("Q1","Q2","Q3","Q4"),3),
x = rnorm(12))
q_dat
year quarters x
1 2011 Q1 -1.2070657
2 2011 Q2 0.2774292
3 2011 Q3 1.0844412
4 2011 Q4 -2.3456977
5 2012 Q1 0.4291247
6 2012 Q2 0.5060559
7 2012 Q3 -0.5747400
8 2012 Q4 -0.5466319
9 2013 Q1 -0.5644520
10 2013 Q2 -0.8900378
11 2013 Q3 -0.4771927
12 2013 Q4 -0.9983864
So what are going to do is
1, Take all possible combinations of the time series
2, Remove all duplicates so each made up year does not have the same quarter in it.
# Expand out all possible combinations of our three years
q_perms <- expand.grid(q1 = 1:nrow(q_dat), q2 = 1:nrow(q_dat) ,
q3 = 1:nrow(q_dat), q4 = 1:nrow(q_dat))
# remove any duplicate combinations
# EX: So we don't get c(2011Q1,2011Q1,2011Q1,2011Q1) as a year
q_perms <- q_perms[apply(q_perms,1,function(x) !any(duplicated(x))),]
# Transpose the grid, remake it as a data frame, and lapply over it
l_rand_dat <- lapply(data.frame(t(q_perms)),function(x) q_dat[x,])
# returns one unique year per list
l_rand_dat[[30]]
year quarters x
5 2012 Q1 0.4291247
6 2012 Q2 0.5060559
2 2011 Q2 0.2774292
1 2011 Q1 -1.2070657
# bind all of those together
rand_bind <- do.call(rbind,l_rand_dat)
head(rand_bind)
year quarters x
X172.4 2011 Q4 -2.3456977
X172.3 2011 Q3 1.0844412
X172.2 2011 Q2 0.2774292
X172.1 2011 Q1 -1.2070657
X173.5 2012 Q1 0.4291247
X173.3 2011 Q3 1.0844412
This is a pretty memory intensive answer. If someone can skip the 'make all possible combinations' step then that would be a significant improvement.

Resources