Conditional statement within loop using multiple datasets R - r

I would like to figure out who was the most recent previous owner at a location within the last two years before the current owner. The locations are called reflo (reference location). Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).
The conditions:
the previous owner has to have lived at the same location (lifetime_census$reflo==owners$reflo.x[i]) within two years of the current owner's year (lifetime_census$census_year <= 2 years of owners$spr_census)
if none, then assign NA
Previous owners (>20,000) are stored in a dataset called lifetime_census. Here is a sample of the data:
id previous_id reflo census_year
16161 5587 -310 2001
17723 5587 -310 2002
19345 5879 -310 2003
16848 5101 Q1 2001
17836 6501 Q1 2002
19439 6501 Q1 2003
21815 6057 Q1 2004
I then have an owners dataset (here is a sample):
squirrel_id spr_census reflo.x
6391 2005 Q1
6130 2005 -310
6288 2005 A12
To illustrate what I am trying to achieve:
squirrel_id spr_census reflo.x previous_owner census_year
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 NA NA
What I have currently tried is this:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$previous_id!=owners$squirrel_id[i] & #previous owner != current owner
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year<=owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"spring_owner"] <- last_owner$previous_id[i]
}
This gives me a new column for the previous owner in any past year for reflo.x, adding NAs after all the conditions are not met. I cannot figure out how to restrict this search to the last two years.
Any ideas? (Note that there is not always an exact match for reflo.x and reflo within two years (so a solution that allows me to add additional conditions, such as to find the next closest reflo, would be extra helpful).)

To figure out who was the most recent previous owner at a location within the last two years before the current owner you can first arrange by date in descending order:
library(dplyr)
lifetime_census<-lifetime_census %>%
group_by(reflo) %>%
arrange(desc(census_date))
Which puts the most recent years first (similar to top_n):
id previous_owner reflo census_year
19345 5879 -310 2003
17723 5587 -310 2002
16161 5587 -310 2001
21815 6057 Q1 2004
19439 6501 Q1 2003
17836 6501 Q1 2002
16848 5101 Q1 2001
Then you can run the loop above:
n <- length(owners$squirrel_id)
for(i in 1:n) {
last_owner <- subset(lifetime_census,
life_census$squirrel_id != owners$squirrel_id[i] &
lifetime_census$reflo==owners$reflo.x[i] &
lifetime_census$census_year <= owners$spr_census[i]) #owners can be in current or past year
#Put it all together
owners[i,"previous_owner"] <- last_owner$previous_owner[i]
owners[i,"prev_census"] <- last_owner$census_year[i]
}
This will give you:
> head(owners)
> squirrel_id spr_census reflo.x previous_owner prev_census
<chr> <chr> <chr> <chr> <dbl>
6391 2005 Q1 6057 2004
6130 2005 -310 5879 2003
6288 2005 A12 <NA> <NA>
If, for example, an individual above had a match for a year <= 2 years from the spr_census year, you can fix this on a case by case basis (not the most elegant solution, but it's workable) by using an if_else statement, like so:
owners<-owners%>% mutate(spring_owner=ifelse(prev_census < 2003, "<NA>", spring_owner))

Related

succinct code for repetitive subsetting in R

I am an R beginner and am having trouble finding a better way to recode an element of a dataframe. I have data which has a column with the year it was sampled (assessed), however I want to run some tests based on the biennial subset (not annual like it is formatted). Therefore I want two concurrent years to be identified by the assessment year. I think I could run something like:
ddd$Assessment[ddd$Assessment==1997 & ddd$Assessment==1998]<-1998
but feel there must be a better way (I know I don't need the second half of the code above but just left it in for clarity) especially as I have a lot of data spanning 23 years.
Any help would be very much appreciated
If your assessment year is consistently every other year, here is one way to create your biennial column by using the properties of the ceiling function.
ddd <- data.frame(Assessment = 1997:2006)
ddd$biennial <- ceiling(ddd$Assessment/2)*2
ddd
# Assessment biennial
#1 1997 1998
#2 1998 1998
#3 1999 2000
#4 2000 2000
#5 2001 2002
#6 2002 2002
#7 2003 2004
#8 2004 2004
#9 2005 2006
#10 2006 2006
To code biennial years and make sure that no future user of the data-set is mistaken in what this column actually represents, I'd rather use cut:
ddd <- data.frame(Assessment = 1997:2006)
ddd$biennial <- cut(ddd$Assessment, breaks = seq(1996, 2008, by=2), right = F)
ddd
# Assessment biennial
#1 1997 [1996,1998)
#2 1998 [1998,2000)
#3 1999 [1998,2000)
#4 2000 [2000,2002)
#5 2001 [2000,2002)
#6 2002 [2002,2004)
#7 2003 [2002,2004)
#8 2004 [2004,2006)
#9 2005 [2004,2006)
#10 2006 [2006,2008)

Create a local id for a combination of 2 columns [duplicate]

This question already has answers here:
R - add column that counts sequentially within groups but repeats for duplicates
(3 answers)
Closed 7 years ago.
I have a dataset I wish to process, and instead of processing it as a time series, I want to summarize the time behaviour. Here is the dataset:
business_id year
vcNAWiLM4dR7D2nwwJ7nCA 2007
vcNAWiLM4dR7D2nwwJ7nCA 2007
vcNAWiLM4dR7D2nwwJ7nCA 2009
UsFtqoBl7naz8AVUBZMjQQ 2004
UsFtqoBl7naz8AVUBZMjQQ 2005
cE27W9VPgO88Qxe4ol6y_g 2007
cE27W9VPgO88Qxe4ol6y_g 2007
cE27W9VPgO88Qxe4ol6y_g 2008
cE27W9VPgO88Qxe4ol6y_g 2010
I want to turn it into this:
business_id year yr_id
vcNAWiLM4dR7D2nwwJ7nCA 2007 1
vcNAWiLM4dR7D2nwwJ7nCA 2007 1
vcNAWiLM4dR7D2nwwJ7nCA 2009 2
UsFtqoBl7naz8AVUBZMjQQ 2004 1
UsFtqoBl7naz8AVUBZMjQQ 2005 2
cE27W9VPgO88Qxe4ol6y_g 2007 1
cE27W9VPgO88Qxe4ol6y_g 2007 1
cE27W9VPgO88Qxe4ol6y_g 2008 2
cE27W9VPgO88Qxe4ol6y_g 2010 3
In other words, I want the ID to be sequential to the year, but local to the business_id, so that it resets when the program finds another business_id.
Is this something that is easily achievable in R?
I found this other question in SO, and the answer effectively answers this question, so this should be marked as duplicate.
https://stackoverflow.com/a/27896841/4858065
The way to achieve this is:
df %>% group_by(business_id) %>%
mutate(year_id = dense_rank(year))

How to find unique field values from two columns in data frame

I have a data frame containing many columns, including Quarter and CustomerID. In this I want to identify the unique combinations of Quarter and CustomerID.
For eg:
masterdf <- read.csv(text = "
Quarter, CustomerID, ProductID
2009 Q1, 1234, 1
2009 Q1, 1234, 2
2009 Q2, 1324, 3
2009 Q3, 1234, 4
2009 Q3, 1234, 5
2009 Q3, 8764, 6
2009 Q4, 5432, 7")
What i want is:
FilterQuarter UniqueCustomerID
2009 Q1 1234
2009 Q2 1324
2009 Q3 8764
2009 Q3 1234
2009 Q4 5432
How to do this in R? I tried unique function but it is not working as i want.
The long comments under the OP are getting hard to follow. You are looking for duplicated as pointed out by #RomanLustrik. Use it to subset your original data.frame like this...
masterdf[ ! duplicated( masterdf[ c("Quarter" , "CustomerID") ] ) , ]
# Quarter CustomerID
#1 2009 Q1 1234
#3 2009 Q2 1324
#4 2009 Q3 1234
#6 2009 Q3 8764
#7 2009 Q4 5432
Another simple way is to use SQL queries from R, check the codes below.
This assumes masterdf is the name of the original file...
library(sqldf)
sqldf("select Quarter, CustomerID from masterdf group by 1,2")

Creating lag variables for matched factors

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!
Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA
An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

Long Format Function [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
faster way to create variable that aggregates a column by id
I am having trouble with a project. I created a dataframe (called dat) in long format (i copied the first 3 rows below) and I want to calculate for example the mean of the Pretax Income of all Banks in the United States for the years 2000 to 2011. How would I do that? I have hardly any experience in R. I am sorry if the answer is too obvious, but I couldn't find anything and i already spent a lot of time on the project. Thank you in advance!
KeyItem Bank Country Year Value
1 Pretax Income WELLS_FARGO_&_COMPANY UNITED STATES 2011 2.365600e+10
2 Total Assets WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.313867e+12
3 Total Liabilities WELLS_FARGO_&_COMPANY UNITED STATES 2011 1.172180e+12
The following should get you started. You basically need to do two things: subset, and aggregate. I'll demonstrate a base R solution and a data.table solution.
First, some sample data.
set.seed(1) # So you can reproduce my results
dat <- data.frame(KeyItem = rep(c("Pretax", "TotalAssets", "TotalLiabilities"),
times = 30),
Bank = rep(c("WellsFargo", "BankOfAmerica", "ICICI"),
each = 30),
Country = rep(c("UnitedStates", "India"), times = c(60, 30)),
Year = rep(c(2000:2009), each = 3, times = 3),
Value = runif(90, min=300, max=600))
Let's aggregate mean of the "Pretax" values by "Country" and "Year", but only for the years 2001 to 2005.
aggregate(Value ~ Country + Year,
dat[dat$KeyItem == "Pretax" & dat$Year >= 2001 & dat$Year <=2005, ],
mean)
# Country Year Value
# 1 India 2001 399.7184
# 2 UnitedStates 2001 464.1638
# 3 India 2002 443.5636
# 4 UnitedStates 2002 560.8373
# 5 India 2003 562.5964
# 6 UnitedStates 2003 370.9591
# 7 India 2004 404.0050
# 8 UnitedStates 2004 520.4933
# 9 India 2005 567.6595
# 10 UnitedStates 2005 493.0583
Here's the same thing in data.table
library(data.table)
DT <- data.table(dat, key = "Country,Bank,Year")
subset(DT, KeyItem == "Pretax")[Year %between% c(2001, 2005),
mean(Value), by = list(Country, Year)]
# Country Year V1
# 1: India 2001 399.7184
# 2: India 2002 443.5636
# 3: India 2003 562.5964
# 4: India 2004 404.0050
# 5: India 2005 567.6595
# 6: UnitedStates 2001 464.1638
# 7: UnitedStates 2002 560.8373
# 8: UnitedStates 2003 370.9591
# 9: UnitedStates 2004 520.4933
# 10: UnitedStates 2005 493.0583

Resources