Change numeric code of two variables set differently in two df in r - r

I use R; I hope my answer will not be considered too much "stupid", but I really can't understand the error that I make.
I have a national survey from 2002 to 2014 and each year it is asked the dimension of the company (number of workers) in which the person interviewed works.
A numeric code (1,2,..) is associated to each class dimension. From 2002 to 2006 I have 6 classes of dimension, whereas from 2008 to 2014 seven classes:
2002-2006 2008-2014
0-4 workers -> 1 0-4 workers -> 1
5-19 workers -> 2 5-15 workers -> 2
20-49 workers -> 3 16-19 workers -> 3
50-99 workers -> 4 20-49 workers -> 4
100-499 workers -> 5 50-99 workers -> 5
>500 workers -> 6 100-499 workers -> 6
>500 workers -> 7
First, I changed the code of class 3 (16-19 workers) in year 2008-14 in code 2, in order to have the same class dimension (5-20 workers) of code in 2002-06:
d.d <- data.frame(id=c(1,2,3,4,5,6), yr=c("2002", "2004", "2006", "2008", "2010", "2014"), dim=c(1,2,3,3,4,7))
For example:
id yr dim
1 2002 1
2 2004 2
3 2006 3
4 2008 3
5 2010 4
6 2014 7
the desired output is:
id yr dim
1 2002 1
2 2004 2
3 2006 3
4 2008 2
5 2010 3
6 2014 6
COMMAND 1
d.d$dim2 <- ifelse(d.d$dim=="3" & d.d$yr=="2008",2,
ifelse(d.d$dim=="3" & d.d$yr=="2010",2,
ifelse(d.d$dim=="3" & d.d$yr=="2012",2,
ifelse(d.d$dim=="3" & d.d$yr=="2014",2,
d.d$dim))))
where dim is the company dimension and yr is year. In this way I changed correctly from class 3 to class 2 from 2008 to 2014.
Since codes are not associated with the same class dimension (2002-06 code 3 (20-49 workers), 2008-14 code 4 (20-24 workers)) I tried to allign the codes as before:
COMMAND 2
d.d$dim2 <- ifelse(d.d$dim=="4" & d.d$yr=="2008",3,
ifelse(d.d$dim=="4" & d.d$yr=="2010",3,
ifelse(d.d$dim=="4" & d.d$yr=="2012",3,
ifelse(d.d$dim=="4" & d.d$yr=="2014",3,
d.d$dim))))
I noticed that the second code changes also the code changed by COMMAND 1
RESULT WITH COMMAND 1
d.d
id yr dim dim2
1 1 2002 1 1
2 2 2004 2 2
3 3 2006 3 3
**4 4 2008 3 2**
5 5 2010 4 4
6 6 2014 7 7
RESULT AFTER APPLYING COMMAND 2 (AFTER COMMAND 1)
d.d
id yr dim dim2
1 1 2002 1 1
2 2 2004 2 2
3 3 2006 3 3
**4 4 2008 3 3**
5 5 2010 4 3
6 6 2014 7 7
I can't understand the error.

Try this:
d.d$yr = as.numeric(d.d$yr)
d.d$dim = as.numeric(d.d$dim)
d.d$dim[ d.d$dim >= 3 & d.d$yr >= 2008 ] = d.d$dim[ d.d$dim >= 3 & d.d$yr >= 2008 ] - 1
First, change the year and dim information to numeric. This will simplify the condition for the subset you want modified.
Then substract 1 from dim for each dim and year that satisfies the condition of being 3 or more and from years 2008 forward.
If year or dim are factors then change them to numeric using as.numeric(as.character(...))

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

How to add a column by matching with previous year?

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

For each group, for each week, find the sum of the observations in the previous X weeks in R

For each group (individual_id), for each week_id, I want to calculate the number of appearances the individual has made in the previous X weeks in each city.
I have experimented with dplyr to no avail. I have tried a loop but it takes forever on the dataset I am using (with around 250,000 observations of >1000 individuals in 20 cities. Especially as I want to look up the number of appearances in the previous two years (ie. X=104 weeks).
theDates = as.Date(c('07/05/2017','07/05/2017', '07/05/2017', '14/05/2017', '14/05/2017',
'21/05/2017','21/05/2017','21/05/2017', '28/05/2017', '04/06/2017', '04/06/2017', '04/06/2017', '11/06/2017',
'18/06/2017', '18/06/2017'), format='%d/%m/%Y')
someData = data.frame(individual_id = c(1,2,3,2,3,1,2,3,3,1,2,3,3,2,3), week_end_date=theDates,
city=c('Chicago','Chicago','Chicago','Washington', 'Washington', 'Chicago','Chicago', 'Chicago','Washington',
'Washington', 'Washington','Washington','Chicago','Washington', 'Washington'))
someData$nChicagoAppearancesInLastXweeks = NA
someData$nWashingtonAppearancesInLastXweeks = NA
X = 4 # this is the number of weeks for the window length
someData$start_of_period_date = someData$week_end_date - 7*X # this is the start of the range of dates to count appearances over
for (i in 1:dim(someData)[1]) {
WEEK_IDS = seq(someData$start_of_period_date[i], someData$week_end_date[i]-1, by='days')
INDIVIDUAL_ID = someData$individual_id[i]
someData$nChicagoAppearancesInLastXweeks[i] = sum(ifelse(someData$city=='Chicago' & someData$individual_id == INDIVIDUAL_ID & someData$week_end_date %in% WEEK_IDS,1,0))
someData$nWashingtonAppearancesInLastXweeks[i] = with(someData, sum(ifelse(city=='Washington' & individual_id == INDIVIDUAL_ID & week_end_date %in% c(WEEK_IDS),1,0)))
}
The expected output would be two new columns giving the number of times each individual_id appeared in each city in the previous X weeks. The loop code does it, but is clearly not the best way to do this.
Perform a left join for each added column:
library(sqldf)
X <- 4
sql <- "select sum(not b.city is null)
from someData a
left join someData b on
b.city == '$lev' and
a.[individual_id] = b.[individual_id] and
b.[week_end_date] between a.[week_end_date] - 7 * $X and a.[week_end_date] - 1
group by a.rowid"
for(lev in levels(someData$city)) someData[lev] <- fn$sqldf(sql)
giving:
> someData
individual_id week_end_date city Chicago Washington
1 1 2017-05-07 Chicago 0 0
2 2 2017-05-07 Chicago 0 0
3 3 2017-05-07 Chicago 0 0
4 2 2017-05-14 Washington 1 0
5 3 2017-05-14 Washington 1 0
6 1 2017-05-21 Chicago 1 0
7 2 2017-05-21 Chicago 1 1
8 3 2017-05-21 Chicago 1 1
9 3 2017-05-28 Washington 2 1
10 1 2017-06-04 Washington 2 0
11 2 2017-06-04 Washington 2 1
12 3 2017-06-04 Washington 2 2
13 3 2017-06-11 Chicago 1 3
14 2 2017-06-18 Washington 1 1
15 3 2017-06-18 Washington 2 2

Conditional cumulative subtraction

This is what my data.table looks like:
library(data.table)
dt <- fread('
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
**Balance** is my desired column. I am trying to find the cumulative subtractions by taking the first value of Total which is 10(it should also be the first value of Balance field) and then cumulatively subtracting values in Shares. So the second value is 10-1 =9 and the third value is 9-2 = 7 and such. There is one condition, if the Year is 2014, then subtract the Shares value after dividing it by 2. so the fourth value is 7-(2/2)=6 and the fifth value is 6-3=3. I want to end the calc as of the last row.
My attempt is:
dt[, Balance:= ifelse( Year == 2014, cumsum(Total[1]-Shares/2), cumsum(Total[1] - Shares))]
Here is one method.
dt[, Balance2 := Total[1] - cumsum(shift(Shares * (1 - (0.5 *(Year == 2015))), fill=0))]
shift is used to create a lag variable, and the first element is filled with 0, using fill=0. The other elements are calculated as Shares * (1 - (0.5 *(Year == 2015))) which return Shares except when Years == 2015, in which case Shares * 0.5 is returned.
which returns
dt
Year Total Shares Balance Balance2
1: 2017 10 1 10 10
2: 2016 12 2 9 9
3: 2015 10 2 7 7
4: 2014 10 3 6 6
5: 2013 10 NA 3 3
FWIW, I wanted to provide a functional alternative that would allow for more flexible calculations in the cumulative differences, indexing, etc. I also have read in the data with read.table.
dt <- read.table(header=TRUE, text='
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
')
makeNewBalance <- function(dt) {
output <- NULL
for (i in 1:nrow(dt)) {
if (i==1) {
output[i] <- dt$Total[i]
} else {
output[i] <- output[i-1] - as.integer(ifelse(dt$Year[i]==2014,
dt$Shares[i-1]/2,
dt$Shares[i-1]))
}
}
return(output)
}
dt$NewBalance <- makeNewBalance(dt)
which also returns
> dt
Year Total Shares Balance NewBalance
1 2017 10 1 10 10
2 2016 12 2 9 9
3 2015 10 2 7 7
4 2014 10 3 6 6
5 2013 10 NA 3 3

How can I pass NA when attempt to select fails with "attempted to select less than one element"?

I have two data frames, a and b, both have a "state" and "year" columns (as well as others). I'm trying to transfer the value for VARX from b to a. I used this:
for(i in seq_along(a)) {a$VARX[[i]]<-b$VARX[[which(b$state==a$state[[i]] & b$year==a$year[[i]])]]}
I get this error:
Error in b$VARX[[which(b$state == a$state[[i]] & b$year == :
attempt to select less than one element
The problem seems to be that there are some rows in a without a corresponding row in b, so it can't select an element. How can I either return NA in those cases (so a$VARX[i]=NA), or clean a from all the cases where there's no corresponding rows in b?
Loop is not a good choice. See this working example:
a=data.frame(state=c("a","b","a"),year=c("2012","2013","2014"),VARX=c(1,2,3))
###################
state year VARX
1 a 2012 1
2 b 2013 2
3 a 2014 3
###################
b=data.frame(state=c("a","b","c"),year=c("2012","2013","2014"),VARX=c(4,5,6))
###################
state year VARX
1 a 2012 4
2 b 2013 5
3 c 2014 6
###################
# merge a and b
c=merge(a,b,by=c("state","year"),all.x=T,sort =F,suffixes = c(".a",""))
###################
state year VARX.a VARX
1 a 2012 1 4
2 b 2013 2 5
3 a 2014 3 NA
###################
# drop VARX in data frame a
subset(c,select=-VARX.a)
###################
state year VARX
1 a 2012 4
2 b 2013 5
3 a 2014 NA
###################

Resources