This is what my data.table looks like:
dt <- fread('
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
**Balance** is my desired column. I am trying to find the cumulative subtractions by taking the first value of Total which is 10(it should also be the first value of Balance field) and then cumulatively subtracting values in Shares. So the second value is 10-1 =9 and the third value is 9-2 = 7 and such. There is one condition, if the Year is 2014, then subtract the Shares value after dividing it by 2. so the fourth value is 7-(2/2)=6 and the fifth value is 6-3=3. I want to end the calc as of the last row.
My attempt is:
dt[, Balance:= ifelse( Year == 2014, cumsum(Total[1]-Shares/2), cumsum(Total[1] - Shares))]

Here is one method.
dt[, Balance2 := Total[1] - cumsum(shift(Shares * (1 - (0.5 *(Year == 2015))), fill=0))]
shift is used to create a lag variable, and the first element is filled with 0, using fill=0. The other elements are calculated as Shares * (1 - (0.5 *(Year == 2015))) which return Shares except when Years == 2015, in which case Shares * 0.5 is returned.
which returns
Year Total Shares Balance Balance2
1: 2017 10 1 10 10
2: 2016 12 2 9 9
3: 2015 10 2 7 7
4: 2014 10 3 6 6
5: 2013 10 NA 3 3

FWIW, I wanted to provide a functional alternative that would allow for more flexible calculations in the cumulative differences, indexing, etc. I also have read in the data with read.table.
dt <- read.table(header=TRUE, text='
Year Total Shares Balance
2017 10 1 10
2016 12 2 9
2015 10 2 7
2014 10 3 6
2013 10 NA 3
makeNewBalance <- function(dt) {
output <- NULL
for (i in 1:nrow(dt)) {
if (i==1) {
output[i] <- dt$Total[i]
} else {
output[i] <- output[i-1] - as.integer(ifelse(dt$Year[i]==2014,
dt$NewBalance <- makeNewBalance(dt)
which also returns
> dt
Year Total Shares Balance NewBalance
1 2017 10 1 10 10
2 2016 12 2 9 9
3 2015 10 2 7 7
4 2014 10 3 6 6
5 2013 10 NA 3 3


find number of customers added each month

customer_id transaction_id month year
1 3 7 2014
1 4 7 2014
2 5 7 2014
2 6 8 2014
1 7 8 2014
3 8 9 2015
1 9 9 2015
4 10 9 2015
5 11 9 2015
2 12 9 2015
I am well familiar with R basics. Any help will be appreciated.
the expected output should look like following:
month year number_unique_customers_added
7 2014 2
8 2014 0
9 2015 3
In the month 7 and year 2014, only customers_id 1 and 2 are present, so number of customers added is two. In the month 8 and year 2014, no new customer ids are added. So there should be zero customers added in this period. Finally in year 2015 and month 9, customer_ids 3,4 and 5 are the new ones added. So new number of customers added in this period is 3.
Using data.table:
dt[, .SD[1,], by = customer_id][, uniqueN(customer_id), by = .(year, month)]
Explanation: We first remove all subsequent transactions of each customer (we're interested in the first one, when she is a "new customer"), and then count unique customers by each combination of year and month.
Using dplyr we can first create a column which indicates if a customer is duplicate or not and then we group_by month and year to count the new customers in each group.
df %>%
mutate(unique_customers = !duplicated(customer_id)) %>%
group_by(month, year) %>%
summarise(unique_customers = sum(unique_customers))
# month year unique_customers
# <int> <int> <int>
#1 7 2014 2
#2 8 2014 0
#3 9 2015 3

Change numeric code of two variables set differently in two df in r

I use R; I hope my answer will not be considered too much "stupid", but I really can't understand the error that I make.
I have a national survey from 2002 to 2014 and each year it is asked the dimension of the company (number of workers) in which the person interviewed works.
A numeric code (1,2,..) is associated to each class dimension. From 2002 to 2006 I have 6 classes of dimension, whereas from 2008 to 2014 seven classes:
2002-2006 2008-2014
0-4 workers -> 1 0-4 workers -> 1
5-19 workers -> 2 5-15 workers -> 2
20-49 workers -> 3 16-19 workers -> 3
50-99 workers -> 4 20-49 workers -> 4
100-499 workers -> 5 50-99 workers -> 5
>500 workers -> 6 100-499 workers -> 6
>500 workers -> 7
First, I changed the code of class 3 (16-19 workers) in year 2008-14 in code 2, in order to have the same class dimension (5-20 workers) of code in 2002-06:
d.d <- data.frame(id=c(1,2,3,4,5,6), yr=c("2002", "2004", "2006", "2008", "2010", "2014"), dim=c(1,2,3,3,4,7))
For example:
id yr dim
1 2002 1
2 2004 2
3 2006 3
4 2008 3
5 2010 4
6 2014 7
the desired output is:
id yr dim
1 2002 1
2 2004 2
3 2006 3
4 2008 2
5 2010 3
6 2014 6
d.d$dim2 <- ifelse(d.d$dim=="3" & d.d$yr=="2008",2,
ifelse(d.d$dim=="3" & d.d$yr=="2010",2,
ifelse(d.d$dim=="3" & d.d$yr=="2012",2,
ifelse(d.d$dim=="3" & d.d$yr=="2014",2,
where dim is the company dimension and yr is year. In this way I changed correctly from class 3 to class 2 from 2008 to 2014.
Since codes are not associated with the same class dimension (2002-06 code 3 (20-49 workers), 2008-14 code 4 (20-24 workers)) I tried to allign the codes as before:
d.d$dim2 <- ifelse(d.d$dim=="4" & d.d$yr=="2008",3,
ifelse(d.d$dim=="4" & d.d$yr=="2010",3,
ifelse(d.d$dim=="4" & d.d$yr=="2012",3,
ifelse(d.d$dim=="4" & d.d$yr=="2014",3,
I noticed that the second code changes also the code changed by COMMAND 1
id yr dim dim2
1 1 2002 1 1
2 2 2004 2 2
3 3 2006 3 3
**4 4 2008 3 2**
5 5 2010 4 4
6 6 2014 7 7
id yr dim dim2
1 1 2002 1 1
2 2 2004 2 2
3 3 2006 3 3
**4 4 2008 3 3**
5 5 2010 4 3
6 6 2014 7 7
I can't understand the error.
Try this:
d.d$yr = as.numeric(d.d$yr)
d.d$dim = as.numeric(d.d$dim)
d.d$dim[ d.d$dim >= 3 & d.d$yr >= 2008 ] = d.d$dim[ d.d$dim >= 3 & d.d$yr >= 2008 ] - 1
First, change the year and dim information to numeric. This will simplify the condition for the subset you want modified.
Then substract 1 from dim for each dim and year that satisfies the condition of being 3 or more and from years 2008 forward.
If year or dim are factors then change them to numeric using as.numeric(as.character(...))

R: Insert and fill missing periods in panel data

I'm trying to learn R coming from Stata, but have run into the following two problems which I cannot seem to find elegant solutions for in R:
1) I have a panel dataset with gaps in my time variable. I would like to expand my time variable to include the gaps despite having no observed data for these rows.
In Stata I would usually go about this by setting my ID and time variables with xtset and then expanding the dataset based on this with tsfill. Is there an equivalently elegant way in R?
2) I would like to fill some of the new, blank cells with data for constant variables.
In Stata I would do this by copying data from previous (relative to my time variable) observations using the l.-prefix; for example using replace Con = l.Con.
In other words I'm asking how to go from something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Jun 15 B
To something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 Mar A
1 Apr A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Apr B
2 May B
2 Jun 15 B
Hopefully that makes sense. Thanks in advance.
You can try merge from base R or the data.table join
DT2 <- setDT(df1)[, {tmp <- match(Time,
list([min(tmp):max(tmp)])}, .(ID,Con)]
setkey(df1[, c(1,4,2,3), with=FALSE], ID, Con, Time)[DT2]
# ID Con Time Num
# 1: 1 A Jan 10
# 2: 1 A Feb 15
# 3: 1 A Mar NA
# 4: 1 A Apr NA
# 5: 1 A May 20
# 6: 2 B Feb 12
# 7: 2 B Mar 14
# 8: 2 B Apr NA
# 9: 2 B May NA
#10: 2 B Jun 15
NOTE: It may be better to keep missing value as NA

How to calculate the exponential in some columns of a dataframe in R?

I have a dataframe:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 2009 12.42669703 12.41831191
2 2010 12.39309563 12.40043599
3 2011 12.36596964 12.38256006
4 2012 12.32067284 12.36468414
5 2013 12.303095 12.34680822
6 2014 NA 12.32893229
7 2015 NA 12.31105637
8 2016 NA 12.29318044
9 2017 NA 12.27530452
10 2018 NA 12.25742859
I want to calulate the exponential of the third and fourth columns. How can I do that?
In case your dataframe is called dfs, you can do the following:
dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')] <- exp(dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')])
which gives you:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 1 2009 249371 247288.7
2 2 2010 241131 242907.5
3 3 2011 234678 238603.9
4 4 2012 224285 234376.5
5 5 2013 220377 230224.0
6 6 2014 NA 226145.1
7 7 2015 NA 222138.5
8 8 2016 NA 218202.9
9 9 2017 NA 214336.9
10 10 2018 NA 210539.5
In case you know the column numbers, this could then also simply be done by using:
dfs[,3:4] <- exp(dfs[,3:4])
which gives you the same result as above. I usually prefer to use the actual column names as the indices might change when the data frame is further processed (e.g. I delete columns, then the indices change).
Or you could do:
dfs$Dependent.variable.1 <- exp(dfs$Dependent.variable.1)
dfs$Forecast.Dependent.variable.1 <- exp(dfs$Forecast.Dependent.variable.1)
In case you want to store these columns in new variables (below they are called exp1 and exp2, respectively), you can do:
exp1 <- exp(dfs$Forecast.Dependent.variable.1)
exp2 <- exp(dfs$Dependent.variable.1)
In case you want to apply it to more than two columns and/or use more complicated functions, I highly recommend to look at apply/lappy.
Does that answer your question?

writing the outcome of a nested loop to a vector object in R

I have the following data read into R as a data frame named "data_old":
yes year month
1 15 2004 5
2 9 2005 6
3 15 2006 3
4 12 2004 5
5 14 2005 1
6 15 2006 7
. . ... .
. . ... .
I have written a small loop which goes through the data and sums up the yes variable for each month/year combination:
year_f <- c(2004:2006)
month_f <- c(1:12)
for (i in year_f){
for (j in month_f){
x <- subset(data_old, month == j & year == i, select="yes")
if (nrow(x) > 0){
My question is this: I can print the sum for each month/year combination in the terminal, but how do i store it in a vector? (the nested loop is giving me headaches trying to figure this out).
Another way,
ddply(data_old,.(year,month),function(x) sum(x[1]))
year month V1
1 2004 5 27
2 2005 1 14
3 2005 6 9
4 2006 3 15
5 2006 7 15
Forget the loops, you want to use an aggregation function. There's a recent discussion of them in this SO question.
with(data_old, tapply(yes, list(year, month), sum))
is one of many solutions.
Also, you don't need to use c() when you aren't concatenating anything. Plain 1:12 is fine.
Just to add a third option:
aggregate(yes ~ year + month, FUN=sum, data=data_old)
