Change column values based on factors of other columns - r

For example, if I have a data frame like this:
df <- data.frame(profit=c(10,10,10), year=c(2010,2011,2012))
profit year
10 2010
10 2011
10 2012
I want to change the value of profit according to the year. For year 2010, I multiple the profit by 3, for year 2011, multiple the profit by 4, for year 2012, multiple by 5, which should result like this:
profit year
30 2010
40 2011
50 2012
How should I approach this? I tried:
inflationtransform <- function(k,v) {
switch(k,
2010,v<-v*3,
2011,v<-v*4,
2012,v<-v*5,
)
}
df$profit <- sapply(df$year,df$profit,inflationtransform)
But it doesn't work. Can someone tell me what to do?

For this particular example, since your factors and years are both ordered and incremented by 1, you could just subtract 2007 from the year column and multiply it by profit.
transform(df, profit = profit * (year - 2007))
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
Otherwise, you could use a lookup vector. This will cover all cases.
lookup <- c("2010" = 3, "2011" = 4, "2012" = 5)
transform(df, profit = profit * lookup[as.character(year)])
# profit year
# 1 30 2010
# 2 40 2011
# 3 50 2012
I wouldn't use switch() unless you really need to. It's not vectorized, and that's where R is most efficient. However, since you ask for it in the comments, here's one way. I find it easier to use a for() loop with switch().
for(i in seq_len(nrow(df))) {
df$profit[i] <- with(df, switch(as.character(year[i]),
"2010" = 3 * profit[i],
"2011" = 4 * profit[i],
"2012" = 5 * profit[i]
))
}

Related

Grouping data by specific observations in R

I want to create a new variable that's derived from specific values in my existing variables. My data frame looks something like the following:
year <- c("2010", "2011", "2012", "2013", "2014", "2015")
x <- c(2980, 2955, 3110, 2962, 2566, 3788)
y <- c(2453, 2919, 2930, 2864, 2873, 3031)
df <- data.frame(year, x, y)
More specifically, I want to create a third column, z, that is the ratio of x and y. However, I don't want to create this ratio by simply dividing x by y for each individual year. Instead, I want the values in 2015 (and 2014 etc.) to be an average of this ratio in the three preceding years, i.e. 2014, 2013, and 2012.
I've looked at Wickham's dplyr package and, in particular, the group_by function but I'm stumped because I don't want to group my data by year per se but by each years' three preceding years as illustrated (hopefully) above.
With dplyr and library(zoo):
df_fin<- df %>% mutate( z = rollmeanr(x/y,3,na.pad=TRUE))
I think the column z is what you want but it would be good to have the desired output.
The answers that use zoo::rollmean are all on the correct track, but they have a couple of "off by one" errors in them. First, you actually want zoo::rollmeanr( ..., na.pad=TRUE ) which will correctly pad the output with NA on the left side:
> zoo::rollmeanr( df$x / df$y, 3, na.pad=TRUE )
[1] NA NA 1.0962018 1.0359948 0.9962648 1.0590378
The second "off by one" error arises from alignment of this vector with the rest of your data. From your description, you want the value for 2015 to be the average of 2014, 2013, and 2012. However, appending the vector above to your table will make the value for 2015 to be the average of 2015, 2014, and 2013, instead. To correct, you want to omit the last value in your input to the rolling average and prepend an NA to compensate:
> c( NA, zoo::rollmeanr( head(df$x / df$y,-1), 3, na.pad=TRUE ) )
[1] NA NA NA 1.0962018 1.0359948 0.9962648
Putting it all together using dplyr notation:
df %>% mutate( z = c( NA, zoo::rollmeanr( head(x/y,-1), 3, na.pad=TRUE ) ) )
year x y z
1 2010 2980 2453 NA
2 2011 2955 2919 NA
3 2012 3110 2930 NA
4 2013 2962 2864 1.0962018
5 2014 2566 2873 1.0359948
6 2015 3788 3031 0.9962648
df$z<-0
for (i in 4:6){
df$z[i]<-mean(df$x[(i-3):(i-1)])/mean(df$y[(i-3):(i-1)])
}
Whit a loop, you can get this:
year x y z
1 2010 2980 2453 0.000000
2 2011 2955 2919 0.000000
3 2012 3110 2930 0.000000
4 2013 2962 2864 1.089497
5 2014 2566 2873 1.036038
6 2015 3788 3031 0.996654
library(zoo)
library(dplyr)
df %>% mutate(z = x/y, zz = rollmean(z, 3, fill = NA)

How to calculate customer acquisition rate by finding out overlapping with previous years?

I have a date set CustOrderabout customer purchases from 2008-2013 with following information (this just part of the data):
CustID OrderYear Amount
101102 2008 22429.00
101102 2009 11045.00
101435 2010 10740.77
101435 2011 73669.50
107236 2012 162123.50
101416 2010 8102.00
101416 2011 360.00
101416 2012 36576.00
101416 2013 1960.00
101467 2012 997.00
101604 2010 2971.53
101664 2009 91.94
101664 2011 130.93
.........
Some customers may purchases continuously every year (i.e. 101416), or just certain years (i.e. 101664). I want to figure out the customer acquisition rate, that is how many new customers gained that year, in terms of rate and numbers (For customers who did not purchase continuously, only consider the first time of purchase). For instance,
Year Customer TotalCustomerNumber NewCustomerRate
2008 5 5 0%
2009 3 8 37%
2010 4 12 33%
2011 2 14 14%
2012 3 17 17%
2013 2 19 10%
Anyone have any ideas/hints how to do it?
I appreciate any helps!
I took some time to work out a solution and this method should work. Take a look a the comments for details:
# Setting a seed for reproducibility.
set.seed(10)
# Setting what years we want allowed.
validYears <- 2008:2015
# Generating a "fake" dataset for testing purposes.
custDF <- data.frame(CustID = abs(as.integer(rnorm(250, 50, 50))), OrderYear = 0, Amount = abs(rnorm(250, 100, 1000)))
custDF$OrderYear <- sapply(custDF$OrderYear, function(x) x <- sample(validYears, 1)) # Adding random years for each purchase.
# Initializing a new data frame to store the output values.
newDF <- data.frame(Year = validYears, NewCustomers = 0, RunningNewCustomerTotal = 0, NewCustomerRate = "")
custTotal <- 0 # Initializing a variable to be used in the loop.
firstIt <- 1 # Denotes the first iteration.
for (year in validYears) { # For each uniqueYear in your data set (which I arbitarily defined before making the dataset)
# Getting the unique IDs of the current year and the unique IDs of all past years.
currentIDs <- unique(custDF[custDF$OrderYear == year, "CustID"])
pastIDs <- unique(custDF[custDF$OrderYear < year, "CustID"])
if (firstIt == 1) { pastIDs <- c(-1) } # Setting a condition for the first iteration.
newIDs <- currentIDs[!(currentIDs %in% pastIDs)] # Getting all IDs that have not been previously used.
numNewIDs <- length(newIDs) # Getting the number of new IDs.
custTotal <- custTotal + numNewIDs # Getting the running total.
# Adding the new data into the data frame.
newDF[newDF$Year == year, "NewCustomers"] <- numNewIDs
newDF[newDF$Year == year, "RunningNewCustomerTotal"] <- custTotal
# Getting the rate.
if (firstIt == 1) {
NewCustRate <- 0
firstIt <- 2
} else { NewCustRate <- (1 - (newDF[newDF$Year == (year - 1), "RunningNewCustomerTotal"] / custTotal)) * 100 }
# Inputting the new data. Format and round are just getting the decimals down.
newDF[newDF$Year == year, "NewCustomerRate"] <- paste0(format(round(NewCustRate, 2)), "%")
}
With output:
> newDF
Year NewCustomers RunningNewCustomerTotal NewCustomerRate
1 2008 32 32 0%
2 2009 22 54 41%
3 2010 19 73 26%
4 2011 14 87 16%
5 2012 7 94 7.4%
6 2013 3 97 3.1%
7 2014 9 106 8.5%
8 2015 5 111 4.5%
Hope this helps!

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Filter data frame by lowest common overlap in categorical variable in R

I have the following data frame:
input<-data.frame(
site=c("1","2","3","1","2","3","4","1","2"),
year=c(rep("2006",3),rep("2010",4),rep("2014",2)
))
site year
1 1 2006
2 2 2006
3 3 2006
4 1 2010
5 2 2010
6 3 2010
7 4 2010
8 1 2014
9 2 2014
I would like to return a list of sites surveyed in 2006, 2010, and 2014; so in the example above only site 1 and 2 would be in the list as they are the only sites that were surveyed in 2006, 2010, and 2014.
Any advice is most appreciated.
You can use ddply to count the number of years that are in your list of years of interest, for each site and then pull the sites that have all three.
library(plyr)
res <- ddply(.data = input, .variables = .(site),
summarize, allthree = all(c("2006","2010","2014") %in% year))
res$site[res$allthree]
If your data may contain other years. This solution should work
yearsneeded <- c("2006","2010","2014")
names(which(tapply(input$year, input$site, function(x) all(yearsneeded %in% x))))
It may be most straightforward to first cross-tabulate year and site using table(), and to then "apply" the function all to each of the table's rows to find which ones have all non-zero entries, like so:
(tb <- table(input))
# year
# site 2006 2010 2014
# 1 1 1 1
# 2 1 1 1
# 3 1 1 0
# 4 0 1 0
rownames(tb)[apply(tb,1,all)]
# [1] "1" "2"
Or, if you really just care that there should be at least one presence in each of 2006, 2010, and 2014 (even if your data might contain other years), try this:
rownames(tb)[apply(tb[,c("2006", "2010", "2014")], 1, all)]
# [1] "1" "2"
This is another approach (updated). It also works if the original input data frame has more than the 3 years in the example
years <- c(2006,2010,2014) #list with required years
df <- input[input$year %in% years,] #data frame containing only the required years
sites <- as.numeric(which(rowSums(table(df)) == length(years))) #sites that fullfill the criteria

Resources