I want to create a function that holds an ifelse statement like follows:
ArrearsL1M<-function(input1, input2, input3, input4){
output=0
output=ifelse((df[[input1]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input2]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input3]] %in% c(1,2,3,4,5)),1,
ifelse((df[[input4]] %in% c(1,2,3,4,5)), 1, 0))))
return(output)
Then I'd have this:
df$Arrears_L1M<-ArrearsL1M("col_201823","col_201822","col_201821","col_201820")
Here is an example of the data:
col_201823 col_201822 col_201821 col_201820 col_201819 col_201818 col_201817 col_201816 col_201815
1 99 5 4 2 99 99 99 99 99
2 3 0 3 2 3 3 3 3 3
3 2 2 2 2 2 2 2 2 2
4 0 0 0 1 0 0 0 0 0
5 99 99 5 99 99 99 99 99 99
6 2 1 4 99 2 2 2 2 2
7 1 1 99 99 1 1 1 1 1
So the code will check the previous 4 weeks of data starting with the most recent (i.e. 2018 week 23, week 22, week 21 and week 20)
The starting week can change and I want to make this work so that the I enter the first week and it runs the function for the past 4 weeks. I only want to enter the first week, so only one input. So if I enter col_201820 I will get the answer for weeks col_201820, col_201819, col_201818 and col_201817.
I'll want to run this for 52 weeks of data (i.e. a year ) at some point so I'm trying to make it easier to change if the starting week changes. It also needs to go to 201752, 201751, 201750,if the starting week is 201801.
I'm not sure where to start with this so can't show you anything I've already tried.
** Code for reproducible example
col_201823<-c(99,3,2,0,99,2,1)
col_201822<-c(5,0,2,0,99,1,1)
col_201821<-c(4,3,2,0,5,4,99)
col_201820<-c(2,2,2,1,99,99,99)
col_201819<-c(99,3,2,0,99,2,1)
col_201818<-c(99,3,2,0,99,2,1)
col_201817<-c(99,3,2,0,99,2,1)
col_201816<-c(99,3,2,0,99,2,1)
col_201815<-c(99,3,2,0,99,2,1)
test<-as.data.frame(cbind(col_201823,col_201822,col_201821,col_201820,col_201819,col_201818,col_201817,col_201816,col_201815))
I guess you want to figure out how to create a vector of weeks from a starting week. For instance
weeks_from_start <- function(x) {
week <- as.integer(substring(x, nchar(x) - 1))
rest <- substring(x, 1, nchar(x) - 2)
paste0(rest, seq(week, by = -1, length.out=4))
}
so
> weeks_from_start("col_201823")
[1] "col_201823" "col_201822" "col_201821" "col_201820"
Use this at the top of your ArrearsL1M() function. I would implement this as
ArrearsL1M <- function(df, last_week) {
weeks <- weeks_from_start(last_week)
m <- as.matrix(df[, weeks])
m[] <- m %in% 1:5 # test all elements in 1 call; format as matrix
rowSums(m) != 0
}
For more complicated parsing, revise weeks_from_start() as
week0 <- as.integer(substring(x, nchar(x) - 1))
year0 <- as.integer(substring(x, 5, 8))
week0 <- seq(week0, by = -1, length.out = 4)
week <- (week0 - 1) %% 52 + 1
year <- year0 - cumsum(week0 == 0)
sprintf("col_%4d%.2d", year, week)
Probably this is approaching a 'hack', e.g., do all years have 52 weeks? For a year beginning on, say, Tuesday, is week 1 Tues - Sunday, week 52 of the previous year just Monday? Time to rethink how this data is represented...
ArrearsL1M <- function(input1, input2, input3, input4){
cols <- c(input1, input2, input3, input4)
output <- as.numeric(apply(apply(df[, cols], 2, function(x) x %in% 1:5), 1, any))
return(output)
}
With 1 imput:
ArrearsL1M <- function(cols){
output <- as.numeric(apply(apply(df[, cols], 2, function(x) x %in% 1:5), 1, any))
return(output)
}
Related
I'm trying to create a function that sums the closest n values to a given date. So if I had 5 weeks of data, and n=2, the value on week 1 would be the sum of weeks 2&3, the value on week 2 would be the sum of weeks 1&3, etc. Example:
library(dplyr)
library(data.table)
Week <- 1:5
Sales <- c(1, 3, 5, 7, 9)
frame <- data.table(Week, Sales)
frame
Week Sales Recent
1: 1 1 8
2: 2 3 6
3: 3 5 10
4: 4 7 14
5: 5 9 12
I want to make a function that does this for me with an input for most recent n (not just 2), but for now I want to get 2 right. Here's my function using lag/lead:
RecentSum = function(Variable, Lags){
Sum = 0
for(i in 1:(Lags/2)){ #Lags/2 because I want half values before and half after
#Check to see if you can go backwards. If not, go foward (i.e. use lead).
if(is.na(lag(Variable, i))){
LoopSum = lead(Variable, i)
}
else{
LoopSum = lag(Variable, i)
}
Sum = Sum + LoopSum
}
for(i in 1:(Lags/2)){
if(is.na(lead(Variable, i))){ #Check to see if you can go forward. If not, go backwards (i.e. use lag).
LoopSum = lag(Variable, i)
}
else{
LoopSum = lead(Variable, i)
}
Sum = Sum + LoopSum
}
Sum
}
When I do RecentSum(frame$Sale,2) I get [1] 6 10 14 18 NA which is wrong for a number of reasons:
My if statements are only hitting on week one, so it will always be NA for lag and always be non-NA for lead.
I need to have a way to see if it uses lag/lead the first time. The first value is 6 instead of 8 because the first for-loop sends it to lead(_,1), but then the second for-loop does the same. I can't think of how I'd make my second for-loop recognize this.
Is there a function or library (Zoo?) that makes this task easy? I'd like to get my own function to work for the sake of practice/understanding, but at this point I'd rather just get it done.
Thanks!
To elaborate on my comment, lead and lag are functions that are meant to be used within vectorized functions such as dplyr. Here is a way to do it within dplyr without using a function:
df <- tibble(week = Week, sales = Sales)
df %>%
mutate(recent = case_when(is.na(lag(sales)) ~ lead(sales, n = 1) + lead(sales, n = 2),
is.na(lead(sales)) ~ lag(sales, n = 1) + lag(sales, n = 2),
TRUE ~ lag(sales) + lead(sales)))
That gives you this:
# A tibble: 5 x 3
week sales recent
<int> <dbl> <dbl>
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
1) Assuming that k is even define to as a vector of indices such that for each element of to we sum the k+1 elements of Sales that end in that index and from that subtract Sales:
k <- 2 # number of elements to sum
n <- nrow(frame)
to <- pmax(k+1, pmin(1:n + k/2, n))
Sum <- function(to, Sales) sum(Sales[seq(to = to, length = k+1)])
frame %>% mutate(recent = sapply(to, Sum, Sales) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Note that by replacing the last line of code above with the following line the solution can be done entirely in base R:
transform(frame, recent = sapply(to, Sum, Sales) - Sales)
2) This concatenates the appropriate elements before and after the Sales series so that an ordinary rolling sum gives the result.
library(zoo)
ix <- c(seq(to = k+1, length = k/2), 1:n, seq(to = n-k, length = k/2))
frame %>% mutate(recent = rollsum(Sales[ix], k+1) - Sales)
Note that if k=2 then it reduces this to this one-liner:
frame %>% mutate(recent = rollsum(Sales[c(3, 1:n(), n()-2)], 3) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Update: fixed for k > 2
This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 7 years ago.
I have a dataframe like this
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B)
df$Day <- as.character(df$Day)
The first column is a character and hence I used this solution to sort this dataframe but not quite getting it right since this only sorts the first column and leaves the column 2 & 3 unchanged.
df$Day <- df$Day[order(nchar(df$Day), df$Day)]
My desired output is
Day A B
Day1 5 15
Day5 2 16
Day10 0 30
Day20 7 12
What am I missing here? Kindly provide some inputs.
You can try using something like this that does numeric day sorting:
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B, stringsAsFactors = FALSE)
df$DayNum <- as.numeric(gsub('Day', '', df$Day))
df <- df[order(df$DayNum), ]
Output as follows:
df
Day A B DayNum
1 Day1 5 15 1
3 Day5 2 16 5
4 Day10 0 30 10
2 Day20 7 12 20
You can avoid creating a new column by doing the following (was trying to show full detail of what was going on):
df <- df[order(as.numeric(substr(df$Day, 4, nchar(df$Day)))), ]
Output will be same as above.
This could be done with mixedorder from library(gtools)
library(gtools)
df[mixedorder(df$Day),]
# Day A B
#1 Day1 5 15
#3 Day5 2 16
#4 Day10 0 30
#2 Day20 7 12
Day <- c("Day1","Day20","Day5","Day10")
A <- c (5,7,2,0)
B <- c(15,12,16,30)
df <- data.frame(Day,A,B, stringsAsFactors = FALSE)
# add leading zero(s) to digits in values of Day column,
# e.g., "Day5" --> "Day05"
# then return the indices of the sorted vector
indices_to_sort_by <- sort(
sub(
pattern = "([a-z]{1})([1-9]{1}$)",
replacement = "\\10\\2",
x = df$Day
),
index.return = TRUE)$ix
df[indices_to_sort_by, ]
# Day A B
# 1 Day1 5 15
# 3 Day5 2 16
# 4 Day10 0 30
# 2 Day20 7 12
I've got a data.frame of monthly values of a variable for many locations (so many rows) and I want to count the numbers of consecutive months (i.e consecutive cells) that have a value of zero. This would be easy if it was just being read left to right, but the added complication is that the end of the year is consecutive to the start of the year.
For example, in the shortened example dataset below (with seasons instead of months),location 1 has 3 '0' months, location 2 has 2, and 3 has none.
df<-cbind(location= c(1,2,3),
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
How can I count these consecutive zero values? I've looked at rle but I'm still none the wiser currently!
Many thanks for any help :)
You've identified the two cases that the longest run can take: (1) somewhere int he middle or (2) split between the end and beginning of each row. Hence you want to calculate each condition and take the max like so:
df<-cbind(
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
#> Winter Spring Summer Autumn
#> [1,] 0 0 0 3
#> [2,] 0 2 2 0
#> [3,] 3 4 7 4
# calculate the number of consecutive zeros at the start and end
startZeros <- apply(df,1,function(x)which.min(x==0)-1)
#> [1] 3 1 0
endZeros <- apply(df,1,function(x)which.min(rev(x==0))-1)
#> [1] 0 1 0
# calculate the longest run of zeros
longestRun <- apply(df,1,function(x){
y = rle(x);
max(y$lengths[y$values==0],0)}))
#> [1] 3 1 0
# take the max of the two values
pmax(longestRun,startZeros +endZeros )
#> [1] 3 2 0
Of course an even easier solution is:
longestRun <- apply(cbind(df,df),# tricky way to wrap the zeros from the start to the end
1,# the margin over which to apply the summary function
function(x){# the summary function
y = rle(x);
max(y$lengths[y$values==0],
0)#include zero incase there are no zeros in y$values
})
Note that the above solution works because my df does not include the location field (column).
Try this:
df <- data.frame(location = c(1, 2, 3),
Winter = c(0, 0, 3),
Spring = c(0, 2, 4),
Summer = c(0, 2, 7),
Autumn = c(3, 0, 4))
maxcumzero <- function(x) {
l <- x == 0
max(cumsum(l) - cummax(cumsum(l) * !l))
}
df$N.Consec <- apply(cbind(df[, -1], df[, -1]), 1, maxcumzero)
df
# location Winter Spring Summer Autumn N.Consec
# 1 1 0 0 0 3 3
# 2 2 0 2 2 0 2
# 3 3 3 4 7 4 0
This adds a column to the data frame specifying the maximum number of times zero has occurred consecutively in each row of the data frame. The data frame is column bound to itself to be able to detect consecutive zeroes between autumn and winter.
The method used here is based on that of Martin Morgan in his answer to this similar question.
I am trying to use the rle function in R to calculate the run lengths for the variable positive in the example below, aggregated by the variable id.
Here is a toy dataset (that admittedly has a few quirks):
test <- c('id', 'positive')
test$id <- rep(1:3, c(24, 24, 24))
set.seed(123456)
test$positive <- round(runif(72, 0, 1))
test <- data.frame(test)
test <- subset(test, select = -X.id.)
test <- subset(test, select = -X.positive.)
result <- aggregate(positive ~ id, data = test, FUN = rle)
The way this currently is set up it reads the run lengths for all possible values (0 and 1) of the variable positive. Is it possible to condition this function such that it only evaluates the run lengths when positive == 1?
At the end of the day, I ultimately want to figure out how to count the number of instances in which two or more consecutive months were positive (positive == 1) for each subject.
UPDATE:
I have a variable called event that has values of 0 or 1. For each of the occurrences of two or more positives that were developed from the code featured in the suggestions below, is it possible to stratify our results such that if event == 1 occurs during any of the positive months it would be classified differently than a run of positives in which event == 0 for all of the months?
The toy dataset looks like this:
set.seed(123456)
x <- c(1, 2, 1)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)), event = round(runif(72, 0, 1)))
results <- aggregate(positive ~ id + event, data = test, FUN=function(x) with(rle(x), sum(lengths > 1 & values == 1)))
aggregate(positive ~ event, data = result, FUN=sum)
However, this code gives all possible permutations of event and positive, while I would like to delimit the results to counting only those occurrences of two or more consecutive positive months for which any event == 1. Alternatively, if it is easier to evaluate only the number of consecutive positive months for which all event == 0 that would be a fine solution too.
To count occurrences of two or more consecutive positives, use this:
aggregate(positive ~ id, data=test, FUN=function(x) with(rle(x), sum(lengths>=2 & values==1)))
(inspired in #sgibb's answer.)
EDIT: Counting the number of 2 or more consecutive positives such that any of them has event==1, separated by id:
Calculate the run to which each record belongs:
tmp <- within(test, run <- ave(positive, by=id, FUN=function(x)cumsum(c(1,diff(x)!=0))))
# id positive event run
# 1 1 1 1
# 1 1 0 1
# 1 0 1 2
# 1 0 0 2
# 1 0 1 2
# 1 0 0 2
For each id and each run mark if there was at least one record with event==1 and run length >= 2:
tmp2 <- aggregate(event~id+positive+run, data=tmp, function(x)any(x>0) && length(x)>=2)
# id positive run event
# 2 0 1 FALSE
# 1 1 1 TRUE
# 3 1 1 FALSE
# 1 0 2 TRUE
# 3 0 2 TRUE
# 2 1 2 TRUE
Now simply count how many marked runs are there in each id and each kind of run (positive==1 or positive==0):
aggregate(event~positive+id, tmp2, sum)
# positive id event
# 0 1 1
# 1 1 2
# 0 2 1
# 1 2 3
# 0 3 3
# 1 3 1
Do you mean something like this?:
aggregate(positive ~ id, data=test, FUN=function(x) {
r <- rle(x);
return(r$length[r$value == 1])
})
# id positive
# 1 1 2, 1, 1, 7, 1
# 2 2 4, 2, 1, 4, 2, 1, 2
# 3 3 1, 7, 1, 1, 1
A ddply version for the 'at the end of the day' part:
library(plyr)
set.seed(123456)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)))
ddply(.data = test, .variables = .(id), function(x){
rl <- rle(x$positive)
sum(rl$length[rl$value == 1] > 1)
}
)
# id V1
# 1 1 2
# 2 2 5
# 3 3 1
I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from the previous year. A sample data set can be created with this code:
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
(df <- rbind(x, y, z))
I can use a combination of lapply() and split() to calculate the difference between each year for every unique id, like so:
(diffs <- lapply(split(df, df$id), function(x){-diff(x$value)}))
However, because of the nature of the diff() function, there are no results for the values in year 1, which means that after I flatten the diffs list of lists with Reduce(), I cannot add the actual yearly increases back into the data frame, like so:
df$actual <- Reduce(c, diffs) # flatten the list of lists
In this example, there are only 10 calculated differences or lags, while there are 15 rows in the data frame, so R throws an error when trying to add a new column.
How can I create a new column of actual increases with (1) the values for year 1 and (2) the calculated diffs/lags for all subsequent years?
This is the output I'm eventually looking for. My diffs list of lists calculates the actual values for years 2 and 3 just fine.
id value year actual
1 21 3 5
2 26 3 16
3 26 3 14
4 26 3 10
5 29 3 14
1 16 2 10
2 10 2 5
3 12 2 10
4 16 2 7
5 15 2 13
1 6 1 6
2 5 1 5
3 2 1 2
4 9 1 9
5 2 1 2
I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
There's lots of ways to do this but this one is fairly fast and uses base.
Here's a second & third way of approaching this problem utilizing aggregate and by:
aggregate:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
by:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
plyr
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(-df$year, df$temp),][, -1]
It gives you the final product of:
> df[order(as.numeric(rownames(df))),]
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
EDIT: Avoiding the Loop
May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.
set.seed(1234) #make new data with another numeric column
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- data.frame(df, sapply(df[, 2:3], group.diff)) #apply group.diff to col 2:3
df[order(as.numeric(rownames(df))),] #reorder it
Of course you'd have to rename these unless you used transform as in:
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x)) #function one
group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
This would depend on how many variables you were doing this to.
1) diff.zoo. With the zoo package its just a matter of converting it to zoo using split= and then performing the diff :
library(zoo)
zz <- zz0 <- read.zoo(df, split = "id", index = "year", FUN = identity)
zz[2:3, ] <- diff(zz)
It gives the following (in wide form rather than the long form you mentioned) where each column is an id and each row is a year minus the prior year:
> zz
1 2 3 4 5
1 6 5 2 9 2
2 10 5 10 7 13
3 5 16 14 10 14
The wide form shown may actually be preferable but you can convert it to long form if you want that like this:
dt <- function(x) as.data.frame.table(t(x))
setNames(cbind(dt(zz), dt(zz0)[3]), c("id", "year", "value", "actual"))
This puts the years in ascending order which is the convention normally used in R.
2) rollapply. Also using zoo this alternative uses a rolling calculation to add the actual column to your data. It assumes the data is structured as you show with the same number of years in each group arranged in order:
df$actual <- rollapply(df$value, 6, partial = TRUE, align = "left",
FUN = function(x) if (length(x) < 6) x[1] else x[1]-x[6])
3) subtraction. Making the same assumptions as in the prior solution we can further simplify it to just this which subtracts from each value the value 5 positions hence:
transform(df, actual = value - c(tail(value, -5), rep(0, 5)))
or this variation:
transform(df, actual = replace(value, year > 1, -diff(ts(value), 5)))
EDIT: added rollapply and subtraction solutions.
Kind of hackish but keeping in place your wonderful Reduce you could add mock rows to your df for year 0:
mockRows <- data.frame(id = 1:5, value = 0, year = 0)
(df <- rbind(df, mockRows))
(df <- df[order(df$id, df$year), ])
(diffs <- lapply(split(df, df$id), function(x){diff(x$value)}))
(df <- df[df$year != 0,])
(df$actual <- Reduce(c, diffs)) # flatten the list of lists
df[order(as.numeric(rownames(df))),]
This is the output:
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2