I'd really appreciate some help with an issue I have with my R dataframe. Couldn't find a similar thread, so please share if it exists already!
I have the following data:
mydata <- data.frame(inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65),
current=c(100,100,100,100,100))
I want to create a new column which does something like:
mutate(calc=pmax(lag(calc,default=current)+inflow-outflow,inflow))
which basically creates a new column called calc that chooses between the maximum of a) the previous row value of calc plus this row's inflow minus outflow or b) this row's inflow value. pmax is a function from a package called rmpfr which selects the maximum across given columns per row.
so my results will be: row1 = max(100+50-70, 50) which is 80, row2 = max(80+60-80,60) which is 60 and so on.
The main issue is that the lag function doesn't allow for taking previous row values for the same column you're creating, it has to be a column that already exists in the data. I thought of doing it in steps by creating the calc column first and then adding a second calculation step, but can't exactly work it out.
Lastly, I know that using a for loop might be a solution but was wondering if there is a different way? my data is grouped by an extra column and not sure the for loop will work well with grouped data rows?
Thanks for any help :)
# I don't define the current column, as this is handled with the .init argument of accumulate2
mydata <- data.frame(
inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65)
)
# define your recursive function
flow_function <- function(current, inflow, outflow){
pmax(inflow, inflow - outflow + current)
}
mydata %>%
mutate(result = accumulate2(inflow, outflow, flow_function, .init = 100)[-1] %>% unlist)
# inflow outflow result
# 1 50 70 80
# 2 60 80 60
# 3 55 70 55
# 4 70 65 70
# 5 80 65 85
Detail
The purrr::accumulate family of functions are designed to perform recursive calculations.
accumulate can handle functions which take the previous value plus values from one other column, whilst accumulate2 allows for a second additional column. Your scenario falls into the later.
accumulate2 expects the following arguments:
.x - the first column for the calculation.
.y - the second column for the calculation.
.f - the function to apply recursively: this should have three arguments, the first of which is the recursive argument.
.init - (optional) the initial value to use as the first argument.
So in your case the function to pass to .f will be
# define your recursive function
flow_function <- function(current, inflow, outflow){
pmax(inflow, inflow - outflow + current)
}
We first test what this produces outside of a dplyr::mutate
# note I don't define the current column, as this is handled with the .init argument
mydata <- data.frame(
inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65)
)
purrr::accumulate2(mydata$inflow, mydata$outflow, flow_function, .init = 100)
# returns
# [[1]]
# [1] 100
#
# [[2]]
# [1] 80
#
# [[3]]
# [1] 60
#
# [[4]]
# [1] 55
#
# [[5]]
# [1] 70
#
# [[6]]
# [1] 85
So there's two things to note about the returned value:
The returned object is a list, so we'll want to unlist back to a vector.
The list has 6 entries as it includes the initial value, we'll want to drop this.
These two final steps are brought together in the full example at the top.
maybe the cummax function will help
mutate(calc=pmax(cummax(current+inflow-outflow),inflow))
Related
I am trying to do a couple things here in R. I have a large dataset. I need to find the mean of a column SI.x, which I have done, then break up the data and find the SI.x mean for each of the subsets, which I have also done.
But then I need to subtract the total mean SI.x (which I've called meangen0a as it's the mean of the generation I'm looking at) from each of the subsetted means. I'd like a way to save the subsetted means as a vector, subtract meangen0a from each of these, and save the result as another vector, as I will need to do more vector math later.
Here's what I've done so far:
I got the mean SI.x of the generation I'm looking at (which I called gen0a):
meangen0a <- mean(gen0a$SI.x)
This worked fine.
I split up the generation by treatment (a control and four others) and only used those that were selected for (which was designated by a 1 in the Select column).
gen0ameans <- with(gen0a[gen0a$Select == 1,], aggregate(SI.x, by=list(Generation, SelectTreatment), mean))
colnames(gen0amean) <- c("Generation, "Treatment", "S")
This gave me a table with the generation (all 0a), the five treatments, and what their respective SI.x means were. This is what I wanted.
Now I want to subtract the total mean meangen0a from each of the five treatment means in the gen0ameans table. I tried doing this:
S0a <- lapply(gen0ameans$S, FUN=function(S) S-meangen0a)
and it gave me the correct numbers, but not in vector format. I need it to be in a vector of some sort because I will later need to subset the next generation and subtract 0a's means from the next generation's. When I tried to save S0a as a vector or matrix, it wasn't giving me a single row or column of the means like I'd like.
Any help would be appreciated. Thanks!
Edit - The mean of gen0a is -0.07267818.
The gen0ameans table looks like:
Generation
-----------------
0a
0a
0a
0a
0a
Treatment
-----------------
Control
Down1
Down2
Up1
Up2
S
-----------------
-0.07205068
-0.08288528
-0.08146745
-0.06296805
-0.06401943
When doing the S0a command from #3 above, it gives me:
[[1]]
[1] 0.0006274983
[[2]]
[1] -0.0102071
[[3]]
[1] -0.008789275
[[4]]
[1] 0.009710126
[[5]]
[1] 0.008658747
We can do this in tidyverse
library(tidyverse)
gen0a %>%
mutate(Meanval = mean(SI.x)) %>%
filter(Select == 1) %>%
group_by(Generation, SelectTreatment) %>%
mutate(NewMean = mean(SI.x) - Meanval)
Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?
I have a big dataset (10+ Mil x 30 vars) and i am trying to compute some new variables based on complicated interactions of current ones. For clarity i am including only the important variables in the question. I have the following code in R but i am interested in other views and opinions. I am using the dplyr package to compute new columns based on current/following row values of 3 other columns. (more explanation below code)
I am wondering if there is a way to make this faster and more efficient, or maybe completely rewrite it...
# the main function-data is a dataframe, windowSize and ratio are ints
computeNewColumn <- function(data,windowSize,ratio){
#helper function used in the second mutate down...
# all args are ints, i return a boolean out
windowAhead <- function(timeTo,window,reduction){
# subset the original dataframe-only observations with values of
# TimeToGo between timeTo-1 and window (basically the following X rows
# from the current one)
subframe <- data[(timeTo-1 >= data$TimeToGo & data$TimeToGo >= window), ]
isthere <- any(subframe$Price < reduction)
return(isthere)
}
# I group by value of ID first and order by TimeToGo...
data %<>% group_by(ID) %>%
arrange(desc(TimeToGo)) %>%
# ...create two new columns from simple interactions of existing ones...
mutate(Window = ifelse(TimeToGo > windowSize, TimeToGo - windowSize, 0),
Reduction = floor(Price - (ratio * Price))) %>%
rowwise() %>%
#...now comes the more complex stuff- I want to compute a third column
# depending on the next (TimeToGo - Window) number of values of Price
mutate(Advice = ifelse(windowAhead(TimeToGo,Window,Reduction),1,0) )
return(data)
}
We have a dataset with the following columns: ID,Price, TimeToGo.
We first group by values of ID and compute two new columns based on current row values (Window from TimeToGo and Reduction from Price). Next thing we would like to do is compute a new third column based on
1.current value of Reduction
2.the next (Window - TimeToGo) amount of values of Price in the dataframe.
I am wondering if there is a simple way to reference upcoming values of a column from within mutate()? I am ideally looking for a sliding window function on one column, where the limits of the sliding window are set from two other current column values. My solution for now just uses a custom function which subsets on the original dataframe manually, does a comparison and returns back a value to the mutate() call. Any help and ideas would be much appreciated!
p.s. heres a sample of data... please let me know if you would need any more info. Thanks!
> a
ID TimeToGo Price
1 AQSAFOTO30A 96 19
2 AQSAFOTO20A 95 19
3 AQSAFOTO30A 94 17
4 AQSAFOTO20A 93 18
5 AQSAFOTO25A 92 19
6 AQSAFOTO30A 91 17
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?
You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).
To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)
you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))
When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)
to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.