cumulative sum until a certain value

cumulative sum until a certain value - r

I have some data from a school year that I am working with. The variables are SchoolYear, Aug, Sep, Oct, ..., May, June where each month corresponds to th number of participants for that month. I need to sum the months until there is missing info, in this case identified by a 0. I have tried
yeardf <-within(yeardf,{
Max_enroll<-cummax(Sep)
Enroll_To_Date<-cumsum(Sep)
}
)
1973-74,0,0,4,2,14,26,22,8,0,99,
1974-75,0,0,4,26,10,23,10,14,0,91,
while putting a condition of Sep>0 on the Enroll_To_Date line but have not been succesfull.

Set up your data as lists and data frame:
> row1 <- c("1973-74",0,0,4,2,14,26,22,8,0,99)
> row2 <- c("1974-75",0,0,4,26,10,23,10,14,0,91)
> df <- rbind(row1,row2)
Cumulative sums of row1 can be found like this, and it looks like you want to capture the 76 (where it hits a zero):
> (z <- cumsum(row1[2:length(row1)]))
[1] 0 0 4 6 20 46 68 76 76 175
Here's one way to get it. First find the spot in the list that has the value:
> which(duplicated(lead(cumsum(row1[2:length(row1)]))))
[1] 8
And then look up the cumulative sum at that value:
> z[which(duplicated(lead(cumsum(row1[2:length(row1)]))))]
[1] 76
So here's the calculation for your row2:
> z <- cumsum(row2[2:length(row2)])
> z[which(duplicated(lead(cumsum(row2[2:length(row2)]))))]
[1] 87
And if you want to do a lot of them, like in the data frame df, chain them together in a function and use apply over all the rows (1) of your data frame:
> apply(df,1,function(x) cumsum(x[2:length(x)])[which(duplicated(lead(cumsum(x[2:length(x)]))))])
row1 row2
76 87

Related

Problems subsetting columns based on values from two separate dataframes

I am using data obtained from a spatially gridded system, for example a city divided up into equally spaced squares (e.g. 250m2 cells). Each cell possesses a unique column and row number with corresponding numerical information about the area contained within this 250m2 square (say temperature for each cell across an entire city). Within the entire gridded section (or the example city), I have various study sites and I know where they are located (i.e. which cell row and column each site is located within). I have a dataframe containing information on all cells within the city, but I want to subset this to only contain information from the cells where my study sites are located. I previously asked a question on this 'Matching information from different dataframes and filtering out redundant columns'. Here is some example code again:
###Dataframe showing cell values for my own study sites
Site <- as.data.frame(c("Site.A","Site.B","Site.C"))
Row <- as.data.frame(c(1,2,3))
Column <- as.data.frame(c(5,4,3))
df1 <- cbind(Site,Row, Column)
colnames(df1) <- c("Site","Row","Column")
###Dataframe showing information from ALL cells
eg1 <- rbind(c(1,2,3,4,5),c(5,4,3,2,1)) ##Cell rows and columns
eg2 <- as.data.frame(matrix(sample(0:50, 15*10, replace=TRUE), ncol=5)) ##Numerical information
df2 <- rbind(eg1,eg2)
rownames(df2)[1:2] <- c("Row","Column")
From this, I used the answer from the previous questions which worked perfectly for the example data.
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,])]
However, I cannot apply this to my own data and cannot figure out why.
EDIT: Initially, I thought there was a problem with naming the columns (i.e. the 'names' function). But it would appear there may be an issue with the 'output' line of code, whereby columns are being included from df2 that shouldn't be (i.e. the output contained columns from df2 which possessed column and row numbers not specified within df1).
I have also tried:
output <- df2[, (df2['Row', ] == df1$Row) & (df2['Column', ] == df1$Column)]
But when using my own (seemingly comparable) data, I don't get information from all cells specified in the 'df1' equivalent (although again works fine in the example data above). I can get my own data to work if I do each study site individually.
SiteA <- df2[, which(df2['Row', ] == 1) & (df2['Column', ] == 5)]
SiteB <- df2[, which(df2['Row', ] == 2) & (df2['Column', ] == 4)]
SiteC <- df2[, which(df2['Row', ] == 3) & (df2['Column', ] == 3)]
But I have 1000s of sites and was hoping for a more succinct way. I am sure that I have maintained the same structure, double checked spellings and variable names. Would anyone be able to shed any light on potential things which I could be doing wrong? Or failing this an alternative method?
Apologies for not providing an example code for the actual problem (I wish I could pinpoint what the specific problem is, but until then the original example is the best I can do)! Thank you.

The only apparent issue I can see is that mapply is not wrapped around unlist. mapply returns a list, which is not what you're after for subsetting purposes. So, try:
output <- df2[, (df2['Row', ] %in% df1$Row) & (df2['Column', ] %in% df1$Column)]
names(output) <- df1$Site[unlist(mapply(function(r, c){which(r == df1$Row & c == df1$Column)}, output[1,], output[2,]))]
Edit:
If the goal is to grab columns whose first 2 rows match the 2nd and 3rd elements of a given row in df1, you can try the following:
output_df <- Filter(function(x) !all(is.na(x)), data.frame(do.call(cbind,apply(df2, 2, function(x) {
##Create a condition vector for an if-statement or for subsetting
condition <- paste0(x[1:2], collapse = "") == apply(df1[,c('Row','Column')], 1, function(y) {
paste0(y,collapse = "")
})
##Return a column if it meets the condition (first 2 rows are matched in df1)
if(sum(condition) != 0) {
tempdf <- data.frame(x)
names(tempdf) <- df1[condition,]$Site[1]
tempdf
} else {
##If they are not matched, then return an empty column
data.frame(rep(NA,nrow(df2)))
}
}))))
It is quite a condensed piece of code, so I hope the following explanation will help clarify some things:
This basically goes through every column in df2 (with apply(df2, 2, FUN)) and checks if its first 2 rows can be found in the 2nd and 3rd elements of every row in df1. If the condition is met, then it returns that column in a data.frame format with its column name being the value of Site in the matching row in df1; otherwise an empty column (with NA's) is returned. These columns are then bound together with do.call and cbind, and then coerced into a data.frame. Finally, we use the Filter function to remove columns whose values are NA's.
All that should give the following:
Site.A Site.B Site.C
1 2 3
5 4 3
40 42 33
13 47 25
23 0 34
2 41 17
10 29 38
43 27 8
31 1 25
31 40 31
34 12 43
43 30 46
46 49 25
45 7 17
2 13 38
28 12 12
16 19 15
39 28 30
41 24 30
10 20 42
11 4 8
33 40 41
34 26 48
2 29 13
38 0 27
38 34 13
30 29 28
47 2 49
22 10 49
45 37 30
29 31 4
25 24 31
I hope this helps.

Calculation in columns using previous row value without loops

I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?
ID Data PF
135 5 123
136 4 141
137 5 124
138 6 200
139 1 310
140 2 141
141 4 141
So here in this dataset you would do:
Find 139 in ID column
Return average of last 5 rows in Data (Gives 4.2)
Return performance of values in PF 5 rows above to current value (Gives 152%)
If I would do a loop it looks like this:
for (i in 1:nrow(data)){
if(data$ID == "139" & i>=3)
{data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}
The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.
Many thanks.
Carlos

As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):
library(zoo)
ind <- which(df$ID==139) ## this is the row you are querying
N <- 5 ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2
rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.
To get the percent performance from PF as you specified:
percent.performance <- function(x) {
z <- zoo(x) ## create a zoo series
lz <- lag(z,4) ## create the lag version
return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325
Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.
Hope this helps.

Is that what you want?
ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
tdf<-df[(ii-4):ii,]
if(tdf[5,1]==ntest)
c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol
[[1]]
row aberage performance
5.0 4.2 251.0

This could be a decent start:
mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"
mydf <- read.table(text=mytext, header = T, sep = ",")
do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
tempdf <- mydf[1:which(mydf$ID==x),]
data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))
ID Data PF
139 4.200000 152.03252
140 3.833333 0.00000
141 3.857143 13.70968
The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.

find all possible sums in vector (R)

I have a vector of dollar values like this (vec):
[1] 460.08 3220.56 1506.20 1363.76 1838.00 1838.00 3684.94 2352.66 1606.02
[10] 1840.05 518.98 1603.53 1556.94 347.32 253.16 12.95 1828.81 1896.32
[19] 4962.60 426.33 3237.04 1601.40 2004.57 183.80 1570.75 3622.96 230.04
[28] 426.33 3237.04 1601.40 2004.57 183.80
If I have a charge that resulted from some sum of these numbers, how could I find it? For example, if the charge was 6747.81, then it must have resulted from 1506.20 + 3237.04 + 2004.57 (the 3rd, 29th and 31st vector elements). How could I solve for these vector elements given the sum?
I would imagine finding all possible sums is the answer then matching it to the vector elements that led to it.
I have played with using combn(vec, 3) to find all 3 but this doesn't quite quite give what I want.

You'll want to use colSums (or apply) after combn to get the sums.
set.seed(100)
# Generate fake data
vec <- rpois(10, 20)
# Get all combinations of 3 elements
combs <- combn(vec, 3)
# Find the resulting sums
out <- colSums(combs)
# Making up a value to search for
val <- vec[2]+vec[6]+vec[8]
# Find which combinations lead to that value
id <- which(out == val)
# Pull out those combinations
combs[,id]
Some output to show the results for this example
> vec
[1] 17 12 23 20 21 17 21 18 22 22
> val
[1] 47
> combs[,id]
[,1] [,2]
[1,] 17 12
[2,] 12 17
[3,] 18 18
Edit: Just saw that there isn't necessarily a restriction to use 3 items. One could generalize this just by doing it for every possible sample size but I don't have time to do that right now. It would also be fairly slow for even moderately sized problems.

How to compare two consecutive rows with a reference value in R?

I have a data frame of vehicle trajectories. Here's a snapshot:
> head(df)
vehicle frame globalx class velocity lane
1 2 43 6451214 2 37.76 2
2 2 44 6451217 2 37.90 2
3 2 45 6451220 2 38.05 2
4 2 46 6451223 2 38.18 2
5 2 47 6451225 2 38.32 2
6 2 48 6451228 2 38.44 2
where, vehicle= vehicle id (repeats because same vehicle is observed in several time frames), frame= frame id of time frames in which it was observed, globalx = x coordinate of the front center of the vehicle, class=type of vehicle (1=motorcycle, 2=car, 3=truck), velocity=speed of vehicles in feet per second, lane= lane number (there are 6 lanes). I think following illustration will better explain the problem:
The 'frame' represents one tenth of a second i.e. one frame is 0.1 seconds long. At frame 't' the vehicle has globalx coordinate x(t) and at frame 't-1' (0.1 seconds before) it was x(t-1). The reference location is 'U'(globalx=6451179.1116) and I simply want a new column in df called 'u' which has 'yes' in the row where globalx of the vehicle was greater than reference coordinate at 'U' AND the previous consecutive globalx coordinate of this vehicle was less than reference coordinate at 'U'. This means that if df has 100 vehicles then there will be 100 'yes' in 'u' column because every vehicle will meet the above criteria once. I have tried to do this by running the function with ifelse and also tried to do the same using a for loop but it doesn't work for me. The output should have one new column:
vehicle frame globalx class velocity lane u
I tried using ifelse inside for loop and a function but it doesn't work for me.

I assume the data frame is sorted primarily for vehicle and secondarily for globalx. If it's not you can do it by:
idx <- with(df,order(vehicle,globalx))
df <- df[idx,]
Now, you can perform it with the following vectorized operations:
# example reference line
U <- 6451220
# adding the extra column
samecar <- duplicated(df[,"vehicle"])
passU <- c(FALSE,diff(sign(df[,"globalx"]-U+1e-10))>0)
df[,"u"] <- ifelse(samecar & passU,"yes","no")

Here is my solution:
First create dummy data, based on your provided data (I have saved it to data.txt on my desktop), duplicate the data so that there are two cars with the same identical data, but different vehicle id's:
library(plyr)
df <- read.table("~/Desktop/data.txt",header=T)
df.B <- df; df.B$vehicle = 3 #For demonstration
df <- rbind(df,df.B); rm(df.B)
Then we can build a function to process:
mvt <- function(xref=NULL,...,data=df){
if(!is.numeric(xref)) #Input must be numeric
stop("xref must be numeric",call.=F)
xref = xref[1]
##Split on vehicle and process.
ddply(data,"vehicle",function(d){
L = nrow(d) #Number of Rows
d$u = FALSE #Default to Not crossing
#One or more rows can be checked.
if(L == 1)
d$u = (d$globalx > xref)
else if(L > 1){
ix <- which(d$globalx[2:L] > xref & d$globalx[1:(L-1)] <= xref)
if(length(ix) > 0)
d$u[ix + 1] = TRUE
}
#done
return(d)
})
}
Which can be used in the following manner:
mvt(6451216)
mvt(6451217)

prop.test iteratively

I have a situation like this: first of all I have a data.frame:
DF
COL1 COL2
29 1623
27 1600
30 1617
8 1620
Then, I have a vector like this:
[1] [2]
50 1602
What I need is to bind the first row of DF with the vector to have:
output
[1] [2]
29 1623
50 1602
On this output I would like to apply the prop.test using this code:
prop.test(output[,1], output[,2], correct=FALSE)
I need to do this on the entire DF, so:
first: bind first row of DF with the vector
second: prop.test
then again
first: bind second row of DF with the vector
second: prop.test
This iteratively.
Any suggestion please?
thanks a lot

apply(DF, 1, function(x) prop.test( rbind(x, c(50, 1602) ) ,correct=FALSE ) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

cumulative sum until a certain value - r

Related

Problems subsetting columns based on values from two separate dataframes

Calculation in columns using previous row value without loops

find all possible sums in vector (R)

How to compare two consecutive rows with a reference value in R?

prop.test iteratively

Categories

Resources