Cumulative sum on preceding rows in the same column - R - r

data:
test <- structure(list(fgu_clms = c(14621389.763697, 145818119.352026,
21565415.2337476, 20120830.8221406, 12999772.0950838), loss_to_layer = c(0,
125818119.352026, 1565415.23374765, 120830.822140567, 0)), row.names = c(NA,
5L), class = "data.frame")
> test
fgu_clms loss_to_layer
1 14621390 0.0
2 145818119 125818119.4
3 21565415 1565415.2
4 20120831 120830.8
5 12999772 0.0
I want to create a new column which tries to use a cumulative sum on the rows above it. It's easier if I show how the calculation on the new column works row by row:
row 1: first calculate the sum the value of rows above in the same column. As this is row 1 there are no rows above this value is 0, call this cumsum_1. It should then take the minimum of the value of row 1 in column "loss_to_layer" and the calculation "x2 - cumsum_1".
In row 2: calculate the cumsum by looking at the value above, i.e. min(x2-cumsum_1,loss_to_layer value). Call this cumsum_2. Then repeat as above, i.e. be subject to the minimum of the value on row 2 of the loss-to_layer column and x2 - cumsum_2.
And so on.
In excel, this would be done by using MIN(B2,x2 - SUM(C$1:C1)) and dragging this formula down.
The results with x2 = 127,000,000 should be:
fgu_clms loss_to_layer new_col
1 14621390 0.0 0
2 145818119 125818119.4 125818119
3 21565415 1565415.2 1181881
4 20120831 120830.8 0
5 12999772 0.0 0
As you can see the sum of the "new_col" always sums back up to "x2", in this case 127,000,000.
I have tried:
test <- test %>% mutate(new_col = pmin(loss_to_layer,127e6-cumsum(lag(new_col,1,default=0))))
But get an error as it cannot find the column new_col in the lag function

test %>%
mutate(
cumsum_1 = cumsum(lag(loss_to_layer, default = 0)),
new_col = pmin(loss_to_layer, 127000000 - cumsum_1),
new_col = ifelse(new_col < 0, 0, new_col)
) %>%
select(-cumsum_1)

Related

How to generate random data based on some criteria in R

I wan to generate 300 random data based on the following criteria:
Class value
0 1-8
1 9-11
2 12-14
3 15-16
4 17-20
Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.
This gives me the following hypothetical table as an example:
Class Value
0 7
0 4
1 10
1 9
1 11
. .
. .
I want to have equal and unequal mixtures in each class
You could do:
df <- data.frame(Class = sample(0:4, 300, TRUE))
df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class + 1],
sample, size = 1)
This gives you a data frame with 300 rows and appropriate numbers for each class:
head(df)
#> Class Value
#> 1 0 3
#> 2 1 10
#> 3 4 19
#> 4 2 12
#> 5 4 19
#> 6 1 10
Created on 2022-12-30 with reprex v2.0.2
Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:
# load data.table
library(data.table)
# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14",
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]
# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])
# This is a vector of UNEQUAL Probabilities. If wanted, it should be
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)
# This is the number of Class samples wanted
numberOfSamples = 300
# This is the working horse
a[sample(.N, numberOfSamples, TRUE, prob = probs), ][,
smpl := apply(.SD,
1,
function(x) sample(eval(parse(text = x)), 1)),
.SDcols = "value"][,
.(Class, smpl)]
What is good about this code?
If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.

How to calculate percentage of dataframe column in R with condition?

I would like to find out how to calculate the percentage of a column based on a condition.
My table looks like this:
url | call_count
-------|-----------
bbc.com| 1
bbc.com| 1
bbc.com| 1
bbc.com| 1
ao.com | 0
ab.com | 2
I would like to group the table by the url column and calculate a new column called "percent_calling" - this is based on a condition whereby the call_count column value is greater than 0 then calculate it as a percent of the whole column value - this is basically just % calling if the value is > 0 as >0 means they made a call.
I'm currently stuck on how to do this with dplyr the closest i have got is the following:
df %>%
group_by(url) %>%
summarise(percent_calling = sum(call_count)/nrow(df))
but as you can see i cannot add a condition i.e. call_count > 0
Your data:
df<-data.frame(
stringsAsFactors = FALSE,
url = c("bbc.com","bbc.com",
"bbc.com","bbc.com","ao.com","ab.com"),
call_count = c(1, 1, 1, 1, 0, 2)
)
Does the following work for you?
df%>%
group_by(url)%>%
summarise(sum_calling = sum(call_count))%>%
mutate(percent_calling=sum_calling/sum(sum_calling)*100)%>%
select(-sum_calling) # remove the sum if not required
url percent_calling
<chr> <dbl>
1 ab.com 33.3
2 ao.com 0
3 bbc.com 66.7

Grouping data in R based on specific column values

I have a set of data in a csv file that I need to group based on transitions of one column. I'm new to R and I'm having trouble finding the right way to accomplish this.
Simplified version of data:
Time Phase Pressure Speed
1 0 0.015 0
2 25 0.015 0
3 25 0.234 0
4 25 0.111 0
5 0 0.567 0
6 0 0.876 0
7 75 0.234 0
8 75 0.542 0
9 75 0.543 0
The length of time that phase changes state is longer than above but I shortened everything to make it readable and this pattern continues on and on. What I'm trying to do is calculate the mean of pressure and speed for each instance where the phase is non-zero. For example, in the output from the sample above there would be two lines, one with the average of the three lines where phase is 25, and with the average of the three lines when phase is 75. It will be possible to see cases where the same numeric value of phase shows up more than once, and I need to treat each of those separately. That is, in the case where phase is 0, 0, 25, 25, 25, 0, 0, 0, 25, 25, 0, I would need to record the first group and the second group of 25s as separate events, as well as any other non-zero groups.
What i've tried:
`csv <- read.csv("c:\\test.csv")`
`ins <- subset(csv,csv$Phase == 25)`
`exs <- subset(csv,csv$Phase == 75)`
`mean(ins$Pressure)`
`mean(exs$Pressure)`
This obviously returns the average of the entire file when phase is 25 and 75, but I need to somehow split it into groups using the trailing and leading 0s. Any help is appreciated.
Super quick:
df <- read.csv("your_file_name.csv")
cbind(aggregate(Pressure ~ Phase, df[df$Phase != 0,], FUN = mean),
aggregate(Speed ~ Phase, df[df$Phase != 0,], FUN = mean)[2])
The cbind is fancy - depending on the distribution of values of Phase, you'll need to merge instead.
EDITED: Based on feedback from the asker, they are really seeking to do some aggregations across runs of numbers (i.e. the first group of continuous 25s, then the second group of continuous 25s, and so on). Because of that, I suggest using rle or the run-level encoding function, to get a group number that you can use in the aggregate command.
I've modified the original data so that it contains two runs of 25, just for illustrative purposes, but it should work regardless. Using rle we get the encoded runs of data, and then we create a group number for each row. We do this by getting a vector of the total number of observed lengths, and then using the rep function to repeat each one by the appropriate length.
After this is done, we can use the same basic aggregation command again.
df_example <- data.frame(Time = 1:9,
Phase = c(0,25,25,25,0,0,25,25,0),
Pressure = c(0.015,0.015,0.234,0.111,0.567,0.876,0.234,0.542,0.543),
Speed = rep(x = 0,times = 9))
encoded_runs <- rle(x = df_example$Phase)
df_example$Group_No <- rep(x = 1:length(x = encoded_runs$lengths),
times = encoded_runs$lengths)
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Group_No = df_example[df_example$Phase != 0,"Group_No"],
Phase = df_example[df_example$Phase != 0,"Phase"]),
FUN = mean)
Group_No Phase Pressure Speed
1 2 25 0.120 0
2 4 25 0.388 0
Building upon comment by Solos, and answer by Cheesman,
try:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
df_example = csv
aggregate(x = df_example[df_example$Phase != 0,c("Pressure","Speed")],
by = list(Phase = df_example[df_example$Phase != 0,"block"]),
FUN = mean)
actually plyr would be handy:
csv$block = paste(csv$Phase, cumsum(c(1, diff(csv$Phase) != 0)))
require(plyr)
ddply(csv[csv$Phase!=0,], .(block), summarize,
mean.Pressure=mean(Pressure), mean.Speed=mean(Speed))

Referring to Previous element in middle of column in R

I have a data frame which has a date column and a cumulative sum column. The cumulative sum data ends at a certain point and I want to use a formula to calculate it for the rest of the dates in the date column. What I am having trouble with is having the formula reference the previous cell in the column, starting from where the count reverts to 0 (where the historical cumulative sum ends).
Example below:
dates.1 <- c("2016-12-06","2016-12-07","2016-12-08","2016-12-09","2016-12-10","2016-12-11","2016-12-12","2016-12-13","2016-12-14")
count.1 <- c(1,3,8,10,0,0,0,0,0)
drift <- .0456
df.1 <- data.frame(cbind(dates.1,count.1))
for (i in df.1$count.1) {
if (i == 0) {
head(df.1$count.1, n = 1L)+exp(drift+(qnorm(runif(5,0,1))))
}
}
I cant get the for loop to calculate it right.
The reason n = 5 for the runif is because that is the number of future entries I want to run the formula for.
The desired output would have something along the lines of
print(df.1$count.1)
[1] 1 3 8 10 12 13 16 17 18
The numbers after the 4th element are just random, the general idea is that the column would be overwritten, keeping the historical data and have the new calculated entries instead of the zeroes.
Any ideas?
There is no need to loop. You can get what you want by first identifying the row index at which the cumsum stopped:
last.ind <- which(df.1$count.1==0)[1]-1
Then use this last.ind to restart the cumsum:
set.seed(123) ## for reproducibility
## simulation of rest of data to cumulatively sum
rest.of.data <- exp(drift+(qnorm(runif(5,0,1))))
df.1$count.1[last.ind:length(df.1$count.1)] <- cumsum(c(df.1$count.1[last.ind],rest.of.data))
print(df.1$count.1)
##[1] 1.00000 3.00000 8.00000 10.00000 10.59757 12.92824 13.75970 17.20085 22.17527
If you do want to use a loop, then you should do the following, which gives the same result but will be slower:
for (i in seq_len(length(df.1$count.1))) {
if (df.1$count.1[i] == 0) {
df.1$count.1[i] <- df.1$count.1[i-1] + exp(drift+(qnorm(runif(1,0,1))))
}
}
Notes:
Loop over indices of df1$.count.1 not values.
If the value at the current index i is 0, write over that value with the sum of the previous value at i-1 and the data to be cumulatively summed.
Also, you should not use cbind to create your data.frame. Doing so in this case will result in df.1$count.1 being a factor instead of numeric. The data used is:
Data:
df.1 <- structure(list(dates.1 = structure(1:9, .Label = c("2016-12-06",
"2016-12-07", "2016-12-08", "2016-12-09", "2016-12-10", "2016-12-11",
"2016-12-12", "2016-12-13", "2016-12-14"), class = "factor"),
count.1 = c(1, 3, 8, 10, 0, 0, 0, 0, 0)), .Names = c("dates.1",
"count.1"), row.names = c(NA, -9L), class = "data.frame")
## dates.1 count.1
##1 2016-12-06 1
##2 2016-12-07 3
##3 2016-12-08 8
##4 2016-12-09 10
##5 2016-12-10 0
##6 2016-12-11 0
##7 2016-12-12 0
##8 2016-12-13 0
##9 2016-12-14 0

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources