R dataframe aggregate conditionally based on column value, per quarter - r

I am running this command:
aggregated_quarterly_realised <- aggregate(merged_dataset$dependent_variable, list(merged_dataset$qy), mean)
which gives me the total amount per quarter. But I would like to get separately the sums in case the merged_dataset$dependent_variable is equal to 0, 1, and the total. Thus I would like to get three values per quarter. How can I do that?
EDIT:
> dput(head(merged_dataset$dependent_variable,10))
c(0, 0, 0, 0, 1, 0, 0, 0, 1, 0)
> dput(head(merged_dataset$qy,10))
structure(c(2008.25, 2008.25, 2008.50, 2008.75, 2009.25, 2009.50,
2008.25, 2008.25, 2008.25, 2008.25), class = "yearqtr")
> dput(head(merged_dataset$test,10))
c(7101273.07, 6855586.59, 800585.78, 8029604.44, 6707122.59,
646079.46, 14598.96, 1303978, 15244705, 322058.74)
What I want is the aggregated values per quarter (quarters are in the merged_dataset$qy variable) for the test variable (merged_dataset$test) separately for the values 0 of the dependent variable, the value 1, and the total.

Using data.table:
Code
dtf = dt[, .(Dep1sum = sum(test[depvar == 1]),
Dep0sum = sum(test[depvar == 0]),
Sum = sum(test)), .(qy)]
Result
> dtf
qy Dep1sum Dep0sum Sum
1: 2008.25 15244705 15597495.4 30842200.4
2: 2008.50 0 800585.8 800585.8
3: 2008.75 0 8029604.4 8029604.4
4: 2009.25 6707123 0.0 6707122.6
5: 2009.50 0 646079.5 646079.5
Data
dt = data.table(
depvar = c(0, 0, 0, 0, 1, 0, 0, 0, 1, 0),
qy = c(2008.25, 2008.25, 2008.50, 2008.75, 2009.25, 2009.50, 2008.25, 2008.25, 2008.25, 2008.25),
test = c(7101273.07, 6855586.59, 800585.78, 8029604.44, 6707122.59, 646079.46, 14598.96, 1303978, 15244705, 322058.74)
)

Related

R Function for Identifying NA Values Incorrectly Entered as Zeroes

I have a data set with a number of columns like this:
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
group_id N N_surv_1 N_surv_2 N_surv_3
1: 1 4588 0 3703 NA
2: 1 4589 0 0 3054
3: 1 0 4264 0 3159
4: 1 4590 4266 3710 0
5: 1 4588 4264 3715 0
The number of rows per group varies and each row represents a measurement for an entity specified by group_id for a particular point in time. I believe the data was incorrectly entered such that in some cases an NA value indicates a missing value, but in other cases a 0 was entered to indicate an NA value. There are legitimate zero values in the dataset, but I can identify the erroneous ones by looking for differences in column values above a particular threshold. For example
1
3
5
0
3
Might be a legit zero but
50
46
50
0
47
probably wouldn't be.
I think the best solution then would be to look for a string of zeroes followed or proceeded by a large jump and relabel the zeroes as NA. How could I do something like this in R?
dcarlson's advice is spot on. You'll need to think harder on your definition of true zeros.
Library(data.table)
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
#Difference approach
pop[c(diff(N),NA)>100,N:=NA,by=group_id]
#This won't handle two zeros in a row that should both be NA.
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
This will -use a rolling mean with na.rm=TRUE and specify the cut-off value (here 100.)
pop[frollmean(N,3,fill = NA,na.rm=TRUE,align = "left")-N>100,N:=NA]
pop[frollmean(N,3,fill = NA,na.rm=TRUE,align = "right")-N>100,N:=NA]
Need to use the right and left rolling mean to get 'em all.
#But, this uses the column names an excessive number of times (6 times for one operation.) You're likely to generate a typo messing up your data table if you do that.
#Let's start again.
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
RollReplace <- function(dt,colName,maxDiffAllowed){
dt[frollmean(get(colName),3,fill = NA,na.rm=TRUE,align = "left")-get(colName)>maxDiffAllowed,
eval(colName):=NA]
dt[frollmean(get(colName),3,fill = NA,na.rm=TRUE,align = "right")-get(colName)>maxDiffAllowed,
eval(colName):=NA]
}
RollReplace(pop,colName='N',100)
RollReplace(pop,colName='N_surv_1',100)
Still, you want to be careful.

R calculate how many values used to calculate mean in aggregate function

I have a dataframe of daily observations dating from 1963-2022. I want to calculate the mean of the observation for each month. However, some months don't have data for each day and some only have one datapoint for one month. This skews some of the data points. How do I calculate how many observations have been used to calculate the mean for a given month.
Head of Data frame
structure(list(prcp_amt = c(0, 1.8, 6.4, 5.1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 4.3, 0, 0, 0, 0, 4.6, 0, 0, 0, 0, 0, 0, 0, 0.3,
4.8, 0, 0, 4.1, 0, 0, 0, 0.3, 3.6, 6.6, 0, 0, 0, 0, 0, 0, 0.8,
0, 0, 0, 0, 0), ob_date = structure(c(-220838400, -220752000,
-220665600, -220579200, -220492800, -220406400, -220320000, -220233600,
-220147200, -220060800, -219974400, -219888000, -219801600, -219715200,
-219628800, -219542400, -219456000, -219369600, -219283200, -219196800,
-219110400, -219024000, -218937600, -218851200, -218764800, -218678400,
-218592000, -218505600, -218419200, -218332800, -218246400, -218160000,
-218073600, -217987200, -217900800, -217814400, -217728000, -217641600,
-217555200, -217468800, -217382400, -217296000, -217209600, -217123200,
-217036800, -216950400, -216864000, -216777600, -216691200, -216604800
), class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = c(NA,
50L), class = "data.frame")
Existing code
# historic monthly rainfall
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=mean)
rainHistMean$day <- 01
rainHistMean <-
rainHistMean %>%
mutate(rainHistMean, Date=paste(year, month, day, sep='-'))
rainHistMean[['Date']] <- as.POSIXct(rainHistMean[['Date']],
format='%Y-%m-%d',
tz='GMT'
)
Updated Code
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')
How do I get there to be 4 columns not 3 with a matrix?
Solution
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
rainHistMean <- data.frame(rainHistMean[1:2], rainHistMean[[3]])
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')
There may be more elegant solutions, but you can use dplyr to group by month and year, then get the count and mean in summarize:
df %>%
group_by(month(ob_date), year(ob_date)) %>%
summarize(mean_prcp = mean(prcp_amt),
count = n())
Output:
# # Groups: month(ob_date) [2]
# `month(ob_date)` `year(ob_date)` mean_prcp count
# <dbl> <dbl> <dbl> <int>
# 1 1 1963 0.91 30
# 2 2 1963 0.77 20

Getting the last number in a series before lowering it below threshold in R

I have the following data:
dat<- structure(list(Pentad = 1:73, RR = c(0, 0.014285714, 0, 0.088571429,
0.071428571, 0, 0.065714286, 0.028571429, 0.094285714, 0.011428571,
0, 0, 0, 0, 0, 0, 0.04, 0, 0.814285714, 0.285714286, 1.14, 5.334285714,
2.351428571, 1.985714286, 1.494285714, 2.005714286, 20.04857143,
25.00857143, 16.32, 11.06857143, 8.965714286, 3.985714286, 5.202857143,
7.802857143, 4.451428571, 9.22, 32.04857143, 19.50571429, 3.148571429,
2.434285714, 9.057142857, 28.70857143, 34.15142857, 33.02571429,
46.50571429, 70.61714286, 3.168571429, 1.928571429, 7.031428571,
0.902857143, 5.377142857, 11.35714286, 15.04571429, 11.66285714,
21.24, 11.43714286, 11.69428571, 2.977142857, 4.337142857, 0.871428571,
1.391428571, 0.871428571, 1.145714286, 2.317142857, 0.182857143,
0.282857143, 0.348571429, 0, 0.345714286, 0.142857143, 0.18,
4.894285714, 0.037142857), YY = c(0.577142857, 0, 1.282857143,
1.445714286, 0.111428571, 0.36, 0, 0, 0, 1, 0.011428571, 0.008571429,
0.305714286, 0, 0, 0, 0, 0.8, 0.062857143, 0, 0, 0, 0, 0.013333333,
0.043333333, 1.486666667, 0, 2.486666667, 1.943333333, 0.773333333,
8.106666667, 7.733333333, 0.5, 4.356666667, 2.66, 6.626666667,
4.404285714, 7.977142857, 12.94285714, 18.49428571, 7.357142857,
11.08285714, 9.034285714, 14.29142857, 34.61428571, 45.30285714,
6.66, 6.702857143, 5.962857143, 14.85428571, 2.1, 2.837142857,
7.391428571, 32.03714286, 9.005714286, 3.525714286, 12.32, 2.32,
7.994285714, 6.565714286, 4.771428571, 2.354285714, 0.005714286,
2.508571429, 0.817142857, 2.885714286, 0.897142857, 0, 0, 0,
0, 0.145714286, 0.434285714)), class = "data.frame", row.names = c(NA,
-73L))
There are three columns: Pentad, RR, and YY.
I would like to get the following:
(a) Get the first pentad when the precipitation exceeds the "annual mean" in "at least three consecutive pentads"
(b) Get the last pentad when the precipitation exceeds the "annual mean" in at least three consecutive pentads BEFORE lowering it below the annual mean.
I was able to do (a) using the following script:
first_exceed_seq <- function(x, thresh = mean(x), len = 3)
{
# Logical vector, does x exceed the threshold
exceed_thresh <- x > thresh
# Indices of transition points; where exceed_thresh[i - 1] != exceed_thresh[i]
transition <- which(diff(c(0, exceed_thresh)) != 0)
# Reference index, grouping observations after each transition
index <- vector("numeric", length(x))
index[transition] <- 1
index <- cumsum(index)
# Break x into groups following the transitions
exceed_list <- split(exceed_thresh, index)
# Get the number of values exceeded in each index period
num_exceed <- vapply(exceed_list, sum, numeric(1))
# Get the starting index of the first sequence where more then len exceed thresh
transition[as.numeric(names(which(num_exceed >= len))[1])]
}
first_exceed_seq(dat$RR)
Here's the plot of the time series:
The correct answer in (a) is 27.
I would like to ask how can I do this for (b). The correct answer for (b) should be 57.
I'll appreciate any help on in this in R.
I don't know if I got your problem right.
This is what I tried:
dat %>%
mutate(
anual_mean = mean(RR),
exceed_thresh = RR > anual_mean,
lag1 = lag(exceed_thresh, 1),
lag2 = lag(exceed_thresh, 2),
pick_3 = ifelse(exceed_thresh & lag1 & lag2, RR, NA)
)

Is there a way to count occurrences of a specific value for unique columns in a dataframe in R?

I am relatively new to R and have a dataframe (cn_data2) with several duplicated columns. It looks something like this:
Gene breast_cancer breast_cancer breast_cancer lung_cancer lung_cancer
myc 1 0 1 1 2
ARID1A 0 2 1 1 0
Essentially, the rows are genes and the columns are different types of cancers. What I want is to find for each gene the number of times, a value (0,1,or 2) occurs for each unique cancer type.
I have tried several things but haven't been able to achieve what I want. For example, cn_data2$count1 <- rowSums(cn_data == '1') gives me a column with the number of "1" for each gene but what I want the number of "1" for each individual disease.
Hope my question is clear!I appreciate any help, thank you!
structure(list(gene1 = structure(1:6, .Label = c("ACAP3", "ACTRT2",
"AGRN", "ANKRD65", "ATAD3A", "ATAD3B"), class = "factor"), glioblastoma_multiforme_Primary_Tumor = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.1 = c(-1,
-1, -1, -1, -1, -1), glioblastoma_multiforme_Primary_Tumor.2 = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.3 = c(2,
2, 2, 2, 2, 2), glioblastoma_multiforme_Primary_Tumor.4 = c(0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 6L))

How to change value of row if doesn't match the row above in a function applied to a dataframe

I have a dataframe as follows
structure(list(chr = 1, leftPos = 240000, OC_AH_026C.res = 0,
OC_AH_026C.1.res = 0, OC_AH_026C.2.res = 0, OC_AH_026T.res = 0,
OC_AH_058T.res = 0, OC_AH_084C.res = 0, OC_AH_084T.res = 0,
OC_AH_086C.res = 0, OC_AH_086C.1.res = 0, OC_AH_086C.2.res = 0,
OC_AH_086C.3.res = 0, OC_AH_086T.res = 0, OC_AH_088C.res = 0,
OC_AH_088T.res = 0, OC_AH_094C.res = 0, OC_AH_094C.1.res = 0,
OC_AH_094C.2.res = 0, OC_AH_094C.3.res = 0, OC_AH_094C.4.res = 0,
OC_AH_094C.5.res = 0, OC_AH_094C.6.res = 0, OC_AH_094C.7.res = 0,
OC_AH_094T.res = 0, OC_AH_096C.res = 0, OC_AH_096T.res = 0,
OC_AH_100C.res = 0, OC_AH_100C.1.res = 0, OC_AH_100T.res = 0,
OC_AH_127C.res = 0, OC_AH_127T.res = 0, OC_AH_133C.res = 0,
OC_AH_133T.res = 0, OC_ED_008C.res = 0, OC_ED_008C.1.res = 0,
OC_ED_008C.2.res = 0, OC_ED_008C.3.res = 0, OC_ED_008T.res = 0,
OC_ED_016C.res = 0, OC_ED_016T.res = 0, OC_ED_031C.res = 0,
OC_ED_031T.res = 0, OC_ED_036C.res = 0, OC_ED_036T.res = 0,
OC_GS_001C.res = 0, OC_GS_001T.res = 0, OC_QE_062C.res = 0,
OC_QE_062T.res = 0, OC_RS_010C.res = 0, OC_RS_010T.res = 0,
OC_RS_027C.res = 0, OC_RS_027C.1.res = 0, OC_RS_027C.2.res = 0,
OC_RS_027T.res = 0, OC_SH_051C.res = 0, OC_SH_051T.res = 0,
OC_ST_014C.res = 0, OC_ST_014C.1.res = 0, OC_ST_014T.res = 0,
OC_ST_016T.res = 0, OC_ST_020C.res = 0, OC_ST_020T.res = 0,
OC_ST_024C.res = 0, OC_ST_024T.res = 0, OC_ST_033C.res = 0,
OC_ST_033T.res = 0, OC_ST_034C.res = 0, OC_ST_034C.1.res = 0,
OC_ST_034C.2.res = 0, OC_ST_034T.res = 0, OC_ST_035C.res = 0,
OC_ST_035T.res = 0, OC_ST_036C.res = 0, OC_ST_036T.res = 0,
OC_ST_037T.res = 0, OC_ST_040C.res = 0, OC_ST_040T.res = 0,
OC_WG_001T.res = 0, OC_WG_002C.res = 0, OC_WG_002T.res = 0,
OC_WG_005C.res = 0, OC_WG_005T.res = 0, OC_WG_006C.res = 0,
OC_WG_006T.res = 0, OC_WG_009T.res = 0, OC_WG_019C.res = 0,
OC_WG_019T.res = 0, Means.res = 0, sd.res = 0, ind = 1L), .Names = c("chr",
"leftPos", "OC_AH_026C.res", "OC_AH_026C.1.res", "OC_AH_026C.2.res",
"OC_AH_026T.res", "OC_AH_058T.res", "OC_AH_084C.res", "OC_AH_084T.res",
"OC_AH_086C.res", "OC_AH_086C.1.res", "OC_AH_086C.2.res", "OC_AH_086C.3.res",
"OC_AH_086T.res", "OC_AH_088C.res", "OC_AH_088T.res", "OC_AH_094C.res",
"OC_AH_094C.1.res", "OC_AH_094C.2.res", "OC_AH_094C.3.res", "OC_AH_094C.4.res",
"OC_AH_094C.5.res", "OC_AH_094C.6.res", "OC_AH_094C.7.res", "OC_AH_094T.res",
"OC_AH_096C.res", "OC_AH_096T.res", "OC_AH_100C.res", "OC_AH_100C.1.res",
"OC_AH_100T.res", "OC_AH_127C.res", "OC_AH_127T.res", "OC_AH_133C.res",
"OC_AH_133T.res", "OC_ED_008C.res", "OC_ED_008C.1.res", "OC_ED_008C.2.res",
"OC_ED_008C.3.res", "OC_ED_008T.res", "OC_ED_016C.res", "OC_ED_016T.res",
"OC_ED_031C.res", "OC_ED_031T.res", "OC_ED_036C.res", "OC_ED_036T.res",
"OC_GS_001C.res", "OC_GS_001T.res", "OC_QE_062C.res", "OC_QE_062T.res",
"OC_RS_010C.res", "OC_RS_010T.res", "OC_RS_027C.res", "OC_RS_027C.1.res",
"OC_RS_027C.2.res", "OC_RS_027T.res", "OC_SH_051C.res", "OC_SH_051T.res",
"OC_ST_014C.res", "OC_ST_014C.1.res", "OC_ST_014T.res", "OC_ST_016T.res",
"OC_ST_020C.res", "OC_ST_020T.res", "OC_ST_024C.res", "OC_ST_024T.res",
"OC_ST_033C.res", "OC_ST_033T.res", "OC_ST_034C.res", "OC_ST_034C.1.res",
"OC_ST_034C.2.res", "OC_ST_034T.res", "OC_ST_035C.res", "OC_ST_035T.res",
"OC_ST_036C.res", "OC_ST_036T.res", "OC_ST_037T.res", "OC_ST_040C.res",
"OC_ST_040T.res", "OC_WG_001T.res", "OC_WG_002C.res", "OC_WG_002T.res",
"OC_WG_005C.res", "OC_WG_005T.res", "OC_WG_006C.res", "OC_WG_006T.res",
"OC_WG_009T.res", "OC_WG_019C.res", "OC_WG_019T.res", "Means.res",
"sd.res", "ind"), class = c("data.table", "data.frame"), row.names = c(NA,
-1L), .internal.selfref = <pointer: 0x103006f78>)
For each column I would like to keep the value as it is if it agrees with the row above it for that chr, only if that value is 1 or -1. If there is no agreement I'd like to convert the value to zero.
For example (not using the dput above)
chr leftPos OC_030_ST.res
1 4324 0
1 23433 1
1 34436 1
1 64755 1
3 234 1
3 354 0
4 1666 0
4 4565 0
5 34777 1
7 2345 1
7 4567 1
should become
chr leftPos OC_030_ST.res
1 4324 0
1 23433 1
1 34436 1
1 64755 1
3 234 0
3 354 0
4 1666 0
4 4565 0
5 34777 0
7 2345 1
7 4567 1
I had a dataframe (called Final) once upon a time that had a column called Def that contained all the res values in one column so I could do something like
ContZ<-setDT(Final)[,ind:=rleid(Def)][, if(.N>1) .SD, .(chr, ind)][, ind:=NULL][]
but assuming I'm going to need apply I'm not sure how to use this.
I tried:
MeOut<-lapply(df_list2res,function (col){
ContZ<-setDT(df_list2res)[,ind:=rleid(col)][, if(.N>1) .SD, .(chr, ind)][, ind:=NULL][]
})
but I get the error when I try to View(MeOut) that
Error in View : arguments imply differing number of rows:
I suspect this is because I have been getting rid of rows rather than replacing values although I can't be sure....
Here's your simple example in dplyr.
library(dplyr)
#create a simple version of your df
df<- data.frame(c(1,1,1,1,3,3,4,4,5,7,7),c(0,1,1,1,1,0,0,0,1,1,1))
names(df) <- c("chr","OC_030_ST.res")
df2 <- df%>%
mutate(last=lag(chr))%>%
mutate(OC_030_ST.res =ifelse(chr == last | is.na(last),
ifelse(OC_030_ST.res == 1|OC_030_ST.res == -1,OC_030_ST.res,0),0))%>%
select(-last)
df2
The logic here is that if the chr value one behind the current value equals the last and the 'OC_030_ST.res' value is 1 or -1, the value will be retained. In all other cases, the value is reset to 0. Please let me know if this isn't the logic you intended.
Note, the first item is a special case (as there can't be a lag on the first row), hence the is.na(last) check catches this.
edit: I realised you may want to apply this to multiple columns in your data frame. The below will let you do this
df<- data.frame(c(1,1,1,1,3,3,4,4,5,7,7),c(0,1,1,1,1,0,0,0,1,1,1),c(0,1,1,1,1,0,0,0,1,1,1))
names(df) <- c("chr","OC_030_ST.res","OC_031_ST.res")
df$count<-1:nrow(df)
make0 <- function(x) {ifelse(x == 1 | x == -1, x <-0, x <- x)}
dftemp <- df%>%
mutate(last=lag(chr))%>%
mutate(flag=ifelse(chr == last | is.na(last),0,1))%>%
dplyr::filter(flag==1)%>%
#the below column range will need to be changed to the columns you actually need to change
mutate_each(funs(make0),2:3)%>%
select(-c(flag,last))
dftemp
#update the original dataframe with the modified values
df[match(dftemp$count, df$count), ] <- dftemp
df <- subset(df,select=-count)
df

Resources