Loop same actions in R - r

Have an issue here.
I want to loop my operations in R, however, do not know how to make this properly and efficiently.
I have several different sized datasets and performing the same block of code each time is time-consuming.
Here is the code I need to apply to each of the datasets and write the data or the output from a model into the datasets with different names.
##########################################################################################################################
#the combined list of separate data frames where the last letter is changing A, B, C...
z <- list(Data_A, Data_B, Data_C)
#need to loop these operations performed by using data from datasets. Here is an example by using data from Data_A dataset.
# TFP estimation by using ACF method
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
#########################################################################################################################
# Growth variables
Data_A <- Data_A %>%
arrange(ID, Year) %>%
group_by(ID) %>%
mutate(domegaACF_A = omegaACF_A - dplyr::lag(omegaACF_A),
debt = LOAN + LTD,
ddebt = debt - dplyr::lag(debt),
dsales = SALE - dplyr::lag(SALE)) %>%
ungroup
# Panel data frame
PData_A <- pdata.frame(Data_A, index = c("ID","Year"))
# Within estimator
within_2way_A <- plm(domegaACF_A ~ dplyr::lag(domegaACF_A, 1) + dplyr::lag(domegaACF_A, 2) + ddebt + lag(ff1, 1) + ddebt:lag(ff1, 1) + log(Age) + ta + dsales,
data = PData_A, effect = "twoways", model ="within", index = c("ID", "Year"))
The main problem is that I do not know how to store the data in separate datasets with according names. For example, in the block of the following code, _A should be changing to _B, _C according to the dataset that is used.
ACF_A <- prodest::prodestACF(Data_A$turn, fX = Data_A$cogs, sX = Data_A$tfa, pX = Data_A$cogs, idvar = Data_A$ID, timevar = Data_A$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_A <- prodest::omega(ACF_A)
Data_A$omegaACF_A <- prodest::omega(ACF_A)
I know there are lapply and for loops but I do not know how to use them with changing names of storing variables:
z <- list (df1, df2, df3)
for (i in z){
ACF_[1 or 2 or 3] <- prodest::prodestACF(i$turn, fX = i$cogs, sX = i$tfa, pX = i$cogs, idvar = i$ID, timevar = i$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
Data_[]$omegaACF_[1 or 2 or 3] <- prodest::omega(ACF_[1 or 2 or 3])
{
UPD: Here are several datasets: https://drive.google.com/drive/folders/1gBV2ZkywW6JqDjRICafCwtYhh2DHWaUq?usp=sharing
UPD2:
Data_A
turn cogs tfa SALE
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Data_B
turn cogs tfa SALE
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
After running the loop I need:
ACF_A, ACF_B, etc. storage variable, where the results of the estimations of prodest function will be stored
omegaACF_A, omegaACF_B, etc. storage where omega variable from prodest will be stored
omegaACF_A, omegaACF_B results of estimations should be added to Data_A, Data_B datasets accordingly as a new variable.
After that, in Data_A, Data_B datasets, growth variables should be created
The plm regression should be stored in within_2way_A, within_2way_B accordingly
So in the end, I need:
Data_A
turn cogs tfa SALE omegaACF_A domegaACF_A debt ddebt dsales
1 1 1 1 0.1 NA 1 NA NA
2 2 2 2 0.3 0.2 2 1 1
3 3 3 3 0.6 0.3 3 1 1
4 4 4 4 0.9 0.3 4 1 1
Data_B
turn cogs tfa SALE omegaACF_B domegaACF_B debt ddebt dsales
5 5 5 5 1.1 NA 5 NA NA
6 6 6 6 1.5 0.4 6 1 1
7 7 7 7 1.7 0.2 7 1 1
8 8 8 8 2.0 0.3 8 1 1

One approach is to separate the ACF estimation and omega calculation from the summary creation with different lapply() commands. Since you did not supply any example data, it's a blind shot, but try the following. Note that I assumed that every dataset has the same column names! In case it doesn't solve your problem I will remove my answer.
data <- list(Data_A, Data_B, Data_C)
Estimates <- lapply(data, function(x){
prodest::prodestACF(x$turn, fX = x$cogs, sX = x$tfa, pX = x$cogs, idvar = x$ID, timevar = x$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates <- lapply(Estimates, summary)
Omegas <- lapply(Estimates, function(x) prodest::omega(x))
Summaries_omega <- lapply(Omegas, summary)
Alternative using loops
Since you asked, it is also possible to define a loop that loops everything together though this is usually much slower. For this, we have to define empty lists that carry the results of ACF etc. and loop over the lists of data.frames that we already created.
data <- list(Data_A, Data_B, Data_C)
Estimates <- list()
Summaries_estimates <- list()
Omegas <- list()
Summaries_omegas <- list()
for(i in 1:(length(data))){
Estimates[[i]] <- prodest::prodestACF(data[[i]]$turn, fX = data[[i]]$cogs, sX = data[[i]]$tfa, pX = data[[i]]$cogs, idvar = data[[i]]$ID, timevar = data[[i]]$Year,
R = 100, cX = NULL, opt = 'DEoptim', theta0 = NULL, cluster = NULL)
}
Summaries_estimates[[i]] <- summary(Estimates[[i]])
Omegas[[i]] <- prodest::omega(Estimates[[i]]))
Summaries_omega[[i]] <- summary(Omegas[[i]])
}

Related

How to remove loop from the following model function?

I'm rewriting some code, and I am currently creating a small population model. I have re-created the current model function below from a book, it's a simple population model based on a few parameters. I've left them at default and returned the data frame. Everything works well. However, I was wondering whether I could somehow exclude the loop from the function.
I know R is great because of vectorized calculation, but I'm not sure in this case whether it would be possible. I thought of using something like lead/lag to do it, but would this work? Perhaps not as things need to be calculated sequentially?
# Nt numbers at start of time t
# Ct = removed at the end of time t
# Nt0 = numbers at time 0
# r = intrinsic rate of population growth
# K = carrying capacity
mod_fun = function (r = 0.5, K = 1000, N0 = 50, Ct = 0, Yrs = 10, p = 1)
{
# sets years to year value plus 1
yr1 <- Yrs + 1
# creates sequence of length years from year 1 to Yrs value +!
years <- seq(1, yr1, 1)
# uses years length to create a vector of length Yrs + 1
pop <- numeric(yr1)
# sets population at time 0
pop[1] <- N0
# creates a loop that calculates model for each year after first year
for (i in 2:yr1) {
# sets starting value of population for step to one calculated previous step
# thus Nt is always the previous step pop size
Nt <- pop[i - 1]
pop[i] <- max((Nt + (r * Nt/p) * (1 - (Nt/K)^p) -
Ct), 0)
}
# sets pop2 to original pop length
pop2 <- pop[2:yr1]
# binds together years (sequence from 1 to length Yrs),
# pop which is created in loop and is the population at the start of step t
# pop2 which is the population at the end of step t
out <- cbind(year = years, nt = pop, nt1 = c(pop2, NA))
# sets row names to
rownames(out) <- years
out <- out[-yr1, ]
#returns data.frame
return(out)
}
result = mod_fun()
This is what the output looks like. Basically rowwise starting from row 1 given the starting population of 50 the loop calculates nt1 then sets next nt row to lag(nt1) and then things continue in a similar fashion.
result
#> year nt nt1
#> 1 1 50.0000 73.7500
#> 2 2 73.7500 107.9055
#> 3 3 107.9055 156.0364
#> 4 4 156.0364 221.8809
#> 5 5 221.8809 308.2058
#> 6 6 308.2058 414.8133
#> 7 7 414.8133 536.1849
#> 8 8 536.1849 660.5303
#> 9 9 660.5303 772.6453
#> 10 10 772.6453 860.4776
Created on 2022-04-24 by the reprex package (v2.0.1)
mod_fun = function (r = 0.5, K = 1000, N0 = 50, Ct = 0, Yrs = 10, p = 1)
{
years <- seq_len(Yrs)
pop <- Reduce(function(Nt, y)max((Nt + (r * Nt/p) * (1 - (Nt/K)^p) - Ct), 0),
years, init = N0, accumulate = TRUE)
data.frame(year = years, nt = head(pop,-1), nt1 = pop[-1])
}
year nt nt1
1 1 50.0000 73.7500
2 2 73.7500 107.9055
3 3 107.9055 156.0364
4 4 156.0364 221.8809
5 5 221.8809 308.2058
6 6 308.2058 414.8133
7 7 414.8133 536.1849
8 8 536.1849 660.5303
9 9 660.5303 772.6453
10 10 772.6453 860.4776

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

How to apply function with multiple conditions on multiple columns to get new conditional columns in R

Hello all a R noob here,
I hope you guys can help me with the following.
I need to transform multiple columns in my dataset to new columns based on the values in the original columns multiple times. This means that for the first transformation I use column 1, 2, 3 and if certain conditions are met the output results a new column with a 1 or a 0, for the second transformation I use columns 4, 5, 6 and the output should be a 1 or a 0 also. I have to do this 18 times. I already wrote a function which succesfully does the transformation if I impute the variables manually, but I would like to apply this function to all the desired columns at once. My desired output would be 18 new columns with 0's and 1's. Finally I will make a last column which will display a 1 if any of the 18 columns is a 1 and a 0 otherwise.
df <- data.frame(admiss1 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss2 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
admiss3 = sample(seq(as.Date('1990/01/01'), as.Date('2000/01/01'), by="day"), 12),
visit1 = sample(seq(as.Date('1995/01/01'), as.Date('1996/01/01'), by="day"), 12),
visit2 = sample(seq(as.Date('1997/01/01'), as.Date('1998/01/01'), by="day"), 12),
reason1 = sample(3,12, replace = T),
reason2 = sample(3,12, replace = T),
reason3 = sample(3,12, replace = T))
df$discharge1 <- df$admiss1 + 10
df$discharge2 <- df$admiss2 + 10
df$discharge3 <- df$admiss3 + 10
#every discharge date is 10 days after the admission date for the sake of this example
#now I have the following dataframe
#for the sake of it I included only 3 dates and reasons(instead of 18)
admiss1 admiss2 admiss3 visit1 visit2 reason1 reason2 reason3 discharge1 discharge2 discharge3
1 1990-03-12 1992-04-04 1998-07-31 1995-01-24 1997-10-07 2 1 3 1990-03-22 1992-04-14 1998-08-10
2 1999-05-18 1990-11-25 1995-10-04 1995-03-06 1997-03-13 1 2 1 1999-05-28 1990-12-05 1995-10-14
3 1993-07-16 1998-06-10 1991-07-05 1995-11-06 1997-11-15 1 1 2 1993-07-26 1998-06-20 1991-07-15
4 1991-07-05 1992-06-17 1995-10-12 1995-05-14 1997-05-02 2 1 3 1991-07-15 1992-06-27 1995-10-22
5 1995-08-16 1999-03-08 1992-04-03 1995-02-20 1997-01-03 1 3 3 1995-08-26 1999-03-18 1992-04-13
6 1999-10-07 1991-12-26 1995-05-05 1995-10-24 1997-10-15 3 1 1 1999-10-17 1992-01-05 1995-05-15
7 1998-03-18 1992-04-18 1993-12-31 1995-11-14 1997-06-14 3 2 2 1998-03-28 1992-04-28 1994-01-10
8 1992-08-04 1991-09-16 1992-04-23 1995-05-29 1997-10-11 1 2 3 1992-08-14 1991-09-26 1992-05-03
9 1997-02-20 1990-02-12 1998-03-08 1995-10-09 1997-12-29 1 1 3 1997-03-02 1990-02-22 1998-03-18
10 1992-09-16 1997-06-16 1997-07-18 1995-12-11 1997-01-12 1 2 2 1992-09-26 1997-06-26 1997-07-28
11 1991-01-25 1998-04-07 1999-07-02 1995-12-27 1997-05-28 3 2 1 1991-02-04 1998-04-17 1999-07-12
12 1996-02-25 1993-03-30 1997-06-25 1995-09-07 1997-10-18 1 3 2 1996-03-06 1993-04-09 1997-07-05
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(dis))] <= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, 0)
xnew <- ifelse(df[eval(substitute(admis))] >= df[eval(substitute(vis1))] & df[eval(substitute(admis))] <= df[eval(substitute(vis2))] & df[eval(substitute(dis))] >= df[eval(substitute(vis2))] & df[eval(substitute(rsn))] == 2, 1, xnew)
return(xnew)
}
I wrote this function to generate a 1 if the conditions are true and a 0 if the conditions are false.
-Condition 1: admission date and discharge date are between visit 1 and visit 2 + admission reason is 2.
-Condition 2: admission date is after visit 1 but before visit 2 and the discharge date is after visit 2 with also admission reason 2.
It should return 1 if these conditions are true and 0 if these conditions are false. Eventually, I will end up with 18 new variables with 1's or 0's and will combine them to make one variable with Admission between visit 1 and visit 2 (with reason 2).
If I manually impute the variable names it will work, but I cant make it work for all the variables at once. I tried to make a string vector with all the admiss dates, discharge dates and reasons and tried to transform them with mapply, but this does not work.
admiss <- paste0(rep("admiss", 3), 1:3)
discharge <- paste0(rep("discharge", 3), 1:3)
reason <- paste0(rep("reason", 3), 1:3)
visit1 <- rep("visit1",3)
visit2 <- rep("visit2",3)
mapply(admissdate, admis = admiss, dis = discharge, rsn = reason, vis1 = visit1, vis2 = visit2)
I have also considered lapply but here you have to define an X = ..., which I think I cannot use because I have multiple column that I want to impute, please correct me if I am wrong!
Also I considered using a for loop, but I don't know how to use that with multiple conditions.
Any help would be greatly appreciated!
You can change the function to accept values instead of column names.
admissdate <- function(admis, dis, rsn, vis1, vis2){
xnew <- as.integer(admis >= vis1 & dis <= vis2 & rsn == 2)
xnew <- ifelse(admis >= vis1 & admis <= vis2 & dis >= vis2 & rsn == 2, 1, xnew)
return(xnew)
}
Now create new columns -
admiss <- paste0("admiss", 1:3)
discharge <- paste0("discharge", 1:3)
reason <- paste0("reason", 1:3)
new_col <- paste0('newcol', 1:3)
df[new_col] <- Map(function(x, y, z) admissdate(x, y, z, df$visit1, df$visit2),
df[admiss],df[discharge],df[reason])
#Additional column will be 1 if any of the value in the new column is 1.
df$result <- as.integer(rowSums(df[new_col]) > 0)
df

Vectorization in R

I am new to R, and I have researched vectorization. But I am still trying to train my mind to think in vectorization terms. Very often examples of vectorization instead of loops are either too simple, so it's difficult for me to generalize them, or not present at all.
Can anyone suggest how I can vectorize the following?
Model2 <- subset(Cor.RMA, MODEL == Models.Sort[2,1])
RCM2 <- count(Model2$REPAIR_CODE)
colnames(RCM2) <- c("REPAIR_CODE", "FREQ")
M2M <- merge(RCM.Sort, RCM2, by = "REPAIR_CODE", all.x = TRUE)
M2M.Sort <- M2M[order(M2M$FREQ.x, decreasing = TRUE), ]
M2M.Sort[is.na(M2M.Sort)] <- 0
In the above code, each "2" needs to run from 2 to 85
writeWorksheetToFile(file="CL2 - TC - RC.xlsx",
data = M2M.Sort[ ,c("FREQ.y")],
sheet = "RC by Model",
clearSheets = FALSE,
startRow = 6,
startCol = 6)
In the above code, "data" should from from "M2M..." to "M85M..." and "startCol" should run from 6 to 89 for an Excel printout.
The data frame this comes from (Cor.RMA) has columns "MODEL", "REPAIR_CODE", and others that are unused.
RCM.Sort is a frequency table of each "REPAIR_CODE" across all models that I use as a Master list to adjoin Device-specific Repair Code counts. (left-join: all.x = TRUE)
Models.Sort is a frequency table I generated using the "count" function from the plyr package, so I can create subsets for each MODEL.
Then I merge a list of each "REPAIR_CODE" that I generated using the "unique" function.
Sample Data:
CASE_NO DEVICE_TYPE MODEL TRIAGE_CODE REPAIR_CODE
12341 Smartphone X TC01 RC01
12342 Smartphone Y TC02 RC02
12343 Smartphone Z TC01, TC05 RC05
12344 Tablet AA TC02 RC37
12345 Wearable BB TC05 RC37
12346 Smartphone X TC07 RC01
12347 Smartphone Y TC04 RC02
I very much appreciate your time and effort if you are willing to help.
Alright, this is not what your original script did, but here goes:
models <- c("X","Y","Z","AA","BB") # in your case it would be Models.Sort[2:85,1]
new <- Cor.RMA[Cor.RMA$MODEL %in% models,]
new2 <- aggregate(new$REPAIR_CODE, list(new$MODEL), table)
temp <- unlist(new2[[2]])
temp <- temp[, order(colSums(temp), decreasing = T)]
out <- data.frame(group=new2[,1], temp)
out <- out[order(rowSums(out[,-1]), decreasing = T),]
out
# group RC01 RC02 RC37 RC05
# 3 X 2 0 0 0
# 4 Y 0 2 0 0
# 1 AA 0 0 1 0
# 2 BB 0 0 1 0
# 5 Z 0 0 0 1
You can then write it easily to an xlsx file, e.g. with:
require(xlsx)
xlsx:::xlsx.write(out,"test.xlsx",row.names=F)
Edit: Added sorting.
Edit2: Fixed sorting.

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources