Chaining Along Data Frames in a list - r

I have a list of data.frames which hold the data for each of the stages of a chemical process. Each of the data.frames has the same number of columns in the same order but the number of rows can vary for each of the data.frames.
See below the example data with the difference that fruits are standing in for chemical substances and reagents.
I've written a function to scale up the raw data and add the data to columns in the original data frames.
I have two problems, when a I apply a scale factor it only applies to the last element of the last data.frame. The new scale factor is then applied to the whole of the last data.frame. I can generate the scale factor for the next but last data frame by taking the weight of the common fruits (chemicals) between the two data frames (always the in the last and first rows) and dividing the wts in a similar manner to how we got the first scale factor ... then multiplying throughout this data.frame and repeating to get to the first data.frame. The other problem is ... when a use lapply to apply the scale_up function over the list, how can I feed it these scale factors so that each one is only applied to its particular data frame.
example.data <- list(
stage1 <- data.frame(code=c("aaa", "ooo", "bbb"),
stuff=c("Apples","Oranges","Bananas"),
Mw=c(1,2,3),
Density=c(3,2,1),
Assay=c(8,9,1),
Wt=c(1,2,3), stringsAsFactors = FALSE),
stage2 <- data.frame(code=c("bbb","mmm","ccc","qqq","ggg"),
stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Mw=c(8,9,10,1,2),
Density=c(23,32,55,5,4),
Assay=c(0.1,0.3,0.4,0.4,0.9),
Wt=c(45,23,56,99,2), stringsAsFactors = FALSE),
stage3 <- data.frame(code=c("ggg","bbb","ggg","bbb"),
stuff=c("Gooseberry","Bread","Grapes","Butter"),
Mw=c(9,8,9,10),
Density=c(34,45,67,88),
Assay=c(10,10,46,52),
Wt=c(24,56,31,84), stringsAsFactors = FALSE)
)
scale_up <- function(inventory,scale_factor,vessel_volume_L, NoBatches = 1) {
## This function accepts a data.frame with Molecule, Mw, Density,
## Assay and Wt columns
## It takes a scale factor and vessel volume and returns input
## charges and fill volumes
## rownames(inventory) <- inventory$smiles
inventory <- inventory[,-1] ## the rownames are given the smiles designation
## and the smiles column is removed
## volumes and moles are calculated for the given data
inventory$Vol <- round((inventory$Wt / inventory$Density) , 3)
inventory$Moles <- round((inventory$Wt / inventory$Mw) , 3)
inventory$Equivs <- round((inventory$Moles / inventory$Moles[1]) , 3)
inventory[,paste0(scale_factor,"xWt_kg")] <- round((((inventory$Wt * scale_factor) / 1000 ) / NoBatches) , 3)
inventory[,paste(scale_factor,"xVol_L",sep="")] <- round((((inventory$Vol * scale_factor) / 1000 ) / NoBatches) , 3)
inventory$PerCentFill <- round((100 * cumsum(inventory[,paste(scale_factor,"xVol_L",sep="")]) / vessel_volume_L) , 2)
inventory
## at which point everything is in place to scale up
}
new.example.data <- lapply(example.data, scale_up,20e3,454)
> new.example.data[[1]]
stuff Mw Density Assay Wt Vol Moles Equivs 20000xWt_kg 20000xVol_L PerCentFill
1 Apples 1 3 8 1 0.333 1 1 20 6.66 1.47
2 Oranges 2 2 9 2 1.000 1 1 40 20.00 5.87
3 Bananas 3 1 1 3 3.000 1 1 60 60.00 19.09
So, I've scaled my original data (laboratory scale, grams) to see if it will fit in a ten gallon plant vessel (454 L) but the only stage that is scaled properly is the last one ... the other two need those 'fiddle factors' and I need to apply the 'fiddle factors' to each of the stages as I loop (presumably a for loop rather than lapply) through the list.
(Ps ... I tried to ask this earlier but I tried to disguise my example too much and just confused the stack overflowers).

Based on the details mentioned in this post and the other link Chaining dataframes in a list here's the solution that I have come up with:
Extract the weights for the first and last fruit in a matrix like this:
wts<-sapply(example.data,function(t){c(t$Wt[1],t$Wt[nrow(t)])},simplify=T)
Declare a global variable final.wt as you have initially taken:
final.wt<<- 20000
Create a scales function to caclulate the scaling factor for each corresponding stage:
scales<-function(x,final.wt){
n=ncol(x)
nscales<-numeric(n)
for(i in (n:1)){
if(i==n){
.GlobalEnv$final.wt = final.wt/x[2,i]
nscales[i]=.GlobalEnv$final.wt
}else{
.GlobalEnv$final.wt = .GlobalEnv$final.wt * x[1,i+1]/(x[2,i])
nscales[i]=.GlobalEnv$final.wt
}
}
return(nscales)
}
This gives you a vector of the desired scaling factors for each stage:
scale.fact<-scales(wts,final.wt)
Now you can call scale_up using mapply like this:
mapply(scale_up,example.data,scale.fact,454)
The values in scale.fact are:
42858.0 2857.2 238.1
Each value will be passed to scale_factor using mapply corresponding to the stage .

Related

outlier detection for a grouped (clustered) data

I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)
Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.

R code to iteratively and randomly delete entire rows from a data frame based on a column value, and saving as a new data frame each time

Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2

Chaining dataframes in a list

I have a list of data.frames an example of which can be found in the example.data below
example.data <- list(
stage1 <- data.frame(stuff=c("Apples","Oranges","Bananas"),
Prop1=c(1,2,3),
Prop2=c(3,2,1),
Wt=c(1,2,3)),
stage2 <- data.frame(stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Prop1=c(8,9,10,1,2),
Prop2=c(23,32,55,5,4),
Wt=c(45,23,56,99,2)),
stage3 <- data.frame(stuff=c("Gooseberry","Bread","Grapes","Butter"),
Prop1=c(9,8,9,10),
Prop2=c(34,45,67,88),
Wt=c(24,56,31,84))
)
The data.frames will always have the same number of columns but their rows will vary, as will the number of data.frames in the list. Notice the chain through the list apples go to bananas, bananas go to gooseberry and gooseberry goes to butter. That is, each pair of data.frames has a common element.
I want to scale-up the weights throughout the whole list as follows. Firstly, I need to input my final weight, say 20e3. Secondly I need a scale factor for the last row, last column of the last data frame: in this particular case this will be 20e3/84 for the last dataframe. I want to use this scale factor at some point to create new columns in the last dataframe.
Next I want to scale between the last dataframe and the previous one. So using the scale factor previously calculated the input for the stage2 is (24*20e3/84) / 2 that is the weight of stage3 Gooseberry multiplied by the scale factor with respect to 20e3 divided by the stage2 Gooseberry weight to give a new scale factor. This process is repeated (via Bananas) to give the stage1 scale factor.
In this particular example the scale factors should be 42858.0 2857.2 238.1 for stage1 stage2 stage3.
I tried doing a for loop over the reverse of the length of the list with appropriate sub-setting after extracting the coordinates of the last element of each data.frame. This failed because the for loop was out of synch. I'm loathe to post what I've tried in case I lead anyone astray.
Not getting many responses so here's what I've done so far ...
last.element <- function(a.list) {
## The function finds the last element in a list of dataframes which
a <- length(a.list) ## required to subset the last element
x <- dim(a.list[[a]])[1]
y <- dim(a.list[[a]])[2]
details <- c(a,x,y)
return(details)
}
details <- as.data.frame(matrix(,nrow=length(example.data),ncol=3))
for (i in length(example.data):1) {
details[i,1:3] <- last.element(example.data[1:i])
}
The function gives the last element in each of the data.frames down the list. I've set up a data.frame which I want to populate with the scale factor. Next,
details[,4] <- 1
for (i in length(example.data):1) {
details[i,4] <- 20e3 / as.numeric(example.data[[i]][as.matrix(details[i,2:3])])
}
I set an extra column in the details data.frame ready for the scale up factors. But the for loop only gives me the last scale factor,
> details
V1 V2 V3 V4
1 1 3 4 6666.6667
2 2 5 4 10000.0000
3 3 4 4 238.0952
If I multiply 238.0952 by 84 it will give me 20000.
But the scale factor for the second data frame should be (24 * 238.0952) / 2 that is ... all the weights in the third data.frame are multiplied by the scale factor. A new scale factor is derived by dividing the scaled up Gooseberry value in the third data.frame by the Gooseberry value in the second data.frame. The scale factor for the first data frame is found in a similar manner.

How to select specific elements and find their index in a data.frame?

I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list
I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.

Sample with constraint, vectorized

What is the most efficient way to sample a data frame under a certain constraint?
For example, say I have a directory of Names and Salaries, how do I select 3 such that their sum does not exceed some value. I'm just using a while loop but that seems pretty inefficient.
You could face a combinatorial explosion. This simulates the selection of 3 combinations of the EE's from a set of 20 with salaries at a mean of 60 and sd 20. It shows that from the enumeration of the 1140 combinations you will find only 263 having sum of salaries less than 150.
> sum( apply( combn(1:20,3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 200
> set.seed(123)
> salry <- data.frame(EEnams = sapply(1:20 ,
function(x){paste(sample(letters[1:20], 6) ,
collapse="")}), sals = rnorm(20, 60, 20))
> head(salry)
EEnams sals
1 fohpqa 67.59279
2 kqjhpg 49.95353
3 nkbpda 53.33585
4 gsqlko 39.62849
5 ntjkec 38.56418
6 trmnah 66.07057
> sum( apply( combn(1:NROW(salry), 3) , 2, function(x) sum(salry[x, "sals"]) < 150))
[1] 263
If you had 1000 EE's then you would have:
> choose(1000, 3) # Combination possibilities
# [1] 166,167,000 Commas added to output
One approach would be to start with the full data frame and sample one case. Create a data frame which consists of all the cases which have a salary less than your constraint minus the selected salary. Select a second case from this and repeat the process of creating a remaining set of cases to choose from. Stop if you get to the number you need (3), or if at any point there are no cases in the data frame to choose from (reject what you have so far and restart the sampling procedure).
Note that different approaches will create different probability distributions for a case being included; generally it won't be uniform.
How big is your dataset? If it is small (and small really depends on your hardware), you could just list all groups of three, calculate the sum, and sample from that.
## create data frame
N <- 100
salary <- rnorm(N))
## list all possible groups of 3 from this
x <- combn(salary, 3)
## the sum
sx <- colSums(x)
sxc <- sx[sx<1]
## sampling with replacement
sample(sxc, 10, replace=TRUE)

Resources