Chaining dataframes in a list

Chaining dataframes in a list - r

I have a list of data.frames an example of which can be found in the example.data below
example.data <- list(
stage1 <- data.frame(stuff=c("Apples","Oranges","Bananas"),
Prop1=c(1,2,3),
Prop2=c(3,2,1),
Wt=c(1,2,3)),
stage2 <- data.frame(stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Prop1=c(8,9,10,1,2),
Prop2=c(23,32,55,5,4),
Wt=c(45,23,56,99,2)),
stage3 <- data.frame(stuff=c("Gooseberry","Bread","Grapes","Butter"),
Prop1=c(9,8,9,10),
Prop2=c(34,45,67,88),
Wt=c(24,56,31,84))
)
The data.frames will always have the same number of columns but their rows will vary, as will the number of data.frames in the list. Notice the chain through the list apples go to bananas, bananas go to gooseberry and gooseberry goes to butter. That is, each pair of data.frames has a common element.
I want to scale-up the weights throughout the whole list as follows. Firstly, I need to input my final weight, say 20e3. Secondly I need a scale factor for the last row, last column of the last data frame: in this particular case this will be 20e3/84 for the last dataframe. I want to use this scale factor at some point to create new columns in the last dataframe.
Next I want to scale between the last dataframe and the previous one. So using the scale factor previously calculated the input for the stage2 is (24*20e3/84) / 2 that is the weight of stage3 Gooseberry multiplied by the scale factor with respect to 20e3 divided by the stage2 Gooseberry weight to give a new scale factor. This process is repeated (via Bananas) to give the stage1 scale factor.
In this particular example the scale factors should be 42858.0 2857.2 238.1 for stage1 stage2 stage3.
I tried doing a for loop over the reverse of the length of the list with appropriate sub-setting after extracting the coordinates of the last element of each data.frame. This failed because the for loop was out of synch. I'm loathe to post what I've tried in case I lead anyone astray.
Not getting many responses so here's what I've done so far ...
last.element <- function(a.list) {
## The function finds the last element in a list of dataframes which
a <- length(a.list) ## required to subset the last element
x <- dim(a.list[[a]])[1]
y <- dim(a.list[[a]])[2]
details <- c(a,x,y)
return(details)
}
details <- as.data.frame(matrix(,nrow=length(example.data),ncol=3))
for (i in length(example.data):1) {
details[i,1:3] <- last.element(example.data[1:i])
}
The function gives the last element in each of the data.frames down the list. I've set up a data.frame which I want to populate with the scale factor. Next,
details[,4] <- 1
for (i in length(example.data):1) {
details[i,4] <- 20e3 / as.numeric(example.data[[i]][as.matrix(details[i,2:3])])
}
I set an extra column in the details data.frame ready for the scale up factors. But the for loop only gives me the last scale factor,
> details
V1 V2 V3 V4
1 1 3 4 6666.6667
2 2 5 4 10000.0000
3 3 4 4 238.0952
If I multiply 238.0952 by 84 it will give me 20000.
But the scale factor for the second data frame should be (24 * 238.0952) / 2 that is ... all the weights in the third data.frame are multiplied by the scale factor. A new scale factor is derived by dividing the scaled up Gooseberry value in the third data.frame by the Gooseberry value in the second data.frame. The scale factor for the first data frame is found in a similar manner.

Related

Take unique rows in R, but keep most common value of a column, and use hierarchy to break ties in frequency

I have a data frame that looks like this:
df <- data.frame(Set = c("A","A","A","B","B","B","B"), Values=c(1,1,2,1,1,2,2))
I want to collapse the data frame so I have one row for A and one for B. I want the Values column for those two rows to reflect the most common Values from the whole dataset.
I could do this as described here (How to find the statistical mode?), but notably when there's a tie (two values that each occur once, therefore no "true" mode) it simply takes the first value.
I'd prefer to use my own hierarchy to determine which value is selected in the case of a tie.

Create a data frame that defines the hierarchy, and assigns each possibility a numeric score.
hi <- data.frame(Poss = unique(df$Set), Nums =c(105,104))
In this case, A gets a numerical value of 105, B gets a numerical score of 104 (so A would be preferred over B in the case of a tie).
Join the hierarchy to the original data frame.
require(dplyr)
matched <- left_join(df, hi, by = c("Set"="Poss"))
Then, add a frequency column to your original data frame that lists the number of times each unique Set-Value combination occurs.
setDT(matched)[, freq := .N, by = c("Set", "Value")]
Now that those frequencies have been recorded, we only need row of each Set-Value combo, so get rid of the rest.
multiplied <- distinct(matched, Set, Value, .keep_all = TRUE)
Now, multiply frequency by the numeric scores.
multiplied$mult <- multiplied$Nums * multiplied$freq
Lastly, sort by Set first (ascending), then mult (descending), and use distinct() to take the highest numerical score for each Value within each Set.
check <- multiplied[with(multiplied, order(Set, -mult)), ]
final <- distinct(check, Set, .keep_all = TRUE)
This works because multiple instances of B (numerical score = 104) will be added together (3 instances would give B a total score in the mult column of 312) but whenever A and B occur at the same frequency, A will win out (105 > 104, 210 > 208, etc.).
If using different numeric scores than the ones provided here, make sure they are spaced out enough for the dataset at hand. For example, using 2 for A and 1 for B doesn't work because it requires 3 instances of B to trump A, instead of only 2. Likewise, if you anticipate large differences in the frequencies of A and B, use 1005 and 1004, since A will eventually catch up to B with the scores I used above (200 * 104 is less than 199 * 205).

How to accsss R data frame contents using element in factor level

As below, dataframe factorizedss is the factorized version of a sourcedata dataframe ss.
ss <- data.frame(c('a','b','a'), c(1,2,1)); #There are string columns and number columns.
#So, I factorized them as below.
factorizedss <- data.frame(lapply(ss, as.factor)); #factorized version
indices <- data.frame(c(1,1,2,2), c(1,1,1,2)); #Now, given integer indices
With given indices, using factorizedss, is it possible to get corresponding element of the source dataframe as below? (The purpose is to access data frame element by integer number in factor level )
a 1
a 1
b 1
b 2

You can access the first column like this
factorizedss[indices[,1],][,1]
and the second in a similar way
factorizedss[indices[,2],][,2]
It gets more difficult when trying to combine them, you might have to convert them back to native types
t(rbind(as.character(factorizedss[indices[,1],][,1]),as.numeric(factorizedss[indices[,2],][,2])))

Chaining Along Data Frames in a list

I have a list of data.frames which hold the data for each of the stages of a chemical process. Each of the data.frames has the same number of columns in the same order but the number of rows can vary for each of the data.frames.
See below the example data with the difference that fruits are standing in for chemical substances and reagents.
I've written a function to scale up the raw data and add the data to columns in the original data frames.
I have two problems, when a I apply a scale factor it only applies to the last element of the last data.frame. The new scale factor is then applied to the whole of the last data.frame. I can generate the scale factor for the next but last data frame by taking the weight of the common fruits (chemicals) between the two data frames (always the in the last and first rows) and dividing the wts in a similar manner to how we got the first scale factor ... then multiplying throughout this data.frame and repeating to get to the first data.frame. The other problem is ... when a use lapply to apply the scale_up function over the list, how can I feed it these scale factors so that each one is only applied to its particular data frame.
example.data <- list(
stage1 <- data.frame(code=c("aaa", "ooo", "bbb"),
stuff=c("Apples","Oranges","Bananas"),
Mw=c(1,2,3),
Density=c(3,2,1),
Assay=c(8,9,1),
Wt=c(1,2,3), stringsAsFactors = FALSE),
stage2 <- data.frame(code=c("bbb","mmm","ccc","qqq","ggg"),
stuff=c("Bananas","Mango","Cherry","Quince","Gooseberry"),
Mw=c(8,9,10,1,2),
Density=c(23,32,55,5,4),
Assay=c(0.1,0.3,0.4,0.4,0.9),
Wt=c(45,23,56,99,2), stringsAsFactors = FALSE),
stage3 <- data.frame(code=c("ggg","bbb","ggg","bbb"),
stuff=c("Gooseberry","Bread","Grapes","Butter"),
Mw=c(9,8,9,10),
Density=c(34,45,67,88),
Assay=c(10,10,46,52),
Wt=c(24,56,31,84), stringsAsFactors = FALSE)
)
scale_up <- function(inventory,scale_factor,vessel_volume_L, NoBatches = 1) {
## This function accepts a data.frame with Molecule, Mw, Density,
## Assay and Wt columns
## It takes a scale factor and vessel volume and returns input
## charges and fill volumes
## rownames(inventory) <- inventory$smiles
inventory <- inventory[,-1] ## the rownames are given the smiles designation
## and the smiles column is removed
## volumes and moles are calculated for the given data
inventory$Vol <- round((inventory$Wt / inventory$Density) , 3)
inventory$Moles <- round((inventory$Wt / inventory$Mw) , 3)
inventory$Equivs <- round((inventory$Moles / inventory$Moles[1]) , 3)
inventory[,paste0(scale_factor,"xWt_kg")] <- round((((inventory$Wt * scale_factor) / 1000 ) / NoBatches) , 3)
inventory[,paste(scale_factor,"xVol_L",sep="")] <- round((((inventory$Vol * scale_factor) / 1000 ) / NoBatches) , 3)
inventory$PerCentFill <- round((100 * cumsum(inventory[,paste(scale_factor,"xVol_L",sep="")]) / vessel_volume_L) , 2)
inventory
## at which point everything is in place to scale up
}
new.example.data <- lapply(example.data, scale_up,20e3,454)
> new.example.data[[1]]
stuff Mw Density Assay Wt Vol Moles Equivs 20000xWt_kg 20000xVol_L PerCentFill
1 Apples 1 3 8 1 0.333 1 1 20 6.66 1.47
2 Oranges 2 2 9 2 1.000 1 1 40 20.00 5.87
3 Bananas 3 1 1 3 3.000 1 1 60 60.00 19.09
So, I've scaled my original data (laboratory scale, grams) to see if it will fit in a ten gallon plant vessel (454 L) but the only stage that is scaled properly is the last one ... the other two need those 'fiddle factors' and I need to apply the 'fiddle factors' to each of the stages as I loop (presumably a for loop rather than lapply) through the list.
(Ps ... I tried to ask this earlier but I tried to disguise my example too much and just confused the stack overflowers).

Based on the details mentioned in this post and the other link Chaining dataframes in a list here's the solution that I have come up with:
Extract the weights for the first and last fruit in a matrix like this:
wts<-sapply(example.data,function(t){c(t$Wt[1],t$Wt[nrow(t)])},simplify=T)
Declare a global variable final.wt as you have initially taken:
final.wt<<- 20000
Create a scales function to caclulate the scaling factor for each corresponding stage:
scales<-function(x,final.wt){
n=ncol(x)
nscales<-numeric(n)
for(i in (n:1)){
if(i==n){
.GlobalEnv$final.wt = final.wt/x[2,i]
nscales[i]=.GlobalEnv$final.wt
}else{
.GlobalEnv$final.wt = .GlobalEnv$final.wt * x[1,i+1]/(x[2,i])
nscales[i]=.GlobalEnv$final.wt
}
}
return(nscales)
}
This gives you a vector of the desired scaling factors for each stage:
scale.fact<-scales(wts,final.wt)
Now you can call scale_up using mapply like this:
mapply(scale_up,example.data,scale.fact,454)
The values in scale.fact are:
42858.0 2857.2 238.1
Each value will be passed to scale_factor using mapply corresponding to the stage .

Pairwise Comparison of Rows in R

I have a dataset that contains results for many tests across many samples. The samples are replicated within the dataset. I would like to compare the test results between replicates within each group of replicated samples. I thought it might be easiest to first split my data frame by the SampleID so that I have a list of data frames, one data frame for each SampleID. There could be 2, 3, 4, or even 5 replicates of a sample so the number of unique combinations of rows to compare for each sample group is not the same. I have the logic that I am thinking laid out below. I want to run a function on the list of data frames and output the match results. The function would compare unique sets of 2 rows within each group of replicated samples and return values of "Match", "Mismatch", or NA (if one or both values for a test is missing). It would also return the count of tests that overlapped between the 2 compared replicates, the number of matches, and the number of mismatches. Lastly, it would include a column where the sample names are pasted together with their row numbers so I know which two samples were compared (ex. Sample1.1_Sample1.2). Could anyone point me in the right direction?
#Input data structure
data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"),
rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))
colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
data
data.split = split(data, data$SampleID)
##Row comparison function
#Input is a list of data frames. Each data frame contains results for replicates of the same sample.
RowCompare = function(x){
rowcount = nrow(x)
##ifelse(rowcount==2,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
#ifelse(rowcount==3,
##compare row 1 to row 2
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 1 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
##compare row 2 to row 3
##paste sample names being compared together
##how many non-NA values overlap, keep value
##of those that overlap, how many match, keep value
##of those that overlap, how many do not match, keep value
return(results)
}
#Output is a list of data frames - one for sample name
out = lapply(names(data.split), function(x) RowCompare(data.split[[x]]))
#Row bind the list of data frames back together to one large data frame
out.merge = do.call(rbind.data.frame, out)
head(out.merge)
#Desired output
out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"),
rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA),
rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"),
rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))
colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
out.merge
One thing I did see on another post that I thought might be useful is the line below which would create a data frame of unique row combinations that could then be used to define which rows to compare in each group of replicated samples. Not sure how to implement it though.
t(combn(nrow(data),2))
Thank you.

You are on the right track with t(combn(nrow(data),2)). See below for how I would do it.
testCols <- which(grepl("^Test\\d+",colnames(data)))
TestsCompare=function(x,y){
##how many non-NA values overlap
overlaps <- sum(!is.na(x) & !is.na(y))
##of those that overlap, how many match
matches <- sum(x==y, na.rm=T)
##of those that overlap, how many do not match
non_matches <- overlaps - matches # complement of matches
c(overlaps,matches,non_matches)
}
RowCompare= function(x){
comp <- NULL
pairs <- t(combn(nrow(x),2))
for(i in 1:nrow(pairs)){
row_a <- pairs[i,1]
row_b <- pairs[i,2]
a_tests <- x[row_a,testCols]
b_tests <- x[row_b,testCols]
comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
}
colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
return(comp)
}
out = lapply(data.split, RowCompare)
Produces:
> out
$Sample1
row_a row_b overlaps matches non_matches
[1,] 1 2 8 6 2
$Sample2
row_a row_b overlaps matches non_matches
[1,] 1 2 10 9 1
[2,] 1 3 9 8 1
[3,] 2 3 9 9 0

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.

I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)

It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).

This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.