From Excel Table to Data Frame in R - r

I need to generate a Data Frame in R from the below Excel Table.
Every time I modify one of the values from column Value the variable Score will have a different value (the cell is protected so I cannot see the formula).
The idea is to generate enough samples to check the main sources of variability, and perform some basic statistics.
I think the only way would be to manually modify the variables in the column Value and anotate the result from Score in the Dataframe.
The main issue I am having is that I am not used to work with data that has this format, and because of this I am finding difficult to visualize how should I structure the Data Frame.
I am getting stuck because the variable Score depends on 5 different Stages (where each one of them has 2 different variables) and a set of dimensions with 7 different variables.
I was trying the way I am used to create Data Frames, starting with the Vectors, but it feels wrong and I cannot see how can I represent this relationships between the different variables.
stage <- c('Inspection','Cut','Assembling','Test','Labelling','Dimensions')
variables <- c('Experience level', 'Equipement', 'User','Length','Wide','Length Body','Width Body','Tape Wing','Tape Body','Clip)
range <- c('b','m','a','UA','UB','UC') ?? not sure what to do about the range??
Could anybody help me with the logic on how this should be modelled?

As suggested by #Gregor, to resolve your main issue consider building a data frame of all needed values in respective columns. Then run each row to produce Score.
Specifically, to build needed data frame from inputs in Excel table, consider Map (wrapper to mapply) and data.frame constructor on equal-length list or vectors of 17 items:
Excel Table Inputs
# VECTOR OF 17 CHARACTER ITEMS
stage_list <- c(rep("Inspection", 2),
rep("Cut", 2),
rep("Assembling", 2),
rep("Test", 2),
rep("Labelling", 2),
rep("Dimensions", 7))
# VECTOR OF 17 CHARACTER ITEMS
exp_equip <- c("Experience level", "Equipement")
var_list <- c(rep(exp_equip, 3),
c("User", "Equipement"),
exp_equip,
c("Length", "Wide", "Length body", "Width body",
"Tape wing", "Tape body", "Clip"))
# LIST OF 17 VECTORS
bma_range <- c("b", "m", "a")
noyes_range <- c("no", "yes")
range_list <- c(replicate(6, bma_range, simplify=FALSE),
list(c("UA", "UB", "UC")),
replicate(3, bma_range, simplify=FALSE),
list(seq(6.5, 9.5, by=0.1)),
list(seq(11.9, 12.1, by=0.1)),
list(seq(6.5, 9.5, by=0.1)),
list(seq(4, 6, by=1)),
replicate(3, noyes_range, simplify=FALSE))
Map + data.frame
df_list <- Map(function(s, v, r)
data.frame(Stage = s, Variable = v, Range = r, stringsAsFactors=FALSE),
stage_list, var_list, range_list, USE.NAMES = FALSE)
# APPEND ALL DFS
final_df <- do.call(rbind, df_list)
head(final_df)
# Stage Variable Range
# 1 Inspection Experience level b
# 2 Inspection Experience level m
# 3 Inspection Experience level a
# 4 Inspection Equipement b
# 5 Inspection Equipement m
# 6 Inspection Equipement a
Rextester demo
Score Calculation (using unknown score_function, assumed to take three non-optional args)
# VECTORIZED METHOD
final_df$Score <- score_function(final_df$Stage, final_df$Variable, final_df$Range)
# NON-VECTORIZED/LOOP ROW METHOD
final_df$Score <- sapply(1:nrow(final_df), function(i)
score_function(final_df$Stage[i], final_df$Variable[i], final_df$Range[i])
# NON-VECTORIZED/LOOP ELEMENTWISE METHOD
final_df$Score <- mapply(score_function, final_df$Stage, final_df$Variable, final_df$Range)

Related

(Pearson's) Correlation loop through the data frame

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.
df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)
for (i in colnames(df)){ # Check the class of the variables
print(class(df[[i]]))
}
print(df)
# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method
cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])
So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).
Thank you!
If I've understood your question correctly, the solution below should work well.
#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))
#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.
my_correlator <- function(mydf, start_col = 4, end_col = 0){
if(end_col == 0){
end_col <- ncol(mydf)
}
#out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
out_corr <- list()
for(i in (start_col+1):end_col){
out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
}
return(do.call("rbind", out_corr))
}
test_run <- my_correlator(df, 4)
head(test_run)
# start_col end_col corr_val
# 1 4 5 -0.027508521
# 2 4 6 0.100414199
# 3 4 7 0.036648608
# 4 4 8 -0.050845418
# 5 4 9 -0.003625019
# 6 4 10 -0.058172227
The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

Creating Dataframes through R For Loop

Fairly new to R, so any guidance is appreciated.
GOAL: I'm trying to create hundreds of dataframes in a short script. They follow a pattern, so I thought a For Loop would suffice, but the data.frame function seems to ignore the variable nature of the variable, reading it as it appears. Here's an example:
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
for (i in dfTitles){
i <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))
}
# Trying an alternative method
for (i in 1:length(dfTitles))
{dfTitles[i] <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))}
This results in the creation of one dataframe named 'i', in the former, or a list of 4, in the case of the latter. Any ideas? Thank you!
PROBABLY UNNECESSARY BACKGROUND INFORMATION: We're using fMRI data to run an analysis which will run correlations across stimuli, brain voxels, brain regions, and participants. We're correlating whole matrices, so separating the values (aka COPEs) into separate dataframes by both Participant ID and Brain Region is going to make the next step much much easier. I already had tried the next step after having loaded and sorted the data into one large dataframe and it was a big pain in the butt.
rm(list=ls)
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:3)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
nr <- length(Voxels)
nc <- length(Copes)
N <- length(dfTitles) # Number of data frames, same as length of dfTitles
DF <- vector(N, mode="list")
for (i in 1:N){
DF[[i]] <- data.frame(matrix(rnorm(nr*nc), nrow = nr))
dimnames(DF[[i]]) <- list(Voxels, Copes)
}
names(DF) <- dfTitles
DF[1:2]
$C2000.AMY
Cope1 Cope2 Cope3 Cope4
1 -0.8293164 -1.813807 -0.3290645 -0.7730110
2 -1.1965588 1.022871 -0.7764960 -0.3056280
3 0.2536782 -0.365232 2.3949076 0.5672671
$C2000.ACC
Cope1 Cope2 Cope3 Cope4
1 -0.7505513 1.023325 -0.3110537 -1.4298174
2 1.2807725 1.216997 1.0644983 1.6374749
3 1.0047408 1.385460 0.1527678 0.1576037
When creating objects in a for loop, they need to be saved somewhere before the next iteration of the loop, or it gets overwritten.
One way to handle that is to create an empty list or vector with c()before the beginning of your loop, and append the output of each run of the loop.
Another way to handle it is to assign the object to your environment before moving on to the next iteration of the loop.
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# initialize a list to store the data.frame output
df_list <- list()
for (d in dfTitles) {
# create data.frame with the dfTitle, and 1 row per Copes observation
df <- data.frame(dfTitle = d,
Copes = Copes)
# append columns for Voxels
# setting to NA, can be reassigned later as needed
for (v in Voxels) {
df[[paste0("Voxel", v)]] <- NA
}
# store df in the list as the 'd'th element
df_list[[d]] <- df
# or, assign the object to your environment
# assign(d, df)
}
# data.frames can be referenced by name
names(df_list)
head(df_list$C2000.AMY)

Fast method for combining list elements based on criteria

I'm building a little function in R that takes size measurements from several species and several sites, combines all the data by site (lumping many species together), and then computes some statistics on those combined data.
Here is some simplistic sample data:
SiteID <- rep(c("D00002", "D00003", "D00004"), c(5, 2, 3))
SpeciesID <- c("CHIL", "CHIP", "GAM", "NZMS", "LUMB", "CHIL", "SIMA", "CHIP", "CHIL", "NZMS")
Counts <- data.frame(matrix(sample(0:99,200, replace = TRUE), nrow = 10, ncol = 20))
colnames(Counts) <- paste0('B', 1:20)
spec <- cbind(SiteID, SpeciesID, Counts)
stat1 <- data.frame(unique(SiteID))
colnames(stat1) <- 'SiteID'
stat1$Mean <- NA
Here is the function, which creates a list, lsize1, where each list element is a vector of the sizes (B1 to B20) for a given SpeciesID in a given SiteID, multiplied by the number of counts for each size class. From this, the function creates a list, lsize2, which combines list elements from lsize1 that have the same SiteID. Finally, it gets the mean of each element in lsize2 (i.e., the average size of an individual for each SiteID, regardless of SpeciesID), and outputs that as a result.
fsize <- function(){
specB <- spec[, 3:22]
lsize1 <- apply(specB, 1, function(x) rep(1:20, x))
names(lsize1) <- spec$SiteID
lsize2 <- sapply(unique(names(lsize1)), function(x) unlist(lsize1[names(lsize1) == x], use.names = FALSE), simplify = FALSE)
stat1[stat1$SiteID %in% names(lsize2), 'Mean'] <- round(sapply(lsize2, mean), 2)
return(stat1)
}
In creating this function, I followed the suggestion here: combine list elements based on element names, which gets at the crux of my problem: combining list elements based on some criteria in common (in my case, combining all elements from the same SiteID). The function works as intended, but my question is if there's a way to make it substantially faster?
Note: for my actual data set, which is ~40,000 rows in length, I find that the function runs in ~ 0.7 seconds, with the most time consuming step being the creation of lsize2 (~ 0.5 seconds). I need to run this function many, many times, with different permutations and subsets of the data, so I'm hoping there's a way to cut this processing time down significantly.
There shouldn't be any need for loops here. Here's one attempt:
tmp <- data.frame(spec["SiteID"], sums = rowSums(specB * col(specB)), counts=rowSums(specB) )
tmp <- aggregate(. ~ SiteID, tmp, sum)
tmp$avg <- tmp$sums / tmp$counts
tmp
# SiteID sums counts avg
#1 D00002 46254 4549 10.16795
#2 D00003 20327 1810 11.23039
#3 D00004 29651 2889 10.26341
Compare:
fsize()
# SiteID Mean
#1 D00002 10.17
#2 D00003 11.23
#3 D00004 10.26
This code essentially multiplies each value by it's index (col(specB)), then aggregates the sums and counts by SiteID. This logic should be relatively transferable to other methods (data.table/dplyr) as well. E.g.: in data.table:
setDT(spec)
spec[, .(avg = sum(.SD * col(.SD)) / sum(unlist(.SD))), by=SiteID, .SDcols=B1:B20]
# SiteID avg
#1: D00002 10.16795
#2: D00003 11.23039
#3: D00004 10.26341

I want to split data in R by varying block sizes but each observation is unique

I have managed to read in a data file, and subset out the 2 columns of info that I want to work with. I am now stuck because I need to split the data into chunks of varying sizes and apply a function (mean, sd) to them, save the chunks and plot the sd from each. Otherwise known generally as block averaging. Right now I have a data frame with 2 columns and 10005 rows. The head of it looks like this:
Frame CA
1 0.773
Is there an efficient way that I could subset pieces of the data from a:b so that I can dictate how the data is broken up by the "Frame" column? I have found really good answers on here but I am not sure what they mean fully or if they would work.
chunk <- function(x, n)
(mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n),
pmin(seq.int(from=1, to=length(x), by=n)+(n-1),
length(x)), SIMPLIFY=FALSE))
I'm not sure if it is what you're looking for but with closure, a data frame can be subsetted by arbitrary indices.
(If Frame can be subsetted by a:b, it is likely to be a sequence and thus a subset may be made by row index?)
df <- data.frame(group = sample(c("a", "b"), 20, replace = T),
val = rnorm(20))
# closure - returns a function that accepts from and to
subsetter <- function(from, to) {
function(x) {
x[from:to, ]
}
}
# from and to are specified
sub1 <- subsetter(2, 4)
sub2 <- subsetter(1, 5)
# data is split from to to
sub1(df)
#group val
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
sub2(df)
# group val
#1 b 0.4780080
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
#5 b 0.4449554

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources