Related
HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10
I need to generate a Data Frame in R from the below Excel Table.
Every time I modify one of the values from column Value the variable Score will have a different value (the cell is protected so I cannot see the formula).
The idea is to generate enough samples to check the main sources of variability, and perform some basic statistics.
I think the only way would be to manually modify the variables in the column Value and anotate the result from Score in the Dataframe.
The main issue I am having is that I am not used to work with data that has this format, and because of this I am finding difficult to visualize how should I structure the Data Frame.
I am getting stuck because the variable Score depends on 5 different Stages (where each one of them has 2 different variables) and a set of dimensions with 7 different variables.
I was trying the way I am used to create Data Frames, starting with the Vectors, but it feels wrong and I cannot see how can I represent this relationships between the different variables.
stage <- c('Inspection','Cut','Assembling','Test','Labelling','Dimensions')
variables <- c('Experience level', 'Equipement', 'User','Length','Wide','Length Body','Width Body','Tape Wing','Tape Body','Clip)
range <- c('b','m','a','UA','UB','UC') ?? not sure what to do about the range??
Could anybody help me with the logic on how this should be modelled?
As suggested by #Gregor, to resolve your main issue consider building a data frame of all needed values in respective columns. Then run each row to produce Score.
Specifically, to build needed data frame from inputs in Excel table, consider Map (wrapper to mapply) and data.frame constructor on equal-length list or vectors of 17 items:
Excel Table Inputs
# VECTOR OF 17 CHARACTER ITEMS
stage_list <- c(rep("Inspection", 2),
rep("Cut", 2),
rep("Assembling", 2),
rep("Test", 2),
rep("Labelling", 2),
rep("Dimensions", 7))
# VECTOR OF 17 CHARACTER ITEMS
exp_equip <- c("Experience level", "Equipement")
var_list <- c(rep(exp_equip, 3),
c("User", "Equipement"),
exp_equip,
c("Length", "Wide", "Length body", "Width body",
"Tape wing", "Tape body", "Clip"))
# LIST OF 17 VECTORS
bma_range <- c("b", "m", "a")
noyes_range <- c("no", "yes")
range_list <- c(replicate(6, bma_range, simplify=FALSE),
list(c("UA", "UB", "UC")),
replicate(3, bma_range, simplify=FALSE),
list(seq(6.5, 9.5, by=0.1)),
list(seq(11.9, 12.1, by=0.1)),
list(seq(6.5, 9.5, by=0.1)),
list(seq(4, 6, by=1)),
replicate(3, noyes_range, simplify=FALSE))
Map + data.frame
df_list <- Map(function(s, v, r)
data.frame(Stage = s, Variable = v, Range = r, stringsAsFactors=FALSE),
stage_list, var_list, range_list, USE.NAMES = FALSE)
# APPEND ALL DFS
final_df <- do.call(rbind, df_list)
head(final_df)
# Stage Variable Range
# 1 Inspection Experience level b
# 2 Inspection Experience level m
# 3 Inspection Experience level a
# 4 Inspection Equipement b
# 5 Inspection Equipement m
# 6 Inspection Equipement a
Rextester demo
Score Calculation (using unknown score_function, assumed to take three non-optional args)
# VECTORIZED METHOD
final_df$Score <- score_function(final_df$Stage, final_df$Variable, final_df$Range)
# NON-VECTORIZED/LOOP ROW METHOD
final_df$Score <- sapply(1:nrow(final_df), function(i)
score_function(final_df$Stage[i], final_df$Variable[i], final_df$Range[i])
# NON-VECTORIZED/LOOP ELEMENTWISE METHOD
final_df$Score <- mapply(score_function, final_df$Stage, final_df$Variable, final_df$Range)
Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.
Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')
df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14
#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)
I've imported one large table from a SQL database with similar structure to this example table
testData <- data.frame(
BatchNo = c(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3),
Y = c(1,1.247011378,1.340630851,1.319026357,1.41264583,1.093619473,1.38023909,1.473858563,1,1.093619473,1.038888089,1.081833061,1,1.215913383,1.278861891,1.297746443,1.360694952,1.332368123,1.414201183,1,1.081833061,1,1.063661202),
Categorical1 = c("A9","B5513","B5513","B5514","B5514","A9","B5514","B5514","A9","A9","B1723","A9","A9","B5513","B5514","B5513","B5514","B5514","B5514","A9","A9","A486","B1701"),
Categorical2 = c("A2793","B5512","B5512","B5512","B5512","B5508","B6623","B6623","B5508","B5508","B5508","A127","A127","B5515","B5515","B5515","B5515","B6623","B6623","A127","A127","A2727","A2727"),
Categorical3 = c("A5510","B5511","B5511","B5511","B5511","A5510","B5511","B5511","B5511","B5511","B5511","A5518","A5518","B5517","B5517","B5517","B5517","B5517","B5517","B5517","B5517","A2","A2"),
Categorical4 = c("A5","A5","B649","A5","B649","B649","A5","B649","A5","B649","A5","B649","A5","A5","A5","B649","B649","A5","B649","A5","B649","A649","A649"),
Binary1 = c(rep(0,times=23)),
Binary2 = c(rep(0,times=23)),
Binary3 = c(rep(0,times=23)),
Binary4 = c(rep(0,times=23))
)
What I'd like to do in a for loop is to:
1.Create subset data frames based on the BatchNo column (1 to 2500)
2.Fit linear models using each subset data frame
3.Export the list of coefficient estimates back to a SQL table
I've got the following so far for steps 1 & 2:
n<-max(testData[,1])
for (i in 1:n) {
assign(paste("dat"),droplevels(subset(testData,BatchNo == i, select = 1:10)))
assign(paste("lm.", i, sep =
""),lm(Y~Categorical1+Categorical2+Categorical3+Categorical4+Binary1+Binary2+Binary3+Binary4,data=dat))}
The problem is that there will be subsets created where at least one of the 4 Categorical variables (or maybe all of them) will have a single level (like BatchNo = 3 in this example) and R cannot use those in regression.
It is not a problem for the binary predictors as it only results in a N/A coefficient estimate, and I'll do a step(backward) to remove any of those after the models have been fitted.
At first I tried to use step(forward) to select only meaningful predictors in each loop, but that didn't work as I had to list all potential predictors for selection.
I can think of 2 possible solutions:
Either drop single-level factor columns from "dat" in each loop
Or create a vector/list of the multi-level factor names each loop and use that somehow in the lm formula
I've only got to the point of creating these 2 vectors:
factors<-dat[,3:6]
f<-names(factors)
levels<-c(length(levels(factors[,1])),length(levels(factors[,2])),length(levels(factors[,3])),length(levels(factors[,4])))
So now I just had to drop the nth element from "f" where the nth element of "levels" equals 1.
Eventually I've been able to find a way to do what I intended to. There might be a simpler/more elegant way, but I've used:
l<-nrow(dat)
a<-length(levels(dat[,3]))
b<-length(levels(dat[,4]))
c<-length(levels(dat[,5]))
d<-length(levels(dat[,6]))
zeros<-c(rep(0,times=l))
if (a<2) dat[,2]<-zeros
if (b<2) dat[,3]<-zeros
if (c<2) dat[,4]<-zeros
if (d<2) dat[,5]<-zeros
The single-level factors are replaced with an appropriate length of vectors containing zeros each loop, hence the regressions can be run without getting an error.
Try this:
do.call(rbind,
lapply(split(testData, testData$BatchNo), function(i){
#check if factor columns have more than 1 level
cats <- colnames(i)[c(3:6)[sapply(i[, c(3:6)], function(j){length(unique(j))}) > 1]]
cats <- paste(cats, collapse = "+")
fit <- lm(as.formula(paste0("Y~", cats, "+Binary2+Binary3+Binary4")), data = i)
#return coef as df
as.data.frame(coef(fit))
})
)
Output
# coef(fit)
# 1.(Intercept) 1.000000e+00
# 1.Categorical1B1723 3.888809e-02
# 1.Categorical1B5513 3.082241e-01
# 1.Categorical1B5514 3.802391e-01
# 1.Categorical2B5508 5.611389e-16
# 1.Categorical2B5512 -6.121273e-02
# 1.Categorical2B6623 NA
# 1.Categorical3B5511 1.699675e-17
# 1.Categorical4B649 9.361947e-02
# 1.Binary2 NA
# 1.Binary3 NA
# 1.Binary4 NA
# 2.(Intercept) 1.000000e+00
# 2.Categorical1B5513 2.694196e-01
# 2.Categorical1B5514 3.323681e-01
# 2.Categorical2B5515 -5.350623e-02
# 2.Categorical2B6623 NA
# 2.Categorical3B5517 3.289161e-18
# 2.Categorical4B649 8.183306e-02
# 2.Binary2 NA
# 2.Binary3 NA
# 2.Binary4 NA
# 3.(Intercept) 1.000000e+00
# 3.Categorical1B1701 6.366120e-02
# 3.Binary2 NA
# 3.Binary3 NA
# 3.Binary4 NA
How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn