Avoid rbind()/cbind() conversion from numeric to factor - r

I'm trying to build a dataset before plotting it. I decided to use function factory gammaplot.ff() and the first version of my code looks like this:
PowerUtility1d <- function(x, delta = 4) {
return(((x+1)^(1 - delta)) / (1 - delta))
}
PowerUtility1d <- Vectorize(PowerUtility1d, "x")
# function factory allows multiparametrization of PowerUtility1d()
gammaplot.ff <- function(type, gamma) {
ff <- switch(type,
original = function(x) PowerUtility1d(x/10, gamma),
pnorm_wrong = function(x) PowerUtility1d(2*pnorm(x)-1, gamma),
pnorm_right = function(x) PowerUtility1d(2*pnorm(x/3)-1, gamma)
)
ff
}
gammaplot.df <- data.frame(type=numeric(), gamma=numeric(),
x=numeric(), y=numeric())
gammaplot.gamma <- c(1.1, 1.3, 1.5, 2:7)
gammaplot.pts <- (-1e4:1e4)/1e3
# building the data set
for (gm in gammaplot.gamma) {
for (tp in c("original", "pnorm_wrong", "pnorm_right")) {
fpts <- gammaplot.ff(tp, gm)(gammaplot.pts)
dataChunk <- cbind(tp, gm, gammaplot.pts, fpts)
colnames(dataChunk) <- names(gammaplot.df)
gammaplot.df <- rbind(gammaplot.df, dataChunk)
}
}
# rbind()/cbind() cast all data to character, but x and y are numeric
gammaplot.df$x <- as.numeric(as.character(gammaplot.df$x))
gammaplot.df$y <- as.numeric(as.character(gammaplot.df$y))
Turns out, the whole data frame contains character data, so I have to convert it back manually (took me a while to discover that in the first place!). SO search indicates that this happens because type variable is character. To avoid this (you can imagine performance issues on character data while building the data set!) I changed the code a bit:
gammaplot.ff <- function(type, gamma) {
ff <- switch(type,
function(x) PowerUtility1d(x/10, gamma),
function(x) PowerUtility1d(2*pnorm(x)-1, gamma),
function(x) PowerUtility1d(2*pnorm(x/3)-1, gamma)
)
ff
}
for (gm in gammaplot.gamma) {
for (tp in 1:3) {
fpts <- gammaplot.ff(tp, gm)(gammaplot.pts)
dataChunk <- cbind(tp, gm, gammaplot.pts, fpts)
colnames(dataChunk) <- names(gammaplot.df)
gammaplot.df <- rbind(gammaplot.df, dataChunk)
}
}
This works fine for me, but I lost a self-explanatory character parameter, which is a downside. Is there a way to keep the first version of function factory without an implicit conversion of all data to character?
If there's another way of achieving the same result, I'd be happy to try it out.

You can use rbind.data.frame and cbind.data.frame instead of rbind and cbind.

I want to put #mtelesha 's comment to the front.
Use stringsAsFactors = FALSE in cbind or cbind.data.frame:
x <- data.frame(a = letters[1:5], b = 1:5)
y <- cbind(x, c = LETTERS[1:5])
class(y$c)
## "factor"
y <- cbind.data.frame(x, c = LETTERS[1:5])
class(y$c)
## "factor"
y <- cbind(x, c = LETTERS[1:5], stringsAsFactors = FALSE)
class(y$c)
## "character"
y <- cbind.data.frame(x, c = LETTERS[1:5], stringsAsFactors = FALSE)
class(y$c)
## "character"
UPDATE (May 5, 2020):
As of R version 4.0.0, R uses a stringsAsFactors = FALSE default in calls to data.frame() and read.table().
https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/

If I use rbind or rbind.data.frame, the columns are turned into characters every time. Even if I use stringsAsFactors = FALSE. What worked for me was using
rbind.data.frame(df, data.frame(ColNam = data, Col2 = data), stringsAsFactors = FALSE)

Related

How to write a function with an unspecified number of arguments where the arguments are column names

I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))

problems looping with fastLink

First of all, sorry for my English, I'm translating with google translator
I have two df to which I apply fastLink
df1<-data.frame(col1=c("pruebaA","pruebaA","pruebaA","pruebaB","pruebaB","pruebaB"),col2=c("avion","casa","coche","verde","antonio","jardin"), stringsAsFactors = FALSE)
df2<-data.frame(col1=c("pruebaA","pruebaA","pruebaA","pruebaB","pruebaB","pruebaA"),col2=c("avion","casa grande","coche rojo","Berde","antoƱito","jardinn"), stringsAsFactors = FALSE)
library(fastLink)
prueba <- function(d1, d2) {
out <- fastLink(
dfA = d1, dfB = d2,
varnames = c("col1","col2"),
partial.match = c("col2"),
stringdist.match = c("col2")
)
indi<<- out$matches
dfA.match <<- d1[out$matches$inds.a,]
}
prueba(df1,df2)
I get indi and dfA.match so I can query them.
How could I do the same when I have a lot of df?
I can't make a loop
For example,
I divide df1 and df2 into parts
df1$M <- paste0(df1$col1, "_df1")
z <- split(df1,df1$M )
list2env(z, .GlobalEnv)
df2$M <- paste0(df2$col1,"_df2")
b <- split(df2,df2$M )
list2env(b, .GlobalEnv)
I get
-PruebaA_df1
-PruebaA_df2
-PruebaB_df1
-PruebaB_df1
prueba(pruebaA_df1,pruebaA_df2)
prueba(pruebaB_df1,pruebaB_df2)
works!
Same with a loop
unique(df1$col1)->nom2b
indices<- list()
uniones<- list()
for (i in nom2b){
d1<-paste0(i,"_df1")
d2<-paste0(i,"_df2")
#cat(d1)->d1
#cat(d2)->d2
prueba(d1,d2)
indices[[paste0("modelo",i)]]<-indi
uniones[[paste0("uniones",i)]]<- dfA.match
}
Wrong!!, it doesn't work!!
Assuming you have objects called pruebaA_df1, pruebaA_df2 .... pruebaA_df1000 in your environment, you can use Reduce as :
result <- Reduce(prueba, mget(paste0('pruebaA_df', 1:1000)))

R: object y not found in function (x,y) [function to pass through data frames in r]

I am writing a function to build new data frames based on existing data frames. So I essentially have
f1 <- function(x,y) {
x_adj <- data.frame("DID*"= df.y$`DM`[x], "LDI"= df.y$`DirectorID*`[-(x)], "LDM"= df.y$`DM`[-(x)], "IID*"=y)
}
I have 4,000 data frames df., so I really need to use this and R is returning an error saying that df.y is not found. y is meant to be used through a list of all the 4000 names of the different df. I am very new at R so any help would be really appreciated.
In case more specifics are needed I essentially have something like
df.1 <- data.frame(x = 1:3, b = 5)
And I need the following as a result using a function
df.11 <- data.frame(x = 1, c = 2:3, b = 5)
df.12 <- data.frame(x = 2, c = c(1,3), b = 5)
df.13 <- data.frame(x = 3, c = 1:2, b = 5)
Thanks in advance!
OP seems to access data.frame with dynamic name.
One option is to use get:
get(paste("df",y,sep = "."))
The above get will return df.1.
Hence, the function can be modified as:
f1 <- function(x,y) {
temp_df <- get(paste("df",y,sep = "."))
x_adj <- data.frame("DID*"= temp_df$`DM`[x], "LDI"= temp_df$`DirectorID*`[-(x)],
"LDM"= temp_df$`DM`[-(x)], "IID*"=y)
}

Adding variable name to column in for statement

I have multiple columns in a table called "Gr1","Gr2",...,"Gr10".
I want to convert the class from character to integer. I want to do it in a dynamic way, I'm trying this, but it doesn't work:
for (i in 1:10) {
Col <- paste0('Students1$Gr',i)
Col <- as.integer(Col)
}
My objective here is to know how to add dynamically the for variable to the name of a column. Something like:
for (i in 1:10) {
Students1$Gr(i) <- as.integer(Students1$Gr(i))
}
Any idea is welcome.
Thank you very much,
Matias
# Example matrix
xm <- matrix(as.character(1:100), ncol = 10);
colnames(xm) <- paste0('Gr', 1:10);
# Example data frame
xd <- as.data.frame(xm, stringsAsFactors = FALSE);
# For matrices, this works
xm <- apply(X = xm, MARGIN = 2, FUN = as.integer);
# For data frames, this works
for (i in 1:10) {
xd[ , paste0('Gr', i)] <- as.integer(xd[ , paste0('Gr', i)]);
}

Organizing data from physics experiments for ggplot2

I am currently trying to use ggplot2 to visualize results from simple current-voltage experiments. I managed to achieve good results for one set of data of course.
However, I have a number of current-voltage datasets, which I input in R recursively to get the following organisation (see minimal code) :
data.frame(cbind(batch(string list), sample(string list), dataset(data.frame list)))
Edit : My data are stored in text files names batchname_samplenumber.txt, with voltage and current columns. The code I use to import them is :
require(plyr)
require(ggplot2)
#VARIABLES
regex <- "([[:alnum:]_]+).([[:alpha:]]+)"
regex2 <- "G5_([[:alnum:]]+)_([[:alnum:]]+).([[:alpha:]]+)"
#FUNCTIONS
getJ <- function(list, k) llply(list, function(i) llply(i, function(i, indix) getElement(i,indix), indix = k))
#FILES
files <- list.files("Data/",full.names= T)
#NAMES FOR FILES
paths <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex,i)))
paths2 <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex2,i)))
names <- llply(llply(getJ(paths, 2)),unlist)
batches <- llply(llply(getJ(paths2, 2)),unlist)
samples <- llply(llply(getJ(paths2, 3)),unlist)
#SETS OF DATA, NAMED
sets <- llply(files,function(i) read.table(i,skip = 0, header = F))
names(sets) <- names
for (i in as.list(names)) names(sets[[i]]) <- c("voltage","current")
df<-data.frame(cbind(batches,samples,sets))
And a minimal data can be generated via :
require(plyr)
batch <- list("A","A","B","B")
sample <- list(1,2,1,2)
set <- list(data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)))
df<-data.frame(cbind(batch,sample,set))
My question is : is it possible to use the data as is to plot using a code similar to the following (which does not work) ?
ggplot(data, aes(x = dataset$current, y = dataset$voltage, colour = sample)) + facet_wrap(~batch)
The more general version would be : is ggplot2 able of handeling raw physical data, as opposed to discrete statistical data (like diamonds, cars) ?
With the newly-defined problem (two-column files named "batchname_samplenumber.txt"), I would suggest the following strategy:
read_custom <- function(f, ...) {
d <- read.table(f, ...)
names(d) <- c("V", "I")
## extract sample and batch from the base filename
ids <- strsplit(gsub(".txt", "", f), "_")
d$batch <- ids[[1]][1]
d$sample <- ids[[1]][2]
d
}
## list files to read
files <- list.files(pattern=".txt")
## read them all in a single data.frame
m <- ldply(files, read_custom)
It's not clear how the sample names are defined with respect to the dataset. The general idea for ggplot2 is that you should group all your data in the form of a melted (long format) data.frame.
library(ggplot2)
library(plyr)
library(reshape2)
l1 <- list(batch="b1", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l2 <- list(batch="b2", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l3 <- list(batch="b3", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
list_to_df <- function(l, n=10){
m <- l[["dataset"]]
m$batch <- l[["batch"]]
m$sample <- rep(l[["sample"]], each=n)
m
}
## list_to_df(l1)
m <- ldply(list(l1, l2, l3), list_to_df)
ggplot(m) + facet_wrap(~batch)+
geom_path(aes(current, voltage, colour=sample))

Resources