How to properly loop without eval, parse, text=paste("... in R - r

So I had a friend help me with some R code and I feel bad asking because the code works but I have a hard time understanding and changing it and I have this feeling that it's not correct or proper code.
I am loading files into separate R dataframes, labeled x1, x2... xN etc.
I want to combine the dataframes and this is the code we got to work:
assign("x",eval(parse(text=paste("rbind(",paste("x",rep(1:length(toAppend)),sep="",collapse=", "),")",sep=""))))
"toAppend" is a list of the files that were loaded into the x1, x2 etc. dataframes.
Without all the text to code tricks it should be something like:
x <- rbind(##x1 through xN or some loop for 1:length(toAppend)#)
Why can't R take the code without the evaluate text trick? Is this good code? Will I get fired if I use this IRL? Do you know a proper way to write this out as a loop instead? Is there a way to do it without a loop? Once I combine these files/dataframes I have a data set over 30 million lines long which is very slow to work with using loops. It takes more than 24 hours to run my example line of code to get the 30M line data set from ~400 files.

If these dataframes all have the same structure, you will save considerable time by using the 'colClasses' argument to the read.table or read.csv steps. The lapply function can pass this to read.* functions and if you used Dason's guess at what you were really doing, it would be:
x <- do.call(rbind, lapply(file_names, read.csv,
colClasses=c("numeric", "Date", "character")
)) # whatever the ordered sequence of classes might be
The reason that rbind cannot take your character vector is that the names of objects are 'language' objects and a character vector is ... just not a language type. Pushing character vectors through the semi-permeable membrane separating 'language' from 'data' in R requires using assign, or do.call eval(parse()) or environments or Reference Classes or perhaps other methods I have forgotten.

Related

R merge large number of data frames

I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks

With vs. lapply - why does with not work here?

I am trying to learn R and can't really figure out when to use with appropriately. I was thinking about this example:
The goal is to convert "dstr" and "died" in the whole dataframe "stroke" (in the ISwR database) to date format in several ways (just for practice). I've managed to do it like this:
#applying a function to the whole data frame - use the fact that data frames are lists actually
rawstroke=read.csv2(system.file("rawdata","stroke.csv",package="ISwR"),na.strings=".")
names(rawstroke)=tolower(names(rawstroke))
ix=c("dstr","died")
rawstroke[ix]=lapply(rawstroke[ix],as.Date,format="%d.%m.%Y")
head(rawstroke)
However, when I try using with function it does not give data frame as output, but only writes the definition of the function myfun. Here is the code I tried.
myfun=function(x)
{y=as.Date(x,format="%d.%m.%Y")
return(y)}
rawstroke=read.csv2(system.file("rawdata","stroke.csv",package="ISwR"),na.strings=".")
names(rawstroke)=tolower(names(rawstroke))
ix=c("dstr","died")
bla=with(rawstroke[ix],myfun)
head(bla)
If somebody could help me with this, it would be great.
Yeah, this doesn't seem like a job for with. To use your function here, you'd just replace as.Date in your first code with myfun and remove the format parameter, like
rawstroke[ix]=lapply(rawstroke[ix], myfun)
with is used to more cleanly access variables in data frames and environments. For example, instead of
t.test(dat$x, dat$y)
you could do
with(dat, t.test(x, y))

Using values from a dataframe to apply a function to a vector

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.
Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

R equivalent to Stata's `compress` command?

Stata has a command called compress which looks through all the data rows and tries to coerce each to the most efficient format. For instance, if you have a bunch of integers stored as a character vector within a data.frame, it will coerce that to integer.
I can imagine how one might write such a function in R, but does it exist already?
Technically, read.table does exactly that with the help of type.convert. So you could use that - it is not the most efficient way but probably the easiest:
df <- as.data.frame(lapply(df ,function(x) type.convert(as.character(x))))
In practice it may be better to do that selectively, though, so you only touch characters/factors:
for (i in seq.int(df)) if (is.factor(df[[i]]) || is.character(df[[i]]))
df[[i]] <- type.convert(as.character(df[[i]]))

Resources