Using values from a dataframe to apply a function to a vector - r

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.

Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

Related

Anonymous function in lapply

I am reading Wickham's Advanced R book. This question is relating to solving Question 5 in chapter 12 - Functionals. The exercise asks us to:
Implement a version of lapply() that supplies FUN with both the name and value of each component.
Now, when I run below code, I get expected answer for one column.
c(class(iris[1]),names(iris[1]))
Output is:
"data.frame" "Sepal.Length"
Building upon above code, here's what I did:
lapply(iris,function(x){c(class(x),names(x))})
However, I only get the output from class(x) and not from names(x). Why is this the case?
I also tried paste() to see whether it works.
lapply(iris,function(x){paste(class(x),names(x),sep = " ")})
I only get class(x) in the output. I don't see names(x) being returned.
Why is this the case? Also, how do I fix it?
Can someone please help me?
Instead of going over the data frame directly you could switch things around and have lapply go over a vector of the column names,
data(iris)
lapply(colnames(iris), function(x) c(class(iris[[x]]), x))
or over an index for the columns, referencing the data frame.
lapply(1:ncol(iris), function(x) c(class(iris[[x]]), names(iris[x])))
Notice the use of both single and double square brackets.
iris[[n]] references the values of the nth object in the list iris (a data frame is just a particular kind of list), stripping all attributes, making something like mean(iris[[1]]) possible.
iris[n] references the nth object itself, all attributes intact, making something like names(iris[1]) possible.

not error, but not results either in R

I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?
A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).
The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).

Dealing with transposed data in R

I have some functions like this:
myf = function(x) {
# many similar statements involving indexing x
do1(x[, indexfunc1()])
do2(x[, indexfunc1()])
do3(x[, indexfunc1()])
do4(x[, indexfunc1()])
do5(x[, indexfunc1()])
}
In all these functions, I need extract columns or rows
of x, and these functions are used in some loops.
The problem is sometimes we also have data in a transposed
format, so this means for these data we have to get t(x).
This is very ineffecient and very time consuming since
these matrices are often huge.
Is there a smart way to deal with this? It would be very annoying
to have to change code manually.
Well, first of all, if your doX functions expect the transpose of the matrix, you are going to be calling t somewhere, for example
do1(t(x[indexfunc(),])))
So your options are:
Transpose x once at the top
Transpose at each doX call
Rewrite your doX functions so they take an optional isTranspose argument.
Option 3 will be the most work, but also the most efficient. The situation where it would make sense to use option 2 is if x is huge, but you are only selecting a small number of rows/cols each time. In which case you could do something like this:
matrixSelect<-function(x,subset,dim=1){
if(dim==1)
t(x[subset,])
else
x[,subset]
}
and then write
myf = function(x,dim=2) {
# many similar statements involving indexing x
do1(matrixSelect(x,indexfunc1(),dim)
# etc
}

How to properly loop without eval, parse, text=paste("... in R

So I had a friend help me with some R code and I feel bad asking because the code works but I have a hard time understanding and changing it and I have this feeling that it's not correct or proper code.
I am loading files into separate R dataframes, labeled x1, x2... xN etc.
I want to combine the dataframes and this is the code we got to work:
assign("x",eval(parse(text=paste("rbind(",paste("x",rep(1:length(toAppend)),sep="",collapse=", "),")",sep=""))))
"toAppend" is a list of the files that were loaded into the x1, x2 etc. dataframes.
Without all the text to code tricks it should be something like:
x <- rbind(##x1 through xN or some loop for 1:length(toAppend)#)
Why can't R take the code without the evaluate text trick? Is this good code? Will I get fired if I use this IRL? Do you know a proper way to write this out as a loop instead? Is there a way to do it without a loop? Once I combine these files/dataframes I have a data set over 30 million lines long which is very slow to work with using loops. It takes more than 24 hours to run my example line of code to get the 30M line data set from ~400 files.
If these dataframes all have the same structure, you will save considerable time by using the 'colClasses' argument to the read.table or read.csv steps. The lapply function can pass this to read.* functions and if you used Dason's guess at what you were really doing, it would be:
x <- do.call(rbind, lapply(file_names, read.csv,
colClasses=c("numeric", "Date", "character")
)) # whatever the ordered sequence of classes might be
The reason that rbind cannot take your character vector is that the names of objects are 'language' objects and a character vector is ... just not a language type. Pushing character vectors through the semi-permeable membrane separating 'language' from 'data' in R requires using assign, or do.call eval(parse()) or environments or Reference Classes or perhaps other methods I have forgotten.

returning different data frames in a function - R

Is it possible to return 4 different data frames from one function?
Scenario:
I am trying to read a file, parse it, and return some parts of the file.
My function looks something like this:
parseFile <- function(file){
carFile <- read.table(file, header=TRUE, sep="\t")
carNames <- carFile[1,]
carYear <- colnames(carFile)
return(list(carFile,carNames,carYear))
}
I don't want to have to use list(carFile,carNames,carYear). Is there a way return the 3 data frames without returning them in a list first?
R does not support multiple return values. You want to do something like:
foo = function(x,y){return(x+y,x-y)}
plus,minus = foo(10,4)
yeah? Well, you can't. You get an error that R cannot return multiple values.
You've already found the solution - put them in a list and then get the data frames from the list. This is efficient - there is no conversion or copying of the data frames from one block of memory to another.
This is also logical, the return from a function should conceptually be a single entity with some meaning that is transferred to whatever function is calling it. This meaning is also better conveyed if you name the returned values of the list.
You could use a technique to create multiple objects in the calling environment, but when you do that, kittens die.
Note in your example carYear isn't a data frame - its a character vector of column names.
There are other ways you could do that, if you really really want, in R.
assign('carFile',carFile,envir=parent.frame())
If you use that, then carFile will be created in the calling environment. As Spacedman indicated you can only return one thing from your function and the clean solution is to go for the list.
In addition, my personal opinion is that if you find yourself in such a situation, where you feel like you need to return multiple dataframes with one function, or do something that no one has ever done before, you should really revisit your approach. In most cases you could find a cleaner solution with an additional function perhaps, or with the recommended (i.e. list).
In other words the
envir=parent.frame()
will do the job, but as SpacedMan mentioned
when you do that, kittens die
The zeallot package does what you need in a similar that Python can unpack variables from a function. Reproducible example below.
parseFile <- function(){
carMPG <- mtcars$mpg
carName <- rownames(mtcars)
carCYL <- mtcars$cyl
return(list(carMPG,carName,carCYL))
}
library(zeallot)
c(myFile, myName, myYear) %<-% parseFile()

Resources