Anonymous function in lapply - r

I am reading Wickham's Advanced R book. This question is relating to solving Question 5 in chapter 12 - Functionals. The exercise asks us to:
Implement a version of lapply() that supplies FUN with both the name and value of each component.
Now, when I run below code, I get expected answer for one column.
c(class(iris[1]),names(iris[1]))
Output is:
"data.frame" "Sepal.Length"
Building upon above code, here's what I did:
lapply(iris,function(x){c(class(x),names(x))})
However, I only get the output from class(x) and not from names(x). Why is this the case?
I also tried paste() to see whether it works.
lapply(iris,function(x){paste(class(x),names(x),sep = " ")})
I only get class(x) in the output. I don't see names(x) being returned.
Why is this the case? Also, how do I fix it?
Can someone please help me?

Instead of going over the data frame directly you could switch things around and have lapply go over a vector of the column names,
data(iris)
lapply(colnames(iris), function(x) c(class(iris[[x]]), x))
or over an index for the columns, referencing the data frame.
lapply(1:ncol(iris), function(x) c(class(iris[[x]]), names(iris[x])))
Notice the use of both single and double square brackets.
iris[[n]] references the values of the nth object in the list iris (a data frame is just a particular kind of list), stripping all attributes, making something like mean(iris[[1]]) possible.
iris[n] references the nth object itself, all attributes intact, making something like names(iris[1]) possible.

Related

Understanding the logic of R code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am learning R through tutorials, but I have difficulties in "how to read" R code, which in turn makes it difficult to write R code. For example:
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
vs
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
While I know what these lines of code do, I cannot read or interpret the logic of each line of code. Whether I read left to right or right to left. What strategies should I use when reading/writing R code?
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Don't let lines of code like this ruin writing R code for you
I'm going to be honest here. The code is bad. And for many reasons.
Not a lot of people can read a line like this and intuitively know what the output is.
The point is you should not write lines of code that you don't understand. This is not Excel, you do not have but 1 single line to fit everything within. You have a whole deliciously large script, an empty canvas. Use that space to break your code into smaller bits that make a beautiful mosaic piece of art! Let's dive in~
Dissecting the code: Data Frames
Reading a line of code is like looking at a face for familiar features. You can read left to right, middle to out, whatever -- as long as you can lock onto something that is familiar.
Okay you see data.combined. You know (hope) it has rows and columns... because it's data!
You spot a $ in the code and you know it has to be a data.frame. This is because only lists and data.frames (which are really just lists) allow you to subset columns using $ followed by the column name. Subset-by the way- just means looking at a portion of the overall. In R, subsetting for data.frames and matrices can be done using single brackets[, within which you will see [row, column]. Thus if we type data.combined[1,2], it would give you the value in row 1 of column 2.
Now, if you knew that the name of column 2 was name you can use data.combined[1,"name"] to get the same output as data.combined$name[1]. Look back at that code:
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Okay, so now we see our eyes should be locked on data.combined[SOMETHING IS IN HERE?!]) and slowly be picking out data.combined[ ?ROW? , Oh the "name" column]. Cool.
Finding those ROW values!
which(duplicated(as.character(data.combined$name)))
Anytime you see the which function, it is just giving you locations. An example: For the logical vector a = c(1,2,2,1), which(a == 1) would give you 1 and 4, the location of 1s in a.
Now duplicated is simple too. duplicated(a) (which is just duplicated(c(1,2,2,1))) will give you back FALSE FALSE TRUE TRUE. If we ran which(duplicated(a)) it would return 3 and 4. Now here is a secret you will learn. If you have TRUES and FALSES, you don't need to use the which function! So maybe which was unnessary here. And also as.character... since duplicated works on numbers and strings.
What You Should Be Writing
Who am I to tell you how to write code? But here's my take.
Don't mix up ways of subsetting: use EITHER data.frame[,column] or data.frame$column...
The code could have been written a little bit more legibly as:
dupes <- duplicated(data.combined$name)
dupe.names <- data.combines$name[dupes]
or equally:
dupes <- duplicated(data.combined[,"name"])
dupe.names <- data.combined[dupes,"name"]
I know this was lengthy but I hope it helps.
An easier way to read any code is to break up their components.
dup.names <-
as.character(
data.combined[which(
duplicated(
as.character(
data.combined$name
)
)
), "name"]
)
For each of the functions - those parts with rounded brackets following them e.g. as.character() you can learn more about what they do and how they work by typing ?as.character in the console
Square brackets [] are use to subset data frames, which are stored in your environment (the box to the upper right if you're using R within RStudio contains your values as well as any defined functions). In this case, you can tell that data.combined is the name that has been given to such a data frame in this example (type ?data.frame to find out more about data frames).
"Unwrapping" long lines of code can be daunting at first. Start by breaking it down into parenthesis , brackets, and commas. Parenthesis directly tacked onto a word indicate a function, and any commas that lie within them (unless they are part of another nested function or bracket) separate arguments which contain parameters that modify the way the function behaves. We can reduce your 2nd line to an outer function as.character and its arguments:
dup.names <- as.character(argument_1)
Just from this, we know that dup.names will be assigned a value with the data type "character" off of a single argument.
Two functions in the first line, file.path() and dir.create(), contain a comma to denote two arguments. Arguments can either be a single value or specified with an equal sign. In this case, the output of file.path happens to perform as argument #1 of dir.create().
file.path(argument_1,argument_2)
dir.create(argument_1,argument_2)
Brackets are a way of subsetting data frames, with the general notation of dataframe_object[row,column]. Within your second line is a dataframe object, data.combined. You know it's a dataframe object because of the brackets directly tacked onto it, and knowing this allows you to that any functions internal to this are contributing to subsetting this data frame.
data.combined[row, column]
So from there, we can see that the internal functions within this bracket will produce an output that specifies the rows of data.combined that will contribute to the subset, and that only columns with name "name" will be selected.
Use the help function to start to unpack these lines by discovering what each function does, and what it's arguments are.

Working with "..." input in R function

I am putting together an R function that takes some undefined input through the ... argument described in the docs as:
"..." the special variable length argument ***
The idea is that the user will enter a number of column names here, each belonging to a dataset also specified by the user. These columns will then be cross-tabulated in comparison to the dependent variable by tapply. The function is to return a table (independent variable x indedependent variable).
Thus, I tried:
plotter=function(dataset, dependent_variable, ...)
{
indi_variables=list(...); # making a list of the ... input as described in the docs
result=with (dataset, tapply(dependent_variable, indi_variables, mean); # this fails
}
I figured this should work as tapply can take a list as input.
But it does not in this case ('Error in tapply...arguments must have same length') and I think it is because indi_variables is a list of strings.
If I input the contents of the list by hand and leave out the quotation marks, everything works just fine.
However, if the user feeds the function the column names as non-strings, R will interpret them as variable names; and I cannot figure out how to transform the list indi_variables in the right way, unsuccessfully trying things like this:
indi_variables=lapply(indi_variables, as.factor)
So I am wondering
What causes the error described above? Is my interpretation correct?
How would one go about transforming the list created through ... in the right way?
Is there an overall better way of doing this, in the input or the implementation of tapply?
Any help is much appreciated!
Thanks to Joran's helpful reading, I have come up with these improvements than make things work out...
indi_variables=substitute(list(...));
result=with (dataset, tapply(dependent_variable, eval(indi_variables, dataset), FUN=mean));

Using values from a dataframe to apply a function to a vector

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.
Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

How to loop a list through lappy after the $ sign in r

I have been trying to figure out how to apply the apply functions plyr is out there. I will learn that later. But, I need help. I can get output with actually typing the object name in, but I am trying to loop a list through it. The code is as follows:
list<-noquote(c("T","AAVL"))
lapply(list,function(i) xts(l.df$i[,-1:-5],order.by=as.POSIXct(rownames(l.df$i))))
If I just do xts(l.df$T[,-1:-5],order.by=as.POSIXct(rownames(l.df$T))
I get the xts file that I need. Could someone please help me loop the names without quotes into the lapply(), so that I could have this work for numerous elements in my list? Thank you!
There are a number of ways to subset a list in R. See https://ramnathv.github.io/pycon2014-r/learn/subsetting.html or http://adv-r.had.co.nz/Subsetting.html for more detailed discussion.
However, in your case the issue is that the dollar operator $ takes a fixed string rather than a variable name. So myList[["item"]] and myList$item are equivalent. In the example you gave, you're trying to find the member of the data.frame called "i", not the one referenced by the variable i. The noquote class you used purely affects printing of a character vector; it has no effect on subsetting.
The version of your code that works doesn't work as you explain in your comment. It works because you're now subsetting the column whose name is stored in i not the one called "i".
I want to add 106 new columns to a dataframe that are the length of the df ofcourse and filled with zeros (0). How would I loop over i in this case:
geo <- unique(df$geo)
geo
[1] "AL" "AT1" "AT2" "AT3" "BE1" "BE2" "BE3"
for(i in geo) {
df$i <- v(0,length(df)
}
Emil Krabbe 2 mins ago Edit

Select a column from a dynamic variable

How can I select the second column of a dynamically named variable?
I create variables of the form "population.USA", "population.Mexico", "population.Canada". Each variable has a column for the year, and another column for the population value. I would like to select the second column from each of these variables during a loop.
I use this syntax:
sprintf("population.%s", country)[, 2]
R returns the error: Error in sprintf("population.%s", country)[, 2] : incorrect number of dimensions
Based on your sequence of questions over the last few minutes, I have two general recommendations for you as you get familiar with R:
Don't use sprintf.
Don't use assign.
Now, obviously, those functions are both useful at times. But you've learned about them too early, before you've mastered some basic stuff about R's data structures. Try to write code without those crutches (for the time being!), as they're just causing you problems.
Rather than creating separate individual variables for each nation's population, place them in a list.
population <- vector("list",3)
names(population) <- c('USA','Mexico','Russia')
Then you can access each using the string representation of the name of each country:
population[['USA']] <- 10000
Or,
region <- 'USA'
population[[region]]
In this example, I've assigned a single value to a list element, lists will hold any other data type, including matrices or data frames. It will be a lot less typing than using sprintf and assign, and a lot safer and more efficient as well.
See ?get. Here is an example:
> country <- "FOO"
> assign(sprintf("population.%s", country), data.frame(runif(5), runif(5)))
>
> get(sprintf("population.%s", country))[,2]
[1] 0.2241105 0.5640709 0.5945869 0.1830719 0.1895938
It is critically important to look at the object returned by a function if you get an error. It is immediately clear why your example fails if you just look at what it returns:
> sprintf("population.%s", country)
[1] "population.FOO"
At that point it would be immediately clear, if you didn't already know or have thought to read ?sprintf, that sprintf() returns a string not the object of that name. Armed with that knowledge you would have narrowed down the problem to how to recall an object from the computed name?

Resources