Having trouble generalizing a data cleaning solution to a function in R - r

I have a question about generalizing some code into a function in R. Below is the code I want to generalize:
#file name information
years <-c("_1999.XPT","_2003.XPT","_2005.XPT","_2007.XPT","_2009.XPT","_2011.XPT","_2013.XPT","_2015.XPT")
#create initial frame
assign("diabetes", get(paste0("diabetes",years[1])))
#binding rest of frames
for(i in 2:length(years))
{
update_frame <- bind_rows(get("diabetes"),get(paste0("diabetes",years[i])))
assign("diabetes", update_frame)
}
The basic idea is that I want to do a vertical join (bind_rows) of multiple year files into a single dataframe.
My attempted solution to this looks something like this:
big_bind <- function(name)
{
#create initial frame
assign(name, get(paste0(name,years[1])))
#binding rest of frames
for(i in 2:length(years))
{
update_frame <- bind_rows(get(name),get(paste0(name,years[i])))
assign(name, update_frame)
}
}
big_bind("diabetes")
The solution above doesn't work, which leaves me stumped because it works if I swap out the name variable for "diabetes". To be a little more specific, the code runs without errors, but doesn't do anything. I think it has something to do with how R defines variables for functions. Anybody see what I'm missing or has a solution?

Related

Renaming variables within a function for multiple tables

I'm trying to rename multiple variables which show up in a few different files I'm working with. In this example I'll just provide one row for the rule. Here's the code:
renaming <- function(dataset){
names(dataset)[names(dataset)=="Lookup Code...3"]<-"Recipient Code"
.
.
.
}
data <- read_excel("File.xlsx",sheet = "Sheet name")
renaming(data)
In the above example I am passing through one dataset. At this point the variable is not being renamed. I'm only new to making functions in R so maybe my syntax is off somewhere.
Once that problem is resolved I would like to then be able to pass a list into this function. I would like to do this by using a for loop which would look something like this:
dataset_list <- c("Data","Data_1",...)
for(i in 1:length(dataset_list)){
renaming(dataset_list[i])
}
I made an attempt at a for loop similar to this but the dataset doesn't seem to get picked up in order to be passed into the function.
I appreciate the help and if you need clarification on this please ask.
You can try -
renaming <- function(dataset){
names(dataset)[names(dataset)=="Lookup Code...3"]<-"Recipient Code"
#Some other code
#Some more code
#Return the changed dataset
dataset
}
#Get all the filenames in a vector
filenames <- list.files(pattern = '.xlsx')
#apply the function to each file
list_data <- lapply(filenames, function(x) {
renaming(readxl::read_excel(x))
})
list_data would have list of dataframes where each dataframe should have the changed column name and other code applied as written in renaming function. You can access individual dataframes using list_data[[1]], list_data[[2]] etc.

Simple for loop to add columns to dataframes in a list not working, what is going on?

I am new to R but not to programming. I know there is probably a function that does this already, but I want to know why this does not work.
grids <- list(grid_all_2011,grid_all_2017,grid_tenyear_2011,grid_tenyear_2017)
for (i in seq_along(grids)) {
grids[[i]]$test <- "test"
}
The output is just an object with the name i, it doesn't add the test column to any of the dataframes.

Adding columns to data frame via user-defined function

I am trying to add columns to several dataframes. I am trying to create a function that will add the columns and then I would want to use that function with lapply over a list of object. The function is currently just adding empty columns to the data frame. But, if I solve the problem below, I would like to add to it to automatically populate the new columns (and still keeping the initial name of the object).
This is the code I have so far:
AAA_Metadata <- data.frame(AAA_Code=character(),AAA_REV4=character(),AAA_CONCEPT=character(),AAA_Unit=character(),AAA_Date=character(),AAA_Vintage=character())
add_empty_metadata <- function(x) {
temp_dataframe <- setNames(data.frame(matrix(ncol=length(AAA_Metadata),nrow=nrow(x))),as.list(colnames(AAA_Metadata)))
x <- cbind(temp_dataframe,x)
}
However when I run this
a <- data.frame(matrix(ncol=6,nrow=100))
add_empty_metadata(a)
and look at the Global Environment
object "a" still has 6 columns instead of 12.
I understand that I am actually working on a copy of "a" within the function (based on the other topics I checked, e.g. Update data frame via function doesn't work). So I tried:
x <<- cbind(temp_dataframe,x)
and
x <- cbind(temp_dataframe,x)
assign('x',x, envir=.GlobalEnv)
But none of those work. I want to have the new a in the Global Environment for future reference and keep the name 'a' unchanged. Any idea what I am doing wrong here?
Is this what you're looking for:
addCol <- function(x, newColNames){
for(i in newColNames){
x[,i] <- NA
}
return(x)
}
a <- data.frame(matrix(ncol=6,nrow=100));dim(a)
a <- addCol(a, newColNames = names(WIS_Metadata));dim(a)
Amazing source for this kind of stuff is Advanced R by Hadley Wickham with a website here.
R objects are immutable - they don't change - just get destroyed and rebuilt with the same name. a is never changed - it is used as an input to the function and unless the resulting object inside the function is reassigned via return, the version inside the function (this is a separate environment) is binned when the function is complete.

How should I pass a data frame between functions in a package, perhaps using a special environment

I have a data frame that I want to use, unmodified, in a number of functions in my package.
I could create a function that returns the data frame when required, e.g.
lookup <- function() return(data.frame(id=c(1,2,3),text=c("Alpha","Beta","Gamma")))}
test <- function() {
print(lookup())
}
Or I could write a global variable
lookup <<- data.frame(id=c(1,2,3),text=c("Alpha","Beta","Gamma"))
test <- function() {
print(lookup)
}
Or I could create the data frame in my main function and pass it to every function that needs it.
main <- function() {
lookup <- data.frame(id=c(1,2,3),text=c("Alpha","Beta","Gamma"))
test(lookup)
}
test <- function(lookup) {
print(lookup)
}
Option one doesn't feel right as I'm re-creating the data frame from scratch each time I call the function.
Option two is bad practice as a package shouldn't rely on global variables as the caller may be using that variable themselves.
Is option three the right way to do this and is just the way things work by design in a functional language?
Or is there a way where I don't need to pass the data frame to the called function? Some magic involving environments perhaps? i.e. Can I create the data frame in a special environment and then refer to that environment in the function without passing it?
Something like this:
main <- function() {
lookup <- data.frame(id=c(1,2,3),text=c("Alpha","Beta","Gamma"))
my_special_enviroment <- add(lookup)
test()
}
test <- function() {
print(my_special_environment.get(lookup))
}
Thanks
Note, I posted a similar question here but that related to large data sets, not small ones as in this example.

calling objects in nested function R

First off, I'm an R beginner taking an R programming course at the moment. It is extremely lacking in teaching the fundamentals of R so I'm trying to learn myself via you wonderful contributors on Stack Overflow. I'm trying to figure out how nested functions work, which means I also need to learn about how lexical scoping works. I've got a function that computes the complete cases in multiple CSV files and spits out a nice table right now.
Here's the CSV files:
https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
And here's my code, I realize it'd be cleaner if I used the apply stuff but it works as is:
complete<- function(directory, id = 1:332){
data <- NULL
for (i in 1:length(id)) {
data[[i]]<- c(paste(directory, "/", formatC(id[i], width=3, flag=0),
".csv", sep=""))
}
cases <- NULL
for (d in 1:length(data)) {
cases[[d]]<-c(read.csv(data[d]))
}
df <- NULL
for (c in 1:length(cases)){
df[[c]] <- (data.frame(cases[c]))
}
dt <- do.call(rbind, df)
ok <- (complete.cases(dt))
finally <- as.data.frame(table(dt[ok, "ID"]), colnames=c("id", "nobs"))
colnames(finally) <- c('id', 'nobs')
return(finally)
}
I am now trying to call the different variables in the dataframe finally that is the output of the above function within this new function:
corr<-function(directory, threshold = 0){
complete(directory, id = 1:332)
finally$nobs
}
corr('specdata')
Without finally$nobs this function spits out the data frame, as it should, but when I try to call the variable nobs in object finally, it says object finally is not found. I realize this problem is due to my lack of understanding in the subject of lexical scoping, my professor hasn't really made lexical scoping very clear so I'm not totally sure how to find the object within the nested function environment... any help would be great.
The object finally is only in scope within the function complete(). If you want to do something further with the object you are returning, you need to store it in a variable in the environment you are working in (in this instance, the environment you are working in is the function corr(). If we weren't working inside any function, the environment would be the "global environment"). In other words, this code should work:
corr<-function(directory, threshold=0){
this.finally <- complete(directory, id=1:332)
this.finally$nobs
}
I am calling the object that is returned by complete() this.finally to help distinguish it from the object finally that is now out of scope. Of course, you can call it anything you like!

Resources