Function to update dataframe name stored in variable - r

I am having to convert some dates to Character formats for a project I am working on, to make the code cleaner I wanted to write a function that you pass the name of the dataframe (and possibly the column name, though in this example it doesn't change so can be hard coded) to and it does the format for each, rather than having to repeat the full line for each dataframe I am formatting the column in.
Is this possible to do? I have done a lot of googling and can't seem to find an answer.
kpidataRM$Period <- format(kpidataRM$Period, "%b-%y")
kpidataAFM$Period <- format(kpidataAFM$Period, "%b-%y")
kpidataNATIONAL$Period <- format(kpidataNATIONAL$Period, "%b-%y")
kpidataHOD$Period <- format(kpidataHOD$Period, "%b-%y")

To answer your specific question, you could create a very simple function like this:
# Your function here takes as input the dataframe name (df) and formats the predefined column (Period)
new_function <- function(df){
df$Period <- format(df$Period, "%b-%y")
return(df)
}
and then run
df1 <- new_function(df1)
df2 <- new_function(df2)
for each of your dataframes (in your example df1 would be kpidataRM for instance). If you would like to include the column as a variable as well in your function you can write it like this:
# Your function here takes as input the dataframe name (df) and column name (col) and formats it.
new_function2 <- function(df, col){
df[[col]] <- format(df[[col]], "%b-%y")
return(df)
}
However, I would say though that this is not the best approach in this case, as you only seem to want to format a set of columns from a set of dataframes, in a specific way. What i would instead propose, exactly as Roland suggested, is to make a list of dataframes and iterate through each element. A simple example would look like this:
# Push all your dataframes in a list (dflist)
dflist <- list(df1,df2)
# Apply in this list a function that changes the column format (lapply)
dflist <- lapply(dflist, function(x){x[[Period]] <- format(x[[Period]], "%b-%y")})
Hope this works for you.

Related

Adding rows to dataframe using lapply

I am trying to extract data from a pdf then enter it into a row in a dataframe. I have figured out how to extract the data I want, but the last two parts Im not able to figure out yet. I've set up a basic function to try with lapply and it gives me a 1 row, 39 observation dataframe with the information I want as characters properly formatted and
filenames <- list.files("C:/Users/.../inputfolder", pattern="*.pdf")
function01 <- function(x) {
df1 <- pdf_text(x) |>
str_squish() |>
mgsub ()|>
etc
}
master_list <- lapply(filenames, function01)
mdf <- as.data.frame(do.call(rbind, master_list))
So right now this works for one pdf and Im not quite sure how to make it apply to all files in the folder properly and add the data to the rows of mdf.
You can use purrr:map_dfr.
This function calls the function in a loop, then returns the output as a data.frame, with a row for every iteration:
library(purrr)
master_list <- map_dfr(filenames, function01)

How can I get the column/variable names of a dataframe that fit certain parameters?

I came across a problem in my DataCamp exercise that basically asked "Remove the column names in this vector that are not factors." I know what they -wanted- me to do, and that was to simply do glimpse(df) and manually delete elements of the vector containing the column names, but that wasn't satisfying for me. I figured there was a simple way to store the column names of the dataframe that are factors into a vector. So, I tried two things that ended up working, but I worry they might be inefficient.
Example data Frame:
factorVar <- as.factor(LETTERS[1:10])
df1 <- data.frame(x = 1, y = 1:10, factorVar = sample(factorVar, 10))
My first solution was this:
vector1 <- names(select_if(df1, is.factor))
This worked, but select_if returns an entire tibble of a filtered dataframe and then gets the column names. Surely there's an easier way...
Next, I tried this:
vector2 <- colnames(df1)[sapply(df1,is.factor)]
This also worked, but I wanted to know if there's a quicker, more efficient way of filtering column names based on their type and then storing the results as a vector.

sapply use in conjunction with dplyr's mutate_at

I am trying to clean-up some data stored in multiple data frames using the same function repeatedly. I am trying in this example to leverage mutate_at from dplyr to convert to Date format all columns names which contain 'date'.
I have a list of tables in my environment such as:
table_list <- c('table_1','table_2','table_3')
The objective for me is to overwrite each of the tables for which the name is listed in table_list with their corrected version. Instead I can only get the results stored in a large list.
I have so far created a basic function as follows:
fix_dates <- function(df_name){
get(df_name) %>%
mutate_at(vars(contains('date')),
funs(as.Date(.,
origin = "1899-12-30")
))
}
The fix_dates() function works perfectly fine if I feed it one element at a time with for example fix_dates('table_1').
However if I use sapply such as results <- sapply(table_list, fix_dates) then I will find in results list all the tables from table_list at their respective indexes. However I would like to instead have table_1 <- fix_dates('table_1') instead for each of the elements of table_list
Is it possible to have sapply store the results in-place instead?
There's probably a more elegant way to do this, but I think this gets you where you want to go:
# use lapply to get a list with the transformed versions of the data frames
# named in table_list
new_tables <- lapply(table_list, function(x) {
mutate_at(get(x), vars(contains("date")), funs(as.Date(., origin = "1899-12-30")))
})
# assign the original df names to the elements of that list
names(new_tables) <- table_list
# use list2env to put those list elements in the current environment, overwriting
# the ones that were there before
list2env(new_tables, envir = environment())

For loop to convert variables to Date format

I have two data frames of identical dimensions and column names.
I want on both convert the dates stored currently as characters to dates. Is there any way to automate this using a for loop? I though to something similar to the following script:
names <- c("old.df", "new.df")
# use Date format
for (i in names) {
i$Date <- as.Date(i$Date, "%d/%m/%Y")
i$Datetime <- as.Date(i$Datetime, "%d/%m/%Y %h:%m:%s.000")
i$ClickDatetime <- as.Date(i$ClickDatetime, "%d/%m/%Y %h:%m:%s.000")
}
This actually doesn't work and returns the following error message:
Error in i$Date : $ operator is invalid for atomic vectors
I don't think I can use the i object in this way. I'm wondering if there is a nice workaround you usually use to achieve the same goal is similar conditions.
Correct, it won't work like that because R sees i as the string, not the dataframe named by the string. Something like this should work:
df_list <- list(old_df, new_df)
# use Date format
for (df in df_list) {
df["Date"] <- as.Date(df["Date"], "%d/%m/%Y")
df["Datetime"] <- as.Date(df["Datetime"], "%d/%m/%Y %h:%m:%s.000")
df["ClickDatetime"] <- as.Date(df["ClickDatetime"], "%d/%m/%Y %h:%m:%s.000")
}
old_df <- df_list[[1]]
new_df <- df_list[[2]]
There are lots of ways to do this.
With only two dataframes, doing each one individually might be as good an option. With many dataframes with identical columns you could stack them up with rbind (putting in an identifier column to tell you which row belongs to which df), apply your changes and then split them apart again. Or put then in a list and build a function which can be used with lapply.

renaming subset of columns in r with paste0

I have a data frame (my_df) with columns named after individual county numbers. I melted/cast the data from a much larger set to get to this point. The first column name is year and it is a list of years from 1970-2011. The next 3010 columns are counties. However, I'd like to rename the county columns to be "column_"+county number.
This code executes in R but for whatever reason doesn't update the column names. they remain solely the numbers... any help?
new_col_names = paste0("county_",colnames(my_df[,2:ncol(my_df)]))
colnames(my_df[,2:ncol(my_df)]) = new_col_names
The problem is the subsetting within the colnames call.
Try names(my_df) <- c(names(my_df)[1], new_col_names) instead.
Note: names and colnames are interchangeable for data.frame objects.
EDIT: alternate approach suggested by flodel, subsetting outside the function call:
names(my_df)[-1] <- new_col_names
colnames() is for a matrix (or matrix-like object), try simply names() for a data.frame
Example:
new_col_names=paste0("county_",colnames(my_df[,2:ncol(my_df)]))
my_df <- data.frame(a=c(1,2,3,4,5), b=rnorm(5), c=rnorm(5), d=rnorm(5))
names(my_df) <- c(names(my_df)[1], new_col_names)

Resources