Fully programmatically rename columns in R with dplyr - r

I have sensor data, for several different sensor types, in many dataframes. I need to perform inner_joins on the dataframes so that I end up with one dataframe. The column names of the dataframes for a given sensor type are identical, e.g.
> z501h001
timeBgn soilTempMean soilTempVar
1 01:00:00 100 4
2 01:30:00 112 6
3 02:00:00 111 6
> z501h002
timeBgn soilTempMean soilTempVar
1 01:00:00 120 4
2 01:30:00 122 6
3 02:00:00 121 5
except there are way more columns. The column names are different for different types of sensors (they all have timeBgn in common) .
I need (in R) a flexible way to rename the columns (so I can tell which column corresponds to which sensor) based on adding a suffix to the existing column names for all columns except timeBgn (which is the common column by which the inner_join will be done).
Here is the Python / Pandas equivalent of what I am trying to do:
def rename_cols_by_sensor(df, sensor_name):
cols = df.columns
new_cols = [f'{c}_{sensor_name}' if c!='timeBgn' else c for c in cols]
df.columns = new_cols
I found most of a solution here:
programmatically rename columns in dplyr
The problem is that I cannot figure out how to make the cnames vector programmatically. I do not want to hard-code all of the myriad column names. As an example for z501h001 it would need to look like
cnames <- c('soilTempMean' = 'soilTempMean_z501h001', 'soilTempVar' = 'soilTempVar_z501h001')
the suffix (in the example: _z501h001) can be passed to the function so there is no need to discuss obtaining it here. The original names are easily obtained using names(df). All I need to know is how to put them together in this c("character" = "other_character") format.
I have tried:
rename_by_loc <- function(df, loc) {
old_names <- names(df)
new_names <- c()
loc = z501h001
for (name in old_names) {
if (name != "timeBgn") {
new_names <- c(new_names, paste(name, paste(name, loc, sep="_"), sep = " = ") )
}
}
return(new_names)
}
but that gives me names like "soilTempMean = soilTempMean_z501h001"
I need the = to be outside of the character strings. I have tried a few other things. None have been successful.
This problem is trivial using Pandas which makes me think I am missing something about column renaming in R.
Thanks.

We can use mget to get all the values of the objects with the pattern for object names starts with 'z' followed by 3 digits, 'h', and then 3 digits in a list, then use imap to loop over the list and rename all those columns except 'timeBgn' by concatenating (str_c) the original column with the object name
library(dplyr)
library(purrr)
library(stringr)
out <- mget(ls(pattern = "^z\\d{3}h\\d{3}$")) %>%
imap(~ {
nm1 <- .y
.x %>%
rename_with(~ str_c(., "_", nm1), -timeBgn)
})
The output will be a list. If we need to change the column name in the original object (not recommended), use list2env
list2env(out, .GlobalEnv)
Or using base R
v1 <- ls(pattern = "^z\\d{3}h\\d{3}$")
for(v in v1) {
tmp <- get(v)
i1 <- names(tmp) != 'timeBgn'
names(tmp)[i1] <- paste0(names(tmp)[i1], '_', v)
assign(v, tmp)
}

Related

Using Dataframe to Automatically create a list of values based off Subproduct

df <- data.frame("date"=
1:4,"product"=c("B","B","A","A"),"subproduct"=c("1","2","x","y"),"actuals"=1:4)
#creates df1,df2,dfx,dfy
for(i in unique(df$subproduct)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$subproduct==i,])
}
# CREATES LIST OF DATAFRAMES
# How do I make this so i don't have to manually type list(df.,df.,df.)
list_df <- list(df.1,df.2,df.x,df.y) %>%
lapply( function(x) x[(names(x) %in% c("date", "actuals"))])
# creates df1,df2,df3,df4 only dates and actuals, removes the other column names
for (i in 1:length(list_df)) {
assign(paste0("df", i), as.data.frame(list_df[[i]]))
}
For the first for loop, it creates a df object based off unique subproduct. For the list() function, I want to be able to not have to type in df.1 ... df2... etc so if I have 100 unique subproducts in my data, I wouldn't need to type this df.1, df.2,df.x,df.y,df.z,df.zzz,df. over and over again. How would I best do this (1 question)
The last for loop creates separate dataframe objects with only date and actuals will be used to create time series for each. How can I put the values of these objects into a single dataframe or a list of dfs? (2nd question)
We can use mget to return the value of object on the subset of object names from ls. The pattern matches object names that starts with 'df'followed by a.` and any alphanumeric characters
mget(ls(pattern = '^df\\.[[:alnum:]]+$'))
If the OP wanted to create those objects in a different env
new_env <- new.env()
list2env(mget(ls(pattern = '^df\\.[[:alnum:]]+$')), envir = new_env)
If we want to create new objects from scratch, do a group_split on the 'subproduct' column, set the names accordingly, and create multiple objects (list2env - not recommended though)
library(dplyr)
library(stringr)
df %>%
group_split(subproduct) %>%
setNames(str_c('df.', c(1, 2, 'x', 'y'))) %>%
list2env(.GlobalEnv)

I give three arguments, the input df, the column I want to clean,the new column I want to be added with cleansed names. Where am I going wrong?

library(dplyr)
clean_name <- function(df,col_name,new_col_name){
#remove whitespace and common titles.
df$new_col_name <- mutate_all(df,
trimws(gsub("MR.?|MRS.?|MS.?|MISS.?|MASTER.?","",df$col_name)))
#remove any chunks of text where a number is present
df$new_col_name<- transmute_all(df,
gsub("[^\\s]*[\\d]+[^\\s]*","",df$col_name,perl = TRUE))
}
I get the following error
"Error: Column new_col_name must be a 1d atomic #vector or a list"
what you want to do is make sure that the output of the functions you're using is either a vector or a list with only one dimension so that you can add it as a new column in the desired data frame. You can verify the class of an object with the Class function which comes within the base package.
The mutate function by itself should do what you want, it returns the same data frame but with the new column:
library(dplyr)
clean_name <- function(df, col_name, new_col_name) {
# first_cleaning_to_colname = The first change you want to make to the col_name column. This should be a vector.
# second_cleaning_to_colname = The change you're going to make to the col_name column after the first one. This should be a vector too.
first_change <- mutate(df, col_name = first_cleaning_to_colname)
second_change <- mutate(first_change, new_col_name = second_cleaning_to_colname)
return(second_change)
}
You can make both this changes at the same time but I thought this way it's easier to read.
If we are passing unquoted column names, then use
library(tidyverse)
clean_name <- function(df,col_name, new_col_name){
col_name <- enquo(col_name)
new_col_name <- enquo(new_col_name)
df %>%
mutate(!! new_col_name :=
trimws(str_replace_all(!!col_name, "MR.?|MRS.?|MS.?|MISS.?|MASTER.?","")) ) %>%
transmute(!! new_col_name := trimws(str_replace_all(!! new_col_name,
"[^\\s]*[\\d]+[^\\s]*","")))
}
clean_name(dat1, col1, colN)
# colN
#1 one
#2 two
data
dat1 <- data.frame(col1 = c("MR. one", "MS. two 24"), stringsAsFactors = FALSE)

Recoding a large number of variables using another data frame in R

I'd like to use a data frame (Df2) to recode the variables of another data frame (Df1), so that the end result is a data frame that contains text like local/international rather than 1s/2s (Df3). Missingness is present in the Df1 data frame, and I'd like to make sure it's represented as NA.
This is a minimal working example, the actual data set contains more than a hundred variables (all of which are of the character class) with between one and fifteen levels. Any help would be much appreciated.
Starting point (dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Desired outcome (df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
Thoughts, not really code, so far: (If there's a match between a row of the df2 NameOfVariable and a df1 variable name, as well as a match between a row of df2 VariableLevel and a df1 observation, then paste the corresponding row of df2 VariableDef into df1. Wondering if you can use if statements for it.)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}
Here is on method in base R using match and Map. Map applies a function to corresponding list elements. Here, there are two list elements: Df1 and a list that is composed of the second and third columns of Df2, split by column 1. The second list is reordered to match the order of the names in Df1.
The applied function matches elements in a column Df1 to the corresponding column in the second argument and uses it as an index to return the corresponding name of the Df2 argument. Map returns a list, which is converted to a data.frame with the function of the same name.
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
this returns
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K
Solution using loop and factors. Be careful. Results seem equivalent but they are not. The function fun return data frame with factors. If needed you can convert them to characters.
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3
A solution from dplyr and tidyr. The code will work fine even with warning messages because the columns are in factor. If you don't want to see any warning messages, set stringsAsFactors = FALSE when creating the data frame like the example I provided.
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
DATA
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)

assign dynamic variable names using a loop and mutate from dplyr

I would like to split a character field into individual variables, one for each character in a string.
library(dplyr)
temp1 <- data.frame(a = c('dedefdewfe' , 'rewewqreqw'))
for(i in 1:10){
temp1 <- temp1 %>%
mutate(paste('v' , i , ,sep = '') = substr(a , i , i))
}
The resulting dataframe would have 11 variables, the original a , v1 through v10
tidyr::separate is good for this. You can't split on an empty string, but you can specify splitting positions ...
library(tidyr)
library(dplyr)
temp1 %>%
mutate(b=a) %>% ## make a copy
separate(b,into=paste0("v",1:10),sep=1:9)
(probably better practice to use nc <- nchar(temp1$a[1]) and then use nc, nc-1 instead of 10, 9 respectively)

combine multiple dataframes based on sequence of names

Say I have 30 dataframes all named with a date from 01/01/2000 to 30/01/2000 in the format of ddmmyy (code below) :
Season <- seq(as.Date("2000-01-01"),as.Date("2000-01-30"),1)
Season <- format(Season,"%d%m%y")
for (s in Season) {
df <- data.frame(X=1:10, Y=1:10)
aa <- paste(s,"tests",s ,sep = "_")
assign(aa,df)
}
Each name, you cans see, has the word tests added to it.I want to combine (rbind?) the data.frames based on the date. In this case, combine data.frames that contain the dates from 01-01-00 to 10-01-00.
I have the below code to combine all dataframes but what if I only want to select the ones shown above?
All_dfs <- do.call(rbind, eapply(.GlobalEnv,function(x) if(is.data.frame(x)) x))
Is it better to create a list first?
We can use mget to get the values of 'Season' in a list and then rbind the list of data.frames. As there is a suffix "tests" followed by "Season" concatenated to the "Season", we can use paste to get the string, then use mget.
res <- do.call(rbind, mget( paste0(Season[1:10], "_tests_", Season[1:10])))
dim(res)
#[1] 100 2

Resources