Lapply to a list of dataframes only if column exists - r

I have a list of dataframes for which I want to obtain (in a separate dataframe) the row mean of a specified column which may or may not exist in all dataframes of the list. My problem comes when the specified column does not exist in at least one of the dataframes of the list.
Assume the following example list of dataframes:
df1 <- read.table(text = 'X A B C
name1 1 2 3
name2 5 10 4',
header = TRUE)
df2 <- read.table(text = 'X B C A
name1 8 1 31
name2 9 9 8',
header = TRUE)
df3 <- read.table(text = 'X B A E
name1 9 9 29
name2 5 15 55',
header = TRUE)
mylist_old <-list(df1, df2)
mylist_new <-list(df1, df2, df3)
Assume I want to rowMeans column C the following piece of code works perfectly when the list of dataframe (mylist_old) is composed of elements df1 and df2, :
Mean_C <- rowMeans(do.call(cbind, lapply(mylist_old, "[", "C")))
Mean_C <- as.data.frame(Mean_C)
The trouble comes when the list is composed of at least one dataframe for which column C does not exist, which in my example is the case of df3, that is for list mylist_new:
Mean_C <- rowMeans(do.call(cbind, lapply(mylist_new, "[", "C")))
Leads to: "Error in [.data.frame(X[[i]], ...) : undefined columns selected
One way to circumvent this issue is to exclude df3 from mylist_new. However, my real program has a list of 64 dataframes for which I do not know whether column C exists or not. I would have like to lapply my piece of code only if column C is detected as existing, that is applying the command to the list of dataframes but only for dataframes for which existence of column C is true.
I tried this
if("C" %in% colnames(mylist_new))
{
Mean_C <- rowMeans(do.call(cbind, lapply(mylist_new, "[", "C")))
Mean_C <- as.data.frame(Mean_C)
}
But nothing happens, probably because colnames refers to the list and not to each dataframe of the list. With 64 dataframes, I cannot refer to each "manually" and need an automated procedure.

Here is one option to Filter the list elements and then apply the lapply on the filtered list
rowMeans(do.call(cbind, lapply(Filter(function(x) "C" %in% names(x),
mylist_new), `[[`, "C")))
#[1] 2.0 6.5
or using tidyverse without Filtering, but making use of select to ignore the cases where the column is not present
library(tidyverse)
map(mylist_new, ~ .x %>%
select(one_of("C"))) %>% # gives a warning
bind_cols %>%
rowMeans
#[1] 2.0 6.5
It may be better to have some warning that the column is not present
Or without a warning
map(mylist_new, ~ .x %>%
select(matches("^C$"))) %>%
bind_cols %>%
rowMeans
#[1] 2.0 6.5

We can use if to check names before we do the subset
rowMeans(do.call(cbind,
lapply(mylist_new, function(x) if('C' %in% names(x)) x['C'] else NA)),na.rm = TRUE)
Or using map_if in purrr 0.3.2
library(purrr)
rowMeans(do.call(cbind,map_if(mylist_new,
function(x) 'C' %in% names(x),
'C', .else=~return(NA))),na.rm = TRUE)
[1] 2.0 6.5

One way is to use purrr::safely, it will return for each iteration a list with a result and error element, then we can transpose, extract result and remove the NULL result with compact :
library(tidyverse)
rowMeans(do.call(cbind, transpose(
lapply(mylist_new, safely(`[`), "C"))$result %>% compact()))
# [1] 2.0 6.5
We could also use the otherwise parameter to have a NA result rather than NULL, and we can set na.rm to TRUE in rowMeans.
rowMeans(na.rm = TRUE, do.call(cbind, transpose(
lapply(mylist_new, safely(`[`, otherwise= NA), "C"))$result))
# [1] 2.0 6.5
This was to address your case with minimal modifications. If I have to solve this precise issue I would do it the following way :
map(mylist_new, "C") %>% compact() %>% pmap_dbl(~mean(c(...)))
# [1] 2.0 6.5
We extract the C element, remove it when it's NULL, and then compute mean by element.
This might be more efficient (not sure):
map(set_names(mylist_new), "C") %>% compact() %>% as_tibble() %>% rowMeans()
# [1] 2.0 6.5
One more, using reshaping this time :
map_dfr(mylist_new, ~gather(.,,,-1)) %>%
group_by(X) %>%
filter(key == "C") %>%
summarize_at("value", mean)
# # A tibble: 2 x 2
# X value
# <fct> <dbl>
# 1 name1 2
# 2 name2 6.5
And a base version, quite readable, with a somewhat awkward step where several columns have the same name, but it's on a temp object so that's not that bad:
wide <- do.call(cbind, mylist_new)
rowMeans(wide[names(wide) == "C"])
# [1] 2.0 6.5

Related

Sum numeric sub-dataframe within a list

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes
e.g. Given list L consist of dataframe x and y, I want to get output like z
x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1+5,2+6),v2=c(3+7,4+8))
I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family
Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts
output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)]+num_sum})
cbind(output,num_sum)
but this approach has some problems I prefer to avoid
I need to manully set the 2 parts of output and then manully join the two parts
lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later
If the order of the two first columns is always the same, you can do:
#Get all numeric columns
num <- sapply(L[[1]], is.numeric)
#Sum them across elements of the list
df_num <- Reduce(`+`, lapply(L, `[`, num))
#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:
library(powerjoin)
sum_inner_join <-
function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x + .y)
Reduce(sum_inner_join, L)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
using dplyr and purrr (which has a bit nicer map functions), you could do something like this:
library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
xVals <- x |> select(-ID, -name)
yVals <- y |> select(-ID, -name)
totalVals <- xVals |> map2(yVals, function(x,y) {
rowSums(cbind(x,y))
})
return(x |> select(ID, name) |> cbind(totalVals))
})
Similar logic to Maël's answer, but squishing it all into a Map call:
data.frame(do.call(Map,
c(\(...) if(is.numeric(..1)) Reduce(`+`, list(...)) else ..1, L)
))
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12
If the first ..1 chunk of the column is numeric, sum + all the values in all the lists, otherwise return the first ..1 chunk.
You could also do it via an aggregation if all the rows are unique:
tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12

Create data frame, matching based on first elements of a list

I want to create a data frame based on the first element of a list. Specifically, I have
One vector containing variables (names1);
One list that contains two variables (some vars1 and the values);
And the end product should a data.frame with "names1" that contains as many lines as cases that match.
If there is no match between a specific list and a the vector, it should be NA.
The values can also be factors or strings.
names1 <- c("a", "b", "c")
dat1 <- data.frame(names1 =c("a", "b", "c", "f"),values= c("val1", 13, 11, 0))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "x"),values= c(12, 10, 2))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The results should be a new data frame with the variables "names" and all values that match of each of list parts:
a b c
val1 13 11
12 10 NA
One option would be to loop through the list ('list1'), filter the 'names' column based on the 'names' vector, convert it to a single dataset while creating an identification column with .id, spread from 'long' to 'wide' and remove the 'grp' column
library(tidyverse)
map_df(list1, ~ .x %>%
filter(names %in% !! names), .id = 'grp') %>%
spread(names, values) %>%
select(-grp)
# a b c
#1 25 13 11
#2 12 10 NA
Or another option is to bind the datasets together with bind_rows, created a grouping id 'grp' to specify the list element, filter the rows by selecting only 'names' column that match with the 'names' vector and spread from 'long' to 'wide'
bind_rows(list1, .id = 'grp') %>%
filter(names %in% !! names) %>%
spread(names, values)
NOTE: It is better not to use reserved keywords for specifying object names (names). Also, to avoid confusions, the object should be different from the column names of the dataframe object.
It can be also done with only base R. Create a group identifier with Map, rbind the list elements to single dataset, subset the rows by keeping only the values from the 'names' vector, and reshape from 'long' to 'wide'
df1 <- subset(do.call(rbind, Map(cbind, list1,
ind = seq_along(list1))), names %in% .GlobalEnv$names)
reshape(df1, idvar = 'ind', direction = 'wide', timevar = 'names')[-1]
A mix of base R and dplyr. For every list element we create a dataframe with 1 row. Using dplyr's rbind_list row bind them together and then subset only those columns which we need using names.
library(dplyr)
rbind_list(lapply(list1, function(x)
setNames(data.frame(t(x$values)), x$names)))[names]
# a b c
# <dbl> <dbl> <dbl>
#1 25 13 11
#2 12 10 NA
Output without subset looks like this
rbind_list(lapply(list1, function(x) setNames(data.frame(t(x$values)), x$names)))
# a b c x
# <dbl> <dbl> <dbl> <dbl>
#1 25 13 11 NA
#2 12 10 NA 2
In base R
t(sapply(list1, function(x) setNames(x$values, names)[match(names, x$names)]))
# a b c
# [1,] 25 13 11
# [2,] 12 10 NA
Using base R only
body <- do.call('rbind', lapply(list1, function(list.element){
element.vals <- list.element[['values']]
element.names <- list.element[['names']]
names(element.vals) <- element.names
return.vals <- element.vals[names]
if(all(is.na(return.vals))) NULL else return.vals
}))
df <- as.data.frame(body)
names(df) <- names
df
For the sake of completeness, here is a data.table approach using dcast() and rowid():
library(data.table)
nam <- names1 # avoid name conflict with column name
rbindlist(list1)[names1 %in% nam, dcast(.SD, rowid(names1) ~ names1)][, names1 := NULL][]
a b c
1: val1 13 11
2: 12 10 <NA>
Or, more concisely, pick columns after reshaping:
library(data.table)
rbindlist(list1)[, dcast(.SD, rowid(names1) ~ names1)][, .SD, .SDcols = names1]

How to rename individual column in dataframe [duplicate]

I know if I have a data frame with more than 1 column, then I can use
colnames(x) <- c("col1","col2")
to rename the columns. How to do this if it's just one column?
Meaning a vector or data frame with only one column.
Example:
trSamp <- data.frame(sample(trainer$index, 10000))
head(trSamp )
# sample.trainer.index..10000.
# 1 5907862
# 2 2181266
# 3 7368504
# 4 1949790
# 5 3475174
# 6 6062879
ncol(trSamp)
# [1] 1
class(trSamp)
# [1] "data.frame"
class(trSamp[1])
# [1] "data.frame"
class(trSamp[,1])
# [1] "numeric"
colnames(trSamp)[2] <- "newname2"
# Error in names(x) <- value :
# 'names' attribute [2] must be the same length as the vector [1]
This is a generalized way in which you do not have to remember the exact location of the variable:
# df = dataframe
# old.var.name = The name you don't like anymore
# new.var.name = The name you want to get
names(df)[names(df) == 'old.var.name'] <- 'new.var.name'
This code pretty much does the following:
names(df) looks into all the names in the df
[names(df) == old.var.name] extracts the variable name you want to check
<- 'new.var.name' assigns the new variable name.
colnames(trSamp)[2] <- "newname2"
attempts to set the second column's name. Your object only has one column, so the command throws an error. This should be sufficient:
colnames(trSamp) <- "newname2"
colnames(df)[colnames(df) == 'oldName'] <- 'newName'
This is an old question, but it is worth noting that you can now use setnames from the data.table package.
library(data.table)
setnames(DF, "oldName", "newName")
# or since the data.frame in question is just one column:
setnames(DF, "newName")
# And for reference's sake, in general (more than once column)
nms <- c("col1.name", "col2.name", etc...)
setnames(DF, nms)
This can also be done using Hadley's plyr package, and the rename function.
library(plyr)
df <- data.frame(foo=rnorm(1000))
df <- rename(df,c('foo'='samples'))
You can rename by the name (without knowing the position) and perform multiple renames at once. After doing a merge, for example, you might end up with:
letterid id.x id.y
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
Which you can then rename in one step using:
letters <- rename(letters,c("id.x" = "source", "id.y" = "target"))
letterid source target
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
I think the best way of renaming columns is by using the dplyr package like this:
require(dplyr)
df = rename(df, new_col01 = old_col01, new_col02 = old_col02, ...)
It works the same for renaming one or many columns in any dataset.
I find that the most convenient way to rename a single column is using dplyr::rename_at :
library(dplyr)
cars %>% rename_at("speed",~"new") %>% head
cars %>% rename_at(vars(speed),~"new") %>% head
cars %>% rename_at(1,~"new") %>% head
# new dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
works well in pipe chaines
convenient when names are stored in variables
works with a name or an column index
clear and compact
I like the next style for rename dataframe column names one by one.
colnames(df)[which(colnames(df) == 'old_colname')] <- 'new_colname'
where
which(colnames(df) == 'old_colname')
returns by the index of the specific column.
Let df be the dataframe you have with col names myDays and temp.
If you want to rename "myDays" to "Date",
library(plyr)
rename(df,c("myDays" = "Date"))
or with pipe, you can
dfNew <- df %>%
plyr::rename(c("myDays" = "Date"))
Try:
colnames(x)[2] <- 'newname2'
This is likely already out there, but I was playing with renaming fields while searching out a solution and tried this on a whim. Worked for my purposes.
Table1$FieldNewName <- Table1$FieldOldName
Table1$FieldOldName <- NULL
Edit begins here....
This works as well.
df <- rename(df, c("oldColName" = "newColName"))
You can use the rename.vars in the gdata package.
library(gdata)
df <- rename.vars(df, from = "oldname", to = "newname")
This is particularly useful where you have more than one variable name to change or you want to append or pre-pend some text to the variable names, then you can do something like:
df <- rename.vars(df, from = c("old1", "old2", "old3",
to = c("new1", "new2", "new3"))
For an example of appending text to a subset of variables names see:
https://stackoverflow.com/a/28870000/180892
You could also try 'upData' from 'Hmisc' package.
library(Hmisc)
trSamp = upData(trSamp, rename=c(sample.trainer.index..10000. = 'newname2'))
If you know that your dataframe has only one column, you can use:
names(trSamp) <- "newname2"
The OP's question has been well and truly answered. However, here's a trick that may be useful in some situations: partial matching of the column name, irrespective of its position in a dataframe:
Partial matching on the name:
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("Reported", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
Another example: partial matching on the presence of "punctuation":
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("[[:punct:]]", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
These were examples I had to deal with today, I thought might be worth sharing.
I would simply change a column name to the dataset with the new name I want with the following code:
names(dataset)[index_value] <- "new_col_name"
I found colnames() argument easier
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/row%2Bcolnames
select some column from the data frame
df <- data.frame(df[, c( "hhid","b1005", "b1012_imp", "b3004a")])
and rename the selected column in order,
colnames(df) <- c("hhid", "income", "cost", "credit")
check the names and the values to be sure
names(df);head(df)
I would simply add a new column to the data frame with the name I want and get the data for it from the existing column. like this:
dataf$value=dataf$Article1Order
then I remove the old column! like this:
dataf$Article1Order<-NULL
This code might seem silly! But it works perfectly...
We can use rename_with to rename columns with a function (stringr functions, for example).
Consider the following data df_1:
df_1 <- data.frame(
x = replicate(n = 3, expr = rnorm(n = 3, mean = 10, sd = 1)),
y = sample(x = 1:2, size = 10, replace = TRUE)
)
names(df_1)
#[1] "x.1" "x.2" "x.3" "y"
Rename all variables with dplyr::everything():
library(tidyverse)
df_1 %>%
rename_with(.data = ., .cols = everything(.),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "var_4"
Rename by name particle with some dplyr verbs (starts_with, ends_with, contains, matches, ...).
Example with . (x variables):
df_1 %>%
rename_with(.data = ., .cols = contains('.'),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "y"
Rename by class with many functions of class test, like is.integer, is.numeric, is.factor...
Example with is.integer (y):
df_1 %>%
rename_with(.data = ., .cols = is.integer,
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "x.1" "x.2" "x.3" "var_1"
The warning:
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
2: In names[cols] <- .fn(names[cols], ...) :
number of items to replace is not a multiple of replacement length
It is not relevant, as it is just an inconsistency of seq_along(.) with the replace function.
library(dplyr)
rename(data, de=de.y)

Select column name based on data frame content R

I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.

How to rename a single column in a data.frame?

I know if I have a data frame with more than 1 column, then I can use
colnames(x) <- c("col1","col2")
to rename the columns. How to do this if it's just one column?
Meaning a vector or data frame with only one column.
Example:
trSamp <- data.frame(sample(trainer$index, 10000))
head(trSamp )
# sample.trainer.index..10000.
# 1 5907862
# 2 2181266
# 3 7368504
# 4 1949790
# 5 3475174
# 6 6062879
ncol(trSamp)
# [1] 1
class(trSamp)
# [1] "data.frame"
class(trSamp[1])
# [1] "data.frame"
class(trSamp[,1])
# [1] "numeric"
colnames(trSamp)[2] <- "newname2"
# Error in names(x) <- value :
# 'names' attribute [2] must be the same length as the vector [1]
This is a generalized way in which you do not have to remember the exact location of the variable:
# df = dataframe
# old.var.name = The name you don't like anymore
# new.var.name = The name you want to get
names(df)[names(df) == 'old.var.name'] <- 'new.var.name'
This code pretty much does the following:
names(df) looks into all the names in the df
[names(df) == old.var.name] extracts the variable name you want to check
<- 'new.var.name' assigns the new variable name.
colnames(trSamp)[2] <- "newname2"
attempts to set the second column's name. Your object only has one column, so the command throws an error. This should be sufficient:
colnames(trSamp) <- "newname2"
colnames(df)[colnames(df) == 'oldName'] <- 'newName'
This is an old question, but it is worth noting that you can now use setnames from the data.table package.
library(data.table)
setnames(DF, "oldName", "newName")
# or since the data.frame in question is just one column:
setnames(DF, "newName")
# And for reference's sake, in general (more than once column)
nms <- c("col1.name", "col2.name", etc...)
setnames(DF, nms)
This can also be done using Hadley's plyr package, and the rename function.
library(plyr)
df <- data.frame(foo=rnorm(1000))
df <- rename(df,c('foo'='samples'))
You can rename by the name (without knowing the position) and perform multiple renames at once. After doing a merge, for example, you might end up with:
letterid id.x id.y
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
Which you can then rename in one step using:
letters <- rename(letters,c("id.x" = "source", "id.y" = "target"))
letterid source target
1 70 2 1
2 116 6 5
3 116 6 4
4 116 6 3
5 766 14 9
6 766 14 13
I think the best way of renaming columns is by using the dplyr package like this:
require(dplyr)
df = rename(df, new_col01 = old_col01, new_col02 = old_col02, ...)
It works the same for renaming one or many columns in any dataset.
I find that the most convenient way to rename a single column is using dplyr::rename_at :
library(dplyr)
cars %>% rename_at("speed",~"new") %>% head
cars %>% rename_at(vars(speed),~"new") %>% head
cars %>% rename_at(1,~"new") %>% head
# new dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
works well in pipe chaines
convenient when names are stored in variables
works with a name or an column index
clear and compact
I like the next style for rename dataframe column names one by one.
colnames(df)[which(colnames(df) == 'old_colname')] <- 'new_colname'
where
which(colnames(df) == 'old_colname')
returns by the index of the specific column.
Let df be the dataframe you have with col names myDays and temp.
If you want to rename "myDays" to "Date",
library(plyr)
rename(df,c("myDays" = "Date"))
or with pipe, you can
dfNew <- df %>%
plyr::rename(c("myDays" = "Date"))
Try:
colnames(x)[2] <- 'newname2'
This is likely already out there, but I was playing with renaming fields while searching out a solution and tried this on a whim. Worked for my purposes.
Table1$FieldNewName <- Table1$FieldOldName
Table1$FieldOldName <- NULL
Edit begins here....
This works as well.
df <- rename(df, c("oldColName" = "newColName"))
You can use the rename.vars in the gdata package.
library(gdata)
df <- rename.vars(df, from = "oldname", to = "newname")
This is particularly useful where you have more than one variable name to change or you want to append or pre-pend some text to the variable names, then you can do something like:
df <- rename.vars(df, from = c("old1", "old2", "old3",
to = c("new1", "new2", "new3"))
For an example of appending text to a subset of variables names see:
https://stackoverflow.com/a/28870000/180892
You could also try 'upData' from 'Hmisc' package.
library(Hmisc)
trSamp = upData(trSamp, rename=c(sample.trainer.index..10000. = 'newname2'))
If you know that your dataframe has only one column, you can use:
names(trSamp) <- "newname2"
The OP's question has been well and truly answered. However, here's a trick that may be useful in some situations: partial matching of the column name, irrespective of its position in a dataframe:
Partial matching on the name:
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("Reported", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
Another example: partial matching on the presence of "punctuation":
d <- data.frame(name1 = NA, Reported.Cases..WHO..2011. = NA, name3 = NA)
## name1 Reported.Cases..WHO..2011. name3
## 1 NA NA NA
names(d)[grepl("[[:punct:]]", names(d))] <- "name2"
## name1 name2 name3
## 1 NA NA NA
These were examples I had to deal with today, I thought might be worth sharing.
I would simply change a column name to the dataset with the new name I want with the following code:
names(dataset)[index_value] <- "new_col_name"
I found colnames() argument easier
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/row%2Bcolnames
select some column from the data frame
df <- data.frame(df[, c( "hhid","b1005", "b1012_imp", "b3004a")])
and rename the selected column in order,
colnames(df) <- c("hhid", "income", "cost", "credit")
check the names and the values to be sure
names(df);head(df)
I would simply add a new column to the data frame with the name I want and get the data for it from the existing column. like this:
dataf$value=dataf$Article1Order
then I remove the old column! like this:
dataf$Article1Order<-NULL
This code might seem silly! But it works perfectly...
We can use rename_with to rename columns with a function (stringr functions, for example).
Consider the following data df_1:
df_1 <- data.frame(
x = replicate(n = 3, expr = rnorm(n = 3, mean = 10, sd = 1)),
y = sample(x = 1:2, size = 10, replace = TRUE)
)
names(df_1)
#[1] "x.1" "x.2" "x.3" "y"
Rename all variables with dplyr::everything():
library(tidyverse)
df_1 %>%
rename_with(.data = ., .cols = everything(.),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "var_4"
Rename by name particle with some dplyr verbs (starts_with, ends_with, contains, matches, ...).
Example with . (x variables):
df_1 %>%
rename_with(.data = ., .cols = contains('.'),
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "var_1" "var_2" "var_3" "y"
Rename by class with many functions of class test, like is.integer, is.numeric, is.factor...
Example with is.integer (y):
df_1 %>%
rename_with(.data = ., .cols = is.integer,
.fn = str_replace, pattern = '.*',
replacement = str_c('var', seq_along(.), sep = '_')) %>%
names()
#[1] "x.1" "x.2" "x.3" "var_1"
The warning:
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
2: In names[cols] <- .fn(names[cols], ...) :
number of items to replace is not a multiple of replacement length
It is not relevant, as it is just an inconsistency of seq_along(.) with the replace function.
library(dplyr)
rename(data, de=de.y)

Resources