I know I can do names(df) to get the columns of a dataframe. But is there a more convenient way to rename using dplyr in Rstudio?
Earlier:
names(df)=c("anew","bnew","cnew")
Now?:
library(dplyr)
rename(df, aold = anew, bold = bnew, cold= cnew)
dplyr makes it more difficult as I have to know/type both the old and new column names.
I can see certain conversations around autocompletion of column names in dplyr toolchain. But I can't seem to make it work and I have the latest RStudio.
https://plus.google.com/+SharonMachlis/posts/FHknZcbAdLE
You can try something like this (you don't need to use dplyr to transform names automatically). Just replace the modify_names function with whatever transformation you want to apply to the names.
> modify_names <- function(any_string) {
+ return(paste0(any_string, "-new"))
+ }
>
> df <- data.frame(c(0, 1, 2), c(3, 4, 5))
> names(df) <- c("a", "b")
> df
a b
1 0 3
2 1 4
3 2 5
> names(df) <- modify_names(names(df))
> df
a-new b-new
1 0 3
2 1 4
3 2 5
There's nothing wrong with using names(*) <- new_value. dplyr isn't the be-all and end-all of data manipulation in R.
That said, if you want to include this in a dplyr pipeline, here's how to do it:
df %>% `names<-`(c("a_new", "b_new", "c_new"))
This works because (almost) everything in R is a function, and in particular assigning new names is really a call to the names<- function.
Recently I had the same question and found this RStudio article: https://support.rstudio.com/hc/en-us/articles/205273297-Code-Completion
Following the article, to autocomplete column names with dplyr in RStudio you have to use the magrittr’s %>% operator (pipelines):
library(dplyr)
df %>% rename(aold = anew, bold = bnew, cold= cnew) #Select the variable (old) name after typing the initials (3) + tab
You can find the visual example in the article and manipulate the completation delay (to type less) and other completation options in: RStudio>Tools>Global options...>Code>Completation>Completation delay.
Related
I have a script generating a dataframe with multiple columns named with numbers 1, 2, 3 –> n
I want to rename the columns with the following names: "Cluster_1", "Cluster_2", "Cluster_3" –> "Cluster_n" (with incrementation).
As the number of columns in my dataframe can change accordingly to another part of my script, I would like to be able to have a kind of loop structure that would go through my dataframe and change columns accordingly.
I would like to do something like:
for (i in colnames(df)){
an expression that would change the column name to a concatenation of "Cluster_" + i
}
Outside the loop context, I generally use this expression to rename a column:
names(df)[names(df) == '1'] <- 'Cluster_1'
But I struggle to produce an adapted version of this expression that would properly integrate in my for loop with a concatenation of string and variable value.
How can I adjust the expression that renames the column of the dataframe to integrate in my for loop?
Or is there a better way than a for loop to do this?
A tidyverse solution: rename_with()
require(dplyr)
## '~' notation can be used for formulae in this context:
df <- rename_with(df, ~ paste0("Cluster_", .))
Using paste0.
names(df) <- paste0('cluster_', seq_len(length(df)))
If you really need a for loop, try
for (i in seq_along(names(df))) {
names(df)[i] <- paste0('cluster_', i)
}
df
# cluster_1 cluster_2 cluster_3 cluster_4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
Note: colnames()/rownames() is designed for class "matrix", for "data.frame"s, you might want to use names()/row.names().
Data:
df <- data.frame(matrix(1:12, 3, 4))
I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))
After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.
A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)
Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.
I want to update one column of a dataframe, referencing it using its original name, is this possible? For example say I had the table 'data'
a b c
1 2 2
3 2 3
4 1 2
and I wanted to update the name of column b to 'd'. I know I could use
colnames(data)[2] <- 'd'
but can I make the change by specifically referencing b, i.e. something like
colnames(data)['b'] <- 'd'
so that if the column ordering of the dataframe changes the correct column name will still be updated.
Thanks in advance
There is a function setnames built into package data.table for exactly that.
setnames(DT, "b", "d")
It changes the names by reference with no copy at all. Any other method using names(data)<- or names(data)[i]<- or similar will copy the entire object, usually several times. Even though all you're doing is changing a column name.
DT must be type data.table for setnames to work, though. So you'd need to switch to data.table or convert using as.data.table, to use it.
Here is the extract from ?setnames. The intention is that you run example(setnames) at the prompt and then the comments relate to the copies you see being reported by tracemem.
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
tracemem(DF)
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 2 copies of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
DT = data.table(a=1:2,b=3:4,c=5:6)
tracemem(DT)
setnames(DT,"b","B") # by name; no match() needed. No copy.
setnames(DT,3,"C") # by position. No copy.
setnames(DT,2:3,c("D","E")) # multiple. No copy.
setnames(DT,c("a","E"),c("A","F")) # multiple by name. No copy.
setnames(DT,c("X","Y","Z")) # replace all. No copy.
As of October 2014 this can now be done easily in the dplyr package:
rename(data, d = b)
This seems like a hack, but the first thing that came to mind was to use grepl() with a sufficiently detailed enough search string to only get the column you want. I'm sure there are better options:
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)
colnames(dat)[grepl("b", colnames(dat))] <- "foo"
dat
#------
a foo c
1 1 1 1
2 2 2 2
3 3 3 3
As Joran points out below, I overcomplicated things...no need for a regex at all. This saves a few characters on the typing too.
colnames(dat)[colnames(dat) == "foo"] <- "bar"
#------
a bar c
1 1 1 1
2 2 2 2
3 3 3 3
Yes but it's more difficult (as far as I know) than numeric indexing. I'm going to provide a dirty function that will do this and if you want to see how to do it just tear the function apart line by line:
rename <- function(df, column, new){
x <- names(df) #Did this to avoid typing twice
if (is.numeric(column)) column <- x[column] #Take numeric input by indexing
names(df)[x %in% column] <- new #What you're interested in
return(df)
}
#try it out
rename(mtcars, 'mpg', 'NEW')
rename(mtcars, 1, 'NEW')
I disagree with #Chase - the grepl solution ain't the luckiest one. I'd say: go with simple ==. Here's why:
d <- data.frame(matrix(rnorm(100), 10))
colnames(d) <- replicate(10, paste(sample(letters[1:5], size = 5, replace=TRUE, prob=c(.1, .6, .1, .1, .1)), collapse = ""))
Now try doing grepl("b", colnames(d)). Either pass fixed = TRUE, or even better do simple colnames(d) == "b" like #joran suggested. Regex matching will always be slower than ==, so for simple tasks like this you may want to use simple ==.
I know that I can change a data.frame column name by:
colnames(df)[3] <- "newname"
But there might be cases where the column I want to change is not in the 3rd position. Is there a way to look up the column by name and change it? Like this...
colnames(df)[,"oldname"] <- "newname"
BTW, I have tried this code and I keep getting incorrect number of subscripts on matrix.
Thanks.
colnames(df)[colnames(df)=="oldname"] <- "newname"
or just names
names(df)[names(df)=="oldname"] <- "newname"
There are various functions for renaming columns in packages as well.
colnames(df)[colnames(df)=="oldname"] <- "newname"
or
names(df)[names(df)=="oldname"] <- "newname"
(since names and colnames are equivalent for a data frame)
or you might be looking for
library(reshape)
df <- rename(df,c(oldname="newname"))
I was using package data.table today and when I tried to change a column name using my usual method a message appeared recommending this approach:
library(data.table)
df <- read.table(text= "
region state county
1 1 1
1 2 2
1 2 3
2 1 4
2 1 4
", header=TRUE, na.strings=NA)
df
setnames(df, "county", "district")
df
A somewhat more general approach that will replace all of the "old"s at the beginning of any current name with "new" in the same character location:
names(df) <- sub("^old", "new", names(df) )