I want to update one column of a dataframe, referencing it using its original name, is this possible? For example say I had the table 'data'
a b c
1 2 2
3 2 3
4 1 2
and I wanted to update the name of column b to 'd'. I know I could use
colnames(data)[2] <- 'd'
but can I make the change by specifically referencing b, i.e. something like
colnames(data)['b'] <- 'd'
so that if the column ordering of the dataframe changes the correct column name will still be updated.
Thanks in advance
There is a function setnames built into package data.table for exactly that.
setnames(DT, "b", "d")
It changes the names by reference with no copy at all. Any other method using names(data)<- or names(data)[i]<- or similar will copy the entire object, usually several times. Even though all you're doing is changing a column name.
DT must be type data.table for setnames to work, though. So you'd need to switch to data.table or convert using as.data.table, to use it.
Here is the extract from ?setnames. The intention is that you run example(setnames) at the prompt and then the comments relate to the copies you see being reported by tracemem.
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
tracemem(DF)
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 2 copies of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
DT = data.table(a=1:2,b=3:4,c=5:6)
tracemem(DT)
setnames(DT,"b","B") # by name; no match() needed. No copy.
setnames(DT,3,"C") # by position. No copy.
setnames(DT,2:3,c("D","E")) # multiple. No copy.
setnames(DT,c("a","E"),c("A","F")) # multiple by name. No copy.
setnames(DT,c("X","Y","Z")) # replace all. No copy.
As of October 2014 this can now be done easily in the dplyr package:
rename(data, d = b)
This seems like a hack, but the first thing that came to mind was to use grepl() with a sufficiently detailed enough search string to only get the column you want. I'm sure there are better options:
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)
colnames(dat)[grepl("b", colnames(dat))] <- "foo"
dat
#------
a foo c
1 1 1 1
2 2 2 2
3 3 3 3
As Joran points out below, I overcomplicated things...no need for a regex at all. This saves a few characters on the typing too.
colnames(dat)[colnames(dat) == "foo"] <- "bar"
#------
a bar c
1 1 1 1
2 2 2 2
3 3 3 3
Yes but it's more difficult (as far as I know) than numeric indexing. I'm going to provide a dirty function that will do this and if you want to see how to do it just tear the function apart line by line:
rename <- function(df, column, new){
x <- names(df) #Did this to avoid typing twice
if (is.numeric(column)) column <- x[column] #Take numeric input by indexing
names(df)[x %in% column] <- new #What you're interested in
return(df)
}
#try it out
rename(mtcars, 'mpg', 'NEW')
rename(mtcars, 1, 'NEW')
I disagree with #Chase - the grepl solution ain't the luckiest one. I'd say: go with simple ==. Here's why:
d <- data.frame(matrix(rnorm(100), 10))
colnames(d) <- replicate(10, paste(sample(letters[1:5], size = 5, replace=TRUE, prob=c(.1, .6, .1, .1, .1)), collapse = ""))
Now try doing grepl("b", colnames(d)). Either pass fixed = TRUE, or even better do simple colnames(d) == "b" like #joran suggested. Regex matching will always be slower than ==, so for simple tasks like this you may want to use simple ==.
Related
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I have a general problem in understanding how to create a user defined function that can accept variables as arguments that can be manipulated inside the defined function. I want to create a function in which I can pass variables as arguments to internal functions for manipulation. It appears that many of the functions I want to use require the c() operator which requires quotes around the arguments.
So my function has to be able to pass the name of a variable from a dataframe into the quotes for c() and other functions requiring quote strings. I read through many post on paste0, paste and cat(x), but I cannot figure out how to solve my problem completely.
Here is a simple dataset and shortened code to help structure the problem. Here I just want to be able to provide a dataframe, and three variables. The function should provide the mean of the variable in the y position for each combo of the x and z variable. The resultant aggregate table should have the names of the variables provided as arguments to XTABAR as column headers.
n=50
DataTest = data.frame( xcol=sample(1:3, n, replace=TRUE), ycol = rnorm(n, 5, 2), Catg=letters[1:5])
XTABAR<- function(DS,xcat,yvar,group){
library(plyr)
#library(ggplot2)
#library(dplyr)
#library(scales)
localenv<-environment()
gg<-data.frame(DS,x=DS[,xcat],y=DS[,yvar],z=DS[,group] )
cnames<-colnames(gg)
ag.gg<-aggregate(gg$y, by=list(gg$x,gg$z),FUN=mean)
colnames(ag.gg)<-c(cat('"',cnames[1],'"'),cat('"',cnames[2],'"'),cat('"',cnames[3],'"'))
return(ag.gg)
}
XTABAR(DataTest,"xcol","ycol","Catg")
This code is as close as I can get to solving the simple problem. I don't know how to remove the quotes from the column names nor how to get rid of the NA's.
Thank you for any help on the logic and or code.
Try the following. I was not too clear about the desire to quote the names but we put stars around them in the code below. If that is not needed then remove the setNames statement.
XTABAR <- function(DS, xcat, yvar, group) {
ag <- aggregate(DS[yvar], DS[c(xcat, group)], mean)
setNames(ag, paste0("*", names(ag), "*"))
}
Test it:
XTABAR(DataTest, "xcol", "ycol", "Catg")
giving:
*xcol* *Catg* *ycol*
1 1 a 5.700938
2 2 a 5.292628
3 3 a 5.204395
4 1 b 4.054289
5 2 b 5.119659
6 3 b 4.050799
7 1 c 2.937309
8 2 c 5.696256
9 3 c 6.773029
10 1 d 5.323572
11 2 d 3.430644
12 3 d 4.892041
13 1 e 4.024070
14 3 e 5.038122
I make heavy use of eval(parse(text=)) for this purpose. It evaluates a character string as though it is a command. For example:
> x <- "5 + 5"
> eval(parse(text=x))
[1] 10
Using your example, this should work if you input your parameters as character strings:
XTABAR<- function(DS,xcat,yvar,group){
library(plyr)
#library(ggplot2)
#library(dplyr)
#library(scales)
var1 <- eval(parse(text=paste(DS, "$", xcat, sep="")))
var2 <- eval(parse(text=paste(DS, "$", yvar, sep="")))
var3 <- eval(parse(text=paste(DS, "$", group, sep="")))
localenv<-environment()
gg<-data.frame(x=var1, y=var2, z=var3)
cnames<-colnames(gg)
ag.gg<-aggregate(gg$y, by=list(gg$x,gg$z),FUN=mean)
colnames(ag.gg)<-c(cat('"',cnames[1],'"'),cat('"',cnames[2],'"'),cat('"',cnames[3],'"'))
return(ag.gg)
}
I'm going to go ahead and anticipate a criticism of my answer.
> require(fortunes)
Loading required package: fortunes
> fortune(106)
If the answer is parse() you should usually rethink
the question.
-- Thomas Lumley
R-help (February 2005)
Mr. Lumley is probably correct in this case. There are probably simpler solutions, but this should at least get you going.
To set the column names, use colnames(ag.gg) <- c(xcat, yvar, group).
Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1
How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)