How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)
Related
This question already has answers here:
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 1 year ago.
I am simply extracting a single row from a data.frame. Consider for example
d=data.frame(a=1:3,b=1:3)
d[1,] # returns a data.frame
# a b
# 1 1 1
The output matched my expectation. The result was not as I expected though when dealing with a data.frame that contains a single column.
d=data.frame(a=1:3)
d[1,] # returns an integer
# [1] 1
Indeed, here, the extracted data is not a data.frame anymore but an integer! To me, it seems a little strange that the same function on the same data type wants to return different data types. One of the issue with this conversion is the loss of the column name.
To solve the issue, I did
extractRow = function(d,index)
{
if (ncol(d) > 1)
{
return(d[index,])
} else
{
d2 = as.data.frame(d[index,])
names(d2) = names(d)
return(d2)
}
}
d=data.frame(a=1:3,b=1:3)
extractRow(d,1)
# a b
# 1 1 1
d=data.frame(a=1:3)
extractRow(d,1)
# a
# 1 1
But it seems unnecessarily cumbersome. Is there a better solution?
Just subset with the drop = FALSE option:
extractRow = function(d, index) {
return(d[index, , drop=FALSE])
}
R tries to simplify data.frame cuts by default, the same thing happens with columns:
d[, "a"]
# [1] 1 2 3
Alternatives are:
d[1, , drop = FALSE]
tibble::tibble which has drop = FALSE by default
I can't tell you why that happens - it seems weird. One workaround would be to use slice from dplyr (although using a library seems unecessary for such a simple task).
library(dplyr)
slice(d, 1)
a
1 1
data.frames will simplify to vectors or scallars whith base subsetting [,].
If you want to avoid that, you can use tibbles instead:
> tibble(a=1:2)[1,]
# A tibble: 1 x 1
a
<int>
1 1
tibble(a=1:2)[1,] %>% class
[1] "tbl_df" "tbl" "data.frame"
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).
I want to update one column of a dataframe, referencing it using its original name, is this possible? For example say I had the table 'data'
a b c
1 2 2
3 2 3
4 1 2
and I wanted to update the name of column b to 'd'. I know I could use
colnames(data)[2] <- 'd'
but can I make the change by specifically referencing b, i.e. something like
colnames(data)['b'] <- 'd'
so that if the column ordering of the dataframe changes the correct column name will still be updated.
Thanks in advance
There is a function setnames built into package data.table for exactly that.
setnames(DT, "b", "d")
It changes the names by reference with no copy at all. Any other method using names(data)<- or names(data)[i]<- or similar will copy the entire object, usually several times. Even though all you're doing is changing a column name.
DT must be type data.table for setnames to work, though. So you'd need to switch to data.table or convert using as.data.table, to use it.
Here is the extract from ?setnames. The intention is that you run example(setnames) at the prompt and then the comments relate to the copies you see being reported by tracemem.
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
tracemem(DF)
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 2 copies of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
DT = data.table(a=1:2,b=3:4,c=5:6)
tracemem(DT)
setnames(DT,"b","B") # by name; no match() needed. No copy.
setnames(DT,3,"C") # by position. No copy.
setnames(DT,2:3,c("D","E")) # multiple. No copy.
setnames(DT,c("a","E"),c("A","F")) # multiple by name. No copy.
setnames(DT,c("X","Y","Z")) # replace all. No copy.
As of October 2014 this can now be done easily in the dplyr package:
rename(data, d = b)
This seems like a hack, but the first thing that came to mind was to use grepl() with a sufficiently detailed enough search string to only get the column you want. I'm sure there are better options:
dat <- data.frame(a = 1:3, b = 1:3, c = 1:3)
colnames(dat)[grepl("b", colnames(dat))] <- "foo"
dat
#------
a foo c
1 1 1 1
2 2 2 2
3 3 3 3
As Joran points out below, I overcomplicated things...no need for a regex at all. This saves a few characters on the typing too.
colnames(dat)[colnames(dat) == "foo"] <- "bar"
#------
a bar c
1 1 1 1
2 2 2 2
3 3 3 3
Yes but it's more difficult (as far as I know) than numeric indexing. I'm going to provide a dirty function that will do this and if you want to see how to do it just tear the function apart line by line:
rename <- function(df, column, new){
x <- names(df) #Did this to avoid typing twice
if (is.numeric(column)) column <- x[column] #Take numeric input by indexing
names(df)[x %in% column] <- new #What you're interested in
return(df)
}
#try it out
rename(mtcars, 'mpg', 'NEW')
rename(mtcars, 1, 'NEW')
I disagree with #Chase - the grepl solution ain't the luckiest one. I'd say: go with simple ==. Here's why:
d <- data.frame(matrix(rnorm(100), 10))
colnames(d) <- replicate(10, paste(sample(letters[1:5], size = 5, replace=TRUE, prob=c(.1, .6, .1, .1, .1)), collapse = ""))
Now try doing grepl("b", colnames(d)). Either pass fixed = TRUE, or even better do simple colnames(d) == "b" like #joran suggested. Regex matching will always be slower than ==, so for simple tasks like this you may want to use simple ==.
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1