R: control auto-created column names in call to rbind() - r

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?

#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)

Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Related

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

Add different suffix to column names on multiple data frames in R

I'm trying to add different suffixes to my data frames so that I can distinguish them after I've merge them. I have my data frames in a list and created a vector for the suffixes but so far I have not been successful.
data2016 is the list containing my 7 data frames
new_names <- c("june2016", "july2016", "aug2016", "sep2016", "oct2016", "nov2016", "dec2016")
data2016v2 <- lapply(data2016, paste(colnames(data2016)), new_names)
Your query is not quite clear. Therefore two solutions.
The beginning is the same for either solution. Suppose you have these four dataframes:
df1x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df2x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
df3x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df4x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
Suppose further you assemble them in a list, something akin to your data2016using mgetand ls and describing a pattern to match them:
my_list <- mget(ls(pattern = "^df\\d+x$"))
The names of the dataframes in this list are the following:
names(my_list)
[1] "df1x" "df2x" "df3x" "df4x"
Solution 1:
Suppose you want to change the names of the dataframes thus:
new_names <- c("june2016", "july2016","aug2016", "sep2016")
Then you can simply assign new_namesto names(my_list):
names(my_list) <- new_names
And the result is:
names(my_list)
[1] "june2016" "july2016" "aug2016" "sep2016"
Solution 2:
You want to add the new_names literally as suffixes to the 'old' names, in which case you would use pasteor paste0 thus:
names(my_list) <- paste0(names(my_list), "_", new_names)
And the result is:
names(my_list)
[1] "df1x_june2016" "df2x_july2016" "df3x_aug2016" "df4x_sep2016"
You could use an index number within lapply to reference both the list and your vector of suffixes. Because there are a couple steps, I'll wrap the process in a function(). (Called an anonymous function because we aren't assigning a name to it.)
data2016v2 <- lapply(1:7, function(i) {
this_data <- data2016[[i]] # Double brackets for a list
names(this_data) <- paste0(names(this_data), new_names[i]) # Single bracket for vector
this_data # The renamed data frame to be placed into data2016v2
})
Notice in the paste0() line we are recycling the term in new_names[i], so for example if new_names[i] is "june2016" and your first data.frame has columns "A", "B", and "C" then it would give you this:
> paste0(c("A", "B", "C"), "june2016")
[1] "Ajune2016" "Bjune2016" "Cjune2016"
(You may want to add an underscore in there?)
As an aside, it sounds like you might be better served by adding the "june2016" as a column in your data (like say a variable named month with "june2016" as the value in each row) and combining your data using something like bind_rows() from the dplyr package, running it "long" instead of "wide".

Extract specific information from a list of dataframes

I am facing the following challenge:
I have a list of dataframes in R and I'd like to extract some specific information from it. Here is an example:
df_1 <- data.frame(A = c(1,2), B = c(3,4), D = c(5,6))
df_2 <- data.frame(A = c(7,8), B = c(9,10), D = c(11,12))
df_3 <- data.frame(A = c(0,1), B = c(2,3), D = c(4,5))
L <- list(df_1, df_2, df_3)
What I'd like to extract are the values at position (1,1) in each of these dataframes. In the above case this would be: 1, 7, 0.
Is there a way to extract this information easily, probably with one line of code?
As Ronak has suggested , you can use function like lapply and wrap it with unlist for desired output.
unlist(lapply(L,function(x) x[1,1]))
In addition to the *apply methods shown above, you can also do this in a Vectorized manner. Since all the data frames in your list have the same column names, and you want the first element from the first column, i.e. 'A1', then you can simply unlist (which will create a named vector) and grab the values with the name A1.
v1 <- unlist(L)
v1[names(v1) == 'A1']
#A1 A1 A1
# 1 7 0

One liner wanted: Create data frame and give colnames: R data.frame(..., colnames = c("a", "b", "c"))

Is there an easier (i.e. one line of code instead of two!) way to do the following:
results <- as.data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2))
names(results) <- c("a", "b")
Something like:
results <- data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2), colnames = c("a", "b"))
I do this a lot, and would really love to have a way to have this in one line of code.
/data.table works too, if it's easier to do there than in base data.frame/
Clarifying:
My expected output (which is achieved by running the two lines of code at the top - AND I WANT IT TO BE ONE - THAT's IT!!!) is a result data frame of the structure:
results
a b
1 SampleID_someusefulinfo countsA
2 SampleID_someusefulinfo countsB
3 SampleID_someusefulinfo counts
What I would like to do is:
CREATE the data frame from a matrix or with some content (for example the toy code of matrix(c(1,2,3,4),nrow=2,ncol=2) I provided in the first example I wrote)
SPECIFY IN THAT SAME LINE what I would like the column names of my data frame to be
Use setNames() around a data.frame
setNames(data.frame(matrix(c(1,2,3,4),nrow=2,ncol=2)), c("a","b"))
# a b
#1 1 3
#2 2 4
?setNames:
a convenience function that sets the names on an object and returns the object
> setNames
function (object = nm, nm)
{
names(object) <- nm
object
}
We can use the dimnames option in matrix as the OP was using matrix to create the data.
data.frame(matrix(1:4, 2, 2, dimnames=list(NULL, c("a", "b"))))
Or
`colnames<-`(data.frame(matrix(1:4, 2, 2)), c('a', 'b'))

rename the columns name after cbind the data

merger <- cbind(as.character(Date),weather1$High,weather1$Low,weather1$Avg..High,weather1$Avg.Low,sale$Scanned.Movement[a])
After cbind the data, the new DF has column names automatically V1, V2......
I want rename the column by
colnames(merger)[,1] <- "Date"
but failed. And when I use merger$V1 ,
Error in merger$V1 : $ operator is invalid for atomic vectors
You can also name columns directly in the cbind call, e.g.
cbind(date=c(0,1), high=c(2,3))
Output:
date high
[1,] 0 2
[2,] 1 3
Try:
colnames(merger)[1] <- "Date"
Example
Here is a simple example:
a <- 1:10
b <- cbind(a, a, a)
colnames(b)
# change the first one
colnames(b)[1] <- "abc"
# change all colnames
colnames(b) <- c("aa", "bb", "cc")
you gave the following example in your question:
colnames(merger)[,1]<-"Date"
the problem is the comma: colnames() returns a vector, not a matrix, so the solution is:
colnames(merger)[1]<-"Date"
If you pass only vectors to cbind() it creates a matrix, not a dataframe. Read ?data.frame.
A way of producing a data.frame and being able to do this in one line is to coerce all matrices/data frames passed to cbind into a data.frame while setting the column names attribute using setNames:
a = matrix(rnorm(10), ncol = 2)
b = matrix(runif(10), ncol = 2)
cbind(setNames(data.frame(a), c('n1', 'n2')),
setNames(data.frame(b), c('u1', 'u2')))
which produces:
n1 n2 u1 u2
1 -0.2731750 0.5030773 0.01538194 0.3775269
2 0.5177542 0.6550924 0.04871646 0.4683186
3 -1.1419802 1.0896945 0.57212043 0.9317578
4 0.6965895 1.6973815 0.36124709 0.2882133
5 0.9062591 1.0625280 0.28034347 0.7517128
Unfortunately, there is no setColNames function analogous to setNames for data frames that returns the matrix after the column names, however, there is nothing to stop you from adapting the code of setNames to produce one:
setColNames <- function (object = nm, nm) {
colnames(object) <- nm
object
}
See this answer, the magrittr package contains functions for this.
If you offer cbind a set of arguments all of whom are vectors, you will get not a dataframe, but rather a matrix, in this case an all character matrix. They have different features. You can get a dataframe if some of your arguments remain dataframes, Try:
merger <- cbind(Date =as.character(Date),
weather1[ , c("High", "Low", "Avg..High", "Avg.Low")] ,
ScnMov =sale$Scanned.Movement[a] )
It's easy just add the name which you want to use in quotes before adding
vector
a_matrix <- cbind(b_matrix,'Name-Change'= c_vector)

Resources