Create empty data.frame with column names

Create empty data.frame with column names - r

I am trying to create a empty data frame with two columns and unknown number of row. I would like to specify the names of the columns. I ran the following command
dat <- data.frame("id"=numeric(),"nobs"=numeric())
I can test the result by running
> str(dat)
'data.frame': 0 obs. of 2 variables:
$ id : num
$ nobs: num
But later on when I insert data into this data frame using rbind in the following command, the names of the columns are also changed
for (i in id) {
nobs = nrow(na.omit(read.csv(files_list[i])))
dat = rbind(dat, c(i,nobs))
}
After for loop this is the value of dat
dat
X3 X243
1 3 243
And str command shows the following
str(dat)
'data.frame': 1 obs. of 2 variables:
$ X3 : num 3
$ X243: num 243
Can any one tell why are the col names in data frame change
EDIT:
My lazy solution to fix the problem is to run the follwing commands after for loop that binds data to my data.frame
names(dat)[1] = "id"
names(dat)[2] = "nobs"

Interestingly, the rbind.data.frame function throws away all values passed that have zero rows. It basically happens in this line
allargs <- allargs[nr > 0L]
so passing in a data.frame with no rows, is really like not passing it in nothing at all. Another good example why it's almost always a bad idea to try to build a data.frame row-by-row. Better to build vectors and then combine into a data.frame only when done.

dat = data.frame(col1=numeric(), col2=numeric())
...loop
dat[, dim(dat)[1] + 1] = c(324, 234)
This keeps the column names

You should try specify your column names inside the rbind():
dat = rbind(dat, data.frame("id" = i, "nobs" = nobs))

I would change how you're appending the data to the data frame. Since rbind seems to remove the column names, just replace the indexed location.
dat <- data.frame("id"=numeric(),"nobs"=numeric())
for (i in id) {
dat[i,] <- nrow(na.omit(read.csv(files_list[i])))
}
FYI, Default data frame creation converts all strings to factors, not an issue here, since all your data formats are numeric. But if you had a character(), you might want to turn off the default stringsAsFactors=FALSE, to append character lists.

Related

Add label with names of the variables using expss

I have a data frame with 383 variables. Because the names of the variables are long and self-explanatory, I would like to add these names to the labels of variables, then in a second step (already successfully done), I would rename variables for easier coding. I have tried the following with the error:
library(expss)
REGCON_CA_FIRM <- apply_labels(REGCON_CA_FIRM,names(REGCON_CA_FIRM)<-names(REGCON_CA_FIRM))
# Error in if (curr_name %in% data_names) { : argument is of length zero

A one-liner using mtcars:
do.call(apply_labels, c(list(data=mtcars),setNames(names(mtcars), names(mtcars)) %>% as.list()))
However, for your use case, you can create a small function as below that takes a dataframe and a vector of new names, and basically moves the current column names to labels, and replaces the original (i.e. too long) names with the new names
replace_long_with_short <- function(d,short_names) {
setNames(
do.call(apply_labels, c(list(data=d),setNames(names(df), names(df)) %>% as.list())),
short_names
)
}
Pass your dataframe to this function, along with desired new names. The function will return the frame with the original column names as labels, and the new colnames will be the desired new names:
Example: Let's say you have a data frame that looks like this:
X.is.an.important.variable Y.is.also.important
1 -0.003643385 1.1052905
2 1.641458152 0.5303247
3 -1.058337452 0.5490569
and you want those descriptive column names to be the labels, and the new names to be x and y.
Then calling the above function like this:
df = replace_long_with_short(df,c("x", "y"))
will convert df to this:
x y
1 -0.003643385 1.1052905
2 1.641458152 0.5303247
3 -1.058337452 0.5490569
and the labels will be attached:
str(df)
'data.frame': 3 obs. of 2 variables:
$ x:Class 'labelled' num -0.00364 1.64146 -1.05834
.. .. LABEL: X.is.an.important.variable
$ y:Class 'labelled' num 1.105 0.53 0.549
.. .. LABEL: Y.is.also.important

Creating/Populating Empty Data Frames in R

I am working with R. I found this link here on creating empty data frames in R: Create an empty data.frame .
I tried to do something similar:
df <- data.frame(Date=as.Date(character()),
country=factor(),
total=numeric(),
stringsAsFactors=FALSE)
Yet, when I try to populate it:
df$total = 7
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, total, value = 7) :
replacement has 1 row, data has 0
df[1, "total"] <- rnorm(100,100,100)
Error in `[<-.data.frame`(`*tmp*`, 1, "total", value = c(-79.4584309347689, :
replacement has 100 rows, data has 1
Does anyone know how to fix this error?
Thanks

An option is to specify the row index
df[1, "total"] <- 7
-output
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ Date : Date, format: NA
# $ country: Factor w/ 0 levels: NA
# $ total : num 7
The issue is that when we select a single column and assign on a 0 row dataset, it is not automatically expanding the row for other columns. By specifying the row index, other columns will automatically filled with default NA
Regarding the second question (updated), a standard data.frame column is a vector and the length of the vector should be the same as the index we are specifying. Suppose, we want to expand to 100 rows, change the index accordingly
df[1:100, "total"] <- rnorm(100, 100, 100) # length is 100 here
dim(df)
#[1] 100 3
Or if we need to cram everything in a single row, then wrap the rnorm in a list
df[1, "total"] <- list(rnorm(100, 100, 100))
In short, the lhs should be of the same length as the rhs. Another case is when we are assigning from a different dataset
df[seq_along(aa$bb), "total"] <- aa$bb
This can also be done without initialization i.e.
df <- data.frame(total = aa$bb)

Exponential notation not precise

I have imported a dataset which contains large numbers which were automatically converted to exponential notation. Because I had to see the full number, I used options(scipen = 999). I discovered that the imported number did not equal the original number from the dataset. For example: 5765949338897345178 was changed to 5765949338897345536.
How can it be that these numbers are not the same? The weird thing is that when I use: which(dim_alias1$id == 5765949338897345536) and which(dim_alias1$id == 5765949338897345178), it returns the same rownumber. How is this possible?

As you are using the variable as an id number, it doesn't need to be numeric. So set the column class to character when reading in.
Example:
dat <- data.frame(id=12345, x=1)
write.table(dat, tmp <- tempfile())
dat2 <- read.table(tmp, colClasses = c(id="character"))
str(dat2)
#'data.frame': 1 obs. of 2 variables:
# $ id: chr "12345"
# $ x : int 1

Reading in Data.Frames with Strings as factors = False in R using chain operator

I have a table source that reads into a data frame. I know that by default, external sources are read into data frames as factors. I'd like to apply stringsAsFactors=FALSE in the data frame call below, but it throws an error when I do this. Can I still use chaining and turn stringsAsFactors=FALSE?
library(rvest)
pvbData <- read_html(pvbURL)
pvbDF <- pvbData %>%
html_nodes(xpath = `//*[#id="ajax_result_table"]`) %>%
html_table() %>%
data.frame()
data.frame(,stringsAsFactors=FALSE) <- Throws an error
I know this is probably something very simple, but I'm having trouble finding a way to make this work. Thank you for your help.

Though the statement should logically be data.frame(stringsAsFactors=FALSE) if you are applying chaining, even this statement doesn't produce the required output.
The reason is misunderstanding of use of stringsAsFactors option. This option works only if you make the data.frame column by column. Example:
a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = T)
str(a)
'data.frame': 2 obs. of 2 variables:
$ x: Factor w/ 2 levels "a","b": 1 2
$ y: num 1 2
a <- data.frame(x = c('a','b'),y=c(1,2),stringsAsFactors = F)
str(a)
'data.frame': 2 obs. of 2 variables:
$ x: chr "a" "b"
$ y: num 1 2
If you give data.frame as input, stringsAsFactors option doesn't work
Solution:
Store the chaining result to a variable like this:
library(rvest)
pvbData <- read_html(pvbURL)
pvbDF <- pvbData %>%
html_nodes(xpath = `//*[#id="ajax_result_table"]`) %>%
html_table()
And then apply this command:
data.frame(as.list(pvbDF),stringsAsFactors=F)
Update:
If the column is already a factor, then you can't convert it to character vector using this command. Better first as.character it and retry.
You may refer to Change stringsAsFactors settings for data.frame for more details.

R: losing column names when adding rows to an empty data frame

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.

The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.

was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4

Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)

FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)

One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"

You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus

I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg

Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95

Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.