Creating/Populating Empty Data Frames in R

Creating/Populating Empty Data Frames in R - r

I am working with R. I found this link here on creating empty data frames in R: Create an empty data.frame .
I tried to do something similar:
df <- data.frame(Date=as.Date(character()),
country=factor(),
total=numeric(),
stringsAsFactors=FALSE)
Yet, when I try to populate it:
df$total = 7
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, total, value = 7) :
replacement has 1 row, data has 0
df[1, "total"] <- rnorm(100,100,100)
Error in `[<-.data.frame`(`*tmp*`, 1, "total", value = c(-79.4584309347689, :
replacement has 100 rows, data has 1
Does anyone know how to fix this error?
Thanks

An option is to specify the row index
df[1, "total"] <- 7
-output
str(df)
#'data.frame': 1 obs. of 3 variables:
# $ Date : Date, format: NA
# $ country: Factor w/ 0 levels: NA
# $ total : num 7
The issue is that when we select a single column and assign on a 0 row dataset, it is not automatically expanding the row for other columns. By specifying the row index, other columns will automatically filled with default NA
Regarding the second question (updated), a standard data.frame column is a vector and the length of the vector should be the same as the index we are specifying. Suppose, we want to expand to 100 rows, change the index accordingly
df[1:100, "total"] <- rnorm(100, 100, 100) # length is 100 here
dim(df)
#[1] 100 3
Or if we need to cram everything in a single row, then wrap the rnorm in a list
df[1, "total"] <- list(rnorm(100, 100, 100))
In short, the lhs should be of the same length as the rhs. Another case is when we are assigning from a different dataset
df[seq_along(aa$bb), "total"] <- aa$bb
This can also be done without initialization i.e.
df <- data.frame(total = aa$bb)

Related

Using the apply function over each column for adjusting of data.frame

So my hope is to change columns 14:18 into 1 column "Type". I wanted to give each of the entries in this new column (for matching observations in the previous) the value of which of the 5 is a 1 (because only 1 of them can be true). This is my best attempt at doing this in R (and beyond frustrated).
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
carSubset <- apply(carSubset[,types],
2,
function(each_obs){
hit_index <- which(each_obs == 1)
carSubset$Type <- types[hit_index]
})
head(carSubset) # output:
1 2 3 4 5
"sedan" "coupe" "convertible" "convertible" "convertible"
Which is what I wanted ... however, I also wanted the rest of my data.frame to come along with it, like I just wanted the new column of "Type" but I cannot even access it with the following line of code...
head(carSubset$Type) # output: Error in carSubset$Type : $ operator is invalid for atomic vectors
Any help on how to Add a new column dynamically while appending previously related data observations to it?

I actually figured it out! Probably not the best way to do it, but hey, it works.
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
head(carSubset[,types])
carSubset[,types]
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
newSubset <- c()
newSubset <- apply(carSubset[,types],
1,
function(obs){
hit_index <- which(obs == 1)
newSubset <- types[hit_index]
})
newSubset
carSubset$Type <- cbind(Type = newSubset)
head(carSubset[, !(names(carSubset) %in% types)])

median() coming back as NA for df of integers?

Have a data frame of numerical data and using apply with median along columns. I'm getting NA for the median even though there are some non-zero entries in the columns. I did str(df) to ensure all of the df is integer and it is. What does it mean when R says the median is NA? Thanks.
v1 v2 v3.....
1 3 4
0 0 0
. . .
Also, I got a bunch warnings like this:
"1: In mean.default(sort(x, partial = half + 0L:1L)[half + ... :
argument is not numeric or logical: returning NA"

My solution it is trivial but maybe there are some NAs you did not see. Try to use apply with the na.rm = FALSE in the last argument (the ellipsis).
Using the code provided by akrun.
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:5, 10*5, replace=TRUE), ncol=5))
apply(df1, 2, median)
I add some NA
df1[ 3 , "V2" ] <- NA
and then use sapply (which is the same due to the fact that a data frame is a type of list )
sapply(df1, median, c(na.rm = TRUE))
edit:
consider that str(df1) return int even if there is an NA at row 3 column V2.

dplyr join define NA values

Can I define a "fill" value for NA in dplyr join? For example in the join define that all NA values should be 1?
require(dplyr)
lookup <- data.frame(cbind(c("USD","MYR"),c(0.9,1.1)))
names(lookup) <- c("rate","value")
fx <- data.frame(c("USD","MYR","USD","MYR","XXX","YYY"))
names(fx)[1] <- "rate"
left_join(x=fx,y=lookup,by=c("rate"))
Above code will create NA for values "XXX" and "YYY". In my case I am joining a large number of columns and there will be a lot of non-matches. All non-matches should have the same value. I know I can do it in several steps but the question is can all be done in one?
Thanks!

First off, I would like to recommend not to use the combination data.frame(cbind(...)). Here's why: cbind creates a matrix by default if you only pass atomic vectors to it. And matrices in R can only have one type of data (think of matrices as a vector with dimension attribute, i.e. number of rows and columns). Therefore, your code
cbind(c("USD","MYR"),c(0.9,1.1))
creates a character matrix:
str(cbind(c("USD","MYR"),c(0.9,1.1)))
# chr [1:2, 1:2] "USD" "MYR" "0.9" "1.1"
although you probably expected a final data frame with a character or factor column (rate) and a numeric column (value). But what you get is:
str(data.frame(cbind(c("USD","MYR"),c(0.9,1.1))))
#'data.frame': 2 obs. of 2 variables:
# $ X1: Factor w/ 2 levels "MYR","USD": 2 1
# $ X2: Factor w/ 2 levels "0.9","1.1": 1 2
because strings (characters) are converted to factors when using data.frame by default (You can circumvent this by specifying stringsAsFactors = FALSE in the data.frame() call).
I suggest the following alternative approach to create the sample data (also note that you can easily specify the column names in the same call):
lookup <- data.frame(rate = c("USD","MYR"),
value = c(0.9,1.1))
fx <- data.frame(rate = c("USD","MYR","USD","MYR","XXX","YYY"))
Now, for you actual question, if I understand correctly, you want to replace all NAs with a 1 in the joined data. If that's correct, here's a custom function using left_join and mutate_each to do that:
library(dplyr)
left_join_NA <- function(x, y, ...) {
left_join(x = x, y = y, by = ...) %>%
mutate_each(funs(replace(., which(is.na(.)), 1)))
}
Now you can apply it to your data like this:
> left_join_NA(x = fx, y = lookup, by = "rate")
# rate value
#1 USD 0.9
#2 MYR 1.1
#3 USD 0.9
#4 MYR 1.1
#5 XXX 1.0
#6 YYY 1.0
#Warning message:
#joining factors with different levels, coercing to character vector
Note that you end up with a character column (rate) and a numeric column (value) and all NAs are replaced by 1.
str(left_join_NA(x = fx, y = lookup, by = "rate"))
#'data.frame': 6 obs. of 2 variables:
# $ rate : chr "USD" "MYR" "USD" "MYR" ...
# $ value: num 0.9 1.1 0.9 1.1 1 1

If you're using dplyr anyway, you might as well take advantage of dplyr::coalesce, and use the dplyr syntax to pass into that a 1 or 0. I think this looks nice...
... %>%
mutate_if(is.numeric,coalesce,0)
Where the 0 is the arg passed to dplyr::coalesce to replace NAs.
In the example in the question, there are dataframes with factors. I feel confident one would not have FX rates as factors, or another vector in which you'd replace NA with zero, so I go ahead and add that step below just to make the answer executable after the provided example.
# replace NAs with zeros for all numeric columns
#
# ... code from question above
left_join(x=fx,y=lookup,by=c("rate")) %>%
# ignore if factors in value column are because it's a toy example
mutate(value = as.numeric(as.character(value))) %>%
# the good stuff here
mutate_if(is.numeric,coalesce,0)

I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)
left_join0 <- function(x, y, fill = 0L){
z <- left_join(x, y)
tmp <- setdiff(names(z), names(x))
z <- replace_na(z, setNames(as.list(rep(fill, length(tmp))), tmp))
z
}
Originally answered at: R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table

A tidyverse solution is to use tidyr::replace_na after the join:
left_join(x = fx, y = lookup, by = c("rate")) %>%
replace_na(list(value = 0))
Or, for more general cases:
left_join(x = fx, y = lookup, by = c("rate")) %>%
mutate(across(where(is.numeric), ~ replace_na(.x, 0)))

lapply on single column in data frame

I have a data frame which I populate from a csv file as follows (data for sample only) :
> csv_data <- read.csv('test.csv')
> csv_data
gender country income
1 1 20 10000
2 2 20 12000
3 2 23 3000
I want to convert country to factor. However when I do the following, it fails :
> csv_data[,2] <- lapply(csv_data[,2], factor)
Warning message:
In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
provided 3 variables to replace 1 variables
However, if I convert both gender and country to factor, it succeeds :
> csv_data[,1:2] <- lapply(csv_data[,1:2], factor)
> is.factor(csv_data[,1])
[1] TRUE
> is.factor(csv_data[,2])
[1] TRUE
Is there something I am doing wrong? I want to use lapply since I want to programmatically convert the columns into factors and it could be possible that the number of columns to be converted is only 1(it could be more as well, this number is driven from arguments to a function). Any way I can do it using lapply only?

When subsetting for one single column, you'll need to change it slightly.
There's a big difference between
lapply(df[,2], factor)
and
lapply(df[2], factor)
## and/or
lapply(df[, 2, drop=FALSE], factor)
Have a look at the output of each. If you remove the comma, everything should work fine. Using the comma in [,] turns a single column into a vector and therefore each element in the vector is factored individually. Whereas leaving it out keeps the column as a list, which is what you want to give to lapply in this situation. However, if you use drop=FALSE, you can leave the comma in, and the column will remain a list/data.frame.
No good:
df[,2] <- lapply(df[,2], factor)
# Warning message:
# In `[<-.data.frame`(`*tmp*`, , 2, value = list(1L, 1L, 1L)) :
# provided 3 variables to replace 1 variables
Succeeds on a single column:
df[,2] <- lapply(df[,2,drop=FALSE], factor)
df[,2]
# [1] 20 20 23
# Levels: 20 23
On my opinion, the best way to subset data frame columns is without the comma. This also succeeds:
df[2] <- lapply(df[2], factor)
df[[2]]
# [1] 20 20 23
# Levels: 20 23

R: losing column names when adding rows to an empty data frame

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.

The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.

was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4

Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)

FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)

One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"

You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus

I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg

Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95

Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating/Populating Empty Data Frames in R - r

Related

Using the apply function over each column for adjusting of data.frame

median() coming back as NA for df of integers?

dplyr join define NA values

lapply on single column in data frame

R: losing column names when adding rows to an empty data frame

Categories

Resources