R: Adding row to a dataframe with multiple classes - r

I have a seemingly simple task of adding a row to a data frame in R but I just can't do it!
I have a data frame with 50 rows and 100 columns. The data frame, which I would like to keep in the same format, has the first column as a factor, and all other columns as characters -- this is what lapply produced. I would simply like to add append a 51st row...but I incur warnings every time.
My added data is of the form Cat <- c("Cat", 1,NA,3,NA,5). (I have no clue where " or ' need to go - quite new to R!)
rbind shows "invalid factor levels" every time.
e.g.
df <- rbind(df,Cat)
I believe this is because of the factor/character divide.

The factor levels in your data.frame should also contain the values in your "Cat" object for the relevant factor column.
Here's a simple example:
df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
toAdd <- list("c", 3)
## Warnings...
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 <NA> 3
# Warning message:
# In `[<-.factor`(`*tmp*`, ri, value = "c") :
# invalid factor level, NA generated
## Possible fix
df$v1 <- factor(df$v1, unique(c(levels(df$v1), toAdd[[1]])))
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 c 3
Alternatively, consider rbindlist from "data.table", which should save you from having to convert the factor levels:
> library(data.table)
> df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
> rbindlist(list(df, toAdd))
v1 v2
1: a 1
2: b 2
3: c 3
> str(.Last.value)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ v1: Factor w/ 3 levels "a","b","c": 1 2 3
$ v2: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>

Related

R: replacing <NA> within factor variables as 0

I am working with the R programming language. I have a dataset with both character and numeric variables - I am trying to replace all NA's and empty values in this data with "0". For a continuous variable, the NA/empty value should be replaced with a "numeric 0". For factor variables, the NA/empty value should be replaced with a "factor 0".
In the past, I used to use a standard command for replacing all NA's with 0 (in the below code, "df" represents the data frame containing the data):
df[df == NA] <- 0
I tried the above code on my data, but I still noticed that within the factor variables, this code was not able to replace <NA> values with 0. <NA> 's are still present.
I tried several approaches:
1st Approach:
df[is.na(df)] <- 0
But this did not work:
Warning message:
In '[<-.factor'('*tmp*',thisvar, value = 0):
invalid factor level, NA generated
Second Approach: I tried for one of the factor variables
library(car)
df$some_factor_var <- recode(df$some_factor_var, "NA = 0")
But this replaced every value within "some_factor_var" as 0
Third Approach : I tried again for one of the factor variables
library(forcats)
fct_explicit_na(df$some_factor_var,0)
Error: Can't convert a double vector to a character vector
Can someone please show me how to fix this problem? Is there a way to replace ALL empty/missing/NA values for all variables at once?
Thanks
For factor variables you need to first include the new level (0) in the data if it is not already present.
See this example -
df <- data.frame(a = factor(c(1, NA, 2, 5)), b = 1:4,
c = c('a', 'b', 'c', NA), d = c(1, 2, NA, 1))
#Include 0 in the levels for "a" variable
levels(df$a) <- c(levels(df$a), 0)
#Replace NA to 0
df[is.na(df)] <- 0
df
# a b c d
#1 1 1 a 1
#2 0 2 b 2
#3 2 3 c 0
#4 5 4 0 1
str(df)
#'data.frame': 4 obs. of 4 variables:
# $ a: Factor w/ 4 levels "1","2","5","0": 1 4 2 3
# $ b: int 1 2 3 4
# $ c: chr "a" "b" "c" "0"
# $ d: num 1 2 0 1
With tidyverse, try:
library(tidyverse)
df <-
tibble(var_numeric = c(1,2,3,NA),
var_factor = as.factor(c(4,5,6,NA)))
df %>%
replace_na(list(var_numeric = 0)) %>%
mutate(var_factor = fct_explicit_na(var_factor, "0"))
# A tibble: 4 x 2
var_numeric var_factor
<dbl> <fct>
1 1 4
2 2 5
3 3 6
4 0 0

Why doesn't R change column names when I am trying to change all column names based on a single row?

I am trying to change the names of all columns in my dataframe to the values in a row of the same dataframe.
When I try this in R, it changes it to different numbers.
This is what my data looks like:
QS201EW... Group X X.1
1 Data
2 Area All categories: Ethnic group White
3 Date : 2011
This is the output of str:
'data.frame': 3 obs. of 3 variables:
$ QS201EW...group: Factor w/ 34760 levels "","Area",..: 32848 2 3
$ X : Factor w/ 1849 levels "","1001","1002",
$ X.1 : Factor w/ 2462 levels "","100"
I'm finding it difficult to insert the dput of my data as it is too large but all the columns are factors - is this the issue with not being able to change the column names?.
This is the code I've tried before:
colnames(df) <- df[2,]
your columns are factors, that's why your code does not work.
Try this:
colnames(df) <- as.character(df[2, ])
But you can solve your problem before it starts. Depending on how you read your data you can skip certain lines. For example if you read your data with read.table you can specify the skip argument:
mydata <- read.table("mydata.csv", sep = ",", skip = 2)
This will skip the first two rows of a csv file.
Furthermore, if you want to avoid working with factors (which will be true most of the time) you can use stringsAsFactors = F.
Yes, it is because all the values in the dataframe are factors.
Consider this example,
df <- data.frame(col1 = LETTERS[1:3], col2 = LETTERS[4:6], col3 = LETTERS[7:9])
which is
df
# col1 col2 col3
#1 A D G
#2 B E H
#3 C F I
Now if you try to assign names
names(df) <- df[2, ]
df
# 2 2 2
#1 A D G
#2 B E H
#3 C F I
Try to unlist data and then use as.character to assign names.
names(df) <- as.character(unlist(df[2, ]))
df
# B E H
#1 A D G
#2 B E H
#3 C F I

r - subsetting dataframe creates factors

I have a huge data frame (call it huge) I would like to split up in two by row number. Though, I notice that the way I'd do it makes the resulting subsets large factors instead of data frames.
list1 <- huge[c(1:8175),]
list2 <- huge[c(8176:nrow(huge),]
class(list1)
[1] "factor"
Can someone explain to me why it is like that, and how do I prevent that?
It is likely that you subset a one-column data frame. Considering the following example.
# Create an example data frame
dt <- data.frame(a = 1:5, b = letters[1:5])
dt
# a b
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
str(dt)
# 'data.frame': 5 obs. of 2 variables:
# $ a: int 1 2 3 4 5
# $ b: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# Subset the data frame
list1 <- dt[1:2, ]
list2 <- dt[3:nrow(dt), ]
class(list1)
# [1] "data.frame"
The code to subset dt works well. However, when I created a one-column data frame from dt and subset it, you can see that the output automatically becomes a vector.
# Create a one-column data frame
dt2 <- dt[, 2, drop = FALSE]
# Subset the data frame
list3 <- dt2[1:2, ]
list4 <- dt2[3:nrow(dt2), ]
class(list3)
# [1] "factor"
list3
# [1] a b
# Levels: a b c d e
The solution would be add drop = FALSE when subsetting the data frame to keep the output as a data frame.
# Subset the data frame
list5 <- dt2[1:2, , drop = FALSE]
class(list5)
# [1] "data.frame"

How do I remove a particular level occurring in all factors in a dataframe

After reading in data and cleaning it, I ended up with factor columns that have levels that should no longer be there.
For example, d below has one blank cell in excel. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data.
d <- read.csv(header = TRUE, text='
x,y,value
a,one,1
,,5
b,two,4
c,three,10
')
d
#> x y value
#> 1 a one 1
#> 2 5
#> 3 b two 4
#> 4 c three 10
str(d)
#> 'data.frame': 4 obs. of 3 variables:
#> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4
#> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3
#> $ value: int 1 5 4 10
How do I remove this level, "" from the factors which are about 20 factors in the data frame, without deleting the entire row that has just one empty cell, cause this will reduce my sample size from 299000 to just 7 observation(which I have tried before).
One way would be to replace the '' with NA and use droplevels to remove the unused levels
d[1:2] <- lapply(d[1:2], function(x) droplevels(replace(x, x=="", NA)))
levels(d$x)
#[1] "a" "b" "c"
levels(d$y)
#[1] "one" "three" "two"
Another option while reading the dataset (as we assume the OP wanted factor columns would be
d <- read.csv("yourfile.csv", na.strings = "")
This should make sure that the '' will be read as NA.
Update
Suppose, there are numeric columns in between and we need to do the replace/droplevels only for the factor columns
d[] <- lapply(d, function(x) if(is.factor(x)) droplevels(replace(x, x== "", NA))
else x)

assign data.frame as a component in a data.frame in R

This does not work
> dfi=data.frame(v1=c(1,1),v2=c(2,2))
> dfi
v1 v2
1 1 2
2 1 2
> df$df=dfi
Error in `$<-.data.frame`(`*tmp*`, "df", value = list(v1 = c(1, 1), v2 = c(2, :
replacement has 2 rows, data has 0
df$df=I(dfi) has the same error. Please help.
Thank you.
Moved this from comments for formatting reasons:
What exactly are you trying to achieve? If you want the contents of dfi passed to df you can use this code:
df <- data.frame(matrix(vector(), 0, 2, dimnames=list(c(), c("V1", "V2"))), stringsAsFactors=F)
df=dfi
As #joran says, it is unclear why you would ever want to do this. Nevertheless, it is possible.
One of the requirements of a data frame is that all the columns have the same number of rows. This is why you are getting the error. Something like this will work:
dfi <- data.frame(v1=c(1,1),v2=c(2,2)) # 2 rows
df <- data.frame(x=1:2) # also 2 rows
df$df <- dfi # works now
Printing would lead you to believe that df has three columns...
df
# x df.v1 df.v2
# 1 1 1 2
# 2 2 1 2
but it does not!
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ x : int 1 2
# $ df:'data.frame': 2 obs. of 2 variables:
# ..$ v1: num 1 1
# ..$ v2: num 2 2
Since df$df is a data frame
class(df$df)
# [1] "data.frame"
you can use the standard data frame accessors...
df$df$v1
# [1] 1 1
df$df[1,]
# v1 v2
# 1 1 2
Incidentally, RStudio has trouble displaying this type of data structure; view(df) gives an inaccurate display of the structure.
Finally, you are probably better off creating a list of data frames, rather than a data frame containing data frames:
df <- data.frame(grp=rep(LETTERS[1:3],each=5),x=rnorm(15),y=rpois(15,5))
df.lst <- split(df,df$grp) # creates a list of data frames
df.lst$A
# grp x y
# 1 A -1.3606420 10
# 2 A -0.4511408 5
# 3 A -1.1951950 4
# 4 A -0.8017765 5
# 5 A -0.2816298 9
df.lst$A$x
# [1] -1.3606420 -0.4511408 -1.1951950 -0.8017765 -0.2816298

Resources