rbind datasets with different classes

rbind datasets with different classes - r

I'm trying to merge/bind two datasets (mydata_103 and mydata_17). They have exactly the same variable names, however I get 4 of these warning messages
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 2, 1, 1, 1, 1, 1, 5, :
invalid factor level, NA generated
This seems to be caused by the fact that some variables have different classes. For example, I have a variable "gender" (1 = male, 2 = female). In the merged dataset, I do see value labels for mydata_17, however for the other dataset I get NA's. When I checked the classes, R returned they are different (I don't know why this is the case though?)
> lapply(mydata_103[7], class)
$prgesl
[1] "numeric"
> lapply(mydata_17[7], class)
$prgesl
[1] "factor"
I changed the class of mydata_103 to factor
mydata_103$prgesl <- as.factor(mydata_103$prgesl)
Now, I do get the numeric values, but it still doesn't translate to the value labels:
prgesl
15 Man
16 Man
17 Vrouw
18 2
19 2
20 1
21 2
Does anyone know how to fix this? And is there a way to get the classes for my two datasets the same or check which ones differ? (I have 404 variables so to check this by visual inspection seems ineffecient and prone to errors).
Best, Hanneke
Edit: The code to merge my datasets right now is simply:
data1 <- rbind.data.frame(mydata_17, mydata_103)

Following mtoto's suggestion you want to first convert everything to numeric, then use the levels() function to turn the numbers into labels.
mydata_17$prgesl <- as.numeric(mydata_17$prgesl)
mydata<- rbind(mydata_17,mydata_103)
labels <- levels(mydata_103$prgesl)
mydata_103$prgesl <-labels[mydata_103$prgesl]
levels() should return the factor's names respecting the order given by the numbers.

Convert factor columns to character then rbind, example:
# reproducible data
set.seed(1)
df1 <- data.frame(x = 1:3, y = runif(3))
df2 <- data.frame(x = letters[2:4], y = runif(3))
# below rbind will introduce NAs
rbind.data.frame(df2, df1)
# x y
# 1 b 0.9082078
# 2 c 0.2016819
# 3 d 0.8983897
# 4 <NA> 0.2655087
# 5 <NA> 0.3721239
# 6 <NA> 0.5728534
# Warning message:
# In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
# invalid factor level, NA generated
# Convert factors to character
i <- sapply(df1, is.factor)
df1[i] <- lapply(df1[i], as.character)
i <- sapply(df2, is.factor)
df2[i] <- lapply(df2[i], as.character)
# now bind
res <- rbind.data.frame(df2, df1)
str(res)
# 'data.frame': 6 obs. of 2 variables:
# $ x: chr "b" "c" "d" "1" ...
# $ y: num 0.908 0.202 0.898 0.266 0.372 ...
res
# x y
# 1 b 0.9082078
# 2 c 0.2016819
# 3 d 0.8983897
# 4 1 0.2655087
# 5 2 0.3721239
# 6 3 0.5728534

Related

How to combine columns that have the same name and remove NA's?

Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.
An example of what I would like is below (although on a much larger scale).
df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
x = c(NA,NA,NA,NA,9,NA,NA),
y = c(NA,5,12,"hop",NA,2,NA),
y = c(2,NA,NA,NA,8,NA,4),
z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"),
y = c(2,5,12,"hop",8,2,4),
z = c(9,5,NA,3,2,6,NA))
I have tried a number of things including suggestions such as:
R: merging columns and the values if they have the same column name
Combine column to remove NA's
However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).
I have managed to solve the issue manually by creating new columns that are combinations:
df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])
However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.
Thanks

here is a base R approach
#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)
# x y z
# 1 2 2 9
# 2 1 5 5
# 3 3 12 NA
# 4 5 hop 3
# 5 9 8 2
# 6 12 2 6
# 7 blah 4 NA
# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of 3 variables:
# $ x: chr "2" "1" "3" "5" ...
# $ y: chr " 2" "5" "12" "hop" ...
# $ z: num 9 5 NA 3 2 6 NA

Changing the category of a variable in R: Warning message

I'm trying to change the participants' Age variable (in my dataset) that's showing as character (rather than numeric) using the following code..
bwdata6 <- bwdata6 %>% mutate(Age <- as.numeric(Age))
I get the following warning message when I run the code...
Warning messages: 1: Problem with mutate() input ..1. i NAs introduced by coercion
Input ..1 is Age <- as.numeric(Age). 2: In mask$eval_all_mutate(dots[[i]]) :
NAs introduced by coercion
Any ideas how to resolve this?

Without a warning you may use gsub.
d$x.num <- as.numeric(gsub("\\D", NA, d$x))
to identify those values that become NA accordingly, :
grep("\\D", d$x)
# [1] 2 4 6
d
# x x.num
# 1 1 1
# 2 A NA
# 3 2 2
# 4 B NA
# 5 3 3
# 6 C NA
Data:
d <- data.frame(x=c(1, "A", 2, "B", 3, "C"))

R: Adding row to a dataframe with multiple classes

I have a seemingly simple task of adding a row to a data frame in R but I just can't do it!
I have a data frame with 50 rows and 100 columns. The data frame, which I would like to keep in the same format, has the first column as a factor, and all other columns as characters -- this is what lapply produced. I would simply like to add append a 51st row...but I incur warnings every time.
My added data is of the form Cat <- c("Cat", 1,NA,3,NA,5). (I have no clue where " or ' need to go - quite new to R!)
rbind shows "invalid factor levels" every time.
e.g.
df <- rbind(df,Cat)
I believe this is because of the factor/character divide.

The factor levels in your data.frame should also contain the values in your "Cat" object for the relevant factor column.
Here's a simple example:
df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
toAdd <- list("c", 3)
## Warnings...
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 <NA> 3
# Warning message:
# In `[<-.factor`(`*tmp*`, ri, value = "c") :
# invalid factor level, NA generated
## Possible fix
df$v1 <- factor(df$v1, unique(c(levels(df$v1), toAdd[[1]])))
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 c 3
Alternatively, consider rbindlist from "data.table", which should save you from having to convert the factor levels:
> library(data.table)
> df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
> rbindlist(list(df, toAdd))
v1 v2
1: a 1
2: b 2
3: c 3
> str(.Last.value)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ v1: Factor w/ 3 levels "a","b","c": 1 2 3
$ v2: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3

What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

Insert nonexistent columns in matrix or dataframe in given order

I am on the lookout for a function in R that would check for the presence of particular columns, e.g.
cols=c("a","b","c","d")
in a matrix or dataframe that would insert a column with NAs in case any columns did not exist (in the position in which the columns are given in vector cols). Say if you had a matrix or dataframe with named columns "a", "d", that it would insert a column "b" and "c" filled up with NAs before column "d", and that any columns not listed in cols would be deleted (e.g. column "e"). What would be the easiest and fastest way to achieve this (I am dealing with a fairly large dataset of ca. 1 million rows)? Or is there already some function that does this?

I would separate the creation step and the ordering step. Here is an example:
cols <- letters[1:4]
## initialize test data set
my.df <- data.frame(a = rnorm(100), d = rnorm(100), e = rnorm(100))
## exclude columns not in cols
my.df <- my.df[ , colnames(my.df) %in% cols]
## add missing columns filled with NA
my.df[, cols[!(cols %in% colnames(my.df))]] <- NA
## reorder
my.df <- my.df[, cols]

Other approach I also just discovered using match, but only works for matrices:
# original matrix
matrix=cbind(a = 1:2, d = 3:4)
# required columns
coln=c("a","b","c","d")
colnmatrix=colnames(matrix)
matrix=matrix[,match(coln,colnmatrix)]
colnames(matrix)=coln
matrix
a b c d
[1,] 1 NA NA 3
[2,] 2 NA NA 4

Another possibility if your data is in a matrix
# original matrix
m1 <- cbind(a = 1:2, d = 3:4)
m1
# a d
# [1,] 1 3
# [2,] 2 4
# matrix will all columns, filled with NA
all.cols <- letters[1:4]
m2 <- matrix(nrow = nrow(m1), ncol = length(all.cols), dimnames = list(NULL, all.cols))
m2
# a b c d
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# replace columns in 'NA matrix' with values from original matrix
m2[ , colnames(m1)] <- m1
m2
# a b c d
# [1,] 1 NA NA 3
# [2,] 2 NA NA 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

rbind datasets with different classes - r

Related

How to combine columns that have the same name and remove NA's?

Changing the category of a variable in R: Warning message

R: Adding row to a dataframe with multiple classes

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Insert nonexistent columns in matrix or dataframe in given order

Categories

Resources