Invalid factor level with rbind to data frame

Invalid factor level with rbind to data frame - r

I'm new in R and I don't know how exacly adding row in data frame.
I add two vectors:
b=c("one","lala",1)
d=c("two","lele",2)
I want add this to data.frame called a.
a<-rbind(a,b)
now I have one correct row
A B C
1 one lala 1
next I add
a<-rbind(a,d)
and result is:
A B C
1 one lala 1
2 NA NA NA
and console write me warning messages: invalid factor level, NA generated.
What I do wrong or what is better simple way to add new line.
But I don't want in start create full data.frame. I want adding lines.

When you do
c("one","lala",1)
this creates a vector of strings. The 1 is converted to character type,
so that all elements in the vector are the same type.
Then rbind(a,b) will try to combine a which is a data frame and b
which is a character vector and this is not what you want.
The way to do this is using rbind with data frame objects.
a <- NULL
b <- data.frame(A="one", B="lala", C=1)
d <- data.frame(A="two", B="lele", C=2)
a <- rbind(a, b)
a <- rbind(a, d)
Now we can see that the columns in data frame a are the proper type.
> lapply(a, class)
$A
[1] "factor"
$B
[1] "factor"
$C
[1] "numeric"
>
Notice that you must name the columns when you create the different data
frame, otherwise rbind will fail. If you do
b <- data.frame("one", "lala", 1)
d <- data.frame("two", "lele", 2)
then
> rbind(b, d)
Error in match.names(clabs, names(xi)) :
names do not match previous names

You need to add stringsAsFactors = FALSE to the BOTH the data.frame() function and the rbind() function.
In some versions of R, but not others, rbind() will automatically convert strings to factors. For example, in R version 3.6.2, rbind will do the factor conversion automatically even if the global setting is options(stringsAsFactors = FALSE). This is not the case in R version 4.0.4, and so stringsAsFactors = FALSE does not need to be added to the rbind() statement in version 4.0.4.

Just one more point, as i test, for example:
we have a data frame as df, and 6 columns as below, when try to use rbind to combine
the 2nd line to the 1st line,
df <- data.frame()
df <- rbind(df, row1)
df <- rbind(df, row2)
it will happen like this
col1 col2 col3 col4 col5 col6
1 1 1 Pilot Greg Andy Dwyer 95.00
2 1 1 Pilot Greg NA 92.00
As i test, set stringsAsFactors = FALSE not only should happen on initialize df dataframe, but also should apply to rbind function, after converting to below:
df <- data.frame(stringsAsFactors = FALSE)
df <- rbind(df, row1, stringsAsFactors = FALSE)
df <- rbind(df, row2, stringsAsFactors = FALSE)
it works fine
col1 col2 col3 col4 col5 col6
1 1 1 Pilot Greg Andy Dwyer 95.00
2 1 1 Pilot Greg Audi Sier 92.00

Related

R: efficient way to apply a function according to the columns of a dataframe

I feel extremely stupid now but I can't come up with more than a for loop...
I have a data frame with numerical and factorial columns. I simply want the numerical columns to be scaled and the factorial columns to be kept as they are. For example
> set.seed(160)
> df1 <- data.frame(as.data.frame(matrix(rnorm(8), ncol=2)),
V3=factor(c("A", "A", "B", "B")))
> df1
V1 V2 V3
1 0.6185496 -0.6410203 A
2 -0.8722777 2.6520986 A
3 0.8529240 -1.4156009 B
4 0.3678875 -1.1615607 B
I'd like to get
> df1
V1 V2 V3
1 0.4901808 -0.2642698 A
2 -1.4493527 1.4780179 A
3 0.7950968 -0.6740765 B
4 0.1640750 -0.5396717 B
with a more efficient command than
for(i in 1:ncol(df1)) {
if(is.factor(df1[,i])) {df1[,i] <- df1[,i]}
else{df1[,i] <- scale(df1[,i])}
}
I tried various combinations of lapply(), sapply(), if(), ifelse() but nothing seemed to work (apply doesn't work because the df gets transformed into a matrix and I lose the factor/numeric structure). Any suggestions?
NB: I am not trying to apply a function based on the values in the columns but based on the type of column.

You can try the following, which is similar to a suggestion in the comments:
df1[sapply(df1, is.numeric)] <- scale(df1[sapply(df1, is.numeric)])
#> df1
# V1 V2 V3
#1 0.4901808 -0.2642698 A
#2 -1.4493527 1.4780179 A
#3 0.7950968 -0.6740765 B
#4 0.1640750 -0.5396717 B

This should work.
df1[] <- sapply(df1, function(i) if(is.numeric(i)) scale(i) else i)

Column of an existing data frame as a new data frame in R, but result is NULL

I have a data.frame (name: sample) in R which I imported from an csv file containing 15 fields and 100516 columns. I want to create a new data frame "sample2" with 3rd column of "sample".
sample2 = sample[,3]
When I checked the nrow(sample2) the result is NULL.
But when I used head(sample2) I can see the content.

Your problem is that you are using nrow on a vector.
If you want to keep the data.frame structure when selecting a single column in this way, you need to add drop = FALSE when subsetting.
Consider the following example:
## Sample data
mydf <- data.frame(v1 = 1:2, v2 = 3:4)
nrow(mydf)
# [1] 2
## What you did
mydf[, 1]
# [1] 1 2
nrow(.Last.value)
# NULL
## What you wanted to do
mydf[, 1, drop = FALSE]
# v1
# 1 1
# 2 2
nrow(.Last.value)
# [1] 2

How can this code be compacted?

Can the following code be made more "R like"?
Given data.frame inDF:
V1 V2 V3 V4
1 a ha 1;2;3 A
2 c hb 4 B
3 d hc 5;6 C
4 f hd 7 D
Inside df I want to
find all rows which for the "V3" column has multiple values
separated by ";"
then replicate the respective rows a number of times equal with the number of individual values in the "V3" column,
and then each replicated row receives in the "V3" column only one the initial values
Shortly, the output data.frame (= outDF) will look like:
V1 V2 V3 V4
1 a ha 1 A
1 a ha 2 A
1 a ha 3 A
2 c hb 4 B
3 d hc 5 C
3 d hc 6 C
4 f hd 7 D
So, if from inDF I want to get to outDF, I would write the following code:
#load inDF from csv file
inDF <- read.csv(file='example.csv', header=FALSE, sep=",", fill=TRUE)
#search in inDF, on the V3 column, all the cells with multiple values
rowlist <- grep(";", inDF[,3])
# create empty data.frame and add headers from "headDF"
xDF <- data.frame(matrix(0, nrow=0, ncol=4))
colnames(xDF)=colnames(inDF)
#take every row from the inDF data.frame which has multiple values in col3 and break it in several rows with only one value
for(i in rowlist[])
{
#count the number of individual values in one cell
value_nr <- str_count(inDF[i,3], ";"); value_nr <- value_nr+1
# replicate each row a number of times equal with its value number, and transform it to character
extracted_inDF <- inDF[rep(i, times=value_nr[]),]
extracted_inDF <- data.frame(lapply(extracted_inDF, as.character), stringsAsFactors=FALSE)
# split the values in V3 cell in individual values, place them in a list
value_ls <- str_split(inDF[i, 3], ";")
#initialize f, to use it later to increment both row number and element in the list of values
f = 1
# replace the multiple values with individual values
for(j in extracted_inDF[,3])
{
extracted_inDF[f,3] <- value_ls[[1]][as.integer(f)]
f <- f+1
}
#put all the "demultiplied" rows in xDF
xDF <- merge(extracted_inDF[], xDF[], all=TRUE)
}
# delete the rows with multiple values from the inDF
inDF <- inDF[-rowlist[],]
#create outDF
outDF <- merge(inDF, xDF, all=TRUE)
Could you please

I'm not sure that I'm one to speak about whether you are using R in the "right" or "wrong" way... I mostly just use it to answer questions on Stack Overflow. :-)
However, there are many ways in which your code could be improved. For starters, YES, you should try to become familiar with the predefined functions. They will often be much more efficient, and will make your code much more transparent to other users of the same language. Despite your concise description of what you wanted to achieve, and my knowing an answer virtually right away, I found your code daunting to look through.
I would break up your problem into two main pieces: (1) splitting up the data and (2) recombining it with your original dataset.
For part 1: You obviously know some of the functions you need--or at least the main one you need: strsplit. If you use strsplit, you'll see that it returns a list, but you need a simple vector. How do you get there? Look for unlist. The first part of your problem is now solved.
For part 2: You first need to determine how many times you need to replicate each row of your original dataset. For this, you drill through your list (for example, with l/s/v-apply) and count each item's length. I picked sapply since I knew it would create a vector that I could use with rep.
Then, if you've played with data.frames enough, particularly with extracting data, you would have come to realize that mydf[c(1, 1, 1, 2), ] will result in a data.frame where the first row is repeated two additional times. Knowing this, we can use the length calculation we just made to "expand" our original data.frame.
Finally, with that expanded data.frame, we just need to replace the relevant column with the unlisted values.
Here is the above in action. I've named your dataset "mydf":
V3 <- strsplit(mydf$V3, ";", fixed=TRUE)
sapply(V3, length) ## How many times to repeat each row?
# [1] 3 1 2 1
## ^^ Use that along with `[` to "expand" your data.frame
mydf2 <- mydf[rep(seq_along(V3), sapply(V3, length)), ]
mydf2$V3 <- unlist(V3)
mydf2
# V1 V2 V3 V4
# 1 a ha 1 A
# 1.1 a ha 2 A
# 1.2 a ha 3 A
# 2 c hb 4 B
# 3 d hc 5 C
# 3.1 d hc 6 C
# 4 f hd 7 D
To share some more options...
The "data.table" package can actually be pretty useful for something like this.
library(data.table)
DT <- data.table(mydf)
DT2 <- DT[, list(new = unlist(strsplit(as.character(V3), ";", fixed = TRUE))), by = V1]
merge(DT, DT2, by = "V1")
Alternatively, concat.split.multiple from my "splitstackshape" package pretty much does it in one step, but if you want your exact output, you'll need to drop the NA values and reorder the rows.
library(splitstackshape)
df2 <- concat.split.multiple(mydf, split.cols="V3", seps=";", direction="long")
df2 <- df2[complete.cases(df2), ] ## Optional, perhaps
df2[order(df2$V1), ] ## Optional, perhaps

In this case, you can use the split-apply-combine paradigm for reshaping the data.
You want to split inDF by its rows, since you want to operate on each row separately. I've used the split function here to split it up by row:
spl = split(inDF, 1:nrow(inDF))
spl is a list that contains a 1-row data frame for each row in inDF.
Next, you'll want to apply a function to transform the split up data into the final format you need. Here, I'll use the lapply function to transform the 1-row data frames, using strsplit to break up the variable V3 into its appropriate parts:
transformed = lapply(spl, function(x) {
data.frame(V1=x$V1, V2=x$V2, V3=strsplit(x$V3, ";")[[1]], V4=x$V4)
})
tranformed is now a list where the first element has a 3-row data frame, the third element has a 2-row data frame, and the second and fourth have 1-row data frames.
The last step is to combine this list together into outDF, using do.call with the rbind function. That has the same effect of calling rbind with all of the elements of the transformed list.
outDF = do.call(rbind, transformed)
This yields the desired final data frame:
outDF
# V1 V2 V3 V4
# 1.1 a ha 1 A
# 1.2 a ha 2 A
# 1.3 a ha 3 A
# 2 c hb 4 B
# 3.1 d hc 5 C
# 3.2 d hc 6 C
# 4 f hd 7 D

How do you delete the header in a dataframe?

I want to delete the header from a dataframe that I have. I read in the data from a csv file then I transposed it, but it created a new header that is the name of the file and the row that the data is from in the file.
Here's an example for a dataframe df:
a.csv.1 a.csv.2 a.csv.3 ...
x 5 6 1 ...
y 2 3 2 ...
I want to delete the a.csv.n row, but when I try df <- df[-1,] it deletes row x and not the top.

If you really, really, really don't like column names, you may convert your data frame to a matrix (keeping possible coercion of variables of different class in mind), and then remove the dimnames.
dd <- data.frame(x1 = 1:5, x2 = 11:15)
mm1 <- as.matrix(dd)
mm2 <- matrix(mm1, ncol = ncol(dd), dimnames = NULL)
I add my previous comment here as well:
?data.frame: "The column names should be non-empty, and attempts to use empty names will have unsupported results.".

Set names to NULL
names(df) <- NULL
You can also use the header option in read.csv

You can use names(df) to change the names of header or col names. If newnames is a list of names as newname<-list("col1","col2","col3"), then names(df)<-newname will give you a data with col names as col1 col2 col3.
As # Henrik said, the col names should be non-empty. Setting the names(df)<-NULLwill give NA in col names.
If your data is csv file and if you use header=TRUE to read the data in R then the data will have same colnames as csv file, but if you set the header=FALSE, R will assign the colnames as V1,V2,...and your colnames in the original csv file appear as a first row.
anydata.csv
a b c d
1 1 2 3 13
2 2 3 1 21
read.csv("anydata.csv",header=TRUE)
a b c d
1 1 2 3 13
2 2 3 1 21
read.csv("anydata.csv",header=FALSE)
V1 V2 V3 V4
1 a b c d
2 1 2 3 13
3 2 3 1 21

You could use
setNames(dat, rep(" ", length(dat)))
where dat is the name of the data frame. Then all columns will have the name " " and hence will be 'invisible'.

It comes with some years of delay but you can simply use a vector renaming de columns:
## if you want to delete all column names:
colnames(df)[] <- ""
## if you want to delete let's say column 1:
colnames(df)[1] <- ""
## if you want to delete 1 to 3 and 7:
colnames(df)[c(1:3,7)] <- ""

As already mentioned not having column names just isn't something that is going to happen with a data frame, but I'm kind of guessing that you don't care so much if they are there you just don't want to see them when you print your data frame? If so, you can write a new print function to get around that, like so:
> dat <- data.frame(var1=c("A","B","C"),var2=rnorm(3),var3=rnorm(3))
> print(dat)
var1 var2 var3
1 A 1.2771777 -0.5726623
2 B -1.5000047 1.3249348
3 C 0.1989117 -1.4016253
> ncol.print <- function(dat) print(matrix(as.matrix(dat),ncol=ncol(dat),dimnames=NULL),quote=F)
> ncol.print(dat)
[,1] [,2] [,3]
[1,] A 1.2771777 -0.5726623
[2,] B -1.5000047 1.3249348
[3,] C 0.1989117 -1.4016253
Your other option it set your variable names to unique amounts of whitespace, for example:
> names(dat) <- c(" ", " ", " ")
> dat
1 A 1.2771777 -0.5726623
2 B -1.5000047 1.3249348
3 C 0.1989117 -1.4016253
You can also write a function do this:
> blank.names <- function(dat){
+ for(i in 1:ncol(dat)){
+ names(dat)[i] <- paste(rep(" ",i),collapse="")
+ }
+ return(dat)
+ }
> dat <- data.frame(var1=c("A","B","C"),var2=rnorm(3),var3=rnorm(3))
> dat
var1 var2 var3
1 A -1.01230289 1.2740237
2 B -0.13855777 0.4689117
3 C -0.09703034 -0.4321877
> blank.names(dat)
1 A -1.01230289 1.2740237
2 B -0.13855777 0.4689117
3 C -0.09703034 -0.4321877
But generally I don't think any of this should be done.

A function that I use in one of my R scripts:
read_matrix <- function (csvfile) {
a <- read.csv(csvfile, header=FALSE)
matrix(as.matrix(a), ncol=ncol(a), dimnames=NULL)
}
How to call this:
iops_even <- read_matrix('even_iops_Jan15.csv')
iops_odd <- read_matrix('odd_iops_Jan15.csv')

You can simply do:
print(df.to_string(header=False))
if you want to remove the line indexes as well, you can do:
print(df.to_string(index=False,header=False))

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks

This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2

This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex