rbind data.frames without names - r

I am trying to figure out why the rbind function is not working as intended when joining data.frames without names.
Here is my testing:
test <- data.frame(
id=rep(c("a","b"),each=3),
time=rep(1:3,2),
black=1:6,
white=1:6,
stringsAsFactors=FALSE
)
# take some subsets with different names
pt1 <- test[,c(1,2,3)]
pt2 <- test[,c(1,2,4)]
# method 1 - rename to same names - works
names(pt2) <- names(pt1)
rbind(pt1,pt2)
# method 2 - works - even with duplicate names
names(pt1) <- letters[c(1,1,1)]
names(pt2) <- letters[c(1,1,1)]
rbind(pt1,pt2)
# method 3 - works - with a vector of NA's as names
names(pt1) <- rep(NA,ncol(pt1))
names(pt2) <- rep(NA,ncol(pt2))
rbind(pt1,pt2)
# method 4 - but... does not work without names at all?
pt1 <- unname(pt1)
pt2 <- unname(pt2)
rbind(pt1,pt2)
This seems a bit odd to me. Am I missing a good reason why this shouldn't work out of the box?
edit for additional info
Using #JoshO'Brien's suggestion to debug, I can identify the error as occurring during this if statement part of the rbind.data.frame function
if (is.null(pi) || is.na(jj <- pi[[j]]))
(online version of code here: http://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R starting at: "### Here are the methods for rbind and cbind.")
From stepping through the program, the value of pi does not appear to have been set at this point, hence the program tries to index the built-in constant pi like pi[[3]] and errors out.
From what I can figure, the internal pi object doesn't appear to be set due to this earlier line where clabs has been initialized as NULL:
if (is.null(clabs)) clabs <- names(xi) else { #pi gets set here
I am in a tangle trying to figure this out, but will update as it comes together.

Because unname() & explicitly assigning NA as column headers are not identical actions. When the column names are all NA, then an rbind() is possible. Since rbind() takes the names/colnames of the data frame, the results do not match & hence rbind() fails.
Here is some code to help see what I mean:
> c1 <- c(1,2,3)
> c2 <- c('A','B','C')
> df1 <- data.frame(c1,c2)
> df1
c1 c2
1 1 A
2 2 B
3 3 C
> df2 <- data.frame(c1,c2) # df1 & df2 are identical
>
> #Let's perform unname on one data frame &
> #replacement with NA on the other
>
> unname(df1)
NA NA
1 1 A
2 2 B
3 3 C
> tem1 <- names(unname(df1))
> tem1
NULL
>
> #Please note above that the column headers though showing as NA are null
>
> names(df2) <- rep(NA,ncol(df2))
> df2
NA NA
1 1 A
2 2 B
3 3 C
> tem2 <- names(df2)
> tem2
[1] NA NA
>
> #Though unname(df1) & df2 look identical, they aren't
> #Also note difference in tem1 & tem2
>
> identical(unname(df1),df2)
[1] FALSE
>
I hope this helps. The names show up as NA each, but the two operations are different.
Hence, two data frames with their column headers replaced to NA can be "rbound" but two data frames without any column headers (achieved using unname()) cannot.

Related

Generate a sequence of Data frame from function

I searched but I couldn't find a similar question, so Apologies in advance if this is a duplicate question. I am trying to Generate a data frame from within a for loop in R.
what I want to do:
Define each columns of each data frame by a function,
Generate n data frames (length of my sequence of data frame) using loop,
As example I will use n=100 :
n<-100
k<-8
d1 <- data.frame()
for(i in 1:(k)) {d1 <- rbind(d1,c(a="i+1",b="i-1",c="i/1"))}
d2 <- data.frame()
for(i in 1:(k+2)) {d2 <- rbind(d2,c(a="i+2",b="i-2",c="i/2"))}
...
d100 <- data.frame()
for(i in 1:(k+100)) {d100 <- rbind(d100,c(i+100, i-100, i/100))}
It is clear that It'll be difficult to construct one by one each data.frame. I tried this:
d<-list()
for(j in 1:100) {
d[j] <- data.frame()
for(i in 1:(k+j)) {d[j] <- rbind(d[j] ,c(i+j, i-j, i/j))}
But I cant really do anything with it, I run into an error :
Error in d[j] <- data.frame() : replacement has length zero
In addition: Warning messages:
1: In d[j] <- rbind(d[j], c(i + j, i - j, i/j)) :
number of items to replace is not a multiple of replacement length
And a few more remarks about your example:
the number of rows in each data frame are not the same : d1 has 8 rows, d2 has 10 rows, and d100 has 8+100 rows,
the algorithm should give us : D=(d1,d2, ... ,d100).
It would be great to get an answer using the same approach (rbind) and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious.
Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame
And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:
make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }
mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access
head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1
$d_2
a b c
1 3 -1 0.5
2 4 0 1.0
$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000
Then if you want the "d_40" dataframe it's just:
mylist[[ "d_40" ]]
Or
mylist$d_40
If you want to perform the same operation or get a result from all of them at nce; just use lapply:
lapply(mylist, nrow) # will be a list
Or:
sapply(mylist, nrow) #will be a vector because each value is the same length.

Don't understand how apply gets its parameters in r

I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.
The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.
Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7

Referencing a column in R dataframe

I am having trouble referencing columns in a dataframe by name. The function i have begins with extracting rows where no NA's are present:
prepare <- function(dataframe, attr1,attr2){
subset_na_still_there <- dataframe[!is.na(attr1) & !is.na(attr2),]
subset_na_still_there2 <- subset(dataframe, !is.na(attr1) & !is.na(attr2))
### someother code goes here
}
However, the subsets that are returned still contain NA's. I get no errors.
Here is a related question
edit:
Selecting the columns and then referencing them by number does the trick:
prepare <- function(dataframe, attr1,attr2){
subset_cols <- dataframe[,c(attr1, attr2)]
subset_gone <- subset_cols[!is.na(subset_cols[,1]) & !is.na(subset_cols[,2]),]
}
Why does the first version not work as expected?
How about this:
prepare <- function(x, attr1, attr2){
x[!is.na(x[attr1]) & !is.na(x[attr2]),]
}
Rather than creating your own function, try subset:
subset(mydata, !is.na(attr1) & !is.na(attr2))
If you want to get rid of rows with NAs in any field try
na.omit(mydata)
df <- data.frame(att1=c(1,NA,NA,10),att2=c(NA,1,2,3),val=c("a","z","e","r"))
df
att1 att2 val
1 1 NA a
2 NA 1 z
3 NA 2 e
4 10 3 r
test <- function(df,att1,att2){
df_no_na <- df[!is.na(att1) & !is.na(att2),]
df_no_na
}
test(df,df$att1,df$att2)
att1 att2 val
4 10 3 r
It's work for me. Are you sure about NA's ? Is is.na(df$att1) return TRUE ?

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Update Variable within Loop in R

How should I modify my code to update variables within a loop?
Specifically, I want to do something like the following:
myMatrix1 <- read.table(someFile)
myMatrix2 <- read.table(someFile2)
for (i in nrow(myMatrix2))
{
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- cbind(myMatrix3, myMatrix4, myMatrix5)
# *see question
}
How can I get myMatrix6 to be updated instead of reassigned the product of cbind(myMatrix3, myMatrix4, myMatrix5)? In other words, if the first iteration (i = 1) gave a myMatrix6 of:
> 1 1 1 1
> 2 2 2 2
and the second iteration (i = 2) gave myMatrix 6 of:
> 3 3 3 3
> 4 4 4 4
how do I get a dataframe(?) of:
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
UPDATE:
I have - thanks to DWin and Timo's suggestions - got the following. However, the following code has taken me about 2 hours to run on my datasets. Are there any ways to make it run any faster??? (without using a more powerful computer I may add)
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
You instead initialize myMatrix6 to an empty data.frame and rbind the results (which may be inefficient). If efficiency is a concern then you pre-allocate to the size you want and fill in rows in the data.frame with indexing.
# Method # 1 code
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
for (i in nrow(myMatrix2)) {
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- rbind( myMatrix6, cbind(myMatrix3, myMatrix4, myMatrix5) )
}
In your code, you are not dealing with matrices (in the sense of R), but data frames, as read.table returns a data frame.
In either way, you can append one matrix/data frame to another (assuming column names match) with rbind command
For example, if
> a = data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> b = data.frame(x=c(4,5),y=c(5,6),z=c(6,7))
then
> rbind(a,b)
x y z
1 1 4 7
2 2 5 8
3 3 6 9
4 4 5 6
5 5 6 7
There are other gotchas in the code you provide. For example
for (i in length(someVector)))
should be
for (i in 1:length(someVector)))
R has many functions for iterating over data.frames, vectors etc and can do all kinds of data transformations. Most of the time one does not need to write a for loop.
If you would provide more details about what you are trying to do, maybe we can find a simpler solution.
EDIT:
It seems from your post update that you are trying to do some sort of conversion between 'wide' and 'long' format and filter out some lines that fail a test. Correct me, if I am wrong.
Anyway, if that is the case, you should check out reshape command. Also, there is a reshape package containing extremely useful commands melt and cast, which can do that kind of transformations quite efficiently. Also, there is merge command for doing certain "join" operations for data frames. I'm quite sure your problem could be solved by using a combination of above commands, but it depends on exact details.
For filtering rows/columns with some criteria, check out subset command.

Resources