R data.frame explanation of results - r

I have 3 vectors: MRI, MRI_high, MRI_low. With the _low and _high being half the length of the first. My objective is to put them into one same element, so that I can make iterations (I have a bunch of vectors following same format)
When I wrote: data.entry(MRI, MRI_high, MRI_low) A window popped up with my data arranged in columns and of correct length, problem, I cant use that.
When I used MRI_vector <- data.frame(MRI, MRI_high, MRI_low) The function somehow gave me 3 elements of equal length, by duplicating the shorter lists.
What is a solution to this? And do data frames need equal lengths for their elements?
Moreover, I tried using lists, however my values are then not numerical

One option will be to place them in a list and pad with NA to make the lengths equal before converting to 'data.frame'
lst <- mget(ls(pattern = "MRI.*"))
df1 <- data.frame(lapply(lst, `length<-`, max(lengths(lst))))

I found a solution to my problem:
I used the list, however, when computing statistics I used as.numeric(unlist()) which still allows me to iterate.
Although any further comments on this are appreciated for me to understand this strange data-frame behaviour!

Related

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks
If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

Convert several columns of data frame to numeric

I am reading a txt file into R and have several columns that should be numeric, but everything is interpreted as character. Now I would like to convert only a few columns within that matrix (I converted it to a matrix in a first step) to numeric, but I only managed to extract columns, but that way I got rid of the type matrix...
data <- as.numeric(data[,1])
Now, I've found similar questions here but none of the answers worked in the way that it conserved the type matrix.
For example, I've tried to store the affected columns in a vector and then perform the action on that vector with lapply
cols<- c("a","b","d")
data<- as.matrix(lapply(cols, as.numeric))
But this gives me only empty fields, and of course it only shows the columns I selected and not the rest of the matrix. I also got the error message
NAs introduced by coercion
As a last step I tried the following, but I ended up having a list and not a matrix anymore
data[1:25] <- as.matrix(lapply(data[1:25], as.numeric))
What I would like to have, is a matrix where several columns (not just 1:25 as in my example above but rather, say, columns 1,3 and 6) are converted to numeric and the rest stays the same.
Does someone have an answer and maybe even an explanation for why the things I've tried didn't work?

matrix subseting by column's name using `subset` function

Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?
The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()

Convert a List to a Data Frame after a Split

The top post of this question helped me equally divide a vector into an even set of chunks:
Split a vector into chunks in R
My problem now is that I would like to construct data frames out of the output. Here is the problem in R syntax:
d <- rpois(73,5)
solution1 <- split(d, ceiling(seq_along(d)/20))
ERROR <- as.data.frame(solution1)
The error that you should see is "arguments imply differing number of rows." I'm especially confused because I thought that the as.data.frame() function could handle this problem, as evident here:
http://www.r-bloggers.com/converting-a-list-to-a-data-frame-2/
Thanks for all your help!
EDIT 1:
I am close to a solution with this line, however, there are NA values that are being introduced that distort the output that I seek:
ldply(solution1,data.frame)
ldply is from the plyr package
Did you read the ?split help page? Did you notice the unsplit() function? That sounds like exactly what you're trying to do here.
d <- rpois(73,5)
f <- ceiling(seq_along(d)/20) #factor for splitting
solution1 <- split(d, f)
unsplit(solution1 , f)
I'm not sure what you expected your data.frame to look like, but the error message you got was because as.data.frame() was trying to create a new column in your data.frame for each item in solution1. And since each of those vectors in the list has a different number of elements, you cannot make a data.frame from that. A data.frame requires that every column has the same number of rows.

How should I apply the same formatting to a list of dataframes in R?

Here is what I've done so far. So, that's basically grabbing some tables off the internet using XML, putting them into a list of dataframes and then some mess trying (and failing) to format them in an efficient and consistent way.
I can't work out how to apply the same changes to all of the dataframes. I think I need to use llply, but I can't get it right. Overall I am trying to achieve:
Column names all legitimate R names using make.names, then use the
str_replace_all towards the end of the file to strip all non-alpha
characters so the names are the same
Next I want to remove all but the first four columns from all of the dataframes
Then I want to add a column with the title for each book. I guess I'll have to do this manually.
Finally, I want to do an rbind to join all of the dataframes together
What's really got me stumped is how to apply the same transformations to each dataframe in the list such as modifying their column names and cutting off rows. Is llply the right tool for the job? How do I use it?
So far the most I've been able to achieve is turning my list of dataframes into a list of vectors with the right names. I believe this is because when I tried using names() it returned the vector of correct names, rather than a dataframe with the correct names. This was my attempt:
tlist <- llply(tabs, function(x) as.data.frame(str_replace_all(make.names(names(x)), "[^[:alpha:]]", "")))
I don't think I'm a million miles away here, but I can't think how to get it to return the full df.
Use this instead:
f <- function(x)
{
y <- x[,1:4]
names(y) <- str_replace_all(make.names(names(y)), "[^[:alpha:]]", "")
y
}
result <- rbind.fill(llply(tabs, f))
EDIT: following #baptiste, this may be better:
result <- ldply(tabs, f)

Resources