I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.
Related
I would like to convert a dataframe into a matrix in R. The dataframe has more than 30 different variables with different types, some are numeric, some factors and some characters. When converting it into a matrix, I would like to keep all types exactly the same as in the dataframe.
I tried converting it with as.matrix(), see code below (this is just a simple example dataframe with only two variables).
test_df <- data.frame(a = c(1:10), b = c(letters[1:10]))
test_df <- as.matrix(test_df)
typeof(test_df[,1])
typeof(test_df[,2])
Column 'a' in the example has type integer while column 'b' has type factor. I expect each column to keep its type when converting a dataframe into a matrix. However, when I convert it into a matrix, all variables are being converted into type character.
No, you can't do that. In R, a matrix has to be all one type: it is stored as a vector of that type together with an attribute saying how many rows and columns it has.
For efficiency, you're right that matrices are a lot faster than dataframes. Maybe you can split your dataframe into one numeric one and one character one. Most other types can be coerced to those types without much loss.
I am reading a txt file into R and have several columns that should be numeric, but everything is interpreted as character. Now I would like to convert only a few columns within that matrix (I converted it to a matrix in a first step) to numeric, but I only managed to extract columns, but that way I got rid of the type matrix...
data <- as.numeric(data[,1])
Now, I've found similar questions here but none of the answers worked in the way that it conserved the type matrix.
For example, I've tried to store the affected columns in a vector and then perform the action on that vector with lapply
cols<- c("a","b","d")
data<- as.matrix(lapply(cols, as.numeric))
But this gives me only empty fields, and of course it only shows the columns I selected and not the rest of the matrix. I also got the error message
NAs introduced by coercion
As a last step I tried the following, but I ended up having a list and not a matrix anymore
data[1:25] <- as.matrix(lapply(data[1:25], as.numeric))
What I would like to have, is a matrix where several columns (not just 1:25 as in my example above but rather, say, columns 1,3 and 6) are converted to numeric and the rest stays the same.
Does someone have an answer and maybe even an explanation for why the things I've tried didn't work?
I have two columns I am trying to subtract and put into a new one, but one of them contains values that read '#NULL!', after converting over from SPSS and excel, so R reads it as a factor and will not let me subtract. What is the easiest way to fix it knowing I have 19,000+ rows of data?
While reading the dataset using read.table/read.csv, we can specify the na.strings argument for those values that needs to be transformed to 'NA' or missing values. So, in your dataset it would be
dat <- read.table('yourfile.txt', na.strings=c("#NULL!", "-99999", "-88888"),
header=TRUE, stringsAsFactors=FALSE)
I'm trying to convert a large list (220559 elements) into a data frame. Each element is either chr (RT) or chr(0)
I tried:
data.frame(t(sapply(my.list, c)))
I got the data frame, but it turned out to be one observation with 220559 variables instead of one variable with 220559 observations.
Is there an easy way to switch the observations with the variables? Or do I have to create the data frame differently? I'm new to R and really looking forward to your help.
So you have a giant list where is element is either the character "RT" or is it an empty character vector (character(0)). And you want to turn this into a data frame with one row and one column for each item in the list (220559 columns).
The problem is that data.frames like all columns to have the same number of observations (rows). And length("RT")==1 while length(character(0))==0. So you can either drop those columns, or convert those values to NA. I'm going to assume the latter for my example.
# "large" list
xx<-sample(list(character(), "RT"), 1000, replace=T)
#make into data.frame
df<-data.frame(lapply(xx, function(x) if(length(x)==0) NA else x))
#add nicer names
names(df)<-paste0("V",seq_along(df))
That's it. Normally to turn a list into a data.frame you just call data.frame(). It was just a bit trickier because of your zero-length vectors.
I am having trouble turning my data.frame into a matrix format. Because I wanted to change my data.frame with mostly factor variables into a numeric matrix, I used the following code
UN2010frame <- data.matrix(lapply(UN2010, as.numeric))
However when I checked the mode of the UN2010frame, it still showed up as a list. Because the code I want to run (Ordrating) does not accept data in a list format, I used UN2010matrix <- unlist(UN2010frame) to unlist my matrix. When I did this, my first row ( which was formerly a row with column names) turned into NAs. This was a problem for me because when I tried to run an ordinal IRT model using this data set, I got the following error message.
> Error in 1:nrow(Y) : argument of
> length 0
I think it is because all the values in my first row are now gone.
If you could help me on any front, It would be deeply appreciated.
Thank you very much!
Haillie
First, the correct use of data.matrix is :
data.matrix(UN2010)
as it converts automatically to numeric. The lapply in your code is the first source for the error you get. You put a list in the data.matrix function, not a dataframe. So it returns a list of matrices, and not a matrix.
Second, unlist returns a vector, not a matrix. So pretty sure you won't find a "first row with NA", as you have a vector. Which might explain part of your confusion.
You probably have a character column somewhere. Converting this to numeric gives NA. If you don't want this, then exclude them from the further analysis. One possibility is to use colwise() from the plyr package to convert only the factors:
colwise(as.numeric,is.factor)(UN2010)
Which returns a dataframe with only the factors. This can be easily converted by data.matrix() or as.matrix(). Alternatively you use the base solution :
id <- sapply(UN2010,is.character)
sapply(UN2010[!id],as.numeric)
which will return you a matrix with all non-character columns converted to numeric.If you really want to keep the dataframe with all original columns, you can do :
UN2010frame <- UN2010
UN2010frame[!id] <- lapply(UN2010[!id],as.numeric)
Toy example code :
UN2010 <- data.frame(
F1 = factor(rep(letters[1:3],10)),
F2 = factor(rep(letters[5:10],5)),
Char = rep(letters[11:16],each=5),
Num = 1:30,
stringsAsFactors=FALSE
)
Try as.data.frame instead of data.matrix.