Removing levels in data frame when importing csv data - r

I am importing csv data to R using
data <- read.csv(file="file_name.csv")
This data has 9 columns and 5000 rows and data values are real number. Now I want to use this data as a data frame. But the first columns occurs with some levels. I don't want this levels.
Here is a sample data in .csv format
Could any one please help me to remove the levels from the first column after it is imported to R.
Here is my attempt:
data$col_1 = as.numeric(as.character(data$col_1))
But it showing warning:
Warning message:
NAs introduced by coercion

read.csv is basically a wrapper around read.table, turn off stringsAsFactors will work.
data <- read.csv(file="filename", stringsAsFactors=FALSE)
Then I guess that column will be treated as characters. Then you can do this to convert to numeric.:
data$col <- as.numeric(data$col)
Note: if you have a clean column containing only numbers, read.csv will read in as numeric intelligently, if it read in as factors, it means R detected something that is text or nonnumeric. you might want to pay attention to the warnings see which records got converted to NA due to what reason.
For example, I have a csv file.
When I read in, the id column will be treated as characters simply because there is one row contains ohyeah (if it is empty or NA, R still will treat as column as numeric). I would recommend you to first subset the records that have been contaminated, see if it is a big issue or not.
> subset(data, is.na(as.numeric(id)))
name id
4 dan ohyeah
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercio

Related

'row.names' is not a character vector of length

I am simply trying to create a dataframe.
I read in data by doing:
>example <- read.csv(choose.files(), header=TRUE, sep=";")
The data contains 2 columns with 8736 rows plus a header.
I then simply want to combine this with the column of a dataframe with the same amount of rows (!) by doing:
>data_frame <- as.data.frame(example$x, example$y, otherdata$z)
It produces the following error
Warning message:
In as.data.frame.numeric(example$x, example$y, otherdata$z) :
'row.names' is not a character vector of length 8736 -- omitting it. Will be an error!
I have never had this problem before. It seems so easy to tackle but I cant help myself at the moment.
Overview
As long as the nrow(example) equals length(otherdata$z), use cbind.data.frame to combine columns into one data frame. An advantage with cbind.data.frame() is that there is no need to call the individual columns within example when binding them with otherdata$z.
# create a new data frame that adds the 'z' field from another source
df_example <- cbind.data.frame(example, otherdata$z)

Couldn't convert factor to numeric values for binning operation

I am provided with a dataset and I am asked to perform binning based on a particular column value. Here the column value is in factor when I tried converting to numeric I am getting either the NA coercion or getting the factor values but not the data in the table.
data$imdbVotes <- as.numeric(as.character(data$imdbVotes))
When I tried with this code I got the error:
Warning message:
NAs introduced by coercion
This is the table provided and I have to perform binning based on IMDB votes.
Hi Nice meeting you out of Edwisor. What you are doing is perfectly right. There have to be some NAs in the file.
For example if you try tail(data,7) you will see that the value of imdbVotes for the movie Venky is missing. Now we have two options. Either get the data for this item. Or keep it as NA.
In an ideal scenario when the data is critical, I would extract the data again so that there are no missing values. In this case, I am going to leave it as NA, so it doesn't mess with the calculations.

Converting NA's to factor in r

I want to convert all the NA's in one column (and only one column) of my data frame into "non-PA" instead. The class of the column is factor.
In the past I've successfully used:
df$column[is.na(df$column)] <- "non-PA"
But for some reason this time I get this error message:
In `[<-.factor`(`*tmp*`, is.na(management.points$management),
value = c(NA, : invalid factor level, NA generated
I've tried converting the column to characters and various other ways around it but I still get the same error message. What am I doing wrong?
You have to turn the column into a character vector first:
df$column <- as.character(df$column)
df$column[is.na(df$column)] <- "non-PA"
df$column <- factor(df$column)
The error happens because you cannot input a value in a factor if it is not already a level of that factor.
One potential downside (from #docendo's comment) is that this may remove unused factor levels. To keep them, you could just add "non_PA" to the levels instead of transforming to character:
levels(df$column) <- union(levels(df$column), "non_PA")
df$column[is.na(df$column)] <- "non-PA"

Change data frame with factors to a big matrix R

I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.

Subtracting Columns in R

I have two columns I am trying to subtract and put into a new one, but one of them contains values that read '#NULL!', after converting over from SPSS and excel, so R reads it as a factor and will not let me subtract. What is the easiest way to fix it knowing I have 19,000+ rows of data?
While reading the dataset using read.table/read.csv, we can specify the na.strings argument for those values that needs to be transformed to 'NA' or missing values. So, in your dataset it would be
dat <- read.table('yourfile.txt', na.strings=c("#NULL!", "-99999", "-88888"),
header=TRUE, stringsAsFactors=FALSE)

Resources