I have two columns I am trying to subtract and put into a new one, but one of them contains values that read '#NULL!', after converting over from SPSS and excel, so R reads it as a factor and will not let me subtract. What is the easiest way to fix it knowing I have 19,000+ rows of data?
While reading the dataset using read.table/read.csv, we can specify the na.strings argument for those values that needs to be transformed to 'NA' or missing values. So, in your dataset it would be
dat <- read.table('yourfile.txt', na.strings=c("#NULL!", "-99999", "-88888"),
header=TRUE, stringsAsFactors=FALSE)
Related
I am simply trying to create a dataframe.
I read in data by doing:
>example <- read.csv(choose.files(), header=TRUE, sep=";")
The data contains 2 columns with 8736 rows plus a header.
I then simply want to combine this with the column of a dataframe with the same amount of rows (!) by doing:
>data_frame <- as.data.frame(example$x, example$y, otherdata$z)
It produces the following error
Warning message:
In as.data.frame.numeric(example$x, example$y, otherdata$z) :
'row.names' is not a character vector of length 8736 -- omitting it. Will be an error!
I have never had this problem before. It seems so easy to tackle but I cant help myself at the moment.
Overview
As long as the nrow(example) equals length(otherdata$z), use cbind.data.frame to combine columns into one data frame. An advantage with cbind.data.frame() is that there is no need to call the individual columns within example when binding them with otherdata$z.
# create a new data frame that adds the 'z' field from another source
df_example <- cbind.data.frame(example, otherdata$z)
I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.
This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 7 years ago.
I am working with a csv data set with around 1 million records. I need to perform two operations on the data set:
Prepare a dataset that do not have those rows that have some missing (blank) values in them.
Prepare another data set that replaces empty values with unknown.
I have tried to use excel for it but that is taking too much time. Please someone help with the way it can be done in R?
To get complete cases, use this:
complete_df <- df[complete.cases(df),]
complete.cases returns a logical vector that tells you which rows of dataframe df are complete, and you can use that to subset the data.
To replace the NAs, you can use this:
new_df <- df
new_df[is.na()] <- 'Unknown'
But this has the effect of possibly changing the datatypes of the columns with missing data. For example, if you have a column of numeric data and you put the missing variables as 'Unknown' then that whole column is now a character variable, so be aware of this.
I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.
I have a .csv file which I have read into R as a dataframe (say df).
The first column is date in mm/dd/yyyy format. The second column is a double number. What I want to do is to create a new dataframe like:
df2<-data.frame(date=c(df[10,1],df[15,2]),num=c(111,222))
When I try to do this I get very messy df2. Most probably I am doing it wrong because I do not understand the data frame concept.
Whenever I try to do df[10,1], the output is the 10th row and 1st column of df, including all the levels of column 1.
You can control how R will interpret the classes of data being read in by specifying a vector of column classes as an argument to read.table with colClasses. Otherwise R will use type.convert which will convert a character vector in a "logical" fashion, according to R's definition of logical. That obviously has some potential quirks to it if you aren't familiar with them.
You can also prevent R from creating a factor by specifying stringsAsFactors = FALSE as an argument in read.table, this is generally an easier option than specifying all of the colClasses.
You can format the date with strptime(). Taking all of this into consideration, I would recommend reading your data into R without turning character data into factors and then use strptime to format.
df <- read.csv("myFile.csv", stringsAsFactors = FALSE)
#Convert time to proper time format
df$time <- strptime(df$time, "%m/%d/%Y")
if you don't want to type out stringsAsFactors=FALSE each time you read in / construct a data frame. you can at the outset specify
options(stringsAsFactors=FALSE)