How do I export a data frame completely as.character in r? I have digits that need to be treated as text in large dataframes, and I'm using write.csv, but even though I imported digits into r as characters, they are exporting as numbers (not surrounded by"" when viewed in notepad) and are occasionally rewritten as, e.g., 1e-04 (for a small decimal value). This is for data munging, and I need stuff to stay as formatted (once formatted). Shouldn't that be possible with some form of "as.character" or similar?
Make it into a matrix. If there is at least one character column in your data frame, it'll coerce the rest to character to match, since you can only have on type of data in a matrix.
new <- as.matrix(old_data_frame)
If there are no character columns in your old data frame, do:
new <- matrix(as.character(as.numeric(as.matrix(old_data_frame))),
ncol=ncol(old_data_frame))
If you user the function
write.table(x, file= ,quote=TRUE ...)
anything that is a string will be quoted on output
Related
When I have my data set in excel, I have set it to numeric format and all data points are numbers. I then convert it to a txt and upload to R. Somehow when it is then in R, it reads that some data points are character and some are numeric. I am unsure how to make sure that all data points are numeric format when I upload them.
I need them all to be in numeric format as at a later date I use cTree which does not work if the data is characters.
thanks
What r function are you using to upload? if it is read_csv(), you can use the col_types argument to control how each column is read.
For example
dataframe <- read_csv(file.csv,col_types=col(a=col_numeric(),b=col_character())
I have tables of weather data from several stations. When I import them separately using read.csv, the fields are factors, integers, and numerics. However, when I try to import one csv file with all of data combined, the resulting fields in a dataframe are all factors. In the combined file the 1st field has several alphanumeric variables, whereas in the individual files there is only one variable (name of station).
This is a commom behaviour of data.frame() from base R. And most of the times, the result of read.csv() will be stored in a data.frame. As #Duck suggested in the comment section, you can avoid this behaviour, by setting the stringsAsFactors argument to FALSE.
read.csv('myfile.csv', stringsAsFactors = FALSE)
You can check this description below on the documentation page of the data.frame function. You can access this documentation with ?data.frame command.
Character variables passed to data.frame are converted to factor columns unless protected by I() or argument stringsAsFactors is false.
So in your case, this happens in your combined file, because R are interpreting all variables as caracters. Why? Probably because in one (or some) of your files, in the numeric and integers columns, some lines of data are out of format. For example, maybe in a row, you have an "x" to represent an missing value. And read.csv() uses the entire file to decide wich format of data is each column, so as soon as the function hits this "x" value, it interprets the entire column as character. When this data is passed to data.frame(), the function converts these characters to factors. You sad that, in the combined file, you have in the first field, some alphanumeric values. So these values, are probably your "x"'s that are generating your problem.
I have a large data.frame d that was read from a .csv file using read (it is actually a data.table resulting from fread a .csv file). I want to check in every column of type character for weird/corrupted characters. Meaning the weird sequences of characters that result from other corrupted parts of a text file or from using the wrong encoding.
A data.table solution, or some other fast solution, would be best.
This is a pseudo-code for a possible solution
create a vector str_cols with the names of all the character columns of d
for each column j in str_cols compute a frequency table of the values: tab <-d[,.N,j]. (this step is probably not necessary, just used to reduce the dimensions of the object that will be checked in columns with repetitions)
Check the values of j in the summary table tab
The crucial step is 3. Is there a function that does that?
Edit1: Perhaps some smart regular expression? This is a related non R question, trying to explicitly list all weird characters. Another solution perhaps is to find any character outside of the accepted list of characters [a-z 0-9 + punctuation].
If you post some example data it would be easier to give a more definitive answer. You could likely try something like this though.
DT[, lapply(.SD, stringr::str_detect, "^[^[[:print:]]]+$")]
It will return a data.table of the same size, but any string that has characters that aren't alphanumeric, punctuation, and/or space will be replaced with TRUE, and everything else replaced with FALSE. This would be my interpretation of your question about wanting to detect values that contain these characters.
You can change the behavior by replacing str_detect with whatever base R or stringr function you want, and slightly modifying the regex as needed. For example, you can remove the offending characters with the following code.
DT[, lapply(.SD, stringr::str_replace_all, "[^[[:print:]]]", "")]
This is probably a basic question, but why does R think my vector, which has a bunch of words in it, are numbers when I try to use these vectors as column names?
I imported a data set and it turns out the first row of data are the column headers that I want. The column headers that came with the data set are wrong ones. So I want to replace the column names. I figured this should be easy.
So what I did was I extracted the first row of data into a new object:
names <- data[1,]
Then I deleted the first row of data:
data <- data[-1,]
Then I tried to rename the column headers with the "names" object:
colnames(data) <- names
However, when I do this, instead of changing my column names to the words within the names object, it turns it into a bunch of numbers. I have no idea where these numbers come from.
Thanks
You need to actually show us the data, and the read.csv()/read.table() command you used to import.
If R thinks your numeric column is string, it sounds like that's because it wrongly includes the column name, i.e. you omitted header=TRUE in your read.csv()/read.table() import.
But show us your actual data and commands used.
I imported a csv file from excel. All the Revenue columns are importing as string. I want them to be numeric.
I thought it would be as easy as a$Revenue <- as.numeric(a$Revenue), but this coerces NAs into all the cells, wiping out the data. So the column does convert to numeric, but I lose all the data.
Is there another technique?
When you load the data try setting stringsAsFactors=FALSE in the read.csv function, and then try again.