I am reading a csv file into R and trying to do take the log of the data. The csv file has columns of data with the first row having text headers and the rest numeric data.
data<-read.csv("rawdata.csv",header=T)
trans<-log(csv2)
I get the following error when I do this:
Error in Math.data.frame(list(Revenue = c(18766L, 20197L, 20777L,
23410L, : non-numeric variable in data frame: Costs
Output of str should have been inserted in Q-body:
data.frame': 167 obs. of 3 variables:
$ X: int 18766 20197 20777 23410 23434 22100 22337 21511 22683 23151 ...
$ Y: Factor w/ 163 levels "1,452.70","1,469.00",..: 22 9 55 109 158 82 131 112 119 137 ...
$ Z: num 564 608 636 790 843 ...
How do I correct this?
Tada! Y is a factor - big problem. The commas shouldn't be in there.
Also, your original question has some anomalies: data is the loaded data.frame, yet the transformation is applied to csv2. Did you rename the columns? If so, you've not given a full summary of the steps involved. Anyway, the issue is that you have commas in your second column.
EDIT: removed speculation about structure given that it has now been offered.
Dataframes are lists, so lapply will loop over them columns and return the math function done on them.
If the column is a factor (and here str(Costs) would tell you) then you could do the possibly inefficient approach of converting all columns as if they were factors:
Costs_logged <- lapply(Costs, function(x) log(as.numeric(as.character(x))) )
Costs_logged
(See the FAQ about factor conversion to numeric.)
EDIT2: If you want to convert the factor variable with commas in the labels use this method:
data$Y <- as. numeric( gsub("\\,", "", as.character(data$Y) ) )
The earlier version of this only had a single-backslash, but since both regex and R use backslashes as escape characters, "special regex characters" (see ?regex for listing) need to be doubly escaped.
Can you give use the first few values for the variable that is giving you trouble? If the "Costs" variable is giving you trouble (what it looks like from your example), execute something like this:
data <- read.csv("rawdata.csv",header=T)
data[c(1:5),"Costs"]
It sounds as though you have a column of values in the csv file -- column Y -- that has commas in the numbers. That is, it sounds like your csv file looks like this:
X,Y,Z
"18766","1,452.70","564"
"20197","1,469.00","608"
or
X,Y,Z
18766,"1,452.70",564
20197,"1,469.00",608
or something similar. If this is the case, the problem is that column Y can't be read easily by R with a comma in it (even though it makes it easier for us humans to read). You need to get rid of those commas; that is, make your data file look like this:
X,Y,Z
18766,1452.70,564
20197,1469.00,608
(you can leave the quotes in -- just get rid of the commas in the numbers themselves).
There are a number of ways to do this. If you exported your data from excel, format that column differently. Or, alternatively, open the csv in excel, save it as a tab-delimited file, open the file in your favorite text editor, and find-and-delete the commas ("find and replace with nothing").
Then try to pull it back into R with your original command.
Clearly the columns are not all numeric, so just ensure that they are. You can do this by forcing the class of every column when read in:
data <- read.csv("rawdata.csv", colClasses = "numeric")
(read.csv is just a wrapper on read.table, and header = TRUE by default)
That will ensure all columns are of class numeric if that is in fact possible.
If they really are not numeric columns, exclude the ones you don't want to transform, or just work on the columns individually:
x <- data.frame(x = 1:10, y = runif(1, 2, 10), z = letters[1:10])
colClasses can be used to ignore columns by specifying "NULL" if that makes things simpler.
These are equivalent since "x" and "y" are the first 2 columns:
log(x[ , 1:2])
log(x[ , c("x", "y")])
Individually:
log(x$x)
log(x$y)
It's always important to check assumptions about the data read from external sources. Basic checks like summary(x), head(x) and str(x) will show you what the data actually are.
Related
I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn't work for me.
I'm importing a CSV:
orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)
Here's the structure of the dataframe:
str(orders)
'data.frame': 3331575 obs. of 2 variables:
$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
If I run the length command on the first column, OrderID, I get this:
length(orders$OrderID)
[1] 0
If I run the length on OrderDate, it returns correctly:
length(orders$OrderDate)
[1] 3331575
This is a copy/paste of the head of the CSV.
OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07
Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.
orders2 <- read.csv("<file_location>", sep=",", header=T)
str(orders2)
'data.frame': 3331575 obs. of 2 variables:
$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
length(orders$X.OrderID)
[1] 3331575
This works correctly.
My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.
What can I do to fix this?
Side note: I realize this is a minor - I'm just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn't load correctly.
read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.
Posting the output of sessionInfo() would also be useful.
I just came across this problem and it was for a simple reason. I had labels that began with a number, and R was adding an X in front of them all. I think R is confused with a number in the header and applies a letter to differentiate from values.
So, "3_in" became "X3_in" etc...
I solved by switching the label to "in_3" and the issues was resolved.
I hope this helps someone.
When the column names don´t have correct form, R put an "X" at the start of the column name during the import. For example it is usually happening when your column names starts with number or some spacial character. The check.names = FALSE cause it will not happen - there will be no "X".
However some functions may not work if the column names starts with numbers or other special character. Example is rbind.fill function.
So after the application of that function (with "corrected colnames") I use this simple thing to get rid of the "X".
destroyX = function(es) {
f = es
for (col in c(1:ncol(f))){ #for each column in dataframe
if (startsWith(colnames(f)[col], "X") == TRUE) { #if starts with 'X' ..
colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
}
}
assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
}
I ran over a similar problem and wanted to share the following lines of code to correct the column names. Certainly not perfect, since clean programming in the forehand would be better, but maybe helpful as a starting point to someone as quick and dirty approach. (I would have liked to add them as comment to Ryan's question/Gavin's answer, but my reputation is not high enough, so I had to post an additional answer - sorry).
In my case several steps of writing and reading data produced one or more columns named "X", X.1",... containing content in the X-column and row numbers in the X.1,...-columns. In my case the content of the X-column should be used as row names and the other X.1,...-columns should be deleted.
Correct_Colnames <- function(df) {
delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)
if (length(delete.columns) > 0) {
row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
#other data types might apply than character or
#introduction of a new separate column might be suitable
df <- df[,-delete.columns]
colnames(df) <- gsub("^X", "", colnames(df))
#X might be replaced by different characters, instead of being deleted
}
return(df)
}
I solved a similar problem by including row.names=FALSE as an argument in the write.csv function. write.csv was including the row names as an unnamed column in the CSV file and read.csv was naming that column 'X' when it read the CSV file.
I am trying to import a SAS data set to R (I cannot share the data set). SAS sees columns as number or character. However, some of the number columns have coded character values. I've used the sas7bdat package to bring in the data set but those character values in number columns return NaN. I would like the actual character value. I have tried exporting the data set to csv and tab delimited files. However, I end up with observations that take 2 lines (a problem with SAS that I haven't been able to figure out). Since there are over 9000 observations I cannot go back and look for those observations that take 2 lines manually. Any ideas how I can fix this?
SAS does NOT store character values in numeric columns. But there are some ways that numeric values will be printed using characters.
First is if you are using BEST format (which is the defualt for numeric variables). If the value cannot be represented exactly in the number of characters then it will use scientific notation.
Second is special missing values. SAS has 28 missing values. Regular missing is represented by a period. The others by single letter or underscore.
Third would be a custom format that displays the numbers using letters.
The first should not cause any trouble when importing into R. The last two can be handled by Haven. See the semantics Vignette in the documentation.
As to your multiple line CSV file there are two possible issues. The first is just that you did not tell SAS to use long enough lines for your data. Just make sure to use a longer LRECL setting on the file you are writing to.
filename csv 'myfile.csv' lrecl=1000000 ;
proc export data=mydata file=csv dbms=csv ; run;
The second possible issue is that some of your character variables include end of line characters in them. It is best to just remove or replace those characters. You could always add them back if they are really wanted. For example these steps will export the same file as above. It will first replace the carriage returns and line feeds in the character variables with pipe characters instead.
data for_export ;
set mydata;
array _c _character_;
do over _c;
_c = translate(_c,'||','0A0D'x);
end;
run;
proc export data=for_export file=csv dbms=csv ; run;
partial answer for dealing with data across multiple rows
library( data.table )
#first, read the whole lines into a single colunm, for example with
DT <- data.table::fread( myfile, sep = "")
#sample data for this example: a data.table with ten rows containing the numbers 1 to 10
DT <- data.table( 1:10 )
#column-bind wo subsets of the data, using a logical vector to select the evenery first
#and every second row. then paste the colums together and collapse using a
#comma-separator (of whatever separator you like)
ans <- as.data.table(
cbind ( DT[ rep( c(TRUE, FALSE), length = .N), 1],
DT[ rep( c(FALSE, TRUE), length = .N), 1] )[, do.call( paste, c(.SD, sep = ","))] )
# V1
# 1: 1,2
# 2: 3,4
# 3: 5,6
# 4: 7,8
# 5: 9,10
I prefer read_sas function from 'haven' package for reading sas data
library(haven)
data <- read_sas("data.sas7bdat")
I am trying to load data.csv in R using
S<-read.csv(file="data.csv")
Since it is a single column of numbers (I believe tab deliminated) without header, I was hoping for S to be a vector. But S displays as
X40.87
1 40.69
2 40.94
... ...
(The numbers 40.87,40.69... are my numbers.).
To access the third number, I need to invoke S[2,1]. Why not S[3]?
Use scan()
S <- scan("file.csv")
S[3]
# 40.94
Alternatively, as said by billinkc you can use read.csv("file.csv", header=FALSE) or just read.table("file.csv") as the delimiters aren't relevant in a file with a single column.
Since your CSV has no header, you need to indicate it as such when you open the file the interpreter is going to assign the first row as the column name.
Thus with input file like
40.87
40.69
40.94
I open this with the same logic you used
> s <- read.csv(file="~/Documents/r/data.txt",header=FALSE)
> s
V1
1 40.87
2 40.69
3 40.94
References
read.table {utils}
If you really just want a vector, subset the 1-column data frame:
read.csv(file="data.csv", header=FALSE)[,1]
This works because of the argument drop which takes default TRUE, and which drops the empty dimension (in this the column information).
I am reading this from a CSV file, and i need to write a function that churns out a final data frame, so given a particular entry, i have
x
[1] {2,4,5,11,12}
139 Levels: {1,2,3,4,5,6,7,12,17} ...
i can change it to
x2<-as.character(x)
which gives me
x
[1] "{2,4,5,11,12}"
how do i extract 2,4,5,11,12 out? (having 5 elements)
i have tried to use various ways, like gsub, but to no avail
can anyone please help me?
It sounds like you're trying to import a database table that contains arrays. Since R doesn't know about such data structures, it treats them as text.
Try this. I assume the column in question is x. The result will be a list, with each element being the vector of array values for that row in the table.
dat <- read.csv("<file>", stringsAsFactors=FALSE)
dat$x <- strsplit(gsub("\\{(.*)\\}", "\\1", dat$x), ",")
I asked a question about this a few months back, and I thought the answer had solved my problem, but I ran into the problem again and the solution didn't work for me.
I'm importing a CSV:
orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)
Here's the structure of the dataframe:
str(orders)
'data.frame': 3331575 obs. of 2 variables:
$ OrderID : num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
If I run the length command on the first column, OrderID, I get this:
length(orders$OrderID)
[1] 0
If I run the length on OrderDate, it returns correctly:
length(orders$OrderDate)
[1] 3331575
This is a copy/paste of the head of the CSV.
OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07
Now, if I re-run the read.csv, but take out the check.names option, the first column of the dataframe now has an X. at the start of the name.
orders2 <- read.csv("<file_location>", sep=",", header=T)
str(orders2)
'data.frame': 3331575 obs. of 2 variables:
$ X.OrderID: num -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
$ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...
length(orders$X.OrderID)
[1] 3331575
This works correctly.
My question is why does R add an X. to beginning of the first column name? As you can see from the CSV file, there are no special characters. It should be a simple load. Adding check.names, while will import the name from the CSV, will cause the data to not load correctly for me to perform analysis on.
What can I do to fix this?
Side note: I realize this is a minor - I'm just more frustrated by the fact that I think I am loading correctly, yet not getting the result I expected. I could rename the column using colnames(orders)[1] <- "OrderID", but still want to know why it doesn't load correctly.
read.csv() is a wrapper around the more general read.table() function. That latter function has argument check.names which is documented as:
check.names: logical. If ‘TRUE’ then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
‘make.names’) so that they are, and also to ensure that there
are no duplicates.
If your header contains labels that are not syntactically valid then make.names() will replace them with a valid name, based upon the invalid name, removing invalid characters and possibly prepending X:
R> make.names("$Foo")
[1] "X.Foo"
This is documented in ?make.names:
Details:
A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as ‘".2way"’ are not valid,
and neither are the reserved words.
The definition of a _letter_ depends on the current locale, but
only ASCII digits are considered to be digits.
The character ‘"X"’ is prepended if necessary. All invalid
characters are translated to ‘"."’. A missing value is translated
to ‘"NA"’. Names which match R keywords have a dot appended to
them. Duplicated values are altered by ‘make.unique’.
The behaviour you are seeing is entirely consistent with the documented way read.table() loads in your data. That would suggest that you have syntactically invalid labels in the header row of your CSV file. Note the point above from ?make.names that what is a letter depends on the locale of your system; The CSV file might include a valid character that your text editor will display but if R is not running in the same locale that character may not be valid there, for example?
I would look at the CSV file and identify any non-ASCII characters in the header line; there are possibly non-visible characters (or escape sequences; \t?) in the header row also. A lot may be going on between reading in the file with the non-valid names and displaying it in the console which might be masking the non-valid characters, so don't take the fact that it doesn't show anything wrong without check.names as indicating that the file is OK.
Posting the output of sessionInfo() would also be useful.
I just came across this problem and it was for a simple reason. I had labels that began with a number, and R was adding an X in front of them all. I think R is confused with a number in the header and applies a letter to differentiate from values.
So, "3_in" became "X3_in" etc...
I solved by switching the label to "in_3" and the issues was resolved.
I hope this helps someone.
When the column names don´t have correct form, R put an "X" at the start of the column name during the import. For example it is usually happening when your column names starts with number or some spacial character. The check.names = FALSE cause it will not happen - there will be no "X".
However some functions may not work if the column names starts with numbers or other special character. Example is rbind.fill function.
So after the application of that function (with "corrected colnames") I use this simple thing to get rid of the "X".
destroyX = function(es) {
f = es
for (col in c(1:ncol(f))){ #for each column in dataframe
if (startsWith(colnames(f)[col], "X") == TRUE) { #if starts with 'X' ..
colnames(f)[col] <- substr(colnames(f)[col], 2, 100) #get rid of it
}
}
assign(deparse(substitute(es)), f, inherits = TRUE) #assign corrected data to original name
}
I ran over a similar problem and wanted to share the following lines of code to correct the column names. Certainly not perfect, since clean programming in the forehand would be better, but maybe helpful as a starting point to someone as quick and dirty approach. (I would have liked to add them as comment to Ryan's question/Gavin's answer, but my reputation is not high enough, so I had to post an additional answer - sorry).
In my case several steps of writing and reading data produced one or more columns named "X", X.1",... containing content in the X-column and row numbers in the X.1,...-columns. In my case the content of the X-column should be used as row names and the other X.1,...-columns should be deleted.
Correct_Colnames <- function(df) {
delete.columns <- grep("(^X$)|(^X\\.)(\\d+)($)", colnames(df), perl=T)
if (length(delete.columns) > 0) {
row.names(df) <- as.character(df[, grep("^X$", colnames(df))])
#other data types might apply than character or
#introduction of a new separate column might be suitable
df <- df[,-delete.columns]
colnames(df) <- gsub("^X", "", colnames(df))
#X might be replaced by different characters, instead of being deleted
}
return(df)
}
I solved a similar problem by including row.names=FALSE as an argument in the write.csv function. write.csv was including the row names as an unnamed column in the CSV file and read.csv was naming that column 'X' when it read the CSV file.