Export dataframe from R to SAS - r

I have a data.frame in R, and I want to export it to a SAS file. I am using write.xport to do that. The column names are like:
a.b.c, a.b.d, a.f.g, ...
When I get the data in SAS, column names are like: a(1),a(2),..
How can I keep the original labels in exported SAS file?
I get the error:
Warning messages:
1: In makeSASNames(colnames(df)) :
Truncated 119 long names to 8 characters.
2: In makeSASNames(colnames(df)) : Made 106 duplicate names unique.

In addition to the length, it seems your column names contain '.'-character? SAS doesnt allow for those kind of names. SAS uses the . to represent e.g. library.dataset -notation and it has many others uses. The colnames cannot contain many + or - or & -chars either.
So to summarize; make your column names SAS -compatible. See the SAS documentation for more.
SAS uses the column labels, which allow for more complexity, only for the purposes of outputting, afaik. Thus, if you want to manipulate data in SAS, you need to rethink your column names first.

Related

Import excel (csv) data into R conducting bioinformatics task

I'm a new who is exploring bioinformatics via R. Right now I've encounter a trouble, where I imported my data in excel into R through changing it into csv format and using read.csv command, as you see in the pic there are 37 variables (column) where first column is supposed to be considered as fixed factor. And I would like to match it with another matirx which has only 36 variables in the downstream processing, what should I do to reduce variable numbers by fixing first column?
Many thanks in advance.
sure, I added str() properties of my data here.
If I am not mistaken, what you are looking for is setting the "Gene" column as metadata, indicating what gene those values in every row correspond to. You can try then to delete the word "Gene" in the Excel file because when you import it with the read.csv() function, the argument row.names = TRUE is set as default when "there is a header and the first row contains one fewer field than the number of columns".
You can find more information about this function using ?read.csv

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

Excel and R do not see two values as being equal

I loaded data into two Excel sheets from online tables. Both tables include distinct information about the same group of baseball players, who are named in column B (or column 2 when converted to R) of each table. Neither Excel (VLOOKUP/MATCH) nor R will match up the players' names between the two tables, despite those names looking exactly the same in every way.
Yes, I have checked for extra spaces, capitalization, etc. I have attempted reformatting the cells in Excel that include the players' names. Please see input and output below from R (data was loaded as csv file):
> as.character(freeagentvalue$Name)[3064]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294] == as.character(freeagentvalue$Name)[3064]
[1] FALSE
I would appreciate any information on why Excel and R are finding differences like the one above. Otherwise I have to retype a lot of names. Thank you in advance.
The two Travis Hafner strings in your example above differ in that in that the first example has a NBSP between the two names; the second has a normal space.
I suggest preprocessing the tables by Replacing all NBSP's with space You can do that either on the worksheet, using the SUBSTITUTE function; or in VBA, using Replace.

NA Values Appear for All Data in Imported .csv File

I imported a set of data into RStudio containing 85 variables and 139 observations. All values are integers except for the last column which is blank and for some reason was imported alongside everything else in the .csv file I created from a .xls file.
As such, this last column is all NA values. The problem is that when I try to run any kind of analysis it seems to be reading that all values are NA values. Despite this, in the data window in RStudio everything seems to be fine. Are there solutions to this problem that don't involve the data? Is it almost certainly the data that's the problem?
It seems strange that when opening the file anywhere else and even viewing it in R
The most likely issue is that the file is being imported as all text rather than as numeric data. If all of the data is numeric you can just use colClasses="numeric" as an argument to the read.csv() function and that should import correctly. You could also change the data class once it is in R, or give colClasses a vector of different classes if you have a variety of different data types (logical, character, numeric etc.) in your file.
Edit
Seeing as colClasses is not working (it is hard to say why without looking at your data), you can try this:
MyDF<-data.frame(sapply(MyDF,FUN=as.numeric))
Where MyDF is your datafraome. That will change all of your columns to numeric. If you have some charcter/factor/logical values in there this may not work as expected. You might want to check your excel file/csv to see why it is importing a NA column. It could be that there is a cell with a space in it that is being pulled in and this is throwing things off. You could always try deleting that empty column and retrying your import.
If you want to omit your last column while reading the data itself, you can try the following code. In this example, I am assuming that your file has 5 columns and the 5th column has NA values. So, you want to skip reading 5th column in your data set.
data <- read.csv (fileName, ....) [,1:4]
or, if you want to use column names, you can use:
data <- read.csv (fileName, ....) [,c('col1','col2','col3','col4')]
This will read all the observations from selected columns within your data set.
Hope this helps.
If you are trying too find the mean and standard deviation you can use
Data<- mean( dataframe$colname , na.rm = TRUE)
Data1<- sd( dataframe$colname , na.rm = TRUE)
This will give u the answer after omitting the na values from the column

Why does R mix up numerical with categorial variables?

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.

Resources