Import excel (csv) data into R conducting bioinformatics task - r

I'm a new who is exploring bioinformatics via R. Right now I've encounter a trouble, where I imported my data in excel into R through changing it into csv format and using read.csv command, as you see in the pic there are 37 variables (column) where first column is supposed to be considered as fixed factor. And I would like to match it with another matirx which has only 36 variables in the downstream processing, what should I do to reduce variable numbers by fixing first column?
Many thanks in advance.
sure, I added str() properties of my data here.

If I am not mistaken, what you are looking for is setting the "Gene" column as metadata, indicating what gene those values in every row correspond to. You can try then to delete the word "Gene" in the Excel file because when you import it with the read.csv() function, the argument row.names = TRUE is set as default when "there is a header and the first row contains one fewer field than the number of columns".
You can find more information about this function using ?read.csv

Related

I am importing a csv table into R but it takes my observations as variables

I imported a csv file into R. The first column has my observations and I have 5 variables. However, when I import it into R it takes my column of observations as a variable, and tells me I have 6 variables. How do I make it understand that the first column of "cars" is a column of observations? I attach a picture for reference.
Thank you,
Marianaenter image description here
You should be able to specify this with the row.names parameter in read.csv. Although I can't say exactly what to type since I don't have the original dataset, it should be something like:
read.csv(file = "myfile.csv", row.names = 1, [other options])
indicating that row names can be found in the first column.
If you're using some other method of importing the file (e.g. by using the RStudio graphical interface), there should be an option somewhere along the way to specify the location of row names.
Alternatively, a possibly easier approach is suggested by the read.csv documentation:
If row.names is not specified and the header line has one less entry than the number of columns, the first column is taken to be the row names. This allows data frames to be read in from the format in which they are printed. If row.names is specified and does not refer to the first column, that column is discarded from such files.
Try deleting the X in the top left corner of your .csv file (and delete the comma that follows it) and see if that gets you anywhere.
EDIT Marius has the right suggestion, by the way - just ignore the junk column and work with row numbers instead. (What's the harm?)

Export dataframe from R to SAS

I have a data.frame in R, and I want to export it to a SAS file. I am using write.xport to do that. The column names are like:
a.b.c, a.b.d, a.f.g, ...
When I get the data in SAS, column names are like: a(1),a(2),..
How can I keep the original labels in exported SAS file?
I get the error:
Warning messages:
1: In makeSASNames(colnames(df)) :
Truncated 119 long names to 8 characters.
2: In makeSASNames(colnames(df)) : Made 106 duplicate names unique.
In addition to the length, it seems your column names contain '.'-character? SAS doesnt allow for those kind of names. SAS uses the . to represent e.g. library.dataset -notation and it has many others uses. The colnames cannot contain many + or - or & -chars either.
So to summarize; make your column names SAS -compatible. See the SAS documentation for more.
SAS uses the column labels, which allow for more complexity, only for the purposes of outputting, afaik. Thus, if you want to manipulate data in SAS, you need to rethink your column names first.

Formatting Header to Append to Data Frame in R

I'm attempting to create a specifically formatted header to append to a data frame I have created in R.
The essence of my problem is that it seems increasingly difficult (maybe impossible?) to create a header that breaks away from a typical one row by one column framework, without merging the underlying table, using the dataframe concept in R.
The issue stems from me not being able to figure out a way to import this particular format of a header into R through methods such as read.csv or read.xlsx which preserve the format of the header.
Reading in a header of this format into R from a .csv or .xlsx is quite ugly and doesn't preserve the original format. The format of the header I'm trying to create and append to an already existing dataframe I have of 17 nameless columns in R could be represented in such a way:
Where the number series of 1 - 17 represents the already existing data frame of 17 nameless columns of data that I have created in R in which I wish to append to this header. Could anyone point me in the right direction?
You are correct that this header will not work within R. The data frame only supports single header values and wont do something akin to a merged cell in excel.
However if you simply want to export your data to an .csv or .xlsx (use write.csv) then just copy your header in, that could work.
OR
You could add in a factor column to your data frame to capture the information contained in the top level of your header.

Excel and R do not see two values as being equal

I loaded data into two Excel sheets from online tables. Both tables include distinct information about the same group of baseball players, who are named in column B (or column 2 when converted to R) of each table. Neither Excel (VLOOKUP/MATCH) nor R will match up the players' names between the two tables, despite those names looking exactly the same in every way.
Yes, I have checked for extra spaces, capitalization, etc. I have attempted reformatting the cells in Excel that include the players' names. Please see input and output below from R (data was loaded as csv file):
> as.character(freeagentvalue$Name)[3064]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294] == as.character(freeagentvalue$Name)[3064]
[1] FALSE
I would appreciate any information on why Excel and R are finding differences like the one above. Otherwise I have to retype a lot of names. Thank you in advance.
The two Travis Hafner strings in your example above differ in that in that the first example has a NBSP between the two names; the second has a normal space.
I suggest preprocessing the tables by Replacing all NBSP's with space You can do that either on the worksheet, using the SUBSTITUTE function; or in VBA, using Replace.

R XLConnect getting index/formula to a chunk of data using content found in first cell

Sorry if this is difficult to understand - I don't have enough karma to add a picture so I will do the best I can to describe this! Using XLConnect package within R to read & write from/to Excel spreadsheets.
I am working on a project in which I am trying to take columns of data out of many workbooks and concatenate them together into rows of a new workbook based on which workbook they came from (each workbook is data from a consecutive business day). The snag is that the data that I seek is only a small part (10 rows X 3 columns) of each workbook/worksheet and is not always located in the same place within the worksheet due to sloppiness on behalf of the person who originally created the spreadsheets. (e.g. I can't just start at cell A2 because the dataset that starts at A2 in one workbook might start at B12 or C3 in another workbook).
I am wondering if it is possible to search for a cell based on its contents (e.g. a cell containing the title "Table of Arb Prices") and return either the index or reference formula to be able to access that cell.
Also wondering if, once I reference that cell based on its contents, if there is a way to adjust that formula to get to where I know another cell is compared to that one. For example if a cell with known contents is always located 2 rows above and 3 columns to the left of the cell where I wish to start collecting data, is it possible for me to take that first reference formula and increment it by 2 rows and 3 columns to get the reference formula for the cell I want?
Thanks for any help and please advise me if you need further information to be able to understand my questions!
You can just read the entire worksheet in as a matrix with something like
library(XLConnect)
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
mm <- as.matrix(readWorksheetFromFile(demoExcelFile, sheet=1))
class(mm)<-"character" # convert all to character
Then you can search for values and get the row/colum
which(mm=="3.435", arr.ind=T)
# row col
# [1,] 23 6
Then you can offset those and extract values from the matrix how ever you like. In the end, when you know where you want to read from, you can convert to a cleaner data frame with
read.table(text=apply(mm[25:27, 6:8],1,paste, collapse="\t"), sep="\t")
Hopefully that gives you a general idea of something you can try. It's hard to be more specific without knowing exactly what your input data looks like.

Resources