How to convert categorical string variables in Excel to numeric variables? - r

I am trying to analyze road collision data. I need to convert the categorical data into numeric data in Excel before I import it to RStudio, but I am not able to do it.
Excel file containing the sample data to be used:

Have a look at this, as I don't know which value you want to search with and which you want as the result, I have shown it taking the weather and returning the leftmost column.
You can change that as you need as I have used Index() with match().
So, based on the comment, all you need is left() to get the number, I have used find() to get the space just in case you have 2 or 3 digit numbers...
=LEFT(C3,FIND(" ",C3,1)-1)
C3 is the target cell, drag down as far as...

Related

Import excel (csv) data into R conducting bioinformatics task

I'm a new who is exploring bioinformatics via R. Right now I've encounter a trouble, where I imported my data in excel into R through changing it into csv format and using read.csv command, as you see in the pic there are 37 variables (column) where first column is supposed to be considered as fixed factor. And I would like to match it with another matirx which has only 36 variables in the downstream processing, what should I do to reduce variable numbers by fixing first column?
Many thanks in advance.
sure, I added str() properties of my data here.
If I am not mistaken, what you are looking for is setting the "Gene" column as metadata, indicating what gene those values in every row correspond to. You can try then to delete the word "Gene" in the Excel file because when you import it with the read.csv() function, the argument row.names = TRUE is set as default when "there is a header and the first row contains one fewer field than the number of columns".
You can find more information about this function using ?read.csv

How do I export a custom list of numbers and letters to Excel from R?

To help with some regular label-making and printing I need to do, I am looking to write a script that allows me to enter a range of sequential numbers (some with string identifiers) that I can export with a specific format to Excel. For example, if I entered the range '1:16', I am looking for an output in Excel exactly as:
Example Excel Output
For each unique sequential number (i.e., 1 to 16) the first five rows must be labeled with a 'U", the next three rows with an 'F' and the last two rows must be the number alone. The final exported matrix will be n columns x 21 rows, where n will vary depending on the number range I enter.
My main problem is in writing to Excel. I can't find out how to customize this output and write to specific rows and columns as in the example above. I am limited to 'openxlsx' since I work on a corporate secure workstation. Here is what I have so far:
Example Code
Any help you may have would be very appreciated, thanks in advance!

R: referencing a character string

Is it possible to feed R a character string and for it to know I'm looking for the data frame with that name?
Example:
TestData <- matrix(1:100,nrow=10, ncol=10)
Then, if I want to reference it later, can I do something similar to this to have R pull dataset?
paste("TestData$",x[1,],sep="")
When entered this way, it comes up as a character string and obviously returns no data. For context, I'm trying to do it this way because I'm creating a loop that goes through several data sets (and columns within those datasets), but does similar operations, so I would like to be able to dynamically change the referenced dataset.
Any help is appreciated. Thanks!

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

R XLConnect getting index/formula to a chunk of data using content found in first cell

Sorry if this is difficult to understand - I don't have enough karma to add a picture so I will do the best I can to describe this! Using XLConnect package within R to read & write from/to Excel spreadsheets.
I am working on a project in which I am trying to take columns of data out of many workbooks and concatenate them together into rows of a new workbook based on which workbook they came from (each workbook is data from a consecutive business day). The snag is that the data that I seek is only a small part (10 rows X 3 columns) of each workbook/worksheet and is not always located in the same place within the worksheet due to sloppiness on behalf of the person who originally created the spreadsheets. (e.g. I can't just start at cell A2 because the dataset that starts at A2 in one workbook might start at B12 or C3 in another workbook).
I am wondering if it is possible to search for a cell based on its contents (e.g. a cell containing the title "Table of Arb Prices") and return either the index or reference formula to be able to access that cell.
Also wondering if, once I reference that cell based on its contents, if there is a way to adjust that formula to get to where I know another cell is compared to that one. For example if a cell with known contents is always located 2 rows above and 3 columns to the left of the cell where I wish to start collecting data, is it possible for me to take that first reference formula and increment it by 2 rows and 3 columns to get the reference formula for the cell I want?
Thanks for any help and please advise me if you need further information to be able to understand my questions!
You can just read the entire worksheet in as a matrix with something like
library(XLConnect)
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
mm <- as.matrix(readWorksheetFromFile(demoExcelFile, sheet=1))
class(mm)<-"character" # convert all to character
Then you can search for values and get the row/colum
which(mm=="3.435", arr.ind=T)
# row col
# [1,] 23 6
Then you can offset those and extract values from the matrix how ever you like. In the end, when you know where you want to read from, you can convert to a cleaner data frame with
read.table(text=apply(mm[25:27, 6:8],1,paste, collapse="\t"), sep="\t")
Hopefully that gives you a general idea of something you can try. It's hard to be more specific without knowing exactly what your input data looks like.

Resources