Scraping tables from Excel files into R

Scraping tables from Excel files into R - r

I have several excel files (*.xlsx) and I want to import them into R, but each file has 6 to 7 tables in a single sheet, separated by chunks of text, like the picture.
I know how to import several excel files using a loop, but my issue is I cannot figure out how select each of the tables distributed along each sheet, avoiding the rows with text, and bind them. Also, each table from each excel file starts in a different cell, so I cannot just define a coordinate (a specific cell) to import the tables. Every excel file is different in amount of rows. I'll appreciate any help.
For instance, the above picture is about Maryland (an US State), and I want to transform that into what is presented in the following picture:
This is a toy file to anyone able to help me: LINK
Thanks!

Based on the image of the data you showed, it seems that all rows can be removed where the second column has an NA? In that case subsetting in base R is pretty straightforward:
test <- test[!is.na(test[,2]),]
Quick explanation:
test[ ,2] --> evaluate all rows in column 2
is.na(test[ ,2]) --> return TRUE if cell is NA
!is.na(test[ ,2]) --> return FALSE if cell is NA
test[!is.na(test[,2]),] --> all rows of test dataframe where cell in col 2 is not NA
Again, based on the data you showed this should work. But hard to work out w/o true sample date.

Related

Import excel (csv) data into R conducting bioinformatics task

I'm a new who is exploring bioinformatics via R. Right now I've encounter a trouble, where I imported my data in excel into R through changing it into csv format and using read.csv command, as you see in the pic there are 37 variables (column) where first column is supposed to be considered as fixed factor. And I would like to match it with another matirx which has only 36 variables in the downstream processing, what should I do to reduce variable numbers by fixing first column?
Many thanks in advance.
sure, I added str() properties of my data here.

If I am not mistaken, what you are looking for is setting the "Gene" column as metadata, indicating what gene those values in every row correspond to. You can try then to delete the word "Gene" in the Excel file because when you import it with the read.csv() function, the argument row.names = TRUE is set as default when "there is a header and the first row contains one fewer field than the number of columns".
You can find more information about this function using ?read.csv

readxlsx function limitations in openxlsx library

When I try to import an excel worksheet with readxlsx function it can be observed that in the preview there are more than 100 columns inserted into the data frame. But when I look inside the data frame, only the first 100 columns are visible. Thus, adding some columns and then using writexlsx is omitting those columns. Is there any way to avoid this situation?
Regards,
Rafał

Excel and R do not see two values as being equal

I loaded data into two Excel sheets from online tables. Both tables include distinct information about the same group of baseball players, who are named in column B (or column 2 when converted to R) of each table. Neither Excel (VLOOKUP/MATCH) nor R will match up the players' names between the two tables, despite those names looking exactly the same in every way.
Yes, I have checked for extra spaces, capitalization, etc. I have attempted reformatting the cells in Excel that include the players' names. Please see input and output below from R (data was loaded as csv file):
> as.character(freeagentvalue$Name)[3064]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294]
[1] "Travis Hafner"
> as.character(freeagentdata$Name)[294] == as.character(freeagentvalue$Name)[3064]
[1] FALSE
I would appreciate any information on why Excel and R are finding differences like the one above. Otherwise I have to retype a lot of names. Thank you in advance.

The two Travis Hafner strings in your example above differ in that in that the first example has a NBSP between the two names; the second has a normal space.
I suggest preprocessing the tables by Replacing all NBSP's with space You can do that either on the worksheet, using the SUBSTITUTE function; or in VBA, using Replace.

NA Values Appear for All Data in Imported .csv File

I imported a set of data into RStudio containing 85 variables and 139 observations. All values are integers except for the last column which is blank and for some reason was imported alongside everything else in the .csv file I created from a .xls file.
As such, this last column is all NA values. The problem is that when I try to run any kind of analysis it seems to be reading that all values are NA values. Despite this, in the data window in RStudio everything seems to be fine. Are there solutions to this problem that don't involve the data? Is it almost certainly the data that's the problem?
It seems strange that when opening the file anywhere else and even viewing it in R

The most likely issue is that the file is being imported as all text rather than as numeric data. If all of the data is numeric you can just use colClasses="numeric" as an argument to the read.csv() function and that should import correctly. You could also change the data class once it is in R, or give colClasses a vector of different classes if you have a variety of different data types (logical, character, numeric etc.) in your file.
Edit
Seeing as colClasses is not working (it is hard to say why without looking at your data), you can try this:
MyDF<-data.frame(sapply(MyDF,FUN=as.numeric))
Where MyDF is your datafraome. That will change all of your columns to numeric. If you have some charcter/factor/logical values in there this may not work as expected. You might want to check your excel file/csv to see why it is importing a NA column. It could be that there is a cell with a space in it that is being pulled in and this is throwing things off. You could always try deleting that empty column and retrying your import.

If you want to omit your last column while reading the data itself, you can try the following code. In this example, I am assuming that your file has 5 columns and the 5th column has NA values. So, you want to skip reading 5th column in your data set.
data <- read.csv (fileName, ....) [,1:4]
or, if you want to use column names, you can use:
data <- read.csv (fileName, ....) [,c('col1','col2','col3','col4')]
This will read all the observations from selected columns within your data set.
Hope this helps.

If you are trying too find the mean and standard deviation you can use
Data<- mean( dataframe$colname , na.rm = TRUE)
Data1<- sd( dataframe$colname , na.rm = TRUE)
This will give u the answer after omitting the na values from the column

R XLConnect getting index/formula to a chunk of data using content found in first cell

Sorry if this is difficult to understand - I don't have enough karma to add a picture so I will do the best I can to describe this! Using XLConnect package within R to read & write from/to Excel spreadsheets.
I am working on a project in which I am trying to take columns of data out of many workbooks and concatenate them together into rows of a new workbook based on which workbook they came from (each workbook is data from a consecutive business day). The snag is that the data that I seek is only a small part (10 rows X 3 columns) of each workbook/worksheet and is not always located in the same place within the worksheet due to sloppiness on behalf of the person who originally created the spreadsheets. (e.g. I can't just start at cell A2 because the dataset that starts at A2 in one workbook might start at B12 or C3 in another workbook).
I am wondering if it is possible to search for a cell based on its contents (e.g. a cell containing the title "Table of Arb Prices") and return either the index or reference formula to be able to access that cell.
Also wondering if, once I reference that cell based on its contents, if there is a way to adjust that formula to get to where I know another cell is compared to that one. For example if a cell with known contents is always located 2 rows above and 3 columns to the left of the cell where I wish to start collecting data, is it possible for me to take that first reference formula and increment it by 2 rows and 3 columns to get the reference formula for the cell I want?
Thanks for any help and please advise me if you need further information to be able to understand my questions!

You can just read the entire worksheet in as a matrix with something like
library(XLConnect)
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
mm <- as.matrix(readWorksheetFromFile(demoExcelFile, sheet=1))
class(mm)<-"character" # convert all to character
Then you can search for values and get the row/colum
which(mm=="3.435", arr.ind=T)
# row col
# [1,] 23 6
Then you can offset those and extract values from the matrix how ever you like. In the end, when you know where you want to read from, you can convert to a cleaner data frame with
read.table(text=apply(mm[25:27, 6:8],1,paste, collapse="\t"), sep="\t")
Hopefully that gives you a general idea of something you can try. It's hard to be more specific without knowing exactly what your input data looks like.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping tables from Excel files into R - r

Related

Import excel (csv) data into R conducting bioinformatics task

readxlsx function limitations in openxlsx library

Excel and R do not see two values as being equal

NA Values Appear for All Data in Imported .csv File

R XLConnect getting index/formula to a chunk of data using content found in first cell

Categories

Resources