Import only specific cells from Excel to R by hard coding - r

I have around 100 equal .xls files containing 10 sheets each, with very messy data, here is a thought example of one sheet:
I want to add everything together in one R dataframe/tibble.
I don't know the right approach here, but I believe that I can hard code this within readxl::read.xls. It should look like this
I would like if somebody could show a short code of how to pick a cell to be the column name by its position and the data belonging to that column, also by its position/range.
Afterwards, I will find a way to loop this to all sheets within all files, or better: If I can specify the needed code for a certain sheet name within the read.xls function. Then i only have to loop on all the files.
Thanks and let me know if you need some more information on this.

Related

Reading XLSX-file with multiple sheets and special cell-style/formation into R

a coworker for my project is working mainly in excel and he provided me with an xlsx file with more than 400 sheets with a similar structure. Anyway, every sheet has information for more than one individual (max of 12) and for each individual, there are about 50 different measurements (length of special character). But they are equally organized on the sheets so here is not the issue.
The issue is that between all this data there are manually inserted mean values added for cases where the measurement could not be done, which I would like to remove. The only way to identify these mean values is based on the font color.
Does someone know how I can loop through all sheets and get not only the cell information but also the style of each cell?
I tried already using the "readxl", "tidyxl" & "openxlsx" packages, but my problem is that I am not able to loop through all sheets using the "tidyxl::xlsx_formats(path = path)" function or the "openxlsx::loadWorkbook" function.
I´m sure that I´m just missing a function or another way to deal with this problem.
Thank´s already for any help :).
So far I was using "readxl::excel_sheets" to loop through all sheets and get the cell information given in each cell but I can´t find a way to get the cell style information too (R version 4.2.2 Patched, packageVersion("readxl") ‘1.4.0’).
Then I tried using "tidyxl::xlsx_formats" which gives me the cell style information but only for the first sheet (packageVersion("tidyxl") ‘1.0.8’). I followed the instructions on their CRAN page.
Also, I tried different approaches like (openxlsx::loadWorkbook) which I found on other issues like "https://stackoverflow.com/questions/62519400/filter-data-highlighted-in-excel-by-cell-fill-color-using-openxlsx".

Parsing Large XML files efficiently using R

New here. I use R a lot, but I'm pretty unfamiliar with XML. I'm look for advice for efficiently looping through and aggregating large XML files (~200MB) from here. I have XML files with elements that look like this:
<OpportunitySynopsisDetail_1_0>
<OpportunityID>134613</OpportunityID>
<OpportunityTitle>Research Dissemination and Implementation Grants (R18)</OpportunityTitle>
<OpportunityNumber>PAR-12-063</OpportunityNumber>
...
</OpportunitySynopsisDetail_1_0>
None of the sub-elements have children or attributes. The only complicating factor is that some elements can have multiple instances of a single child type.
I've already downloaded and parsed one file using the xml2 package, and I have my xml_nodeset. I've also played around successfully with extracting data from subsets of the nodeset (i.e. the first 100 nodes). Here's an example of what I did to extract elements without an "ArchivedDate" sub-element:
for(i in c(1:100)){
if (is.na(
xml_child(nodeset[[i]],"d1:ArchiveDate",xml_ns(xmlfile)))){
print(paste0("Entry ",i," is not archived."))
Here's the problem: if I replace 100 with length(nodeset) which is 56k+, this thing is gonna take forever to iterate through. Is there a better way to filter out and analyze xml elements without iterating through each and every one? Or is this just a limitation of the file format? The long term goal would be to get a very small subset of this file into a data frame.
Thanks!

Load multiple tables from one word sheet and split them by detecting one fully emptied row

So generally what I would like to do is to load a sheet with 4 different tables, and split this one big data into smaller tables using str_detect() to detect one fully blank row that's deviding those tables. After that I want to plug that information into the startRow, startCol, endRow, endCol.
I have tried using this function as followed :
str_detect(my_data, ‘’) but the my_data format is wrong. I’m not sure what step shall I make do prevent this and make it work.
I’m using read_xlsx() to read my dataset

Custom column design

I have a raw dataset and the columns are not clearly defined at all. When I go to import the data using "Read.Table" in R, it automatically tries to approximate where the columns begin and end. But it is not correct. I know the number of characters per variable, but I am not sure how to customize them as one would in Excel(=Left(x,3) OR =MID(X,4,1)... etc.). Some variables are separated by spaces, some aren't. It is not consistent.
FYI: The document was originally ".dat", then I saved the file as a ".R" file.
Here is an example of my data
Any help is much appreciated! Let me know
You can use read_fwf from the great readr package, to specify the fix widths per variable.

How to get R to use a certain dataset for multiple commands without usin attach() or appending data="" to every command

So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).

Resources