How to split large Excel file into multiple Excel files using R - r

I'm looking for a way to split up a large Excel file into a number of smaller Excel files using R.
Specifically, there are three things I would like to do:
I have a large data set consisting of information regarding students (their school, the area in which the school is located, test score A, test score B) that I would like to split up into individual files, one file per school containing all of the students attending that specific school.
I would also like all of the individual Excel files to contain an image covering the first row and columns A, B, C & D of every Excel file. The image will be the same for all the schools in the data set.
Lastly, I would also like the Excel files, after being created, to end up in individual folders on my desktop. The folders name would be the area in which the schools are located. An area has about 3-5 schools so the folder would contain 3-5 Excel files, 1 for each school.
My data is structured like this:
Area
School
Student ID
Test score A
Test score B
North
A
134
24
31
North
A
221
26
33
South
B
122
22
21
South
B
126
25
25
I have data covering roughly 200 schools located in 5 different areas.
Any guidance on how to do this would be greatly appreciated!

As some of the comments have referenced, this will be hard to solve without knowing your specific operating environment & folder structure, I solved this using Windows 10/ C drive user folder but you can customize to your system. You're going to need a folder with all the images from the school saved by the name (or the ID I created) of the school and they will all need to be the same format (JPG or PNG). Plus, you need folders created for each Area you want to output to (openxlsx can write the files but not create the folders for you). Once you have those setup, something similar to this should work for you, but I would highly recommend referring to the openxlsx documentation for more info:
library(dplyr)
library(openxlsx)
# Load your excel file into a df
# g0 = openxlsx::read.xlsx(<your excel file & sheet..see openxlsx documentation>)
# Replace this tibble with your actual excel file, this was just for an example
g0 = tibble(
Area = c("North","North","North","North"),
School = c("A","A","B","B"),
Student_ID = c(134,221,122,126),
test_score_a = c(24,26,22,25),
test_score_b = c(31,33,21,25))
# Need a numeric school id for the loop
g0$school_id = as.numeric(as.factor(g0$School))
# Loop through schools, filter using dplyr and create a sheet per school
for (i in 1:n_distinct(g0$school_id)){
g1 = g0 %>%
filter(school_id == i)
## Create a new workbook
wb <- createWorkbook(as.character(g1$School))
## Add some worksheets
addWorksheet(wb, as.character(g1$School))
## Insert images
## I left the image as a direct path for example but you can change to a
## relative path once you get it working
img <- system.file("C:","Users","your name","Documents","A","A.jpg", package = "openxlsx")
insertImage(wb, as.character(g1$School), img, startRow = 5, startCol = 3, width = 6, height = 5)
## Save workbook
saveWorkbook(wb, paste0("C://Users//your name//Documents//",g0$Area,"//",g0$school,".xlsx"), overwrite = TRUE)
}

Related

Import fixed width data in R

I have a problem importing a file in R. The file correctly organized must contain 5 million records and 22 columns. I cannot separate the data base properly. I tried it with this code:
content <- scan("filepath",'character',sep='~') # Read the file as a single line
# To split content in lines:
lines <- regmatches(content,gregexpr(".{211}",content)) #Each line must have 211 characters with 5 million rows in total
x <- tempfile()
library(erer)
write.list(lines,x)
data <- read.fw(x, widths = c(12,9,9,3,4,8,1,1,3,3,3,1,12,14,13,30,8,9,12,6,6,27))
unlink(x)
Each record has numbers and letters. I don't know what I can correct to separate in columns properly.
All rows looks like this:
1000100060040000000000808040512000000188801072010010010000000000000 CABANILLAS GONZALES MARIA MANUEL CABANILLAS MARIA GONZALES 00000000000000000000000
I want to separate it according to the widths specified in the function
It includes some spaces that I cannot include in the final view

Loop through a list of identifiers to load corresponding excel files?

I can import a column of unique identifiers into R as an object. There is an excel file with a matching name for each identifier (around 500). I am trying to write a loop to go through all of these unique IDs and load the corresponding excel.
what I tried is:
for (i in 1:nrow(pi)){
read_excel()
}
Update:
so just to clarify because I don't think I provided adequate examples.
I have an excel column which consists of about 500 unique IDs, each of which is a series of 11 numbers or so. For each ID, I have an excel file with a matching name. All the excel files are in the same folder. For each unique ID, I would like to open up the file with a matching name, and retrieve specific cells, ie the bottom and top values in a particular column, the maximum value or the mean of another column, etc.
Where "pi" is the object which should be a vector of the unique IDs. I'm not sure how to complete this. Alternative methods of solving this problem are welcome. Realistically, I am just trying to retrieve specific values from the excel, ie first and last values in a certain column, maximum and mean of another column etc.
Since the original post didn't provide data, I will illustrate one technique where we use a vector of id numbers to generate file names to read multiple spreadsheets associated with Basic Pokémon stats for generations 1 - 8 of Pokémon.
To make the example completely reproducible, I maintain a zip file with this data on GitHub which we can download and load into R.
We will use the sprintf() function to create the file names, because sprintf() allows us to not only add the directory information needed to locate the files, as well as format the numbers with leading zeroes, which are required to generate the right file names.
Instead of a for() loop we will use lapply() along with an anonymous function to create the file names and read them as Excel files with readxl::read_excel().
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonXLSX.zip",
"PokemonXLSX.zip",
method="curl",mode="wb")
unzip("PokemonXLSX.zip",exdir="./pokemonData")
library(readxl)
# create a set of numbers to be used to generate
generationIds <- 1:8
spreadsheets <- lapply(generationIds,function(x) {
# use generation number to create individual file name
aFile <- sprintf("./PokemonData/gen%02i.xlsx",x)
data <- read_excel(aFile)
})
At this point the object spreadsheets is a list with eight elements, one corresponding to each generation of Pokémon (i.e one element per spreadsheet).
We can combine the seven files with rbind(), and then print the last few rows of the resulting data frame.
pokemon <- do.call(rbind,spreadsheets)
tail(pokemon)
...and the result:
> tail(pokemon)
# A tibble: 6 x 13
ID Name Form Type1 Type2 Total HP Attack Defense Sp..Atk Sp..Def
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 895 Regi… NA Drag… NA 580 200 100 50 100 50
2 896 Glas… NA Ice NA 580 100 145 130 65 110
3 897 Spec… NA Ghost NA 580 100 65 60 145 80
4 898 Caly… NA Psyc… Grass 500 100 80 80 80 80
5 898 Caly… Ice … Psyc… Ice 680 100 165 150 85 130
6 898 Caly… Shad… Psyc… Ghost 680 100 85 80 165 100
# … with 2 more variables: Speed <dbl>, Generation <dbl>
Spotlight: accessing the files from disk
To isolate the downloaded files, we use the exdir= argument on unzip() to write the unzipped files to a subdirectory of the R working directory.
We can access files in this subdirectory by adding ./pokemonData/ to their file names. The . in this syntax references the current directory.
We can illustrate how the filenames are created with the following code.
theFiles <- lapply(generationIds,function(x) {
# use generation number to create individual file name
aFile <- sprintf("./pokemonData/gen%02i.xlsx",x)
message(paste("current file is: ",aFile))
aFile
})
...and the output:
> theFiles <- lapply(generationIds,function(x) {
+ # use generation number to create individual file name
+ aFile <- sprintf("./pokemonData/gen%02i.xlsx",x)
+ message(paste("current file is: ",aFile))
+ aFile
+ })
current file is: ./pokemonData/gen01.xlsx
current file is: ./pokemonData/gen02.xlsx
current file is: ./pokemonData/gen03.xlsx
current file is: ./pokemonData/gen04.xlsx
current file is: ./pokemonData/gen05.xlsx
current file is: ./pokemonData/gen06.xlsx
current file is: ./pokemonData/gen07.xlsx
current file is: ./pokemonData/gen08.xlsx
One can identify the R working directory from within RStudio with the getwd() function. On my MacBook Pro, I get the following result.
> getwd()
[1] "/Users/lgreski/gitrepos/datascience"
>

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

How to convert a PDF listing the worlds ministers and cabinet members by country to a .csv in R

The CIA publishes a list of world leaders and cabinet ministers for all countries multiple times a year. This information is in PDF form.
I want to convert this PDF to CSV using R and then seperate and tidy the data.
I am getting the PDF from "https://www.cia.gov/library/publications/resources/world-leaders-1/"
under the link 'PDF Version for Prior Years' located at the center right hand side of the page.
Each PDF has some introductory pages and then lists the Leaders and Ministers for each country.
With each'Title' and 'Name' being seperated by a '..........' of varying lengths.
I have tried to use the pdftools package to convert from PDF, but I am not quite sure how to deal with the format of the data for sorting and tidying.
Here is the first steps I have taken with a downloaded PDF
library(pdftools)
text <- pdf_text("Data/April2006ChiefsDirectory.pdf")
test <- as.data.frame(text)
Starting with a single PDF, I want to list each Minister in a seperate row, with individual columns for year, country, title and name.
With the step I have taken so far, converting the PDF into .csv without any additional tidying, the data is in a single column and each row has a string of text contining title and name for multiple countries.
I am a novice at data tidying any help would be much appreciated.
You can do it with tabulizer but it is going to require some work to clean it up if your want to import all the 240 pages of the document.
Here I import page 4, that is the first with info regarding the leaders
library(tabulizer)
mw_table <- extract_tables(
"https://www.cia.gov/library/publications/resources/world-leaders-1/pdfs/2019/January2019ChiefsDirectory.pdf",
output = "data.frame",
pages = 4,
area = list(c(35.68168, 40.88842, 740.97853, 497.74737 )),
guess = FALSE
)
head(mw_table[[1]])
#> X Afghanistan
#> 1 Last Updated: 20 Dec 2017
#> 2 Pres. Ashraf GHANI
#> 3 CEO Abdullah ABDULLAH, Dr.
#> 4 First Vice Pres. Abdul Rashid DOSTAM
#> 5 Second Vice Pres. Sarwar DANESH
#> 6 First Deputy CEO Khyal Mohammad KHAN
You can use a vector of pages that you want to import as the argument in pages. Consider that you will have all the country names buried among the people names in the second column. Probably you can work out a method to identifying the indexes of the country by looking for the empty "" occurrences in the first column.

Picking up values from several csv files in R

I have a folder where several csv files are dumped. File names can be Product_1234.csv Product_2121.csv etc
The column names in these sheets are different. However there is always one column "Profit" which is there in all sheets. Hence Product_1234.csv and Product_2121.csv will both have Profit as a column.
I have another csv i.e my_csv.csv file in which data is in the following format
Product Cost
1234 12
2345 10
2121 15
I want to have another column in my_csv named Profit. This column should have the Profit from the multiple sheets talked about earlier. For example, to get the Profit for Product 1234 we will have to search for filename which has "1234" and the pick up the Profit from that file. I am not sure if this can be done in R. Please help.
The output file i.e my_csv will be something like this
Product Cost Profit
1234 12 3
2121 15 1
Something like this would do it
# Dummy data - read this in from your my_csv.csv file
my_csv_data = data.frame(
Product = c(1234, 2121),
Cost = c(12, 15)
)
profits <- c()
for(productNumber in my_csv_data$Product) {
fileName <- paste0("Product_", productNumber, ".csv")
productData <- read.csv(fileName)
profits <- c(profits, productData$Profit[1])
}
my_csv_data$Profit <- profits
There are certainly faster ways of doing this, but this gives you somewhere to start from if performance is an issue.

Resources