Picking up values from several csv files in R - r

I have a folder where several csv files are dumped. File names can be Product_1234.csv Product_2121.csv etc
The column names in these sheets are different. However there is always one column "Profit" which is there in all sheets. Hence Product_1234.csv and Product_2121.csv will both have Profit as a column.
I have another csv i.e my_csv.csv file in which data is in the following format
Product Cost
1234 12
2345 10
2121 15
I want to have another column in my_csv named Profit. This column should have the Profit from the multiple sheets talked about earlier. For example, to get the Profit for Product 1234 we will have to search for filename which has "1234" and the pick up the Profit from that file. I am not sure if this can be done in R. Please help.
The output file i.e my_csv will be something like this
Product Cost Profit
1234 12 3
2121 15 1

Something like this would do it
# Dummy data - read this in from your my_csv.csv file
my_csv_data = data.frame(
Product = c(1234, 2121),
Cost = c(12, 15)
)
profits <- c()
for(productNumber in my_csv_data$Product) {
fileName <- paste0("Product_", productNumber, ".csv")
productData <- read.csv(fileName)
profits <- c(profits, productData$Profit[1])
}
my_csv_data$Profit <- profits
There are certainly faster ways of doing this, but this gives you somewhere to start from if performance is an issue.

Related

R xml2 : How to query only corresponding xml nodes

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.
You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

How to split large Excel file into multiple Excel files using R

I'm looking for a way to split up a large Excel file into a number of smaller Excel files using R.
Specifically, there are three things I would like to do:
I have a large data set consisting of information regarding students (their school, the area in which the school is located, test score A, test score B) that I would like to split up into individual files, one file per school containing all of the students attending that specific school.
I would also like all of the individual Excel files to contain an image covering the first row and columns A, B, C & D of every Excel file. The image will be the same for all the schools in the data set.
Lastly, I would also like the Excel files, after being created, to end up in individual folders on my desktop. The folders name would be the area in which the schools are located. An area has about 3-5 schools so the folder would contain 3-5 Excel files, 1 for each school.
My data is structured like this:
Area
School
Student ID
Test score A
Test score B
North
A
134
24
31
North
A
221
26
33
South
B
122
22
21
South
B
126
25
25
I have data covering roughly 200 schools located in 5 different areas.
Any guidance on how to do this would be greatly appreciated!
As some of the comments have referenced, this will be hard to solve without knowing your specific operating environment & folder structure, I solved this using Windows 10/ C drive user folder but you can customize to your system. You're going to need a folder with all the images from the school saved by the name (or the ID I created) of the school and they will all need to be the same format (JPG or PNG). Plus, you need folders created for each Area you want to output to (openxlsx can write the files but not create the folders for you). Once you have those setup, something similar to this should work for you, but I would highly recommend referring to the openxlsx documentation for more info:
library(dplyr)
library(openxlsx)
# Load your excel file into a df
# g0 = openxlsx::read.xlsx(<your excel file & sheet..see openxlsx documentation>)
# Replace this tibble with your actual excel file, this was just for an example
g0 = tibble(
Area = c("North","North","North","North"),
School = c("A","A","B","B"),
Student_ID = c(134,221,122,126),
test_score_a = c(24,26,22,25),
test_score_b = c(31,33,21,25))
# Need a numeric school id for the loop
g0$school_id = as.numeric(as.factor(g0$School))
# Loop through schools, filter using dplyr and create a sheet per school
for (i in 1:n_distinct(g0$school_id)){
g1 = g0 %>%
filter(school_id == i)
## Create a new workbook
wb <- createWorkbook(as.character(g1$School))
## Add some worksheets
addWorksheet(wb, as.character(g1$School))
## Insert images
## I left the image as a direct path for example but you can change to a
## relative path once you get it working
img <- system.file("C:","Users","your name","Documents","A","A.jpg", package = "openxlsx")
insertImage(wb, as.character(g1$School), img, startRow = 5, startCol = 3, width = 6, height = 5)
## Save workbook
saveWorkbook(wb, paste0("C://Users//your name//Documents//",g0$Area,"//",g0$school,".xlsx"), overwrite = TRUE)
}

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

Extracting and storing data from a very large file in R

I have a very large DAT file (16 GB). It contains some information of let's say, 1000 customers. This data is sorted like below that the first column is representing the customer IDs:
9909814 246766 0 31/07/2012 7:00 0.03 0 0 0 0
8211675 262537 0 8/04/2013 3:00 0.52 0 0 0 0
However, the data of customers are not stored in an organized way. So, I want to extract the data of each customer and store it in a separate file. (I have a file that contains the customer IDs. )
For just one customer, I wrote the following code that can search through the file and extract data. However, my problem is to how to do this for all the customers when I'm reading this big file into R.
con<-file('D:/CD_INTERVAL_READING.DAT')
open(con)
n=20
nk=100000
B=9909814 #customer ID for customer no.1
customer1 <- read.table(con, sep=",", nrow=1)
for (i in 1:n) {
conn <- read.table(con,sep=",",skip=(i-1)*nk, nrow=nk)
## extracts just those rows that belong to a specific customer ID
temp1 <-conn[conn$V1==B,]
customer1 <-rbind(customer1,temp1)
}
customer1 <- customer1 [-1,]
library(xlsx)
write.xlsx(customer1, "D:/customer1.xlsx")
The optimal solution would probably be to import the data into a proper database but if you really want to split the file into multiple files based on the first token then you can use awk with this one-liner.
awk '/^/ {ofn=$1 ".txt"} ofn {print > ofn}' filetosplit.txt
It works by
/^/ matching start of line
{ofn=$1 ".txt"} sets the ofn variable to the first word (split by white space) with .txt appended.
Print each line to the file set by ofn.
It takes me just under two minutes on my laptop to split a 1 GB file with the same format as you listed above into multiple text files. I have no idea how well that scales or if it's fast enough for you. If you want an R solution you can always wrap it into a system() call ;o)
Addendum:
Oh ... I'm guessing you are on windows based on the path you mentioned. Then you may need to install Cygwin to get awk.

Collecting data in one row from different csv files by the name

It's hard to explain what exactly I want to achieve with my script but let me try.
I have 20 different csv files, so I loaded them into R:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
then with your help I combined them into one and removed all of the duplicates:
data_rd <- subset(transform(all_data, X = sub("\\..*", "", X)),
!duplicated(X))
I have now 1 master table which includes all of the names (Accession):
Accession
AT1G19570
AT5G38480
AT1G07370
AT4G23670
AT5G10450
AT4G09000
AT1G22300
AT1G16080
AT1G78300
AT2G29570
Now I would like to find this accession in other csv files and put the data of this accession in the same raw. There are like 20 csv files and for each csv there are like 20 columns so in same cases it might give me a 400 columns. It doesn't matter how long it takes. It has to be done. Is it even possible to do with R ?
Example:
First csv Second csv Third csv
Accession Size Lenght Weight Size Lenght Weight Size Lenght Weight
AT1G19570 12 23 43 22 77 666 656 565 33
AT5G38480
AT1G07370 33 22 33 34 22
AT4G23670
AT5G10450
AT4G09000 12 45 32
AT1G22300
AT1G16080
AT1G78300 44 22 222
AT2G29570
It looks like a hard task to do. Propably it has to be done by the loop. Any ideas ?
This is a merge loop. Here is rough R code that will inefficiently grow with each merge.
Begin as before:
tbls = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
tbl=list_of_data[[1]]
for(i in 2:length(list_of_data))
{
tbl=merge(tbl, list of_data[[i]], by="Accession", all=T)
}
The matching column names (not used as a key) will be renamed with a suffix (.x,.y, and so on), the all=T argument will ensure that whenever a new Accession key is merged a new row will be made and the missing cells will be filled with NA.

Resources