How do I download and combine multiple files in r? - r

I am endeavoring to combine files but find myself writing very redundant code, which is cumbersome. I have looked at the documentation but for some reason cannot find anything about how to do this.
Basically, I download the code from my native machine, and then want to combine the exact same columns for each file (the only difference is year).
Can you help?
I download the code from my machine ("C:/SAM/CODE1_2005.csv" then "C:/SAM/CODE1_2006.csv" then "C:/SAM/CODE1_2007.csv", until 2016.
I then define the columns, all the same for each year I have downloaded, such as COLLEGESCORECARD05_A<-subset(COLLEGESCORECARD05, select=c(ï..UNITID,OPEID,OPEID6,INSTNM)) and so forth...
and then combine the files into one database.
The issue is that this seems inefficient. Is there a more efficient way?

You can make a list of the .csv files in the folder and then read them all together into a single df with purrr::map_df. You can add a column to differentiate between files then
library(tidyverse)
df <- list.files(path="C://SAM",
pattern="*.csv") %>%
purrr::map_df(function(x) readr::read_csv(x) %>%
mutate(filename=gsub(" .csv", "", basename(x)))

At the risk of seeming icky for self-promotion, I wrote a function that does exactly this (desiderata::apply_to_files()):
# Apply a function to every file in a folder that matches a regex pattern
rain <- apply_to_files(path = "Raw data/Rainfall", pattern = "csv",
func = readr::read_csv, col_types = "Tiic",
recursive = FALSE, ignorecase = TRUE,
method = "row_bind")
dplyr::sample_n(rain, 5)
#> # A tibble: 5 x 5
#>
#> orig_source_file Time Tips mV Event
#> <chr> <dttm> <int> <int> <chr>
#> 1 BOW-BM-2016-01-15.csv 2015-12-17 03:58:00 0 4047 Normal
#> 2 BOW-BM-2016-01-15.csv 2016-01-03 00:27:00 2 3962 Normal
#> 3 BOW-BM-2016-01-15.csv 2015-11-27 12:06:00 0 4262 Normal
#> 4 BIL-BPA-2018-01-24.csv 2015-11-15 10:00:00 0 4378 Normal
#> 5 BOW-BM-2016-08-05.csv 2016-04-13 19:00:00 0 4447 Normal
In this case, all of the files have identical columns and order (Time, Tips, mV, Event), so I can just method = "row_bind" and the function will automatically add the filename as an extra column. There are other methods available:
"full_join" (the default) returns all columns and rows. "left_join" returns all rows from the first file, and all columns from subsequent files. "inner_join" returns rows from the first file that have matches in subsequent files.
Internally, the function builds a list of files in the path (recursive or not), runs an lapply() on the list, and then handles merging the new list of dataframes into a single dataframe:
apply_to_files <- function(path, pattern, func, ..., recursive = FALSE, ignorecase = TRUE,
method = "full_join") {
file_list <- list.files(path = path,
pattern = pattern,
full.names = TRUE, # Return full relative path.
recursive = recursive, # Search into subfolders.
ignore.case = ignorecase)
df_list <- lapply(file_list, func, ...)
# The .id arg of bind_rows() uses the names to create the ID column.
names(df_list) <- basename(file_list)
out <- switch(method,
"full_join" = plyr::join_all(df_list, type = "full"),
"left_join" = plyr::join_all(df_list, type = "left"),
"inner_join" = plyr::join_all(df_list, type = "inner"),
# The fancy joins don't have orig_source_file because the values were
# getting all mixed together.
"row_bind" = dplyr::bind_rows(df_list, .id = "orig_source_file"))
return(invisible(out))
}

Related

Read all csv files in a directory and add the name of each file in a new column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
How can I read multiple csv files into R at once and know which file the data is from? [duplicate]
(3 answers)
Closed 13 days ago.
I have this code that reads all CSV files in a directory.
nm <- list.files()
df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))
I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?
Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:
library(readr)
library(purrr)
df <- list.files() |>
set_names() |>
map_dfr(read_delim, .id = "file")
df
# A tibble: 9 × 3
file col1 col2
<chr> <dbl> <dbl>
1 f1.csv 1 4
2 f1.csv 2 5
3 f1.csv 3 6
4 f2.csv 1 4
5 f2.csv 2 5
6 f2.csv 3 6
7 f3.csv 1 4
8 f3.csv 2 5
9 f3.csv 3 6
Example data:
for (f in c("f1.csv", "f2.csv", "f3.csv")) {
readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."
nm |>
readr::read_csv(
id = "file_path"
)
I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.
nm |>
readr::read_csv(
id = "file_path"
) |>
dplyr::mutate(
file_name_1 = basename(file_path), # If you want the extension
file_name_2 = tools::file_path_sans_ext(file_name_1), # If you don't
)
Here is another solution using purrr, which removes the file extention from the value in the column filename.
library(tidyverse)
nm <- list.files(pattern = "\\.csv$")
df <- map_dfr(
.x = nm,
~ read.csv(.x) %>%
mutate(
filename = stringr::str_replace(
.x,
"\\.csv$",
""
)
)
)
View(df)
EDIT
Actually you can still removes the file extention from the column for the file names when you apply #zephryl's approach by adding a mutate() process as follows:
df <- nm %>%
set_names() %>%
map_dfr(read_delim, .id = "file") %>%
mutate(
file = stringr::str_replace(
file,
"\\.csv$",
""
)
)
You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.
df <- dplyr::bind_rows(
lapply(setNames(nm, basename(nm)), read_csv2),
.id = 'src'
)
The use of basename() removes the directory paths prepended to the file names.
For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.
A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".
#zephryl's snippet is slightly modified to become
list.files() |>
rlang::set_names() |>
purrr::map(readr::read_delim) |>
# { possibly process files here before stacking/binding } |>
purrr::list_rbind(names_to = "file")
The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.
Source: https://purrr.tidyverse.org/reference/map_dfr.html

How can I process my StringTie data so that I can run DEseq2 using R?

I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`

Data frame from multiple XML files

I have a folder with multiple XML files. All the files have the same basic structure. However, each file actually contains data related to a single entity distributed among 16 parent nodes. Some nodes have children, some even have grandchildren, and some great-grandchildren.
I want to create a data frame from multiple files with only selected nodes/children/grandchildren.
As a first step, I read a single XML file as a list. Then ran a few lines of code to get the required data into a vector. Eventually, converted the vector to a dataframe, like I want one.
This the code:
library(xml2)
library(tidyverse)
x = as_list(read_xml("ACTRN12605000003673.xml"))
tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)
For the next step, I tried to work with 2 XML files together. I tried the following code to read two files simultaneously. However, beyond that, I don't know how to go ahead.
all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)
Sample files here
We can encapsulate your data gathering process into a function. Since each XML file seems to represent one row in your desired output dataframe, we can use purrr::map_dfr to construct the data rowwise. I slightly modified your code. See below:
get_row_data <- function(file_name) {
xml <- xml2::read_xml(file_name)
read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE)
xpaths <- c(
ACTR_Number = "/ANZCTR_Trial/actrnumber",
primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype",
primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname",
primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress",
primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
)
x <- read_text(xml, paste0(xpaths, collapse = " | "))
names(x) <- names(xpaths)
as_tibble(as.list(x))
}
all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
purrr::map_dfr(all_files, get_row_data)
Output
# A tibble: 2 x 13
ACTR_Number primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ACTRN12605000003673 Hospital Barwon Health "272-322 Ryrie S~ Australia Commercial sect~ Astra Zeneca
2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia Commercial sect~ Roche Products ~
# ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
# secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>
R supports list within lists. When you define a list, which will contain all your tmp lists. Then after putting all files in this list you can make a data frame or tibble. Then your tibble contains columns with lists(tmp). You then can address them with map functions from library(purrr)

How to merge files in a directory with r?

Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)

R Plyr Write CSV

I am trying to split a data frame and write it to a csv file in r using the unique values in one variable. I am new to r and I'm not entirely sure I know what I'm doing.
## trying to subset data
library(dplyr)
library(plyr)
#set the working directory
setwd("S:/some stuff")
## load the datafile into an object called data.
data <- read.csv("S:/some stuff/Area.csv",
header = TRUE, sep = ",")
#Create subsets of data by LA
LA<-subset(data,AREA == "LA")
My data frame has 2,500 observations and 20 variables.
My dataframe is called LA
The variable I'd like to split it by is called Disease
I found this How to create multiple ,csv files in R?
And reapproriated it accordingly
from
plyr::d_ply(iris, .(Species), function(x) write.csv(x,
file = paste(x$Species, ".csv", sep = "")))
to
plyr::d_ply(LA, .(Disease), function(x) write.csv(x,
file = paste(LA$Disease, ".csv", )))
However....
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
Show Traceback
Rerun with Debug
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
There are two things I'd like to solve.
1) subsetting a dataframe
2) writing to a path
Ideally I'd like to loop through it from the import of data (the Area.csv file).
This has areas and disease. There are 12 areas and 20 diseases.
I would like to create csv files of each disease by area.
In this example Area = LA and then disease.
How can I step through using a loop to create the 20 different files for each area?
I thought this:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
#Subset the data by MP
tmp=subset(mpExpenses2012,MP.s.Name==name)
#Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
#Save the CSV file containing separate expenses data for each MP
write.csv(tmp,fn,row.names=FALSE)
}
might be helpful, but it's writing to a path that's getting me down.
EDIT
library(tidyr)
library(purrr)
temp_dir <- tempfile()
dir.create(temp_dir)
LA %>%
nest(-FinalDiseaseForMonthlyAnalysis) %>%
pwalk(function(FinalDiseaseForMonthlyAnalysis, data) write.csv(data, file.path(temp_dir, paste0(FinalDiseaseForMonthlyAnalysis, ".csv"))))
list.files(temp_dir)
temp_dir
unlink(temp_dir, recursive = T)
This works. But now comes the "where are the files?" question.
Yes: I get the temp file and then the unlink.
But how do I save in a folder on S:/some stuff/
?
EDIT FINAL: SOLVED
I've read that in r everything is a list. And I found a way to split by two columns to do what I needed. Annoyingly it's linked in the comments in here:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
and I missed it.
I was also having problems generating a dir using dir.create. Who knew that dir.create needs to have recursive = TRUE when you're trying to do stuff? I DO NOW.
Anyway. here's what I did:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(write)
## set working directory
setwd("S:/somestuff")
#create the directories - pretty sure there's a way to avoid doing this long hand
dir.create("S:/somestuff/CSV source files", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA1", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA2", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA3", recursive = TRUE)
#Read in the CSV
DF = read.csv("S:/somestuff/CSV source files/ALL.csv",
header = TRUE, sep = ",")
glimpse(DF)
#This splits the dataframe generated above (DF) and calls it DF4
DF4 <- split(DF,list(DF$LA,DF$FinalDiseaseForMonthlyAnalysis))
lapply(names(DF4), function(name) write.csv(DF4[[name]], file = paste("S:/somestuff/CSV source files/",gsub('','',name),sep = ''), row.names = F))
I'm guessing if I read in the dataframe I could then use dir.create to create paths from the names in LA in the data frame.
After returning to the problem. It's much easier in the latest version of dplyr
ourdata<-DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))
This was really helpful to me! Thanks!! I tried to simplify the crux of the matter.
library(tidyverse)
library(reprex)
states4 <- tribble(~state,~name,~area,
"AL","Alabama",50645.3242,
"AZ","Arizona",113594.0781,
"AR","Arkansas",52035.4727,
"CA","California",155779.2031
)
chain4 <- states4 %>% split(.$state)
map(names(chain4),function(stateabbrev){write_csv(chain4[[stateabbrev]],paste0("~/Downloads/","testtoken_",stateabbrev,".csv"))})
#> [[1]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AL Alabama 50645.
#>
#> [[2]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AR Arkansas 52035.
#>
#> [[3]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AZ Arizona 113594.
#>
#> [[4]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 CA California 155779.
list.files(path="~/Downloads", pattern = "testtoken.*csv")
#> [1] "testtoken_AL.csv" "testtoken_AR.csv" "testtoken_AZ.csv"
#> [4] "testtoken_CA.csv"
reprex()
Created on 2019-10-02 by the reprex package (v0.3.0)
In the end I used:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(stringr)
library(plyr)
library (car)
## set working directory
setwd("S:/Somestuff/Borough profile maps/Working")
## read data in from geocoded file
geocoded<-read.csv("geocoded 2015 - 2018.csv",na.strings=c(""," ","N/A"))
str(geocoded)
str(geocoded$GENDER)
levels(geocoded$LA)
#split geocoded data by LA
x <-split(geocoded,geocoded$LA)
str(x)
#Split geocoded data by LA and Final
#split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, .)
y<-split(geocoded,list(geocoded$Final,geocoded$LA), drop = TRUE, sep = "_")
str(y)
#create dir and then write CSV files of geocoded to file locations
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/",, recursive = TRUE)
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/TEST2",, recursive = TRUE)
lapply(names(x), function(name) write.csv(x[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/',gsub(' ','',name),sep = ''), row.names = F))
lapply(names(y),function(name) write.csv(y[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/TEST2/',name,".csv")))
The problem was that in my original code you'll notice I was using read.csv BUT feeding in a .txt file. I changed the file to .csv and BANG. it worked. First time.
I realise that you don't need all the libraries I called at the beginning, but they're left in from my ridiculous number of attempts.
After returning to the problem. It's much easier in the latest version of dplyr
DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))

Resources