I have been using XLConnect function loadworkbook to load each xlsx file into R then rbind to merge them together. what is the best way of doing this instead of writing multiple df to later merge them. I am trying to use the code below to merge my excel files into 2 dataframes(2 sheet names for most files). The columns are always the same but the file names will change.
Current /slow way
require(XLConnect)
df <- loadWorkbook(paste(location,'UK.xlsx',sep=""))
dfb <- loadWorkbook(paste(location,'US.xlsx',sep=""))
UK <-readWorksheet(df,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
School <- rbind(UK,US)
UK <-readWorksheet(df,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
College <- rbind(UK,US)
New code
require(readxl)
filelist<- list.files(location,pattern='xlsx',full.names = T)
How can I read each sheetname into a dataframe when not every file has both sheetnames. I need 2 dataframes 1 for School and 1 for College.
I think I need to try something like Schools <-lapply(filelist, read_excel, sheet="School") but I get Error: Sheet 'School' not found. I think this error is because sheet School is not on every file. I am using list.files because the filenames are not always the same.
What about this approach?
library(purrr)
library(readxl)
# filenames to xl-sheets
files <- sprintf("Mappe%i.xlsx", 1:3)
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x))
# combine school-files to data frame (repeat same for college)
school_df <- map_df(xl_school, function(x) if(is.data.frame(x)) x)
school_df
#> # A tibble: 3 × 1
#> Test
#> <chr>
#> 1 fdsf
#> 2 543534
#> 3 gfdgfdd
You might need to force the column type to be text. Just add col_types = "text" to the read_excel()-call:
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))
Related
This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
How can I read multiple csv files into R at once and know which file the data is from? [duplicate]
(3 answers)
Closed 13 days ago.
I have this code that reads all CSV files in a directory.
nm <- list.files()
df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))
I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?
Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:
library(readr)
library(purrr)
df <- list.files() |>
set_names() |>
map_dfr(read_delim, .id = "file")
df
# A tibble: 9 × 3
file col1 col2
<chr> <dbl> <dbl>
1 f1.csv 1 4
2 f1.csv 2 5
3 f1.csv 3 6
4 f2.csv 1 4
5 f2.csv 2 5
6 f2.csv 3 6
7 f3.csv 1 4
8 f3.csv 2 5
9 f3.csv 3 6
Example data:
for (f in c("f1.csv", "f2.csv", "f3.csv")) {
readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}
readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."
nm |>
readr::read_csv(
id = "file_path"
)
I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.
nm |>
readr::read_csv(
id = "file_path"
) |>
dplyr::mutate(
file_name_1 = basename(file_path), # If you want the extension
file_name_2 = tools::file_path_sans_ext(file_name_1), # If you don't
)
Here is another solution using purrr, which removes the file extention from the value in the column filename.
library(tidyverse)
nm <- list.files(pattern = "\\.csv$")
df <- map_dfr(
.x = nm,
~ read.csv(.x) %>%
mutate(
filename = stringr::str_replace(
.x,
"\\.csv$",
""
)
)
)
View(df)
EDIT
Actually you can still removes the file extention from the column for the file names when you apply #zephryl's approach by adding a mutate() process as follows:
df <- nm %>%
set_names() %>%
map_dfr(read_delim, .id = "file") %>%
mutate(
file = stringr::str_replace(
file,
"\\.csv$",
""
)
)
You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.
df <- dplyr::bind_rows(
lapply(setNames(nm, basename(nm)), read_csv2),
.id = 'src'
)
The use of basename() removes the directory paths prepended to the file names.
For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.
A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".
#zephryl's snippet is slightly modified to become
list.files() |>
rlang::set_names() |>
purrr::map(readr::read_delim) |>
# { possibly process files here before stacking/binding } |>
purrr::list_rbind(names_to = "file")
The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.
Source: https://purrr.tidyverse.org/reference/map_dfr.html
I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.
# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_",
docvarnames = c("Unit", "Year", "Volume", "Issue"))
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)
Does anyone here know a solution?
I used tokens_by function of the quanteda package which seems to be outdated.
Thanks! I could not get your script to work. But it inspired me to develop an alternative solution:
# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)
# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"
# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))
# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>%
group_by(Year) %>%
summarize(tokens = sum(ntoken(text)))
# Print number of absolute tokens per year
print(rawdata_SPARA_grouped)
You do not need to substring substr(rawdata_SPARA$Year, 0, 4). While calling readtext function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object. As it inherits from data.frame you can use standard data manipulation functions, e.g. from dplyr package.
Then just group_by by year and summarize by tokens. Number of tokens was calculated by quantedas ntoken function.
See the code below:
library(readtext)
library(quanteda)
# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
Output:
# A tibble: 3 × 2
year total_tokens
<int> <int>
1 2005 5681
2 2006 26564
3 2007 24119
I have a script that works fine for one file. It takes the information from a json file, extracts a list and a sublist of it (A), and then another list B with the third element of list A. It creates a data frame with list B and compares it with a master file. Finally, it provides two numbers: the number of elements in the list B and the number of matching elements of that list when comparing with the master file.
However, I have 180 different json files in a folder and I need to run the script for all of them, and build a data frame with the results for each file. So the final result should be something like this (note that the last line's figures are correct, the first two are fictitious):
The code I have so far is the following:
library(rjson)
library(dplyr)
library(tidyverse)
#load data from file
file <- "./raw_data/whf.json"
json_data <- fromJSON(file = file)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
length(org_list$Twitter)
length(usernames$following)
My first attempt is to include this code at the beginning:
files <- list.files()
for(f in files){
json_data <- fromJSON(file = f)
# the rest of the script for one file here
}
but I do not know how to write the code for the data frame or even how to integrate both ideas -the working script and the loop for the file names. I took the idea from here.
The new code after Alvaro Morales' answer is the following
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data/")
calculate_accounts <- function(archivos){
#load data from file
path <- paste("./raw_data/", archivos, sep = "")
json_data <- fromJSON(file = path)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
}
table_psa <- map_dfr(archivos, calculate_accounts)
However, now there is an error when Joining, by = "Twitter", it says subindex out of limits.
Links to 3 test files to put together in raw_data folder:
https://drive.google.com/file/d/1ilUHwLjgtZCzh0LneIJEhTryrGumDF1V/view?usp=sharing
https://drive.google.com/file/d/1KM3hRZ8DzgPMEsMFmwBdmMNHrPCttuaB/view?usp=sharing
https://drive.google.com/file/d/17cWXJ9ltGXZ6izkgJv0uyNwStrE95_OA/view?usp=sharing
Link to the master file to compare:
https://drive.google.com/file/d/11fOpYFFfHijhZl_CuWHKvkrI7edkpUNQ/view?usp=sharing
<<<<< UPDATE >>>>>>
I am trying to find the solution and I did the code work and provide a valide output (a 180x3 data frame), but the columns that should be filled with the values of the objects accounts_db_org and accounts_total_org are showing NA. When checking the value stored in those objects, the values are correct (for the last iteration). So the output now is in its right format, but with NA instead of numbers.
I am really close, but I am not being able to make the code to show the right numbers. My last attempt is:
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv", show_col_types = FALSE) %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
psa_result <- matrix(nrow = nr_archivos, ncol = 3)
# loop for working with all files, one by one
for(f in 1:nr_archivos){
# load file
json_data <- fromJSON(file = archivos[f])
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# empty vector
longitud = length(following)
names <- vector(length = longitud)
# loop to populate with the third element of each i item of the sublist
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# convert the list into a data frame
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# applying some format prior to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# merge tables and calculate the results for each iteration
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
# populate the matrix row by row
psa_result[f] <- c(org_name, accounts_db_org, accounts_total_org)
}
# create a data frame from the matrix and save the result
psa_result <- data.frame(psa_result)
write_csv(psa_result, file = "./outputs/cuentas_seguidas_en_psa.csv")
The subscript out of bounds error was caused by a json file with 0 records. That was fixed deleting the file.
You can do it with purrr::map or purrr::map_dfr.
Is this what you looking for?
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
# load master file and select the needed columns. This needs to be out of "calculate_accounts" because you only read it once.
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# calculate accounts
calculate_accounts <- function(archivo){
json_data <- rjson::fromJSON(file = archivo)
org_handles <- json_data %>%
pluck("twitter", "following") %>%
map_chr("username") %>%
as_tibble() %>%
rename(usernames = value) %>%
mutate(Twitter = str_c("#", usernames)) %>%
select(Twitter)
org_list <- inner_join(psa_handles, org_handles)
org_list %>%
mutate(accounts_db_org = length(Twitter),
accounts_total_org = nrow(org_handles)) %>%
select(-Twitter)
}
table_psa <- map_dfr(archivos, calculate_accounts)
#output:
# A tibble: 53 x 4
Name AKA accounts_db_org accounts_total_org
<chr> <chr> <int> <int>
1 Association of American Medical Colleges AAMC 20 2924
2 American College of Cardiology ACC 20 2924
3 American Heart Association AHA 20 2924
4 British Association of Dermatologists BAD 20 2924
5 Canadian Psoriasis Network CPN 20 2924
6 Canadian Skin Patient Alliance CSPA 20 2924
7 European Academy of Dermatology and Venereology EADV 20 2924
8 European Society for Dermatological Research ESDR 20 2924
9 US Department of Health and Human Service HHS 20 2924
10 International Alliance of Dermatology Patients Organisations (Global Skin) IADPO 20 2924
# ... with 43 more rows
Unfortunately, the answer provided by Álvaro does not work as expected, since the output repeats the same number with different organisation names, making it really difficult to read. Actually, the number 20 is repeated 20 times, the number 11, 11 times, and so on. The information is there, but it is not accessible without further data treatment.
I was doing my own research in the meantime and I got to the following code. Finally I made it to work, but the data format was "matrix" "array", really confusing. Fortunately, I wrote the last lines to transpose the data, unlist the array and convert in a matrix, which is able to be converted in a data frame and manipulated as usual.
Maybe my explanation is not very useful, and since I am a newbie, I am sure the code is far from being elegant and optimised. Anyway, please review the code below:
library(purrr)
library(rjson)
library(dplyr)
library(tidyverse)
setwd("~/documentos/varios/proyectos/programacion/R/psa_twitter")
# Load data from files.
archivos <- list.files("./raw_data/json_files",
pattern = ".json",
full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
calcula_cuentas <- function(a){
# Extract lists
json_data <- fromJSON(file = a)
org_aka <- json_data$id
org_meta <- json_data$metadata
org_name <- org_meta$company
twitter <- json_data$twitter
following <- twitter$following
# create an empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# Create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#",
names_list$usernames,
sep="")
)
colnames(org_handles) <- "Twitter"
# merge tables
org_list <- inner_join(psa_handles, org_handles)
cuentas_db_org <- length(org_list$Twitter)
cuentas_total_org <- length(twitter$following)
results <- data.frame(Name = org_name,
AKA = org_aka,
Cuentas_db = cuentas_db_org,
Total = cuentas_total_org)
results
}
# apply function to list of files and unlist the result
psa <- sapply(archivos, calcula_cuentas)
psa1 <- t(as.data.frame(psa))
psa2 <- matrix(unlist(psa1), ncol = 4) %>%
as.data.frame()
colnames(psa2) <- c("Name", "AKA", "tw_int_outbound", "tw_ext_outbound")
# Save the results.
saveRDS(psa2, file = "rda/psa.RDS")
I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.
Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.
Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)
file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101
file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
and an example of the desired output:
desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101
In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:
files <- list.files(path, pattern = ".txt")
df <- files %>%
map(function(f) {
print(f)
df <- fread(f)
df <- df %>% mutate(date_as_numb = f)
return(df)
}) %>% bind_rows()
df <- df %>%
mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>%
group_by(key_1, key_2) %>%
filter(date_as_numb == max(date_as_numb))
Thanks in advance!
Don't know what is "way too big" for you. Data table can (allegedly) handle really big data. So if bind_rows of the list is not ok, maybe use data table.
(in my own experience, dplyr::group_by can be really slow with many groups (say 10^5 groups) in large-ish data (say around 10^6 rows). I don't have much experience with data.table, but all the threads mention its superiority for large data).
I've used this answer for merging a list of data tables
library(data.table)
dt1 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101")
dt2 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102")
ls_files <- list(dt1, dt2)
# you would have created this list by calling fread with lapply, like
# ls_files <- lapply(files, fread)
# Now here a two-liner with data table.
alldata <- Reduce(function(...) merge(..., all = TRUE), ls_files)
alldata[alldata[, .I[which.max(date_as_numb)], by = .(key_1, key_2)]$V1]
#> key_1 key_2 value date_as_numb
#> 1: 100 foo 15 20200101
#> 2: 123 abc 50 20200102
#> 3: 456 def 500 20200102
#> 4: 789 xyz 300 20200102
Created on 2021-01-28 by the reprex package (v0.3.0)
I am trying to split a data frame and write it to a csv file in r using the unique values in one variable. I am new to r and I'm not entirely sure I know what I'm doing.
## trying to subset data
library(dplyr)
library(plyr)
#set the working directory
setwd("S:/some stuff")
## load the datafile into an object called data.
data <- read.csv("S:/some stuff/Area.csv",
header = TRUE, sep = ",")
#Create subsets of data by LA
LA<-subset(data,AREA == "LA")
My data frame has 2,500 observations and 20 variables.
My dataframe is called LA
The variable I'd like to split it by is called Disease
I found this How to create multiple ,csv files in R?
And reapproriated it accordingly
from
plyr::d_ply(iris, .(Species), function(x) write.csv(x,
file = paste(x$Species, ".csv", sep = "")))
to
plyr::d_ply(LA, .(Disease), function(x) write.csv(x,
file = paste(LA$Disease, ".csv", )))
However....
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
Show Traceback
Rerun with Debug
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
There are two things I'd like to solve.
1) subsetting a dataframe
2) writing to a path
Ideally I'd like to loop through it from the import of data (the Area.csv file).
This has areas and disease. There are 12 areas and 20 diseases.
I would like to create csv files of each disease by area.
In this example Area = LA and then disease.
How can I step through using a loop to create the 20 different files for each area?
I thought this:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
#Subset the data by MP
tmp=subset(mpExpenses2012,MP.s.Name==name)
#Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
#Save the CSV file containing separate expenses data for each MP
write.csv(tmp,fn,row.names=FALSE)
}
might be helpful, but it's writing to a path that's getting me down.
EDIT
library(tidyr)
library(purrr)
temp_dir <- tempfile()
dir.create(temp_dir)
LA %>%
nest(-FinalDiseaseForMonthlyAnalysis) %>%
pwalk(function(FinalDiseaseForMonthlyAnalysis, data) write.csv(data, file.path(temp_dir, paste0(FinalDiseaseForMonthlyAnalysis, ".csv"))))
list.files(temp_dir)
temp_dir
unlink(temp_dir, recursive = T)
This works. But now comes the "where are the files?" question.
Yes: I get the temp file and then the unlink.
But how do I save in a folder on S:/some stuff/
?
EDIT FINAL: SOLVED
I've read that in r everything is a list. And I found a way to split by two columns to do what I needed. Annoyingly it's linked in the comments in here:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
and I missed it.
I was also having problems generating a dir using dir.create. Who knew that dir.create needs to have recursive = TRUE when you're trying to do stuff? I DO NOW.
Anyway. here's what I did:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(write)
## set working directory
setwd("S:/somestuff")
#create the directories - pretty sure there's a way to avoid doing this long hand
dir.create("S:/somestuff/CSV source files", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA1", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA2", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA3", recursive = TRUE)
#Read in the CSV
DF = read.csv("S:/somestuff/CSV source files/ALL.csv",
header = TRUE, sep = ",")
glimpse(DF)
#This splits the dataframe generated above (DF) and calls it DF4
DF4 <- split(DF,list(DF$LA,DF$FinalDiseaseForMonthlyAnalysis))
lapply(names(DF4), function(name) write.csv(DF4[[name]], file = paste("S:/somestuff/CSV source files/",gsub('','',name),sep = ''), row.names = F))
I'm guessing if I read in the dataframe I could then use dir.create to create paths from the names in LA in the data frame.
After returning to the problem. It's much easier in the latest version of dplyr
ourdata<-DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))
This was really helpful to me! Thanks!! I tried to simplify the crux of the matter.
library(tidyverse)
library(reprex)
states4 <- tribble(~state,~name,~area,
"AL","Alabama",50645.3242,
"AZ","Arizona",113594.0781,
"AR","Arkansas",52035.4727,
"CA","California",155779.2031
)
chain4 <- states4 %>% split(.$state)
map(names(chain4),function(stateabbrev){write_csv(chain4[[stateabbrev]],paste0("~/Downloads/","testtoken_",stateabbrev,".csv"))})
#> [[1]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AL Alabama 50645.
#>
#> [[2]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AR Arkansas 52035.
#>
#> [[3]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AZ Arizona 113594.
#>
#> [[4]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 CA California 155779.
list.files(path="~/Downloads", pattern = "testtoken.*csv")
#> [1] "testtoken_AL.csv" "testtoken_AR.csv" "testtoken_AZ.csv"
#> [4] "testtoken_CA.csv"
reprex()
Created on 2019-10-02 by the reprex package (v0.3.0)
In the end I used:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(stringr)
library(plyr)
library (car)
## set working directory
setwd("S:/Somestuff/Borough profile maps/Working")
## read data in from geocoded file
geocoded<-read.csv("geocoded 2015 - 2018.csv",na.strings=c(""," ","N/A"))
str(geocoded)
str(geocoded$GENDER)
levels(geocoded$LA)
#split geocoded data by LA
x <-split(geocoded,geocoded$LA)
str(x)
#Split geocoded data by LA and Final
#split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, .)
y<-split(geocoded,list(geocoded$Final,geocoded$LA), drop = TRUE, sep = "_")
str(y)
#create dir and then write CSV files of geocoded to file locations
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/",, recursive = TRUE)
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/TEST2",, recursive = TRUE)
lapply(names(x), function(name) write.csv(x[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/',gsub(' ','',name),sep = ''), row.names = F))
lapply(names(y),function(name) write.csv(y[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/TEST2/',name,".csv")))
The problem was that in my original code you'll notice I was using read.csv BUT feeding in a .txt file. I changed the file to .csv and BANG. it worked. First time.
I realise that you don't need all the libraries I called at the beginning, but they're left in from my ridiculous number of attempts.
After returning to the problem. It's much easier in the latest version of dplyr
DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))