I am trying to split a data frame and write it to a csv file in r using the unique values in one variable. I am new to r and I'm not entirely sure I know what I'm doing.
## trying to subset data
library(dplyr)
library(plyr)
#set the working directory
setwd("S:/some stuff")
## load the datafile into an object called data.
data <- read.csv("S:/some stuff/Area.csv",
header = TRUE, sep = ",")
#Create subsets of data by LA
LA<-subset(data,AREA == "LA")
My data frame has 2,500 observations and 20 variables.
My dataframe is called LA
The variable I'd like to split it by is called Disease
I found this How to create multiple ,csv files in R?
And reapproriated it accordingly
from
plyr::d_ply(iris, .(Species), function(x) write.csv(x,
file = paste(x$Species, ".csv", sep = "")))
to
plyr::d_ply(LA, .(Disease), function(x) write.csv(x,
file = paste(LA$Disease, ".csv", )))
However....
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
Show Traceback
Rerun with Debug
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
There are two things I'd like to solve.
1) subsetting a dataframe
2) writing to a path
Ideally I'd like to loop through it from the import of data (the Area.csv file).
This has areas and disease. There are 12 areas and 20 diseases.
I would like to create csv files of each disease by area.
In this example Area = LA and then disease.
How can I step through using a loop to create the 20 different files for each area?
I thought this:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
#Subset the data by MP
tmp=subset(mpExpenses2012,MP.s.Name==name)
#Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
#Save the CSV file containing separate expenses data for each MP
write.csv(tmp,fn,row.names=FALSE)
}
might be helpful, but it's writing to a path that's getting me down.
EDIT
library(tidyr)
library(purrr)
temp_dir <- tempfile()
dir.create(temp_dir)
LA %>%
nest(-FinalDiseaseForMonthlyAnalysis) %>%
pwalk(function(FinalDiseaseForMonthlyAnalysis, data) write.csv(data, file.path(temp_dir, paste0(FinalDiseaseForMonthlyAnalysis, ".csv"))))
list.files(temp_dir)
temp_dir
unlink(temp_dir, recursive = T)
This works. But now comes the "where are the files?" question.
Yes: I get the temp file and then the unlink.
But how do I save in a folder on S:/some stuff/
?
EDIT FINAL: SOLVED
I've read that in r everything is a list. And I found a way to split by two columns to do what I needed. Annoyingly it's linked in the comments in here:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
and I missed it.
I was also having problems generating a dir using dir.create. Who knew that dir.create needs to have recursive = TRUE when you're trying to do stuff? I DO NOW.
Anyway. here's what I did:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(write)
## set working directory
setwd("S:/somestuff")
#create the directories - pretty sure there's a way to avoid doing this long hand
dir.create("S:/somestuff/CSV source files", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA1", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA2", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA3", recursive = TRUE)
#Read in the CSV
DF = read.csv("S:/somestuff/CSV source files/ALL.csv",
header = TRUE, sep = ",")
glimpse(DF)
#This splits the dataframe generated above (DF) and calls it DF4
DF4 <- split(DF,list(DF$LA,DF$FinalDiseaseForMonthlyAnalysis))
lapply(names(DF4), function(name) write.csv(DF4[[name]], file = paste("S:/somestuff/CSV source files/",gsub('','',name),sep = ''), row.names = F))
I'm guessing if I read in the dataframe I could then use dir.create to create paths from the names in LA in the data frame.
After returning to the problem. It's much easier in the latest version of dplyr
ourdata<-DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))
This was really helpful to me! Thanks!! I tried to simplify the crux of the matter.
library(tidyverse)
library(reprex)
states4 <- tribble(~state,~name,~area,
"AL","Alabama",50645.3242,
"AZ","Arizona",113594.0781,
"AR","Arkansas",52035.4727,
"CA","California",155779.2031
)
chain4 <- states4 %>% split(.$state)
map(names(chain4),function(stateabbrev){write_csv(chain4[[stateabbrev]],paste0("~/Downloads/","testtoken_",stateabbrev,".csv"))})
#> [[1]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AL Alabama 50645.
#>
#> [[2]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AR Arkansas 52035.
#>
#> [[3]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AZ Arizona 113594.
#>
#> [[4]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 CA California 155779.
list.files(path="~/Downloads", pattern = "testtoken.*csv")
#> [1] "testtoken_AL.csv" "testtoken_AR.csv" "testtoken_AZ.csv"
#> [4] "testtoken_CA.csv"
reprex()
Created on 2019-10-02 by the reprex package (v0.3.0)
In the end I used:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(stringr)
library(plyr)
library (car)
## set working directory
setwd("S:/Somestuff/Borough profile maps/Working")
## read data in from geocoded file
geocoded<-read.csv("geocoded 2015 - 2018.csv",na.strings=c(""," ","N/A"))
str(geocoded)
str(geocoded$GENDER)
levels(geocoded$LA)
#split geocoded data by LA
x <-split(geocoded,geocoded$LA)
str(x)
#Split geocoded data by LA and Final
#split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, .)
y<-split(geocoded,list(geocoded$Final,geocoded$LA), drop = TRUE, sep = "_")
str(y)
#create dir and then write CSV files of geocoded to file locations
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/",, recursive = TRUE)
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/TEST2",, recursive = TRUE)
lapply(names(x), function(name) write.csv(x[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/',gsub(' ','',name),sep = ''), row.names = F))
lapply(names(y),function(name) write.csv(y[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/TEST2/',name,".csv")))
The problem was that in my original code you'll notice I was using read.csv BUT feeding in a .txt file. I changed the file to .csv and BANG. it worked. First time.
I realise that you don't need all the libraries I called at the beginning, but they're left in from my ridiculous number of attempts.
After returning to the problem. It's much easier in the latest version of dplyr
DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))
Related
I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.
# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_",
docvarnames = c("Unit", "Year", "Volume", "Issue"))
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)
Does anyone here know a solution?
I used tokens_by function of the quanteda package which seems to be outdated.
Thanks! I could not get your script to work. But it inspired me to develop an alternative solution:
# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)
# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"
# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))
# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>%
group_by(Year) %>%
summarize(tokens = sum(ntoken(text)))
# Print number of absolute tokens per year
print(rawdata_SPARA_grouped)
You do not need to substring substr(rawdata_SPARA$Year, 0, 4). While calling readtext function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object. As it inherits from data.frame you can use standard data manipulation functions, e.g. from dplyr package.
Then just group_by by year and summarize by tokens. Number of tokens was calculated by quantedas ntoken function.
See the code below:
library(readtext)
library(quanteda)
# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
Output:
# A tibble: 3 × 2
year total_tokens
<int> <int>
1 2005 5681
2 2006 26564
3 2007 24119
I have a folder with multiple XML files. All the files have the same basic structure. However, each file actually contains data related to a single entity distributed among 16 parent nodes. Some nodes have children, some even have grandchildren, and some great-grandchildren.
I want to create a data frame from multiple files with only selected nodes/children/grandchildren.
As a first step, I read a single XML file as a list. Then ran a few lines of code to get the required data into a vector. Eventually, converted the vector to a dataframe, like I want one.
This the code:
library(xml2)
library(tidyverse)
x = as_list(read_xml("ACTRN12605000003673.xml"))
tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)
For the next step, I tried to work with 2 XML files together. I tried the following code to read two files simultaneously. However, beyond that, I don't know how to go ahead.
all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)
Sample files here
We can encapsulate your data gathering process into a function. Since each XML file seems to represent one row in your desired output dataframe, we can use purrr::map_dfr to construct the data rowwise. I slightly modified your code. See below:
get_row_data <- function(file_name) {
xml <- xml2::read_xml(file_name)
read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE)
xpaths <- c(
ACTR_Number = "/ANZCTR_Trial/actrnumber",
primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype",
primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname",
primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress",
primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
)
x <- read_text(xml, paste0(xpaths, collapse = " | "))
names(x) <- names(xpaths)
as_tibble(as.list(x))
}
all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
purrr::map_dfr(all_files, get_row_data)
Output
# A tibble: 2 x 13
ACTR_Number primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ACTRN12605000003673 Hospital Barwon Health "272-322 Ryrie S~ Australia Commercial sect~ Astra Zeneca
2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia Commercial sect~ Roche Products ~
# ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
# secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>
R supports list within lists. When you define a list, which will contain all your tmp lists. Then after putting all files in this list you can make a data frame or tibble. Then your tibble contains columns with lists(tmp). You then can address them with map functions from library(purrr)
I have incredibly raw data in the format of a .zip with a .txt file inside. For the most part, it cleanly reads in using read_csv, but there are some lines where the data is logging something else and completely skews the column structure. This data has no chance of being fixed.
When using read_csv, it shows up as a parsing problem. I want to set up my code where if this problem appears in the data, the whole file is ignored. It'd be great if there was a log of which files were ignored/thrown out. I looked into possibly(), but since it's not a full error with the file, only the lines, it doesn't skip the file.
This is my code at the moment.
library(dplyr)
library(readr)
library(purrr)
read_log <- function(path) {
read_csv(path, col_types = cols(.default = col_character())) %>%
mutate(filename = basename(path))
}
test_files <- file.path("example.txt") #would normally be list.files, simplified for this reprex
raw_data <- map_dfr(test_files, read_log)
#> Warning: 6 parsing failures.
#> row col expected actual file
#> 3 -- 17 columns 4 columns 'example.txt'
#> 4 -- 17 columns 23 columns 'example.txt'
#> 5 -- 17 columns 23 columns 'example.txt'
#> 6 -- 17 columns 23 columns 'example.txt'
#> 7 -- 17 columns 23 columns 'example.txt'
#> ... ... .......... .......... .............
#> See problems(...) for more details.
You can return NULL if a warning is returned. Try using this function.
library(reader)
library(purrr)
library(dplyr)
read_log <- function(path) {
data <- tryCatch(read_csv(path,col_types = cols(.default = col_character())),
warning = function(e) return(NULL))
if(!is.null(data))
data <- data %>% mutate(filename = basename(path))
return(data)
}
Read the data with map instead of map_dfr :
all_data <- map(test_files, read_log)
Files which were not read
not_read_files <- test_files[sapply(all_data, is.null)]
Combine the data
total_data <- bind_rows(all_data)
I am endeavoring to combine files but find myself writing very redundant code, which is cumbersome. I have looked at the documentation but for some reason cannot find anything about how to do this.
Basically, I download the code from my native machine, and then want to combine the exact same columns for each file (the only difference is year).
Can you help?
I download the code from my machine ("C:/SAM/CODE1_2005.csv" then "C:/SAM/CODE1_2006.csv" then "C:/SAM/CODE1_2007.csv", until 2016.
I then define the columns, all the same for each year I have downloaded, such as COLLEGESCORECARD05_A<-subset(COLLEGESCORECARD05, select=c(ï..UNITID,OPEID,OPEID6,INSTNM)) and so forth...
and then combine the files into one database.
The issue is that this seems inefficient. Is there a more efficient way?
You can make a list of the .csv files in the folder and then read them all together into a single df with purrr::map_df. You can add a column to differentiate between files then
library(tidyverse)
df <- list.files(path="C://SAM",
pattern="*.csv") %>%
purrr::map_df(function(x) readr::read_csv(x) %>%
mutate(filename=gsub(" .csv", "", basename(x)))
At the risk of seeming icky for self-promotion, I wrote a function that does exactly this (desiderata::apply_to_files()):
# Apply a function to every file in a folder that matches a regex pattern
rain <- apply_to_files(path = "Raw data/Rainfall", pattern = "csv",
func = readr::read_csv, col_types = "Tiic",
recursive = FALSE, ignorecase = TRUE,
method = "row_bind")
dplyr::sample_n(rain, 5)
#> # A tibble: 5 x 5
#>
#> orig_source_file Time Tips mV Event
#> <chr> <dttm> <int> <int> <chr>
#> 1 BOW-BM-2016-01-15.csv 2015-12-17 03:58:00 0 4047 Normal
#> 2 BOW-BM-2016-01-15.csv 2016-01-03 00:27:00 2 3962 Normal
#> 3 BOW-BM-2016-01-15.csv 2015-11-27 12:06:00 0 4262 Normal
#> 4 BIL-BPA-2018-01-24.csv 2015-11-15 10:00:00 0 4378 Normal
#> 5 BOW-BM-2016-08-05.csv 2016-04-13 19:00:00 0 4447 Normal
In this case, all of the files have identical columns and order (Time, Tips, mV, Event), so I can just method = "row_bind" and the function will automatically add the filename as an extra column. There are other methods available:
"full_join" (the default) returns all columns and rows. "left_join" returns all rows from the first file, and all columns from subsequent files. "inner_join" returns rows from the first file that have matches in subsequent files.
Internally, the function builds a list of files in the path (recursive or not), runs an lapply() on the list, and then handles merging the new list of dataframes into a single dataframe:
apply_to_files <- function(path, pattern, func, ..., recursive = FALSE, ignorecase = TRUE,
method = "full_join") {
file_list <- list.files(path = path,
pattern = pattern,
full.names = TRUE, # Return full relative path.
recursive = recursive, # Search into subfolders.
ignore.case = ignorecase)
df_list <- lapply(file_list, func, ...)
# The .id arg of bind_rows() uses the names to create the ID column.
names(df_list) <- basename(file_list)
out <- switch(method,
"full_join" = plyr::join_all(df_list, type = "full"),
"left_join" = plyr::join_all(df_list, type = "left"),
"inner_join" = plyr::join_all(df_list, type = "inner"),
# The fancy joins don't have orig_source_file because the values were
# getting all mixed together.
"row_bind" = dplyr::bind_rows(df_list, .id = "orig_source_file"))
return(invisible(out))
}
I have been using XLConnect function loadworkbook to load each xlsx file into R then rbind to merge them together. what is the best way of doing this instead of writing multiple df to later merge them. I am trying to use the code below to merge my excel files into 2 dataframes(2 sheet names for most files). The columns are always the same but the file names will change.
Current /slow way
require(XLConnect)
df <- loadWorkbook(paste(location,'UK.xlsx',sep=""))
dfb <- loadWorkbook(paste(location,'US.xlsx',sep=""))
UK <-readWorksheet(df,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
School <- rbind(UK,US)
UK <-readWorksheet(df,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
College <- rbind(UK,US)
New code
require(readxl)
filelist<- list.files(location,pattern='xlsx',full.names = T)
How can I read each sheetname into a dataframe when not every file has both sheetnames. I need 2 dataframes 1 for School and 1 for College.
I think I need to try something like Schools <-lapply(filelist, read_excel, sheet="School") but I get Error: Sheet 'School' not found. I think this error is because sheet School is not on every file. I am using list.files because the filenames are not always the same.
What about this approach?
library(purrr)
library(readxl)
# filenames to xl-sheets
files <- sprintf("Mappe%i.xlsx", 1:3)
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x))
# combine school-files to data frame (repeat same for college)
school_df <- map_df(xl_school, function(x) if(is.data.frame(x)) x)
school_df
#> # A tibble: 3 × 1
#> Test
#> <chr>
#> 1 fdsf
#> 2 543534
#> 3 gfdgfdd
You might need to force the column type to be text. Just add col_types = "text" to the read_excel()-call:
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))