I have a folder with multiple XML files. All the files have the same basic structure. However, each file actually contains data related to a single entity distributed among 16 parent nodes. Some nodes have children, some even have grandchildren, and some great-grandchildren.
I want to create a data frame from multiple files with only selected nodes/children/grandchildren.
As a first step, I read a single XML file as a list. Then ran a few lines of code to get the required data into a vector. Eventually, converted the vector to a dataframe, like I want one.
This the code:
library(xml2)
library(tidyverse)
x = as_list(read_xml("ACTRN12605000003673.xml"))
tmp = c(ACTR_Number = as.character(x$ANZCTR_Trial$actrnumber),
primary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$primarysponsortype),
primary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorname),
primary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$primarysponsoraddress),
primary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$primarysponsorcountry),
funding_source_type = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingtype),
funding_source_name = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingname),
funding_source_address = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingaddress),
funding_source_country = as.character(x$ANZCTR_Trial$sponsorship$fundingsource$fundingcountry),
secondary_sponsor_type = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsortype),
secondary_sponsor_name = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorname),
secondary_sponsor_address = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsoraddress),
secondary_sponsor_country = as.character(x$ANZCTR_Trial$sponsorship$secondarysponsor$sponsorcountry))
tmp = as.list(tmp)
tmp = as.data.frame(tmp)
For the next step, I tried to work with 2 XML files together. I tried the following code to read two files simultaneously. However, beyond that, I don't know how to go ahead.
all_files = list.files(pattern=".xml", path = getwd(), full.names = TRUE)
x = lapply(all_files, read_xml)
class(x)
Sample files here
We can encapsulate your data gathering process into a function. Since each XML file seems to represent one row in your desired output dataframe, we can use purrr::map_dfr to construct the data rowwise. I slightly modified your code. See below:
get_row_data <- function(file_name) {
xml <- xml2::read_xml(file_name)
read_text <- \(x, xpath) rvest::html_elements(x, xpath = xpath) |> rvest::html_text(TRUE)
xpaths <- c(
ACTR_Number = "/ANZCTR_Trial/actrnumber",
primary_sponsor_type = "/ANZCTR_Trial/sponsorship/primarysponsortype",
primary_sponsor_name = "/ANZCTR_Trial/sponsorship/primarysponsorname",
primary_sponsor_address = "/ANZCTR_Trial/sponsorship/primarysponsoraddress",
primary_sponsor_country = "/ANZCTR_Trial/sponsorship/primarysponsorcountry",
funding_source_type = "/ANZCTR_Trial/sponsorship/fundingsource/fundingtype",
funding_source_name = "/ANZCTR_Trial/sponsorship/fundingsource/fundingname",
funding_source_address = "/ANZCTR_Trial/sponsorship/fundingsource/fundingaddress",
funding_source_country = "/ANZCTR_Trial/sponsorship/fundingsource/fundingcountry",
secondary_sponsor_type = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsortype",
secondary_sponsor_name = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorname",
secondary_sponsor_address = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsoraddress",
secondary_sponsor_country = "/ANZCTR_Trial/sponsorship/secondarysponsor/sponsorcountry"
)
x <- read_text(xml, paste0(xpaths, collapse = " | "))
names(x) <- names(xpaths)
as_tibble(as.list(x))
}
all_files <- list.files(pattern = ".xml", path = getwd(), full.names = TRUE)
purrr::map_dfr(all_files, get_row_data)
Output
# A tibble: 2 x 13
ACTR_Number primary_sponsor_~ primary_sponsor_n~ primary_sponsor_~ primary_sponsor~ funding_source_~ funding_source_~
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ACTRN12605000003673 Hospital Barwon Health "272-322 Ryrie S~ Australia Commercial sect~ Astra Zeneca
2 ACTRN12605000025639 Other Collaborat~ Australian Gastro~ "88 Mallett St\n~ Australia Commercial sect~ Roche Products ~
# ... with 6 more variables: funding_source_address <chr>, funding_source_country <chr>, secondary_sponsor_type <chr>,
# secondary_sponsor_name <chr>, secondary_sponsor_address <chr>, secondary_sponsor_country <chr>
R supports list within lists. When you define a list, which will contain all your tmp lists. Then after putting all files in this list you can make a data frame or tibble. Then your tibble contains columns with lists(tmp). You then can address them with map functions from library(purrr)
Related
I wrote a small R script. Input are text files (thousands of journal articles). I generated the metadata (including the publication year) from the file names. Now I want to calculate the total number of tokens per year. However, I am not getting anywhere here.
# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_",
docvarnames = c("Unit", "Year", "Volume", "Issue"))
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)
Does anyone here know a solution?
I used tokens_by function of the quanteda package which seems to be outdated.
Thanks! I could not get your script to work. But it inspired me to develop an alternative solution:
# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)
# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"
# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))
# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>%
group_by(Year) %>%
summarize(tokens = sum(ntoken(text)))
# Print number of absolute tokens per year
print(rawdata_SPARA_grouped)
You do not need to substring substr(rawdata_SPARA$Year, 0, 4). While calling readtext function, it extracts the year from the file name. In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object. As it inherits from data.frame you can use standard data manipulation functions, e.g. from dplyr package.
Then just group_by by year and summarize by tokens. Number of tokens was calculated by quantedas ntoken function.
See the code below:
library(readtext)
library(quanteda)
# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
Output:
# A tibble: 3 × 2
year total_tokens
<int> <int>
1 2005 5681
2 2006 26564
3 2007 24119
I have a script that works fine for one file. It takes the information from a json file, extracts a list and a sublist of it (A), and then another list B with the third element of list A. It creates a data frame with list B and compares it with a master file. Finally, it provides two numbers: the number of elements in the list B and the number of matching elements of that list when comparing with the master file.
However, I have 180 different json files in a folder and I need to run the script for all of them, and build a data frame with the results for each file. So the final result should be something like this (note that the last line's figures are correct, the first two are fictitious):
The code I have so far is the following:
library(rjson)
library(dplyr)
library(tidyverse)
#load data from file
file <- "./raw_data/whf.json"
json_data <- fromJSON(file = file)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
length(org_list$Twitter)
length(usernames$following)
My first attempt is to include this code at the beginning:
files <- list.files()
for(f in files){
json_data <- fromJSON(file = f)
# the rest of the script for one file here
}
but I do not know how to write the code for the data frame or even how to integrate both ideas -the working script and the loop for the file names. I took the idea from here.
The new code after Alvaro Morales' answer is the following
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data/")
calculate_accounts <- function(archivos){
#load data from file
path <- paste("./raw_data/", archivos, sep = "")
json_data <- fromJSON(file = path)
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# create empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# load master file and select the needed columns
psa_handles <- read_csv(file = "./psa_handles.csv") %>%
select(Name, AKA, Twitter)
# merge data frames and present the results
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
}
table_psa <- map_dfr(archivos, calculate_accounts)
However, now there is an error when Joining, by = "Twitter", it says subindex out of limits.
Links to 3 test files to put together in raw_data folder:
https://drive.google.com/file/d/1ilUHwLjgtZCzh0LneIJEhTryrGumDF1V/view?usp=sharing
https://drive.google.com/file/d/1KM3hRZ8DzgPMEsMFmwBdmMNHrPCttuaB/view?usp=sharing
https://drive.google.com/file/d/17cWXJ9ltGXZ6izkgJv0uyNwStrE95_OA/view?usp=sharing
Link to the master file to compare:
https://drive.google.com/file/d/11fOpYFFfHijhZl_CuWHKvkrI7edkpUNQ/view?usp=sharing
<<<<< UPDATE >>>>>>
I am trying to find the solution and I did the code work and provide a valide output (a 180x3 data frame), but the columns that should be filled with the values of the objects accounts_db_org and accounts_total_org are showing NA. When checking the value stored in those objects, the values are correct (for the last iteration). So the output now is in its right format, but with NA instead of numbers.
I am really close, but I am not being able to make the code to show the right numbers. My last attempt is:
library(rjson)
library(dplyr)
library(tidyverse)
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv", show_col_types = FALSE) %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
psa_result <- matrix(nrow = nr_archivos, ncol = 3)
# loop for working with all files, one by one
for(f in 1:nr_archivos){
# load file
json_data <- fromJSON(file = archivos[f])
org_name <- json_data$id
# extract lists and the sublist
usernames <- json_data$twitter
following <- usernames$following
# empty vector
longitud = length(following)
names <- vector(length = longitud)
# loop to populate with the third element of each i item of the sublist
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# convert the list into a data frame
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# applying some format prior to comparison
org_handles <- data.frame(paste("#", names_list$usernames, sep=""))
colnames(org_handles) <- "Twitter"
# merge tables and calculate the results for each iteration
org_list <- inner_join(psa_handles, org_handles)
accounts_db_org <- length(org_list$Twitter)
accounts_total_org <- length(usernames$following)
# populate the matrix row by row
psa_result[f] <- c(org_name, accounts_db_org, accounts_total_org)
}
# create a data frame from the matrix and save the result
psa_result <- data.frame(psa_result)
write_csv(psa_result, file = "./outputs/cuentas_seguidas_en_psa.csv")
The subscript out of bounds error was caused by a json file with 0 records. That was fixed deleting the file.
You can do it with purrr::map or purrr::map_dfr.
Is this what you looking for?
archivos <- list.files("./raw_data", pattern = "json", full.names = TRUE)
# load master file and select the needed columns. This needs to be out of "calculate_accounts" because you only read it once.
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
# calculate accounts
calculate_accounts <- function(archivo){
json_data <- rjson::fromJSON(file = archivo)
org_handles <- json_data %>%
pluck("twitter", "following") %>%
map_chr("username") %>%
as_tibble() %>%
rename(usernames = value) %>%
mutate(Twitter = str_c("#", usernames)) %>%
select(Twitter)
org_list <- inner_join(psa_handles, org_handles)
org_list %>%
mutate(accounts_db_org = length(Twitter),
accounts_total_org = nrow(org_handles)) %>%
select(-Twitter)
}
table_psa <- map_dfr(archivos, calculate_accounts)
#output:
# A tibble: 53 x 4
Name AKA accounts_db_org accounts_total_org
<chr> <chr> <int> <int>
1 Association of American Medical Colleges AAMC 20 2924
2 American College of Cardiology ACC 20 2924
3 American Heart Association AHA 20 2924
4 British Association of Dermatologists BAD 20 2924
5 Canadian Psoriasis Network CPN 20 2924
6 Canadian Skin Patient Alliance CSPA 20 2924
7 European Academy of Dermatology and Venereology EADV 20 2924
8 European Society for Dermatological Research ESDR 20 2924
9 US Department of Health and Human Service HHS 20 2924
10 International Alliance of Dermatology Patients Organisations (Global Skin) IADPO 20 2924
# ... with 43 more rows
Unfortunately, the answer provided by Álvaro does not work as expected, since the output repeats the same number with different organisation names, making it really difficult to read. Actually, the number 20 is repeated 20 times, the number 11, 11 times, and so on. The information is there, but it is not accessible without further data treatment.
I was doing my own research in the meantime and I got to the following code. Finally I made it to work, but the data format was "matrix" "array", really confusing. Fortunately, I wrote the last lines to transpose the data, unlist the array and convert in a matrix, which is able to be converted in a data frame and manipulated as usual.
Maybe my explanation is not very useful, and since I am a newbie, I am sure the code is far from being elegant and optimised. Anyway, please review the code below:
library(purrr)
library(rjson)
library(dplyr)
library(tidyverse)
setwd("~/documentos/varios/proyectos/programacion/R/psa_twitter")
# Load data from files.
archivos <- list.files("./raw_data/json_files",
pattern = ".json",
full.names = TRUE)
psa_handles <- read_csv(file = "./raw_data/psa_handles.csv") %>%
select(Name, AKA, Twitter)
nr_archivos <- length(archivos)
calcula_cuentas <- function(a){
# Extract lists
json_data <- fromJSON(file = a)
org_aka <- json_data$id
org_meta <- json_data$metadata
org_name <- org_meta$company
twitter <- json_data$twitter
following <- twitter$following
# create an empty vector to populate
longitud = length(following)
names <- vector(length = longitud)
# loop to populate the empty vector with third element of the sub-list
for(i in 1:longitud){
names[i] <- following[[i]][3]
}
# create a data frame and change column name
names_list <- data.frame(sapply(names, c))
colnames(names_list) <- "usernames"
# Create a data frame with the correct formatting ready to comparison
org_handles <- data.frame(paste("#",
names_list$usernames,
sep="")
)
colnames(org_handles) <- "Twitter"
# merge tables
org_list <- inner_join(psa_handles, org_handles)
cuentas_db_org <- length(org_list$Twitter)
cuentas_total_org <- length(twitter$following)
results <- data.frame(Name = org_name,
AKA = org_aka,
Cuentas_db = cuentas_db_org,
Total = cuentas_total_org)
results
}
# apply function to list of files and unlist the result
psa <- sapply(archivos, calcula_cuentas)
psa1 <- t(as.data.frame(psa))
psa2 <- matrix(unlist(psa1), ncol = 4) %>%
as.data.frame()
colnames(psa2) <- c("Name", "AKA", "tw_int_outbound", "tw_ext_outbound")
# Save the results.
saveRDS(psa2, file = "rda/psa.RDS")
I am using the help of https://ipstack.com to geocode IP addresses and am having a difficult time trying to geocode all 1200 addresses in a short amount of time.
With R, I've collected the URLs into a list (e.g. http://api.ipstack.com/[IP address]?access_key=[access key]) and can use read_json to read the json data of each URL. But I've not been able to develop a loop to extract the data from each URL.
library(RCurl)
library(jsonlite)
x <- c("http://api.ipstack.com/178.140.119.217?access_key=[access_key]", "http://api.ipstack.com/68.37.21.125?access_key=[access_key]", "http://api.ipstack.com/68.10.255.89?access_key=[access_key]")
read_json(x)
Error in file(path) : invalid 'description' argument
I'm looking for a solution that will be able to read multiple IP addresses and then attach the information to a dataframe.
*Edit 1: Still stuck, but I'm making some progress with the loop,
library(RCurl)
library(jsonlite)
url_lst = as.character(df$URL)
output = NULL
for (i in url_lst) {
x = as.data.frame(read_json(i))
output = rbind(output,x)
}
However, this results in an error:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0
As well, the code only produces 8 observations rather than 1200.
*Edit 2: Bill Ash's answer got me further than I was, but it looks like some values in the JSON data aren't allowing the code to be successful.
Bill Ash's code:
library(httr)
library(tibble)
library(purrr)
library(jsonlite)
ip_addresses <- core_members$ip_address
# a simple function
ip_locate <- function(your_vector_of_ip_addresses, access_key) {
ip <- your_vector_of_ip_addresses
map_df(ip, ~{
out <- httr::GET(url = paste0("http://api.ipstack.com/", .,
"?access_key=", access_key))
resp <- fromJSON(httr::content(out, "text"), flatten = TRUE)
tibble::tibble(ip = resp$ip,
country = resp$country_name,
region = resp$region_name,
city = resp$city,
zip = resp$zip,
lat = resp$latitude,
lng = resp$longitude)
})
}
ip_info <- ip_locate(your_vector_of_ip_addresses = ip_addresses,
access_key = "[access_key]")
# output
ip_info %>%
head()
Where the error begins
ip_info <- ip_locate(your_vector_of_ip_addresses = ip_addresses,
access_key = "[access_key]")
Error: All columns in a tibble must be 1d or 2d objects:
* Column `zip` is NULL
9.
stop(cnd)
8.
abort(error_column_must_be_vector(names_x[is_xd], classes))
7.
check_valid_cols(x)
6.
lst_to_tibble(xlq$output, .rows, .name_repair, lengths = xlq$lengths)
5.
tibble::tibble(ip = resp$ip, country = resp$country_name, region = resp$region_name,
city = resp$city, zip = resp$zip, lat = resp$latitude, lng = resp$longitude)
4.
.f(.x[[i]], ...)
3.
map(.x, .f, ...)
2.
map_df(ip, ~{
out <- httr::GET(url = paste0("http://api.ipstack.com/",
., "?access_key=", access_key))
resp <- fromJSON(httr::content(out, "text"), flatten = TRUE) ...
1.
ip_locate(your_vector_of_ip_addresses = ip_addresses, access_key = "[access_key]")
Because I only need the coordinates from these IP addresses, I believe this has been resolved. Hopefully, someone is willing to continue advising on this issue, but I won't be updating this any further.
It looks like you can pay for bulk lookup as well.
From there documentation page:
Bulk IP Lookup
The ipstack API also offers the ability to request data for multiple IPv4 or IPv6 addresses at the same time. In order to process IP addresses in bulk, simply append multiple comma-separated IP addresses to the API's base URL.
library(httr)
library(tibble)
library(purrr)
library(jsonlite)
# some ip addresses
ip_addresses <- c("178.140.119.217", "68.37.21.125", "68.10.255.89")
# a simple function
ip_locate <- function(your_vector_of_ip_addresses, access_key) {
ip <- your_vector_of_ip_addresses
map_df(ip, ~{
out <- httr::GET(url = paste0("http://api.ipstack.com/", .,
"?access_key=", access_key))
resp <- fromJSON(httr::content(out, "text"), flatten = TRUE)
tibble::tibble(ip = resp$ip,
country = resp$country_name,
region = resp$region_name,
city = resp$city,
zip = resp$zip,
lat = resp$latitude,
lng = resp$longitude)
})
}
# an example
ip_info <- ip_locate(your_vector_of_ip_addresses = ip_addresses,
access_key = "had to edit out my key")
# output
ip_info %>%
head()
# A tibble: 3 x 7
ip country region city zip lat lng
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 178.140.119.217 Russia Moscow Moscow 101001 55.8 37.6
2 68.37.21.125 United States Michigan Southgate 48195 42.2 -83.2
3 68.10.255.89 United States Virginia Chesapeake 23323 36.8 -76.3
Hope this helps.
I am endeavoring to combine files but find myself writing very redundant code, which is cumbersome. I have looked at the documentation but for some reason cannot find anything about how to do this.
Basically, I download the code from my native machine, and then want to combine the exact same columns for each file (the only difference is year).
Can you help?
I download the code from my machine ("C:/SAM/CODE1_2005.csv" then "C:/SAM/CODE1_2006.csv" then "C:/SAM/CODE1_2007.csv", until 2016.
I then define the columns, all the same for each year I have downloaded, such as COLLEGESCORECARD05_A<-subset(COLLEGESCORECARD05, select=c(ï..UNITID,OPEID,OPEID6,INSTNM)) and so forth...
and then combine the files into one database.
The issue is that this seems inefficient. Is there a more efficient way?
You can make a list of the .csv files in the folder and then read them all together into a single df with purrr::map_df. You can add a column to differentiate between files then
library(tidyverse)
df <- list.files(path="C://SAM",
pattern="*.csv") %>%
purrr::map_df(function(x) readr::read_csv(x) %>%
mutate(filename=gsub(" .csv", "", basename(x)))
At the risk of seeming icky for self-promotion, I wrote a function that does exactly this (desiderata::apply_to_files()):
# Apply a function to every file in a folder that matches a regex pattern
rain <- apply_to_files(path = "Raw data/Rainfall", pattern = "csv",
func = readr::read_csv, col_types = "Tiic",
recursive = FALSE, ignorecase = TRUE,
method = "row_bind")
dplyr::sample_n(rain, 5)
#> # A tibble: 5 x 5
#>
#> orig_source_file Time Tips mV Event
#> <chr> <dttm> <int> <int> <chr>
#> 1 BOW-BM-2016-01-15.csv 2015-12-17 03:58:00 0 4047 Normal
#> 2 BOW-BM-2016-01-15.csv 2016-01-03 00:27:00 2 3962 Normal
#> 3 BOW-BM-2016-01-15.csv 2015-11-27 12:06:00 0 4262 Normal
#> 4 BIL-BPA-2018-01-24.csv 2015-11-15 10:00:00 0 4378 Normal
#> 5 BOW-BM-2016-08-05.csv 2016-04-13 19:00:00 0 4447 Normal
In this case, all of the files have identical columns and order (Time, Tips, mV, Event), so I can just method = "row_bind" and the function will automatically add the filename as an extra column. There are other methods available:
"full_join" (the default) returns all columns and rows. "left_join" returns all rows from the first file, and all columns from subsequent files. "inner_join" returns rows from the first file that have matches in subsequent files.
Internally, the function builds a list of files in the path (recursive or not), runs an lapply() on the list, and then handles merging the new list of dataframes into a single dataframe:
apply_to_files <- function(path, pattern, func, ..., recursive = FALSE, ignorecase = TRUE,
method = "full_join") {
file_list <- list.files(path = path,
pattern = pattern,
full.names = TRUE, # Return full relative path.
recursive = recursive, # Search into subfolders.
ignore.case = ignorecase)
df_list <- lapply(file_list, func, ...)
# The .id arg of bind_rows() uses the names to create the ID column.
names(df_list) <- basename(file_list)
out <- switch(method,
"full_join" = plyr::join_all(df_list, type = "full"),
"left_join" = plyr::join_all(df_list, type = "left"),
"inner_join" = plyr::join_all(df_list, type = "inner"),
# The fancy joins don't have orig_source_file because the values were
# getting all mixed together.
"row_bind" = dplyr::bind_rows(df_list, .id = "orig_source_file"))
return(invisible(out))
}
I am trying to split a data frame and write it to a csv file in r using the unique values in one variable. I am new to r and I'm not entirely sure I know what I'm doing.
## trying to subset data
library(dplyr)
library(plyr)
#set the working directory
setwd("S:/some stuff")
## load the datafile into an object called data.
data <- read.csv("S:/some stuff/Area.csv",
header = TRUE, sep = ",")
#Create subsets of data by LA
LA<-subset(data,AREA == "LA")
My data frame has 2,500 observations and 20 variables.
My dataframe is called LA
The variable I'd like to split it by is called Disease
I found this How to create multiple ,csv files in R?
And reapproriated it accordingly
from
plyr::d_ply(iris, .(Species), function(x) write.csv(x,
file = paste(x$Species, ".csv", sep = "")))
to
plyr::d_ply(LA, .(Disease), function(x) write.csv(x,
file = paste(LA$Disease, ".csv", )))
However....
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
Show Traceback
Rerun with Debug
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
There are two things I'd like to solve.
1) subsetting a dataframe
2) writing to a path
Ideally I'd like to loop through it from the import of data (the Area.csv file).
This has areas and disease. There are 12 areas and 20 diseases.
I would like to create csv files of each disease by area.
In this example Area = LA and then disease.
How can I step through using a loop to create the 20 different files for each area?
I thought this:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
#Subset the data by MP
tmp=subset(mpExpenses2012,MP.s.Name==name)
#Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
#Save the CSV file containing separate expenses data for each MP
write.csv(tmp,fn,row.names=FALSE)
}
might be helpful, but it's writing to a path that's getting me down.
EDIT
library(tidyr)
library(purrr)
temp_dir <- tempfile()
dir.create(temp_dir)
LA %>%
nest(-FinalDiseaseForMonthlyAnalysis) %>%
pwalk(function(FinalDiseaseForMonthlyAnalysis, data) write.csv(data, file.path(temp_dir, paste0(FinalDiseaseForMonthlyAnalysis, ".csv"))))
list.files(temp_dir)
temp_dir
unlink(temp_dir, recursive = T)
This works. But now comes the "where are the files?" question.
Yes: I get the temp file and then the unlink.
But how do I save in a folder on S:/some stuff/
?
EDIT FINAL: SOLVED
I've read that in r everything is a list. And I found a way to split by two columns to do what I needed. Annoyingly it's linked in the comments in here:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
and I missed it.
I was also having problems generating a dir using dir.create. Who knew that dir.create needs to have recursive = TRUE when you're trying to do stuff? I DO NOW.
Anyway. here's what I did:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(write)
## set working directory
setwd("S:/somestuff")
#create the directories - pretty sure there's a way to avoid doing this long hand
dir.create("S:/somestuff/CSV source files", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA1", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA2", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA3", recursive = TRUE)
#Read in the CSV
DF = read.csv("S:/somestuff/CSV source files/ALL.csv",
header = TRUE, sep = ",")
glimpse(DF)
#This splits the dataframe generated above (DF) and calls it DF4
DF4 <- split(DF,list(DF$LA,DF$FinalDiseaseForMonthlyAnalysis))
lapply(names(DF4), function(name) write.csv(DF4[[name]], file = paste("S:/somestuff/CSV source files/",gsub('','',name),sep = ''), row.names = F))
I'm guessing if I read in the dataframe I could then use dir.create to create paths from the names in LA in the data frame.
After returning to the problem. It's much easier in the latest version of dplyr
ourdata<-DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))
This was really helpful to me! Thanks!! I tried to simplify the crux of the matter.
library(tidyverse)
library(reprex)
states4 <- tribble(~state,~name,~area,
"AL","Alabama",50645.3242,
"AZ","Arizona",113594.0781,
"AR","Arkansas",52035.4727,
"CA","California",155779.2031
)
chain4 <- states4 %>% split(.$state)
map(names(chain4),function(stateabbrev){write_csv(chain4[[stateabbrev]],paste0("~/Downloads/","testtoken_",stateabbrev,".csv"))})
#> [[1]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AL Alabama 50645.
#>
#> [[2]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AR Arkansas 52035.
#>
#> [[3]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AZ Arizona 113594.
#>
#> [[4]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 CA California 155779.
list.files(path="~/Downloads", pattern = "testtoken.*csv")
#> [1] "testtoken_AL.csv" "testtoken_AR.csv" "testtoken_AZ.csv"
#> [4] "testtoken_CA.csv"
reprex()
Created on 2019-10-02 by the reprex package (v0.3.0)
In the end I used:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(stringr)
library(plyr)
library (car)
## set working directory
setwd("S:/Somestuff/Borough profile maps/Working")
## read data in from geocoded file
geocoded<-read.csv("geocoded 2015 - 2018.csv",na.strings=c(""," ","N/A"))
str(geocoded)
str(geocoded$GENDER)
levels(geocoded$LA)
#split geocoded data by LA
x <-split(geocoded,geocoded$LA)
str(x)
#Split geocoded data by LA and Final
#split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, .)
y<-split(geocoded,list(geocoded$Final,geocoded$LA), drop = TRUE, sep = "_")
str(y)
#create dir and then write CSV files of geocoded to file locations
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/",, recursive = TRUE)
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/TEST2",, recursive = TRUE)
lapply(names(x), function(name) write.csv(x[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/',gsub(' ','',name),sep = ''), row.names = F))
lapply(names(y),function(name) write.csv(y[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/TEST2/',name,".csv")))
The problem was that in my original code you'll notice I was using read.csv BUT feeding in a .txt file. I changed the file to .csv and BANG. it worked. First time.
I realise that you don't need all the libraries I called at the beginning, but they're left in from my ridiculous number of attempts.
After returning to the problem. It's much easier in the latest version of dplyr
DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))