Importing data using readxl - r

Issue description
I'm trying to iterate over multiple sheets in a spread sheet to pick up the first row as a column name and rows 11+ as data. I'm hoping to import them as a single dataframe. I'm having trouble because there are 10 header rows in the sheets and I don't seem to be able to aggregate the sheets without losing data.
Data
The file in question is found at Table 6 on this page of the ABS website.
My attempt
The first chunk does the heavy lifting of getting the data into r. The map function naturally results in a list of lists, containing data found in the sheets whose name contained the text "Data" (done this way because there are two sheets in every one of these spread sheets that contain some irrelevant info).
BUT I want the output in a dataframe so I tried to use the map_df function but all data from spreadsheets after the first is imported as NA values (incorrect).
library(tidyverse)
library(stringr)
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(read_excel, path = path, skip = 9)
The second chunk picks up the column names in each of the sheets so that they can be applied to df1.
nms <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map_df(read_excel, path = path, n_max = 0, col_names = T) %>%
mutate(
date = 1
) %>%
select(
date, everything()
)
names(df1) <- names(nms)
If anyone could show me how to import the data without the NA's in a single dataframe that would be great. Bonus points for showing me how to do it in a single step, without the need for the second chunk to name the columns.

Not exactly sure what you're looking for, but if you want to read all the sheets in that workbook keeping and skipping the first 9 rows. Then you just need to stitch these all together through a reduce using left_join to get rid of the NA values.
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, skip = 9)) %>%
reduce(left_join, by = "Series ID")
If you want to keep the original header names:
path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, col_names = FALSE) %>%
set_names(., c("Series ID", .[1, 2:ncol(.)])) %>%
slice(-1:-10)) %>%
reduce(left_join, by = "Series ID") %>%
mutate_at(vars(-`Series ID`), as.numeric)

Related

Avoid repeated column headers while importing MS-Excel sheets

Based on the question below, I have a follow up question in reference to the following two similar solutions. What if the headers in the sheets are the same and you don't want the headers to be repeated as rows from sheet 2 and so on?
Purpose: Combining all sheets into one dataframe.
I am guessing one will have to match the headers from sheet 2 and so on with the headers of the first sheet and remove them if found.
I guess the most straight forward way would be to delete the header rows from sheet 2 and so on before before importing them into R.
Read all worksheets in an Excel workbook into an R list with data.frames
Solutions for combining sheets:
library(readxl)
library(purrr)
library(pacman)
# Solution 1 for all in one df
final_df = bind_rows(path %>%
excel_sheets() %>%
set_names() %>%
map(read_excel, path = path))
# Solution 2 to concatenate
data = path %>%
excel_sheets() %>%
set_names() %>%
map_df(~ read_excel(path = path, sheet = .x), .id = "Sheet")

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

Making a loop in R for counting word frequencies from specific columns in different tables

I have 15 different tables, each of which cointains a "text" column with a long text (a series of answers to a poll question). I want to tidy the tables by creating a row for each word in "text" in a column named "word". Then I want to know the word frequencies for each table. I wrote this piece of code:
Table1.tidy <- Table1 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
Table1.tidy %>%
count(word, sort = TRUE)
It works fine, but now I'd like to avoid repeating this code for each table. Does anyone know how?
(1) Put all of your data.frames into a list.
(2) Use purrr's map function to apply your workflow:
library(dplyr)
library(tidyr)
library(purrr)
my_list <- list(Table1, Table2, Table3)
my_tidy_list <- my_list %>%
map(~ .x %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
# Table1.tidy %>% # I think this line is a mistake?
count(word, sort = TRUE))
my_tidy_list[[1]] returns Table1.tidy, my_tidy_list[[2]] returns Table2.tidy etc.

load JSON data into a dataframe

I am a beginner working with R and especially JSON files, and this is probably a simple question but I have been unsuccessful for a while.
Here is a sample row of data from a provided text file (there are ~4000 rows):
{"040070005001":4,"040070005003":4,"040138101003":4,"040130718024":4}
Each row has a variable number of values in the string.
I am trying to use a function, but it is only loading the last row of the data set rather than capturing the data from each row?
For (row in 1:nrow(origins)) {
json <- origins$home_cbgs[row] %>%
fromJSON() %>%
unlist() %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(
origin_census_block_group = "rowname",
origin_visitors = "."
)
}

How to use purrr with dplyr to filter list elements and export lists into Excel

I'm fairly new to working with lists in R and have a quick question that also involes using purrr. Below are too small sample data frames as an example.
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals <- c("Cat","Cat","Dog","Rat","Bird")
Living <- c("House","Condo","Condo","Apartment","House")
Data1 <- data.frame(Client1,Animals,Living)
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals2 <- c("Cat","Dog","Dog","Rat","Cat")
Living2 <- c("House","Apartment","Apartment","Family","Apartment")
Data2 <- data.frame(Client1,Animals2,Living2)
Bonus if you can include how to rename list elements at once instead of using the two lines below:
names(Data1)[1:3] <- c("Client","Animals","Living")
names(Data2)[1:3] <- c("Client","Animals","Living")
So next if I want to filter each data frame by Animals and then export each into an Excel spreadsheet by using the two lines of code below:
Data1 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data1.csv")
Data2 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data2.csv")
However, to be more efficient I can join both data frames into a list and use purrr to filter each at the same time.
DataList <- list(Data1,Data2)
DataList %>% map(~filter(.,Animals=="Cat"))
For the above code, I will use multiple ~filter lines for each animal, so not sure if there's a more efficient way that will avoid writing many different lines of code while still using purrr and dplyr?
Also, how do I use write.csv with purrr. I can either export the list into one spreadsheet, but I'm not sure how to break up the list so that it exports properly. Also, I can export each list element into separate spreadsheets. It would be great to see a solution for both of these situations.
If I understand your question correctly, you want to write a separate file for each of the Animals of both the data frames:
DataList <- list(Data1, Data2)
library(purrr)
a <- DataList %>% map(., function(x) {
colnames(x) <- c("Client","Animals","Living")
x
}) %>% map(., function(x) {
split(x, x$Animals)
}) %>% flatten(.)
names(a) <- paste0("Data", (1:length(a)))
lapply(1:length(a), function(x) write.csv(a[[x]],
file = paste0(names(a[x]), ".csv"),
row.names = FALSE))
We first dump both the data frames in DataList, then rename the columns for both the data frames with the first map, then split both the data frames by Animals, and finally flatten the nested list.
I wish I could do this without breaking the chain, but I couldn't find another way.
From here, we first rename the elements of the list, then use lapply to loop over all the elements in the list and apply write.csv on each of them.
You mentioned Excel - you can just as easily replace write.csv with any of the functions for writing excel files from R
Here is one option, involving binding the two datasets together before re-splitting.
library(purrr)
library(dplyr)
DataList %>%
map(~setNames(.x, c("Client","Animals","Living"))) %>%
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id") %>%
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
The first map line shows how to rename the columns of all the datasets in a list at once via setNames.
DataList %>%
map(~setNames(.x, c("Client","Animals","Living")))
I then set the names of the datasets in the list via setNames. While stacking the datasets together into a single data.frame via dplyr's bind_rows, these names are added as a new column, id.
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id")
The last step is to split the combined data.frame by id and Animal before writing each split into a separate csv file. Information is pulled out of the dataset for naming the individual files by dataset and animal (this was the reason to name the elements of DataList). I removed the id variable via select prior to writing the files, as it may be extraneous to your needs.
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
This can be all be done without putting these into a single data.frame, but I had trouble with naming the files at the end.

Resources