Avoid repeated column headers while importing MS-Excel sheets - r

Based on the question below, I have a follow up question in reference to the following two similar solutions. What if the headers in the sheets are the same and you don't want the headers to be repeated as rows from sheet 2 and so on?
Purpose: Combining all sheets into one dataframe.
I am guessing one will have to match the headers from sheet 2 and so on with the headers of the first sheet and remove them if found.
I guess the most straight forward way would be to delete the header rows from sheet 2 and so on before before importing them into R.
Read all worksheets in an Excel workbook into an R list with data.frames
Solutions for combining sheets:
library(readxl)
library(purrr)
library(pacman)
# Solution 1 for all in one df
final_df = bind_rows(path %>%
excel_sheets() %>%
set_names() %>%
map(read_excel, path = path))
# Solution 2 to concatenate
data = path %>%
excel_sheets() %>%
set_names() %>%
map_df(~ read_excel(path = path, sheet = .x), .id = "Sheet")

Related

appending a year from a csv files to their corresponding data in a data frame

I have 12 csv files with names filename_2009.csv, filename_2010.csv, and so on to filename_2021.csv. Each file contains similar data. I have merged all the data into one data frame in R. However I need to add an additional column indicating which year the data is about without modifying the original csv files. How do I do it?
Sorry, I am new to R, so this may be a stupid question.
You could create a new column with the original filename and then extract the year, e.g.
library(tidyverse)
list.files(path="Xxxxxx", full.names=TRUE ) |>
set_names() |>
map_dfr(read_csv, .id = "filename") |>
mutate(file_year = stringr::str_extract_all(filename, "\\d+"))

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

r import excel sheets from file if sheet contain specific value in specific column

I am importing into r multiple similarly structured sheets from a single excel file but would like to know how to adapt below to only import those sheets that contain (among other values) a certain value in a specific column (column = sport, value = football)
excel_sheets("mydata") %>%
map_df(~read_xlsx("mydata", .x)
Is it possible to adapt my code to do this?
This will include the sheet if sport column has at least one value of 'football'.
library(readxl)
library(purrr)
excel_sheets("mydata") %>%
map_df(~{
tmp <- read_xlsx("mydata", .x)
if(any(tmp$sport %in% 'football')) tmp
})

combining ldply to combine multiple csv files AND add column with file names via mutate/basename

I'm trying to apply the code here which uses ldply to combine multiple csv files into one dataframe
I'm trying to figure out what the appropriate tidyverse syntax is to add a column that lists the name of the file from which the data comes from.
Here's what I have
test <- ldply( .data = list.files(pattern="*.csv"),
.fun = read.csv,
header = TRUE) %>%
mutate(filename=gsub(".csv","",basename(x)))
I get
"Error in basename(x) : object 'x' not found message".
My understanding is that basename(path), but when I set the path as the folder which contains the file, the filename column that ends up getting added just has the folder name.
Any help is much appreciated!
You could use purrr::map_dfr
purrr::map_dfr(list.files(pattern="*.csv", full.names = TRUE),
~read.csv(.x) %>% mutate(file = sub(".csv$", "", basename(.x))))
We can use imap
library(purrr)
library(dplyr)
library(stringr)
library(readr)
files <- list.files(pattern="*.csv", full.names = TRUE)
fileSub <- str_remove(basename(files), "\\.csv$")
imap_dfr(setNames(files, fileSub), ~ read_csv(.x) %>%
mutate(file = .y))
I don't know if this helps anyone, I stumbled across this solution which is very simple.
Context: the .id column created by ldply lists the names of each item in your input vector. So, to combine multiple csv files and create a new column with the file name, you can do:
# get csv files in current working directory as a character vector
file_names <- list.files(pattern="*.csv") #for the example above it is .data=list.files(pattern="*.csv")
# Name these items (in this case equal to the items themselves, but can be subbed out for sample.Ids)
names(file_names) <- paste(file_names) # or for the example above names(.data) <- paste(.data)
# then use ldply to do the hard work
combined_csv <- ldply(file_names, read.csv)
# Names are stored under .id
combined_csv$.id

Importing data using readxl

Issue description
I'm trying to iterate over multiple sheets in a spread sheet to pick up the first row as a column name and rows 11+ as data. I'm hoping to import them as a single dataframe. I'm having trouble because there are 10 header rows in the sheets and I don't seem to be able to aggregate the sheets without losing data.
Data
The file in question is found at Table 6 on this page of the ABS website.
My attempt
The first chunk does the heavy lifting of getting the data into r. The map function naturally results in a list of lists, containing data found in the sheets whose name contained the text "Data" (done this way because there are two sheets in every one of these spread sheets that contain some irrelevant info).
BUT I want the output in a dataframe so I tried to use the map_df function but all data from spreadsheets after the first is imported as NA values (incorrect).
library(tidyverse)
library(stringr)
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(read_excel, path = path, skip = 9)
The second chunk picks up the column names in each of the sheets so that they can be applied to df1.
nms <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map_df(read_excel, path = path, n_max = 0, col_names = T) %>%
mutate(
date = 1
) %>%
select(
date, everything()
)
names(df1) <- names(nms)
If anyone could show me how to import the data without the NA's in a single dataframe that would be great. Bonus points for showing me how to do it in a single step, without the need for the second chunk to name the columns.
Not exactly sure what you're looking for, but if you want to read all the sheets in that workbook keeping and skipping the first 9 rows. Then you just need to stitch these all together through a reduce using left_join to get rid of the NA values.
df1 <- path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, skip = 9)) %>%
reduce(left_join, by = "Series ID")
If you want to keep the original header names:
path %>%
excel_sheets() %>%
str_subset("Data") %>%
map(~read_excel(file, .x, col_names = FALSE) %>%
set_names(., c("Series ID", .[1, 2:ncol(.)])) %>%
slice(-1:-10)) %>%
reduce(left_join, by = "Series ID") %>%
mutate_at(vars(-`Series ID`), as.numeric)

Resources