Read a set of files into a matrix in R - r

I am trying to read a set of tab separated files into a matrix or data.frame. For each file I need to extract one column and then concatenate all the columns into a single matrix keeping both column and row names.
I am using tidyverse (and I am terrible at that). I successfully get column names but I miss row names at the very last stage of processing.
library("purrr")
library("tibble")
samples <- c("a","b","c","d")
a <- samples %>%
purrr::map_chr(~ file.path(getwd(), TARGET_FOLDER, paste(., "tsv", sep = "."))) %>%
purrr::map(safely(~ read.table(., row.names = 1, skip = 4))) %>%
purrr::set_names(rownames(samples)) %>%
purrr::transpose()
is_ok <- a$error %>% purrr::map_lgl(is_null)
x <- a$result[is_ok] %>%
purrr::map(~ {
v <- .[,1]
names(v) <- rownames(.)
v
}) %>% as_tibble(rownames = NA)
The x data.frame has correct colnames but lacks rownames. All the element on the a list have the same rownames in the exact same order. I am aware of tricks like rownames(x) <- rownames(a$result[[1]]) but I am looking for more consistent solutions.

It turned out that the solution was easier than expected. Using as.data.frame instead the last as_tibble solved it.

Related

multiple kableExtra::column_spec based on number of variables

I want to reproduce the figure below for a data frame with any number of columns (assuming all columns have same format)
For example, I have a data frame where each cell is a list containing numeric values
# dataframe containg data
df <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
vars <- c("a","b","c")
df["a"] <- list(list(rnorm(100)))
df["b"] <- list(list(rnorm(100)))
df["c"] <- list(list(rnorm(100)))
I then create a table
# dataframe to create for table
newdf <- data.frame(YEAR = 1980:1990) %>%
tibble::as_tibble()
newdf[vars] <- ""
# create table
kableExtra::kbl(newdf,
col.names=c("YEAR",vars),
caption=paste0("Title"),
escape=F) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::column_spec(2,image=kableExtra::spec_hist(df$a)) %>%
kableExtra::column_spec(3,image=kableExtra::spec_hist(df$b)) %>%
kableExtra::column_spec(4,image=kableExtra::spec_hist(df$c))
It looks something like this:
This all works great.
However in reality i have a data frame that changes in the number of columns that need to be plotted by kableExtra (since it is created based on user inputs) and i can't work out how to achieve this since in the example above the column_spec function needs to be repeated for each column. So i need a way to generate the table for a variable data frame size.
This seems to be compounded by the use of the pipe operator.
I have looked at piping a function but i think the function still has the same problem of piping a variable number of sequential commands.
Any help greatly appreciated.
You can simultaneously format multiple columns with a purrr::reduce statement, setting the .init argument to the table. That way, the column_spec function can be applied to multiple columns in an elegant way.
The command call will be like
reduce(columns, column_spec, [column_spec arguments], .init = table)
The reduce will call column_spec(table, columns[1], [column_spec arguments], then send that output (call it modified_table) to column_spec(modeifed_table, columns[2], [column_spec arguments], etc.
Here's some example code. Sorry - I tried to create a reprex but I can't get it to work with the html tables.
library(tidyverse)
library(kableExtra)
df <- data.frame(a = 1:10, b = 1:10, c = 1:10)
which_col <- c("b", "c") # which columns to format in the reduce()
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col), # column_spec wants a vector of column indices
column_spec,
bold = TRUE, # this is a ... argument, which will get sent to column_spec
.init = .
)
# for more complex cases, won't be able to use ... argument as elegantly
df %>%
kbl() %>%
reduce(
which(names(df) %in% which_col),
~column_spec(.x, .y, bold = rep(c(TRUE, FALSE), 5)),
.init = .
)
edit: here is how this would be applied to your table
library(kableExtra)
reduce_inputs <- lst(
col = match(vars, names(newdf)),
dat = df[, vars]
) %>%
transpose()
# create table
newdf %>%
kbl(
newdf,
col.names = c("YEAR", vars),
caption = paste0("Title"),
escape= FALSE
) %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
reduce(
reduce_inputs,
~column_spec(.x, .y$col, image = spec_hist(.y$dat)),
.init = .
)

Create loop to change structure of multiple data frames in R

I have a bunch of excel files that I have loaded into R as separate dataframes. I now need to change the structure/layout of every one of these data frames. I have done all of this separately, but it is becoming very time consuming. I am not sure how there is a better way to accomplish this. My guess would be that I need to combine them all into a list and then create some type of loop to go through every data frame in that list. I need to be able to remove rows and columns from the edge, add 'row' the top left cell that is currently empty, and then follow that pivot_longer, mutate, and select functions that I have listed below that I have done separately.
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
I have tried what is below and I get an error, does anyone have a better way that what I am currently doing to accomplish this?
for(x in seq_along(files.list)){
names(files.list)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
}
If you have a vector of filenames my_files, I think this will work
library(tidyverse)
library(readxl)
prepare_df <- function(df) {
# make changes to df
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
return(df)
}
names(my_files) <- my_files # often useful if the vector we're mapping over has names
dfs <- map(my_files, read_excel) # read into a list of data frames
dfs <- map(dfs, prepare_df) # prepare each one
df <- bind_rows(dfs, .id = "file") # if you prefer one data frame instead

error flattening (convert to data.frame) XML file in R using xlm2 and xlmtools

I am trying to convert this xml_file (and many other similar ones) to a data.frame in R. Desired outcome: a data.frame (or tibble, data.table, etc) with:
One row per Deputado (which is the main tag/level of xml_file, there are 4 of those)
All variables within each Deputado should be columns.
Neste categories with multiple values (such as comissao, cargoComissoes, etc) can be ignored.
In the code below, I tried to follow Example 2 in the readme of github/.../xmltools closely, but I got the error:
...
+ dplyr::mutate_all(empty_as_na)
Error: Argument 4 must be length 4, not 39
Any help fixing this (or different strategy with complete example) would be greatly appreciated.
The code (with reproducible error) is:
file <- "https://www.camara.leg.br/SitCamaraWS/Deputados.asmx/ObterDetalhesDeputado?ideCadastro=141428&numLegislatura="
doc <- file %>%
xml2::read_xml()
nodeset <- doc %>%
xml2::xml_children()
length(nodeset) # lots of nodes!
nodeset[1] %>% # lets look at ONE node's tree
xml_view_tree()
# lets assume that most nodes share the same structure
terminal_paths <- nodeset[1] %>%
xml_get_paths(only_terminal_parent = TRUE)
terminal_xpaths <- terminal_paths %>% ## collapse xpaths to unique only
unlist() %>%
unique()
# xml_to_df (XML package based)
## note that we use file, not doc, hence is_xml = FALSE
# df1 <- lapply(xpaths, xml_to_df, file = file, is_xml = FALSE, dig = FALSE) %>%
# dplyr::bind_cols()
# df1
# xml_dig_df (xml2 package based)
## faster!
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
if(class(x) == "character") ifelse(as.character(x)!="", x, NA) else x
}
terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = doc) # use xml docs, not nodesets! I think this is because it searches the 'root'.
df2 <- terminal_nodesets %>%
purrr::map(xml_dig_df) %>%
purrr::map(dplyr::bind_rows) %>%
dplyr::bind_cols() %>%
dplyr::mutate_all(empty_as_na)
Here is an approach with XML package.
library(tidyverse)
library(XML)
df = xmlInternalTreeParse("./Data/ObterDetalhesDeputado.xml")
df_root = xmlRoot(df)
df_children = xmlChildren(df_root)
df_flattened = map_dfr(df_children, ~.x %>%
xmlToList() %>%
unlist %>%
stack %>%
mutate(ind = as.character(ind),
ind = make.unique(ind)) %>% # for duplicate identifiers
spread(ind, values))
Following Nodes are nested lists. So they will appear as duplicate columns with numbers affixed. You can remove them accordingly.
cargosComissoes 2
partidoAtual 3
gabinete 3
historicoLider 4
comissoes 11

Efficiently create data.frames for a changing number of input csv files with identical 'tidy' format and size

I can't figure out how to:
efficiently create, with rbind or another way, a data.frame compiling csv-derived data.frames, whose number varies for different projects. Or similarly:
efficiently create a data.frame of the difference between a csv-derived "baseline scenario" 's values and those of the rest of the csv-based alternative scenarios.
The csvs are timeseries of hydrologic model output, already in long, 'tidy' format and they're identical in format, size, and order -- there's just different numbers of them for different projects. There's always at least two, a baseline and an alternative, but there's usually quite a few. Eg, Project A might have four csvs/scenarios and Project B might have thirty csvs/scenarios.
I'm hoping to have one code template that will efficiently accommodate projects with any number of scenarios. Without an efficient way, I'm needing to add or delete quite a few lines to match the number of scenarios I have on an sub-daily basis, so it's a time-consuming step I'd like to avoid. After df and df_diff are created, both are used for later summaries and plots.
I'll manually enter the names of the scenarios as they always differ, eg:
library(dplyr)
scenarios <- c("baseline", "alt1", "alt1b", "no dam")
length(scenarios) will always match the number of CSVs I have for a given project.
Read in the csvs (one csv for each scenario) and keep them unmodified for later, separate processing:
#In my case these csv#s are from a separate file's list of csvs,
#eg csv1 <- read.csv("baseline.csv")
# csv2 <- read.csv("alt1.csv"), etc - all tidy monthly timeseries of many variables
#For reproducibility, simplyfying:
csv1 <- data.frame("variable" = "x", "value" = 13) #baseline scenario
csv2 <- data.frame("variable" = "x", "value" = 5) #"alternative 1"
csv3 <- data.frame("variable" = "x", "value" = 109) #"alternative 1b"
csv4 <- data.frame("variable" = "x", "value" = 11) #"dam removal"
#csv5 <- data.frame("variable" = "x", "value" = 2.5) #"100 extra flow for salmon sep-dec"
#...
#csv30 <- data.frame("variable" = "x", "value" = 41) #"alternative H3"
Copy the csvs and connect data to scenario:
baseline <- csv1 %>% mutate(scenario = as.factor(paste0(scenarios[1])))
scen2 <- csv2 %>% mutate(scenario = as.factor(paste0(scenarios[2])))
scen3 <- csv3 %>% mutate(scenario = as.factor(paste0(scenarios[3])))
scen4 <- csv4 %>% mutate(scenario = as.factor(paste0(scenarios[4])))
df <- rbind(baseline, scen2, scen3, scen4) #data.frame #1 I'm looking for.
#eg, if csv1-csv30 were included, how to compile in df efficiently, w/o needing the "scen" lines?
There are 4 scenarios in this case so df$scenario has 4 levels. To get here.
Now for the second "difference" data.frame:
bslnevals <- baseline %>% select(value)
scen2vals <- scen2 %>% select(value)
scen3vals <- scen3 %>% select(value)
scen4vals <- scen4 %>% select(value)
scen2diff <- (scen2vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[2], " - baseline"))) %>%
data.frame(scen2) %>% select(-value, -scenario)
scen3diff <- (scen3vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[3], " - baseline"))) %>%
data.frame(scen3) %>% select(-value, -scenario)
scen4diff <- (scen4vals - bslnevals) %>% transmute(value_diff = value,
scenario_diff = as.factor(paste0(scenarios[4], " - baseline"))) %>%
data.frame(scen4) %>% select(-value, -scenario)
df_diff <- rbind(scen2diff, scen3diff, scen4diff) #data.frame #2 I'm looking for.
#same as above, if csv1 - csv30 were included, how to compile in df_diff efficiently, w/o
#needing the "scen#vals" and "scen#diff" lines?
rm(baseline, scen2, scen3, scen4) #declutter - now unneeded (but csv1, csv2, etc orig csv#s needed later)
rm(bslnevals, scen2vals, scen3vals, scen4vals) #unneeded
rm(scen2diff, scen3diff, scen4diff) #unneeded
With 4 scenarios, there are 3 differences from the baseline so df_diff$scenario has 3 levels.
So, if I had 4 csvs (1 baseline, 3 alternatives) or maybe 30 CSVs (1 baseline, 29 alternatives), I tried to write functions and for loops that would assign scen2 and scen3 ...scen28 , and scen2diff, scen3diff...scen28diff etc, variables dynamically, but I failed. So, I'm looking for a way that works and that doesn't need much modification when applied to a project with any number of scenarios. I'm looking just to create df and df_diff in a clean way for a user, for however many scenarios (ie csvs) happen to be given to me or them for a given project.
Any help is greatly appreciated.
I can't test with your case but this may be a good starting point for refactoring your code. I use case_when to generate rules to map the name of the CSV file to a scenario. I subtract the baseline value from the value in each scenario.
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
baseline_df <- read_csv("baseline.csv") %>%
mutate(id = row_number())
# list all csv files (in current directory), then read them all, and row-bind them.
# use case_when to apply rules to change filenames to "scenarios" (grepl to check presence of string)
# join with baseline df (by scenario row number) for easy subtracting.
# calculate differences values.
# remove baseline-baseline rows (diff is 0)
diff_df <- list.files(path = getwd(), pattern = "*.csv", full.names = TRUE) %>%
tibble(filename = .) %>%
mutate(data = map(filename, read_csv)) %>%
unnest() %>%
mutate(scenario = case_when(
grepl("baseline", filename) ~ "baseline",
grepl("alternative1", filename) ~ "alt1",
grepl("alternative2", filename) ~ "alt2",
grepl("dam_removal", filename) ~ "no dam",
TRUE ~ "other"
)) %>%
group_by(scenario) %>%
mutate(id = row_number()) %>%
left_join(baseline_df, by = "id", suffix = c("_new", "_baseline")) %>%
mutate(Value_diff = Value_new - Value_baseline) %>%
filter(scenario != "baseline")

In dplyr, how to delete and rename columns that don't exist, manipulate all names, and name a new variable using a string?

How can I simplify or perform the following operations using dplyr:
Run a function on all data.frame names, like mutate_each(funs()) for values, e.g.
names(iris) <- make.names(names(iris))
Delete columns that do NOT exist (i.e. delete nothing), e.g.
iris %>% select(-matches("Width")) # ok
iris %>% select(-matches("X")) # returns empty data.frame, why?
Add a new column by name (string), e.g.
iris %>% mutate_("newcol" = 0) # ok
x <- "newcol"
iris %>% mutate_(x = 0) # adds a column with name "x" instead of "newcol"
Rename a data.frame colname that does not exist
names(iris)[names(iris)=="X"] <- "Y"
iris %>% rename(sl=Sepal.Length) # ok
iris %>% rename(Y=X) # error, instead of no change
I would use setNames for this:
iris %>% setNames(make.names(names(.)))
Include everything() as an argument for select:
iris %>% select(-matches("Width"), everything())
iris %>% select(-matches("X"), everything())
To my understanding there's no other shortcut than explicitly naming the string like you already do:
iris %>% mutate_("newcol" = 0)
I came up with the following solution for #4:
iris %>%
rename_at(vars(everything()),
function(nm)
recode(nm,
Sepal.Length="sl",
Sepal.Width = "sw",
X = "Y")) %>%
head()
The last line just for convenient output of course.
1 through 3 are answered above. I came here because I had the same problem as number 4. Here is my solution:
df <- iris
Set a name key with the columns to be renamed and the new values:
name_key <- c(
sl = "Sepal.Length",
sw = "Sepal.Width",
Y = "X"
)
Set values not in data frame to NA. This works for my purpose better. You could probably just remove it from name_key.
for (var in names(name_key)) {
if (!(name_key[[var]] %in% names(df))) {
name_key[var] <- NA
}
}
Get a vector of column names in the data frame.
cols <- names(name_key[!is.na(name_key)])
Rename columns
for (nm in names(name_key)) {
names(df)[names(df) == name_key[[nm]]] <- nm
}
Select columns
df2 <- df %>%
select(cols)
I'm almost positive this can be done more elegantly, but this is what I have so far. Hope this helps, if you haven't solved it already!
Answer for the question n.2:
You can use the function any_of if you want to give explicitly the full names of the columns.
iris %>%
select(-any_of(c("X", "Sepal.Width","Petal.Width")))
This will not remove the non-existing column X and will remove the other two listed.
Otherwise, you are good with the solution with matches or a combination of any_of and matches.
iris %>%
select(-any_of("X")) %>%
select(-matches("Width"))
This will remove explicitly X and the matches. Multiple matches are also possible.
iris %>%
select(-any_of("X")) %>%
select(-matches(c("Width", "Spec"))) # use c for multiple matches

Resources