read_excel won't trim whitespace

read_excel won't trim whitespace - r

I am using the package readxl to load an excel file. As default it should strip the white space however it is not doing so.
The file can be downloaded directly from the link below or alternatively it can be downloaded through the website where it is Appendix B
http://www2.nationalgrid.com/UK/Industry-information/Future-of-Energy/Electricity-Ten-Year-Statement/
http://www2.nationalgrid.com/WorkArea/DownloadAsset.aspx?id=8589937799
require(readxl);require(tidyverse)
test <- read_excel("ETYS 2016 Appendix B.xlsx", skip = 1, sheet = 22, trim_ws = TRUE)
print(test$`MVAr Generation`)
test$`MVAr Generation` %>% str_count(patter = "\\s")
test$`MVAr Generation` %>% table #all are numeric
test$`MVAr Generation` %>% class #however the class is characer
test$`MVAr Generation` %>% str_count(patter = "\\s") %>%
sum(na.rm = T) #It should be 0 however it is 2
This problem is causing problems in the analysis as can be seen by this example in which the numeric column is a character.
Help would be appreciated

library(readxl)
readxl::excel_sheets('ETYS 2016 Appendix B.xlsx')[22]
test <- read_excel("ETYS 2016 Appendix B.xlsx", skip = 1, sheet = 22,
trim_ws = FALSE)
test$`MVAr Generation` <- as.numeric(gsub('^\\s', "", test$`MVAr Generation`))
The error is probably due to character encoding. I get this error when I forced numeric interpretation of the column:
Expecting numeric in D9 / R9C4: got 'Â 225'
You can manually avoid this by substituting leading spaces with gsub.

Maybe this is what you want:
library(xlsx)
test <- read.xlsx("ETYS 2016 Appendix B.xlsx", sheetName = 22,
colIndex = 1:7, startRow = 2, header = TRUE,
stringsAsFactors = FALSE)
# remove whitespace
test <- data.frame(lapply(test, function(y) {
y <- gsub("^\\s+", "", y);
y <- gsub("Â", "", y); y
y <- gsub("^\\s+", "", y);
}))
# set tidy cols to numeric
cols = c(3, 4, 5, 7)
test[,cols] = apply(test[,cols], 2, function(x) as.numeric(x))
# test
class(test$Unit.Number)
test$MVAr.Absorption

The insight of #troh with the character encoding got me to think about using regex. #jaySF 's application across the whole dataframe was a good way to process all the columns at same time. The two suggestions lead me to the below answer.
require(dplyr);require(purrr);require(readr)
RemoveSymbols <-function(df) {
df %>% mutate_all( funs(gsub("^[^A-Z0-9]", "", ., ignore.case = FALSE))) %>%
map_df(parse_guess)
}
test2 <- RemoveSymbols(test)
sapply(test2,class)

Related

Error in R: The subscript var has the wrong type quosure/formula. It must be numeric or character

Code:
GeoSeparate <- function(Dataset, GeoColumn) {
GeoColumn <- enquo(GeoColumn)
Dataset %>%
separate(GeoColumn, into = c("Section1", "Section2"), sep = "\\(")%>%
separate(Section1, into = c("Section3", "Section4"), sep = ",")%>%
separate(Section2, into = c("GeoColumn", "Section5"), sep = "\\)")%>%
separate(GeoColumn, into = c("GeoColumnLat", "GeoColumnLon"), sep = ",")%>%
select(-Section3, -Section4, -Section5) #remove sections we don't need
}
Test:
GeoSeparate(df3, DeathCityGeo)
Error:
Must extract column with a single subscript.
x The subscriptvarhas the wrong typequosure/formula.
ℹ It must be numeric or character.
My function separates a column that has the format: "Norwalk, CT\n(41.11805, -73.412906)" so that the latitude and longitude are all that remain and they are in two separate columns. It worked for a while, but now I get the error message described above. It may be because I updated my libraries but I'm not sure. Any help would be amazing! Thank you.

We need to evaluate (!!)
GeoSeparate <- function(Dataset, GeoColumn) {
GeoColumn <- enquo(GeoColumn)
Dataset %>%
separate(!!GeoColumn, into = c("Section1", "Section2"), sep = "\\(")%>%
separate(Section1, into = c("Section3", "Section4"), sep = ",")%>%
separate(Section2, into = c("GeoColumn", "Section5"), sep = "\\)")%>%
separate(!!GeoColumn, into = c("GeoColumnLat", "GeoColumnLon"), sep = ",")%>%
select(-Section3, -Section4, -Section5) #remove sections we don't need
}
Or another option is curly-curly ({{}})
GeoSeparate <- function(Dataset, GeoColumn) {
Dataset %>%
separate({{GeoColumn}}, into = c("Section1", "Section2"), sep = "\\(")%>%
separate(Section1, into = c("Section3", "Section4"), sep = ",")%>%
separate(Section2, into = c("GeoColumn", "Section5"), sep = "\\)")%>%
separate({{GeoColumn}}, into = c("GeoColumnLat", "GeoColumnLon"), sep = ",")%>%
select(-Section3, -Section4, -Section5) #remove sections we don't need
}

Optimize calls to mutate and summarise?

I have this R script:
rm(list = ls())
library(tidyr)
suppressWarnings(library(dplyr))
outFile = "zFinal.lua"
cat("\014\n")
cat(file = outFile, sep = "")
filea <- read.csv("csva.csv", strip.white = TRUE)
fileb <- read.csv("csvb.csv", strip.white = TRUE, sep = ";", header=FALSE)
df <-
merge(filea, fileb, by.x = c(3), by.y = c(1)) %>%
subset(select = c(1, 3, 6, 2)) %>%
arrange(ColA, ColB, V2) %>%
group_by(ColA) %>%
mutate(V2 = paste0('"', V2, "#", ColB, '"')) %>%
summarise(ID = paste(V2, collapse = ", ", sep=";")) %>%
mutate(ID = paste0('["', ColA, '"] = {', ID, '},')) %>%
mutate(ID = paste0('\t\t', ID))
df <- df[c("ID")]
cat("\n\tmyTable = {\n", file = outFile, append = TRUE, sep = "\n")
write.table(df, append = TRUE, file = outFile, sep = ",", quote = FALSE, row.names = FALSE, col.names = FALSE)
cat("\n\t}", file = outFile, append = TRUE, sep = "\n")
# Done
cat("\nDONE.", sep = "\n")
As you can see, this script opens csva.csv and csvb.csv.
This is csva.csv:
ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1
And this is csvb.csv:
100;file1
200;file2
300;file3
400;file4
This is the output file that my script and the csv files produce:
myTable = {
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},
}
This output file is exactly what I want. It's perfect.
This is what the script does. I'm not sure I can explain this very well so if I don't do a good job at that, please skip this section.
For each line in csva.csv, if ColC (csva) contains a number that is contained in Column 1 (csvb), then the output file should contain a line like this:
["3"] = {"file1#100", "file2#200"},
So, in the above example, the first line in ColA (csva) contains number 3 and colB for that line is 100. In csvb, column 1 contains 100 and column 2 contains file1#100.
Because csva contains another number 3 in ColA (the last line), this is also processed and output to the same line.
Ok so my script runs very well indeed and produces perfect output. The problem is it takes too long to run. csva and csvb in my question here are only a few lines long so the output is instant.
However, the data I have to work with in the real world - csva is over 300,000 lines and csvb is over 900,000 lines. So the script takes a long, long time to run (too long to make it feasible). It does work beautifully but it takes far too long to run.
From commenting out lines gradually, it seems that the slowdown is with mutate and summarise. Without those lines, the script runs in about 30 seconds. But with mutate and summarise, it takes hours.
I'm not too advanced with R so how can I make my script run faster possibly by improving my syntax or providing faster alternatives to mutate and summarise?

Here is a more compact version of your code in base R that should offer something of a performance boost.
(Edited to match the data provided by wibeasley.)
ds_a$file_name <- ds_b$file_name[match(ds_a$ColB, ds_b$ColB)]
ds_a <- ds_a[!is.na(ds_a$file_name), -4]
ds_a <- ds_a[order(ds_a$ColB),]
ds_a$file_name <- paste0('"', ds_a$file_name, "#", ds_a$ColB, '"')
res <- tapply(ds_a$file_name, ds_a$ColA, FUN = paste, collapse = ", ", sep=";")
res <- paste0("\t\t[\"", names(res), "\"] = {", res, "},", collapse = "\n")
cat("\n\tmyTable = {", res, "\t}", sep = "\n\n")
Outputting:
myTable = {
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},
}

Here's a dplyr approach that closely follows yours. The real differences are that rows and columns are dropped from the object as soon as possible so there's less baggage to move around.
I'm making some guesses what will actually help with the large datasets. Please report back what the before & after durations are. I like how you said which calls were taking the longest; reporting the new bottles would help too.
If this isn't fast enough, the next easiest move is probably move to sqldf (which uses SQLite under the cover) or data.table. Both require learning a different syntax (unless you already know sql), but could be worth your time in the long run.
# Pretend this info is being read from a file
str_a <-
"ID,ColA,ColB,ColC,ColD
2,3,100,1,1
3,7,300,1,1
5,7,200,1,1
11,22,900,1,1
14,27,500,1,1
16,30,400,1,1
20,36,900,1,1
23,39,800,1,1
24,42,700,1,1
29,49,800,1,1
45,3,200,1,1"
str_b <-
"100;file1
200;file2
300;file3
400;file4"
# Declare the desired columns and their data types.
# Include only the columns needed. Use the smaller 'integer' data type where possible.
col_types_a <- readr::cols_only(
`ID` = readr::col_integer(),
`ColA` = readr::col_integer(),
`ColB` = readr::col_integer(),
`ColC` = readr::col_integer()
# `ColD` = readr::col_integer() # Exclude columns never used
)
col_types_b <- readr::cols_only(
`ColB` = readr::col_integer(),
`file_name` = readr::col_character()
)
# Read the file into a tibble
ds_a <- readr::read_csv(str_a, col_types = col_types_a)
ds_b <- readr::read_delim(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)
ds_a %>%
dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
ID,
ColB,
ColA
) %>%
dplyr::left_join(ds_b, by = "ColB") %>% # Join the two datasets
tidyr::drop_na(file_name) %>% # Dump the records you'll never use
dplyr::mutate( # Create the hybrid column
entry = paste0('"', file_name, "#", ColB, '"')
) %>%
dplyr::select( # Dump the unneeded columns
-ID,
-file_name
) %>%
dplyr::group_by(ColA) %>% # Create a bunch of subdatasets
dplyr::arrange(ColB, entry) %>% # Sorting inside the group usually is faster?
dplyr::summarise(
entry = paste(entry, collapse = ", ", sep = ";")
) %>%
dplyr::ungroup() %>% # Stack all the subsets on top of each other
dplyr::mutate( # Mush the two columns
entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
) %>%
dplyr::pull(entry) %>% # Isolate the desired vector
paste(collapse = "\n") %>% # Combine all the elements into one.
cat()
result:
["3"] = {"file1#100", "file2#200"},
["7"] = {"file2#200", "file3#300"},
["30"] = {"file4#400"},

you could try to load your table as a data.table instead. usually data.tables are faster in their operations than data.frames
library(data.table)
filea <- fread("csva.csv")
just check that it is still a data.table before you come to the mutate function (just print it, you will see the obvious difference to the data.frame).

Here's another solution that leverages data.table's performance while still staying within your dplyr knowledge. I'm not sure there's much room to improve within only 10 seconds, but theoretically this could help larger datasets where the cost to create the indexes is amortized over a longer stretch of execution.
The dtplyr package is translating the dplyr verbs (that are familiar to you) to data.table syntax under the hood. That's leveraging the keys, which should improve the performance, especially with joining and grouping.
The dtplyr::lazy_dt feature might help optimize the dplyr-to-data.table translation.
Finally, vroom replaces readr, mostly out of curiosity. But it's independent from the other changes, and it sounds like that's never been a bottleneck
col_types_a <- vroom::cols_only(
`ID` = vroom::col_integer(),
`ColA` = vroom::col_integer(),
`ColB` = vroom::col_integer(),
`ColC` = vroom::col_integer()
# `ColD` = vroom::col_integer() # Leave out this column b/c it's never used
)
col_types_b <- vroom::cols_only(
`ColB` = vroom::col_integer(),
`file_name` = vroom::col_character()
)
ds_a <- vroom::vroom(str_a, col_types = col_types_a)
ds_b <- vroom::vroom(str_b, delim = ";", col_names = c("ColB", "file_name"), col_types = col_types_b)
# ds_a <- data.table::setDT(ds_a, key = c("ColB", "ColA"))
# ds_b <- data.table::setDT(ds_b, key = "ColB")
ds_a <- dtplyr::lazy_dt(ds_a, key_by = c("ColB", "ColA")) # New line 1
ds_b <- dtplyr::lazy_dt(ds_b, key_by = "ColB") # New line 2
ds_a %>%
dplyr::select( # Quickly drop as many columns as possible; avoid reading if possible
ID,
ColB,
ColA
) %>%
dplyr::inner_join(ds_b, by = "ColB") %>% # New line 3 (replaces left join)
# tidyr::drop_na(file_name) %>% # Remove this line
# dplyr::filter(!is.na(file_name)) %>% # Alternative w/ left join
dplyr::mutate(
entry = paste0('"', file_name, "#", ColB, '"')
) %>%
dplyr::select( # Dump the uneeded columns
-ID,
-file_name
) %>%
dplyr::group_by(ColA) %>%
dplyr::arrange(ColB, entry) %>% # Sort inside the group usually helps
dplyr::summarise(
entry = paste(entry, collapse = ", ", sep=";")
) %>%
dplyr::ungroup() %>%
dplyr::mutate(
entry = paste0('\t\t["', ColA, '"] = {', entry, '},')
) %>%
dplyr::pull(entry) %>% # Isolate the desired vector
paste(collapse = "\n") %>%
cat()

How do I avoid 'NA' values when coercing a .tsv column into numeric via as.numeric?

I have a dataframe with several columns from a .tsv file and want to transform one of them into the 'numeric' type for analysis. However, I keep getting the 'NAs' introduced by coercion warning all the time and do not know exactly why. There is some unnecessary info at the beginning of another column, which is pretty much the only formatting I did.
Originally, I thought the file might have added some extra tabs or spaces, which is why I tried to delete these via giving sub() as an argument.
I should also mention that I get the NA errors also when I do not replace the values and run the dataframe as is:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
data_1995 <- read_csv('OECD_1995.csv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
'2018Q1'=as.numeric(sub("", "", '2018Q1', fixed = TRUE)),
'2018Q2'=as.numeric(sub(" ", "", '2018Q2', fixed = TRUE)),
'2018Q3'=as.numeric(sub(" ", "", '2018Q3', fixed = TRUE)),
'2018Q4'=as.numeric(sub(" ", "", '2018Q4', fixed = TRUE))
)
Is there another way to get around the problem and convert the column without replacing all the values with 'NA'?
Thanks guys :)

Thanks for the hint #divibisan !
Renaming the columns via rename() actually solved the problem. Here the code which finally worked:
library(tidyverse)
data_2018 <- read_tsv('teina230.tsv')
#get rid of long colname & select only columns containing %GDP
clean_data_2018 <- data_2018 %>%
select('na_item,sector,unit,geo','2018Q1','2018Q2','2018Q3','2018Q4') %>%
rename(country = 'na_item,sector,unit,geo',
quarter_1 = '2018Q1',
quarter_2 = '2018Q2',
quarter_3 = '2018Q3',
quarter_4 = '2018Q4')
clean_data_2018 <- clean_data_2018[grep("PC_GDP", clean_data_2018$'country'), ]
#remove unnecessary info
clean_data_2018 <- clean_data_2018 %>%
mutate(country=gsub('\\GD,S13,PC_GDP,','',country))
clean_data_2018 <- clean_data_2018 %>%
mutate(
quarter_1 = as.numeric(quarter_1),
quarter_2 = as.numeric(quarter_2),
quarter_3 = as.numeric(quarter_3),
quarter_4 = as.numeric(quarter_4)
)

unexpected error when manipulating list of data.frame

I have list of data.frame as an output of custom function, so I intend to split each data.frame by its last column, where threshold is given. However, I manipulated the two list nicely, and combined them to get only one table. But I have an error when manipulating this new table. I can't figure out where is issue come from. How can I fix this error ? Can anyone point me out to possibly fix this error ? If this error can be fixed, I want to implement wrapper. How can I easily manipulate list of data.frame ? Any better idea to debug the error ?
mini example :
savedDF <- list(
bar = data.frame(.start=c(12,21,37), .stop=c(14,29,45), .score=c(5,9,4)),
cat = data.frame(.start=c(18,42,18,42,81), .stop=c(27,46,27,46,114), .score=c(10,5,10,5,34)),
foo = data.frame(.start=c(3,3,33,3,33,91), .stop=c(24,24,10,24,10,17), .score=c(22,22,6,22,6,7))
)
discardedDF <- list(
bar = data.frame(.start=c(16,29), .stop=c(20,37), .score=c(2,11)),
cat = data.frame(.start=c(21,31), .stop=c(23,43), .score=c(1,9)),
foo = data.frame(.start=c(54, 79), .stop=c(71,93), .score=c(3,8))
)
I can manipulate this way :
both <- do.call("rbind", c(savedDF, discardedDF))
cn <- c("letter", "seq")
# FIXME :
DF <- cbind(
read.table(text = chartr("_", ".", rownames(both)), header=T, sep = ".", col.names = cn),
both)
DF <- transform(DF, isPassed = ifelse(.score > 8, "Pass", "Fail"))
by(DF, DF[c("letter", "isPassed")],
function(x) write.csv(x[-(1:length(savedDF))],
sprintf("%s_%s_%s.csv", x$letter[1], x$isPassed[1])))
But I have an error
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 15 did not have 2 elements
Why I have this error ? Can anyone point me out how to fix this ?
my desired output is list of CSV file as follows :
bar.saved.Pass.csv
bar.saved.Fail.csv
bar.discarded.Pass.csv
bar.discarded.Fail.csv
cat.saved.Pass.csv
cat.saved.Fail.csv
cat.discarded.Pass.csv
cat.discarded.Fail.csv
foo.saved.Pass.csv
foo.saved.Fail.csv
foo.discarded.Pass.csv
foo.discarded.Fail.csv
But I think controlling exported CSV files still not desired. How can I improve functionality of this wrapper ? I intend to let use choose output directory by custom, or more dynamic would be nice. Any idea ? Thanks a lot

Is this what you are looking for?
library(tidyverse)
library(magrittr)
both <- do.call("rbind", c(savedDF, discardedDF))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"),
isDiscard = ifelse(is.na(seq), "Saved", "Discarded"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$isDiscard))
csv_names <- paste0("/Users/nathanday/Desktop/", names(list_of_dfs), ".csv") # change this path
mapply(write.csv, list_of_dfs, csv_names)
The %<>% operator is short hand so both %<>% rownames_to_columm(var = "cn") is identical to both <- rownames_to_column(both, var = "cn")
To make it more "dynamic" for allowing output path input, you could wrap this in the function structure you already have like this:
output_where <- function(output_path, list1, list2) {
if (!dir.exists(output_path)) {
dir.create(file.path(output_path))
}
both <- do.call(rbind, c(list1, list2))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"), isDiscard = ifelse(is.na(seq), "Saved", "Discarded"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$isDiscard))
csv_names <- paste0(output_path, names(list_of_dfs), ".csv")
return(mapply(write.csv, list_of_dfs, csv_names))
}
output_where("~/Desktop/", savedDF, discardedDF)
for even more dynamics:
output_where <- function(output_path, list1, list2) {
if (!dir.exists(output_path)) {
dir.create(file.path(output_path))
}
names(list1) <- paste("list1", names(list1), sep = ".")
names(list2) <- paste("list2", names(list2), sep = ".")
both <- do.call(rbind, c(list1, list2))
both %<>% rownames_to_column(var = "cn")
both %<>% separate(cn, c("original_list", "letters", "seq"), sep = "\\.")
both %<>% mutate(isPassed = ifelse(.score > 8, "Passed", "Failed"))
list_of_dfs <- both %>% split(list(.$letters, .$isPassed, .$original_list))
csv_names <- paste0(output_path, names(list_of_dfs), ".csv")
return(mapply(write.csv, list_of_dfs, csv_names))
}

recognizing items in values list in R in for loop

I am having some trouble getting R to recognize items in my Values list (in RStudio) in a function call (just referring to it as a generic function here). Here's an example...the following works just fine if I type it in directly:
result <- function(cnv.chr1.S1, cnv.chr1.S2, cnv.chr1.S3)
because cnv.chr1.S1, cnv.chr1.S2, and cnv.chr1.S3 are objects (specifically GRanges objects) that I've created previously.
But as I'm looping over different chromosomes and there are really many more than 3 samples (S1, S2, S3), I've tried the following (simplified here)
chrom <- paste("chr", 1:1, sep = "")
sample.names <- paste("S", 1:3, sep = "")
for (thischrom in chrom)
{
for (sample in sample.names)
{
a <- function(list(paste(paste("cnv", thischrom, sep = "."), sample.names, sep = ".")))
}
}
However, it doesn't work because
paste(paste("cnv", thischrom, sep = "."), sample.names, sep = ".")
just creates a character list of items that have the same names as the items in my Values list. How do I get R to access the appropriate objects in my Values list?
Thanks for any thoughts you might have!
Steve

Are you looking for something like this?
library(dplyr)
chrom <- paste("chr", 1:1, sep = "")
sample.names <- paste("S", 1:2, sep = "")
cnv.chr1.S1 = c(1, 2)
cnv.chr1.S2 = c(2, 3)
result =
data_frame(chrom = chrom) %>%
merge(data_frame(sample.names = sample.names) ) %>%
rowwise %>%
mutate(object =
paste("cnv", chrom, sample.names, sep = ".") %>%
parse(text = .) %>%
eval %>%
list)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read_excel won't trim whitespace - r

Related

Error in R: The subscript var has the wrong type quosure/formula. It must be numeric or character

Optimize calls to mutate and summarise?

How do I avoid 'NA' values when coercing a .tsv column into numeric via as.numeric?

unexpected error when manipulating list of data.frame

recognizing items in values list in R in for loop

Categories

Resources