I am trying to import my data using read_excel but I need to interpret any NA string value to missing values but I am stuck
Currently, my data has NAs all over the place and I need them to be blank so that when I run colsums(is.na(data)) it shouldn't show 0s
NA is chr in this table within numeric numbers as you can see in screenshot
data <- read_excel(workbook_path, na = c(""))
colSums(is.na(data))
Using one of readxl example files for reproducible example, you can open its location by running browseURL(dirname(readxl_example("type-me.xlsx"))), though the sheet looks like this:
library(readxl)
library(dplyr)
xlsx <- readxl_example("type-me.xlsx")
# open file location explorer:
# browseURL(dirname(readxl_example("type-me.xlsx")))
# by default blank cells are treated as missing data, note the single <NA>:
df <- read_excel(xlsx, sheet = "text_coercion") %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> "empty"
#> 2 cabbage "\"cabbage\""
# add "empty" to na vector, note 2 <NA> values:
df <- readxl::read_excel(xlsx, sheet = "text_coercion", na = c("", "empty")) %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> <NA>
#> 2 cabbage "\"cabbage\""
# to replace all(!) NA values with ""
df[is.na(df)] <- ""
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 "" ""
#> 2 "cabbage" "\"cabbage\""
Created on 2023-01-26 with reprex v2.0.2
Note from your screenshot: you have column names in the first row of your dataframe, this breaks data type detection (everything is chr) and you should deal with that first; at that point data[is.na(data)] <- "" will no longer work as you can not write strings to numerical columns. And it's perfectly fine.
Related
I'm trying to use tidyxl and unpivotr to clean messy excel data.
I'm trying to use the "behead" command inside of a function, with the "name" argument of the "behead" command as one of the arguments of my function.
Code:
data_prep <- function(variable_col){
# Read in excel cells
cells <- xlsx_cells(paste0(data_folder, "Data_name.xlsx")),
include_blank_cells = TRUE)
# Cell manipulation
cells1 <- cells %>%
# Select cells to be columns
behead("up-left", "year") %>%
behead("left", variable_col)
Here, as "variable_col" is one of the data_prep function arguments, I want this to be changeable and change to the desired column name (eg. dog_names).
But instead, when running the function, eg.
data_prep(variable_col = 'dog_names')
the output still has the column name as "variable_col", and not "dog_names".
Therefore, assigning the variable inside a function isn't working.
I've tried putting different quotation and speech marks etc around "variable_col", but no luck.
Anyone used unpivotr::behead in a function before and/or can help me with this?
Thanks.
We may escape (!!) the argument so that the object can be evaluated
library(unpivotr)
data_prep <- function(variable_col){
# Read in excel cells
cells <- xlsx_cells(paste0(data_folder, "Data_name.xlsx")),
include_blank_cells = TRUE)
# Cell manipulation
cells1 <- cells %>%
# Select cells to be columns
behead("up-left", "year") %>%
behead("left", !!variable_col)
}
-reproducible example
x <- data.frame(a = 1:2, b = 3:4)
cells <- as_cells(x, col_names = TRUE)
> variable_col <- "Sex"
> cells %>%
+ behead("up-left", variable_col)
# A tibble: 4 × 6
row col data_type chr int variable_col
<int> <int> <chr> <chr> <int> <chr>
1 2 1 int <NA> 1 a
2 3 1 int <NA> 2 a
3 2 2 int <NA> 3 b
4 3 2 int <NA> 4 b
> cells %>%
+ behead("up-left", !!variable_col)
# A tibble: 4 × 6
row col data_type chr int Sex
<int> <int> <chr> <chr> <int> <chr>
1 2 1 int <NA> 1 a
2 3 1 int <NA> 2 a
3 2 2 int <NA> 3 b
4 3 2 int <NA> 4 b
Raw Data is an XML file:
https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing
library(XML)
doc <- xmlParse("./p48cierre_01-01-2019.xml")
docList <- xmlToList(doc)
mylist_SeriesTemporales <- sapply(docList, '[', 'SeriesTemporales')
$IdentificacionMensaje.NA
[1] NA
$VersionMensaje.NA
[1] NA
$TipoMensaje.NA
[1] NA
$TipoProceso.NA
[1] NA
$TipoClasificacion.NA
[1] NA
I having all NAs in the List of SeriesTemporales also. I have shared the similar output of others given above.
I want to convert all SeriesTemporales lists into a single data frame. Please help me out.
Expected Output:
> xmlDataOut
# A tibble: 30,000 x 7
`Periodo.IntervaloTiempo.Attribute~ `Periodo.Resolucion.Attribute:~ `UnidadMedida.Attribute:~ `UPSalida.Attribute:v` `UPEntrada.Attribute:~ `TipoNegocio.Attribute~ `Periodo.Intervalo.Pos.Attribut~
<chr> <chr> <chr> <lgl> <chr> <chr> <dbl>
1 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 1
2 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 10
I cannot understand how SeriesTemporales can be converted in a dataframe when its elements lengths/sizes are different. However you can extract all SeriesTemporales into another list say l2 simply by doing this
l2 <- docList[names(docList) == 'SeriesTemporales']
Now if first element of l2 is converted to a dataframe, then
library(purrr)
map_df(l2, ~.x[1])
# A tibble: 1,256 x 1
IdentificacionSeriesTemporales
<chr>
1 STP0
2 STP1
3 STP2
4 STP3
5 STP4
6 STP5
7 STP6
8 STP7
9 STP8
10 STP9
# ... with 1,246 more rows
But its third element give this
map_df(l2, ~.x[3])
# A tibble: 2,512 x 2
UPSalida UPEntrada
<chr> <chr>
1 LUBAC01 NA
2 NES NA
3 FUSIC01 NA
4 NES NA
5 NA ECEGRG
6 NA NES
7 NA HYGESTE
8 NA NES
9 GNRAC01 NA
10 NES NA
# ... with 2,502 more rows
This seems like a straight forward xml document to parse. The only catch is the information is stored in the node's attributes and not in the node itself.
Here is a xml2 solution.
See Comments for an explanation.
library(xml2)
library(dplyr)
page <- read_xml("p48cierre_01-01-2019.xml")
#check for namespace
xml_ns(page)
#strip namespace
xml_ns_strip(page)
#find all SeriesTeomorales nodes
seriesT <- page %>% xml_find_all(".//SeriesTemporales")
#get requested information from each parent node
# find the correct subnote and attribute
#assuming only one sub node per parent
Intervalo <- seriesT %>% xml_find_first(".//IntervaloTiempo") %>% xml_attr("v")
Resolution <- seriesT %>% xml_find_first(".//Resolucion") %>% xml_attr("v")
UnidadMedida <- seriesT %>% xml_find_first(".//UnidadMedida") %>% xml_attr("v")
UPSalida <- seriesT %>% xml_find_first(".//UPSalida") %>% xml_attr("v")
UPEntrada <- seriesT %>% xml_find_first(".//UPEntrada") %>% xml_attr("v")
TipoNegocio <- seriesT %>% xml_find_first(".//TipoNegocio") %>% xml_attr("v")
#combine into a final answer
head(data.frame(Intervalo, Resolution, UnidadMedida, UPSalida, UPEntrada, TipoNegocio))
I am not sure your request for the "Pos" node, there are 24 per parent node thus does not store conveniently in a single data.frame. If you are just looking for the first one follow the format above, if not maybe and another question.
Hi and happy new year at all.
I have a tricky task (in my opinion) and I can not find a way to solve it.
Please see following toy data. The orginal dataset has hundreds of cols/rows.
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan"),
US=c(8,2,NA,7),
UK=c(5,4,1,7))
I want to create a new column, called "origin", which pastes the colname of each cell (without NA) seperated by "|" under consideration of the corresponding value. Higher values should be pasted first. As for same values (like Zlatan), the sequence isn´t relevant. Output for Zlatan could be US|UK OR UK|US.
This is the desired ouput:
I tried some hours to solve it but no approach worked. May be it make sense to convert the values as.factor...
Help is much appreciated. Thank you in advance!
Here's a dplyr approach. First, we can use rowwise to work on individual rows independently. Next, we can use c_across which allows us to select values from that row only. We can subset a vector of c("US","UK") based on whether the US and UK columns are not NA.
paste with collapse = "|" allows us to put the values together with the seperator. I added a row to see what would happen if they are both NA.
library(dplyr)
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK")[rev(order(c_across(US:UK), na.last = NA))], collapse = "|"))
# A tibble: 5 x 4
# Rowwise:
name US UK origin
<chr> <dbl> <dbl> <chr>
1 Amber 8 5 "US|UK"
2 Thomas 2 4 "UK|US"
3 Stefan NA 1 "UK"
4 Zlatan 7 7 "UK|US"
5 Bob NA NA ""
This is also trivially expanded to more columns:
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK","AUS")[rev(order(c_across(US:AUS), na.last = NA))], collapse = "|"))
# A tibble: 5 x 5
# Rowwise:
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Thomas 2 4 2 UK|AUS|US
3 Stefan NA 1 NA UK
4 Zlatan 7 7 NA UK|US
5 Bob NA NA 1 AUS
Or with tidyselect assistance to perform all columns but name:
test %>%
rowwise() %>%
mutate(origin = paste(names(across(-name))[rev(order(c_across(-name), na.last = NA))], collapse = "|"))
Another possibility with tidyverse. It is longer than the other two solutions, but it should work directly with a dataframe with as many columns as you need.
I changed the dataframe to long format, filtered out NAs, grouped by name, summarized using paste, and joined with the original dataframe to get the original columns (and rows with all NAs).
library(tidyverse)
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
# change to long format
tidyr::pivot_longer(cols=-name, names_to = "country", values_to = "value") %>%
# remove rows with NA
dplyr::filter(!is.na(value)) %>%
# group by name and sort
dplyr::group_by(name) %>% dplyr::arrange(-value) %>%
# create summary of countries for each name in column 'origin'
dplyr::summarise(origin=paste(country, collapse = "|")) %>%
# join with original data frame to include original columns (and names with only NA) and change NA to '' in origin
dplyr::right_join(test, by='name') %>% dplyr::mutate(origin=ifelse(is.na(origin), '', origin)) %>%
# move origin column to end
dplyr::relocate(origin, .after = last_col())
Result
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Bob NA NA 1 AUS
3 Stefan NA 1 NA UK
4 Thomas 2 4 2 UK|US|AUS
5 Zlatan 7 7 NA US|UK
Here's a different tidyverse solution using case_when:
library(tidyverse)
data <- data.frame (test<-data.frame(
"name" =c("Amber","Thomas","Stefan","Zlatan"),
"US" =c(8,2,NA,7),
"UK" =c(5,4,1,7)))
data <- data %>% mutate(origin = case_when( US > UK ~ "US|UK",
UK >= US ~ "UK|US",
is.na(UK) & !is.na(US) ~ "US",
is.na(US) & !is.na(UK) ~ "UK"))
data
#> name US UK origin
#> 1 Amber 8 5 US|UK
#> 2 Thomas 2 4 UK|US
#> 3 Stefan NA 1 UK
#> 4 Zlatan 7 7 UK|US
Created on 2021-01-06 by the reprex package (v0.3.0)
When trying to combine multiple character columns using unite from dplyr, the na.rm = TRUE option does not remove NA.
Step by step:
Original dataset has 5 columns word1:word5 Image of the original data
Looking to combine word1:word5 in a single column using code:
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep=" ", na.rm=TRUE, remove=FALSE)
Here's an image of the output: data_unite_5
I've tried using mutate_if(is.factor, as.character) but that did not work.
Any suggestions would be appreciated.
You have misinterpreted how the na.rm argument works for unite. Following the examples on the tidyverse page here, z is the unite of x and y.
With na.rm = FALSE
#> z x y
#> <chr> <chr> <chr>
#> 1 a_b a b
#> 2 a_NA a NA
#> 3 NA_b NA b
#> 4 NA_NA NA NA
With na.rm = TRUE
#> z x y
#> <chr> <chr> <chr>
#> 1 "a_b" a b
#> 2 "a" a NA
#> 3 "b" NA b
#> 4 "" NA NA
Hence na.rm determines how NA values appear in the assembled strings (pentrawords) it does not drop rows from the data.
If you were wanting to remove the fourth row of the dataset, I would recommend filter.
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep =" " , na.rm = TRUE, remove = FALSE) %>%
filter(pentawords != "")
Which will exclude from your output all empty strings.
I have a data frame with a large number of string columns.
Each of those columns consists of strings with three parts which I would like split. So in the end the total number of string columns would triple.
When doing that I would additionally like to directly name the new columns by attaching certain predefined strings to their original column name.
As a simplified example
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
x y
a1! A1$
b2# G2%
c3$ NA
should become something like
x_letter x_number x_sign y_letter y_number y_sign
a 1 ! A 1 $
b 2 # G 2 %
c 3 $ NA NA NA
The order of the elements within the string is always the same.
The real data frame has over 100 string columns that all can be split into they three parts using a separator. The only exception might be rows where a string is missing.
I've looked into combinations of str_split_fixed(), strsplit() and separate() and apply functions but couldn't figure out how to directly name the columns while also looping over the columns.
What would be a simple approach here?
This should be what you need, not the cleanest solution but simple
library(tidyverse)
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
pipe_to_do <- . %>%
str_split_fixed(string = .,pattern = "(?<=.)",n = 3) %>%
as_tibble() %>%
rename(letter = V1,
number = V2,
sign = V3)
xx <- test_frame %>%
summarise(across(everything(),.fns = pipe_to_do))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
names_xx <- names(xx)
combine_names <- function(df,name) {
str_c(name,"_",df)
}
combine_names_func <- function(df,name){
df %>%
rename_with(.fn = ~ combine_names(.x,name))
}
map2(xx,names_xx,combine_names_func) %>%
reduce(bind_cols)
#> # A tibble: 3 x 6
#> x_letter x_number x_sign y_letter y_number y_sign
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 1 ! "A" "1" "$"
#> 2 b 2 # "G" "2" "%"
#> 3 c 3 $ "" "" ""
Created on 2020-08-04 by the reprex package (v0.3.0)
You can use str_extract:
library(stringr)
df <- data.frame(
x_letter = str_extract(test_frame$x,"^[a-z]"),
x_number = str_extract(test_frame$x,"(?<=^[a-z])[0-9]"),
x_sign = str_extract(test_frame$x,".$"),
y_letter = str_extract(test_frame$y,"^[A-Z]"),
y_number = str_extract(test_frame$y,"(?<=^[A-Z])[0-9]"),
y_sign = str_extract(test_frame$y,".$")
)
Result:
df
x_letter x_number x_sign y_letter y_number y_sign
1 a 1 ! A 1 $
2 b 2 # G 2 %
3 c 3 $ <NA> <NA> <NA>