Splitting multiple string columns and rename the new columns adequately- R - r

I have a data frame with a large number of string columns.
Each of those columns consists of strings with three parts which I would like split. So in the end the total number of string columns would triple.
When doing that I would additionally like to directly name the new columns by attaching certain predefined strings to their original column name.
As a simplified example
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
x y
a1! A1$
b2# G2%
c3$ NA
should become something like
x_letter x_number x_sign y_letter y_number y_sign
a 1 ! A 1 $
b 2 # G 2 %
c 3 $ NA NA NA
The order of the elements within the string is always the same.
The real data frame has over 100 string columns that all can be split into they three parts using a separator. The only exception might be rows where a string is missing.
I've looked into combinations of str_split_fixed(), strsplit() and separate() and apply functions but couldn't figure out how to directly name the columns while also looping over the columns.
What would be a simple approach here?

This should be what you need, not the cleanest solution but simple
library(tidyverse)
test_frame<-tibble(x=c("a1!","b2#","c3$"), y=c("A1$","G2%", NA))
pipe_to_do <- . %>%
str_split_fixed(string = .,pattern = "(?<=.)",n = 3) %>%
as_tibble() %>%
rename(letter = V1,
number = V2,
sign = V3)
xx <- test_frame %>%
summarise(across(everything(),.fns = pipe_to_do))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
names_xx <- names(xx)
combine_names <- function(df,name) {
str_c(name,"_",df)
}
combine_names_func <- function(df,name){
df %>%
rename_with(.fn = ~ combine_names(.x,name))
}
map2(xx,names_xx,combine_names_func) %>%
reduce(bind_cols)
#> # A tibble: 3 x 6
#> x_letter x_number x_sign y_letter y_number y_sign
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 1 ! "A" "1" "$"
#> 2 b 2 # "G" "2" "%"
#> 3 c 3 $ "" "" ""
Created on 2020-08-04 by the reprex package (v0.3.0)

You can use str_extract:
library(stringr)
df <- data.frame(
x_letter = str_extract(test_frame$x,"^[a-z]"),
x_number = str_extract(test_frame$x,"(?<=^[a-z])[0-9]"),
x_sign = str_extract(test_frame$x,".$"),
y_letter = str_extract(test_frame$y,"^[A-Z]"),
y_number = str_extract(test_frame$y,"(?<=^[A-Z])[0-9]"),
y_sign = str_extract(test_frame$y,".$")
)
Result:
df
x_letter x_number x_sign y_letter y_number y_sign
1 a 1 ! A 1 $
2 b 2 # G 2 %
3 c 3 $ <NA> <NA> <NA>

Related

read_excel and NA values

I am trying to import my data using read_excel but I need to interpret any NA string value to missing values but I am stuck
Currently, my data has NAs all over the place and I need them to be blank so that when I run colsums(is.na(data)) it shouldn't show 0s
NA is chr in this table within numeric numbers as you can see in screenshot
data <- read_excel(workbook_path, na = c(""))
colSums(is.na(data))
Using one of readxl example files for reproducible example, you can open its location by running browseURL(dirname(readxl_example("type-me.xlsx"))), though the sheet looks like this:
library(readxl)
library(dplyr)
xlsx <- readxl_example("type-me.xlsx")
# open file location explorer:
# browseURL(dirname(readxl_example("type-me.xlsx")))
# by default blank cells are treated as missing data, note the single <NA>:
df <- read_excel(xlsx, sheet = "text_coercion") %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> "empty"
#> 2 cabbage "\"cabbage\""
# add "empty" to na vector, note 2 <NA> values:
df <- readxl::read_excel(xlsx, sheet = "text_coercion", na = c("", "empty")) %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> <NA>
#> 2 cabbage "\"cabbage\""
# to replace all(!) NA values with ""
df[is.na(df)] <- ""
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 "" ""
#> 2 "cabbage" "\"cabbage\""
Created on 2023-01-26 with reprex v2.0.2
Note from your screenshot: you have column names in the first row of your dataframe, this breaks data type detection (everything is chr) and you should deal with that first; at that point data[is.na(data)] <- "" will no longer work as you can not write strings to numerical columns. And it's perfectly fine.

dplyr::full_join two data frames with part-match in the "by" argument in R

I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
Using a fuzzyjoin::regex_left_join you could do:
Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10

Issue with na.rm = TRUE when combining multiple character columns using Unite from dplyr

When trying to combine multiple character columns using unite from dplyr, the na.rm = TRUE option does not remove NA.
Step by step:
Original dataset has 5 columns word1:word5 Image of the original data
Looking to combine word1:word5 in a single column using code:
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep=" ", na.rm=TRUE, remove=FALSE)
Here's an image of the output: data_unite_5
I've tried using mutate_if(is.factor, as.character) but that did not work.
Any suggestions would be appreciated.
You have misinterpreted how the na.rm argument works for unite. Following the examples on the tidyverse page here, z is the unite of x and y.
With na.rm = FALSE
#> z x y
#> <chr> <chr> <chr>
#> 1 a_b a b
#> 2 a_NA a NA
#> 3 NA_b NA b
#> 4 NA_NA NA NA
With na.rm = TRUE
#> z x y
#> <chr> <chr> <chr>
#> 1 "a_b" a b
#> 2 "a" a NA
#> 3 "b" NA b
#> 4 "" NA NA
Hence na.rm determines how NA values appear in the assembled strings (pentrawords) it does not drop rows from the data.
If you were wanting to remove the fourth row of the dataset, I would recommend filter.
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep =" " , na.rm = TRUE, remove = FALSE) %>%
filter(pentawords != "")
Which will exclude from your output all empty strings.

How to join two dataframes using dplyr in order to agregate values of the same column?

Is there a simple and elegant way to left join (with dplyr) a "b" table in an "a" table when both contains the same column, but the first has NA's and the second table has the missing values? Here folows an example:
# Tables A and B
a <- tibble(
"ID" = c(1,2,3),
"x" = c(NA,5, NA)
)
b <- tibble(
"ID" = c(1,3),
"x" = c(7, 4)
)
# Table I want as result
c <- tibble(
"ID" = c(1,2,3),
"x" = c(7,5,4)
)
You could use the coalesce function in the dplyr package to match together a complete vector from missing pieces. This is inspired by the sql COALESCE function.
left_join(a,b, by='ID') %>%
mutate(col = coalesce(x.x, x.y)) %>%
select(ID, col)
# A tibble: 3 x 2
ID col
<dbl> <dbl>
1 1 7
2 2 5
3 3 4
Joining and then removing rows with an NA should do it. If an ID has non-NA values of x in both tables, then this code will have 2 rows for that ID, but that is probably the behavior you'd want
library(dplyr)
full_join(a,b, by = c('ID', 'x')) %>%
na.omit()
# A tibble: 3 x 2
ID x
<dbl> <dbl>
1 2 5
2 1 7
3 3 4

skipping elements with Map() and match() in R

I'd like to recode the values in the df1 data frame using the df2 data frame so that I end up with a data frame like df3.
The current code almost does the trick, but there are two problems. First, it introduces NA when there's no match, e.g. there is no match in df2 for the df1 aed_bloodpr variable value "1,2" so the value becomes NA. Second, when a variable in df1 can't be mapped to df2, the code won't run (error message).
Have looked into the nomatch argument for match() and the .default argument for Map(), but I can't figure out how to use them so that I end up with df3.
Starting point:
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
End point:
Df3 <- data.frame("aed_bloodpr" = c("1,2","normal","high","high"),
"aed_gluco" = c("normal","elevated","NA","normal"),
"add_bmi" = c("above","5,7","below","normal"),
"add_asthma"=c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Current code:
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
You need to clean up before you can relabel. The actual relabeling is more easily accomplished by a join. Here using the tidyverse (translate as you like):
library(tidyverse)
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
Df1_long <- Df1 %>%
mutate_all(as.character) %>% # change factors to strings
rowid_to_column('i') %>% # add row index to enable later long-to-wide reshape
gather(variable, value, -i) %>% # reshape to long form
separate_rows(value, convert = TRUE) # unnest nested values and convert to numeric
str(Df1_long)
#> 'data.frame': 22 obs. of 3 variables:
#> $ i : int 1 1 2 3 4 1 2 3 4 1 ...
#> $ variable: chr "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" ...
#> $ value : int 1 2 2 1 1 2 1 3 2 2 ...
Df2_clean <- Df2 %>%
mutate_if(is.factor, as.character) %>% # change factors to strings
mutate_all(na_if, 'NA') # change "NA" to NA
Df3 <- Df1_long %>%
left_join(Df2_clean, by = c('variable' = 'NameOfVariable', # merge
'value' = 'VariableLevel')) %>%
mutate(VariableDef = coalesce(VariableDef, as.character(value))) %>% # combine labels and values
group_by(i, variable) %>%
summarise(value = toString(VariableDef)) %>% # re-aggregate multiple values
spread(variable, value) # reshape to wide form
Df3
#> # A tibble: 4 x 6
#> # Groups: i [4]
#> i add_asthma add_bmi aed_bloodpr aed_gluco nausea
#> * <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 2 above high, normal normal 3
#> 2 2 2 normal, below normal elevated 3
#> 3 3 7 below high 3 4
#> 4 4 5 normal high normal 5

Resources