I am trying to import my data using read_excel but I need to interpret any NA string value to missing values but I am stuck
Currently, my data has NAs all over the place and I need them to be blank so that when I run colsums(is.na(data)) it shouldn't show 0s
NA is chr in this table within numeric numbers as you can see in screenshot
data <- read_excel(workbook_path, na = c(""))
colSums(is.na(data))
Using one of readxl example files for reproducible example, you can open its location by running browseURL(dirname(readxl_example("type-me.xlsx"))), though the sheet looks like this:
library(readxl)
library(dplyr)
xlsx <- readxl_example("type-me.xlsx")
# open file location explorer:
# browseURL(dirname(readxl_example("type-me.xlsx")))
# by default blank cells are treated as missing data, note the single <NA>:
df <- read_excel(xlsx, sheet = "text_coercion") %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> "empty"
#> 2 cabbage "\"cabbage\""
# add "empty" to na vector, note 2 <NA> values:
df <- readxl::read_excel(xlsx, sheet = "text_coercion", na = c("", "empty")) %>% head(n = 2)
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 <NA> <NA>
#> 2 cabbage "\"cabbage\""
# to replace all(!) NA values with ""
df[is.na(df)] <- ""
df
#> # A tibble: 2 × 2
#> text explanation
#> <chr> <chr>
#> 1 "" ""
#> 2 "cabbage" "\"cabbage\""
Created on 2023-01-26 with reprex v2.0.2
Note from your screenshot: you have column names in the first row of your dataframe, this breaks data type detection (everything is chr) and you should deal with that first; at that point data[is.na(data)] <- "" will no longer work as you can not write strings to numerical columns. And it's perfectly fine.
I would like to join two data sets that look like the following data sets. The matching rule would be that the Item variable from mykey matches the first part of the Item entry in mydata to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join() from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
Using a fuzzyjoin::regex_left_join you could do:
Note: I renamed the Item column in your mykey dataset to regex to make clear that this is the regex to match by and added a "^" to ensure that we match at the beginning of the Item column in the mydata dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10
When trying to combine multiple character columns using unite from dplyr, the na.rm = TRUE option does not remove NA.
Step by step:
Original dataset has 5 columns word1:word5 Image of the original data
Looking to combine word1:word5 in a single column using code:
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep=" ", na.rm=TRUE, remove=FALSE)
Here's an image of the output: data_unite_5
I've tried using mutate_if(is.factor, as.character) but that did not work.
Any suggestions would be appreciated.
You have misinterpreted how the na.rm argument works for unite. Following the examples on the tidyverse page here, z is the unite of x and y.
With na.rm = FALSE
#> z x y
#> <chr> <chr> <chr>
#> 1 a_b a b
#> 2 a_NA a NA
#> 3 NA_b NA b
#> 4 NA_NA NA NA
With na.rm = TRUE
#> z x y
#> <chr> <chr> <chr>
#> 1 "a_b" a b
#> 2 "a" a NA
#> 3 "b" NA b
#> 4 "" NA NA
Hence na.rm determines how NA values appear in the assembled strings (pentrawords) it does not drop rows from the data.
If you were wanting to remove the fourth row of the dataset, I would recommend filter.
data_unite_5 <- data_original_5 %>%
unite("pentawords", word1:word5, sep =" " , na.rm = TRUE, remove = FALSE) %>%
filter(pentawords != "")
Which will exclude from your output all empty strings.
Is there a simple and elegant way to left join (with dplyr) a "b" table in an "a" table when both contains the same column, but the first has NA's and the second table has the missing values? Here folows an example:
# Tables A and B
a <- tibble(
"ID" = c(1,2,3),
"x" = c(NA,5, NA)
)
b <- tibble(
"ID" = c(1,3),
"x" = c(7, 4)
)
# Table I want as result
c <- tibble(
"ID" = c(1,2,3),
"x" = c(7,5,4)
)
You could use the coalesce function in the dplyr package to match together a complete vector from missing pieces. This is inspired by the sql COALESCE function.
left_join(a,b, by='ID') %>%
mutate(col = coalesce(x.x, x.y)) %>%
select(ID, col)
# A tibble: 3 x 2
ID col
<dbl> <dbl>
1 1 7
2 2 5
3 3 4
Joining and then removing rows with an NA should do it. If an ID has non-NA values of x in both tables, then this code will have 2 rows for that ID, but that is probably the behavior you'd want
library(dplyr)
full_join(a,b, by = c('ID', 'x')) %>%
na.omit()
# A tibble: 3 x 2
ID x
<dbl> <dbl>
1 2 5
2 1 7
3 3 4
I'd like to recode the values in the df1 data frame using the df2 data frame so that I end up with a data frame like df3.
The current code almost does the trick, but there are two problems. First, it introduces NA when there's no match, e.g. there is no match in df2 for the df1 aed_bloodpr variable value "1,2" so the value becomes NA. Second, when a variable in df1 can't be mapped to df2, the code won't run (error message).
Have looked into the nomatch argument for match() and the .default argument for Map(), but I can't figure out how to use them so that I end up with df3.
Starting point:
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
End point:
Df3 <- data.frame("aed_bloodpr" = c("1,2","normal","high","high"),
"aed_gluco" = c("normal","elevated","NA","normal"),
"add_bmi" = c("above","5,7","below","normal"),
"add_asthma"=c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Current code:
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
You need to clean up before you can relabel. The actual relabeling is more easily accomplished by a join. Here using the tidyverse (translate as you like):
library(tidyverse)
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
Df1_long <- Df1 %>%
mutate_all(as.character) %>% # change factors to strings
rowid_to_column('i') %>% # add row index to enable later long-to-wide reshape
gather(variable, value, -i) %>% # reshape to long form
separate_rows(value, convert = TRUE) # unnest nested values and convert to numeric
str(Df1_long)
#> 'data.frame': 22 obs. of 3 variables:
#> $ i : int 1 1 2 3 4 1 2 3 4 1 ...
#> $ variable: chr "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" ...
#> $ value : int 1 2 2 1 1 2 1 3 2 2 ...
Df2_clean <- Df2 %>%
mutate_if(is.factor, as.character) %>% # change factors to strings
mutate_all(na_if, 'NA') # change "NA" to NA
Df3 <- Df1_long %>%
left_join(Df2_clean, by = c('variable' = 'NameOfVariable', # merge
'value' = 'VariableLevel')) %>%
mutate(VariableDef = coalesce(VariableDef, as.character(value))) %>% # combine labels and values
group_by(i, variable) %>%
summarise(value = toString(VariableDef)) %>% # re-aggregate multiple values
spread(variable, value) # reshape to wide form
Df3
#> # A tibble: 4 x 6
#> # Groups: i [4]
#> i add_asthma add_bmi aed_bloodpr aed_gluco nausea
#> * <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 2 above high, normal normal 3
#> 2 2 2 normal, below normal elevated 3
#> 3 3 7 below high 3 4
#> 4 4 5 normal high normal 5