How to remove all rows that contain character - r

Im trying to remove all rows that contain a ? in any column in a data frame. I have 950 rows by 11 columns.
Ive tried this to do it all at once.
dataNew <- data %>% filter_all(all_vars(!grepl("?",.)))
and this to see if i could even get it to work for one column.
dataNew <- data[!grepl('?',data$column),]
Both of these attempts resulted in an empty dataframe. Any help is appreciated, thank you.

We can use fixed = TRUE as ? is a metacharacter (or escape (\\?) or wrap it inside square bracket ([?]) when the default mode for grep is fixed = FALSE
library(dplyr)
data %>%
filter_all(all_vars(!grepl("?",., fixed = TRUE)))
# col1 col2
#1 1 2
Or using across from the devel version of dplyr
data %>%
filter(across(everything(), ~ !grepl("?", ., fixed = TRUE)))
# col1 col2
#1 1 2
Or using base R
data[!Reduce(`|`, lapply(data, grepl, pattern = '?', fixed = TRUE)),]
data
data <- data.frame(col1 = c("?", 1, 3, "?"), col2 = c(1, 2, "?", "?"),
stringsAsFactors = FALSE)

Related

How to make the column names of a data-frame variables

I can construct a data.frame like this -
data.frame('a_1' = 3)
However, I want to make the column-name a1 as variable. So I tried this -
data.frame(get(paste("a", 1, sep = "_")) = 3)
With this I get below error -
Error: unexpected '=' in "data.frame(get(paste("a", 1, sep = "_")) ="
Can you please help me to understand the right approach to make the colnames as variable?
Thanks for your pointer.
We can use tibble with := to do this
library(stringr)
library(tibble)
tibble(!! str_c("a", "_", 1) := 3)
-output
# A tibble: 1 x 1
a_1
<dbl>
1 3
In base R, this can be done using setNames
df1 <- setNames(data.frame(3), paste0("a", "_", 1))
-output
df1
a_1
1 3
Or if it is only for a specific number of columns, create the dataset, and use names
df1 <- data.frame(3)
names(df1)[1] <- paste0("a_", 1)

How to make a column out of a column name its value?

I have below data
df<- data_frame(State= c('CA', 'IN', 'CHI'),
Age= c(46,29,32),
Status= c('Employed', '', 'Employed')
)
In the end, I want to create data that looks like this:
df<- data_frame(col1= c('State-CA', 'State-IN', 'State-CHI'),
col2= c('Age-46','Age-29','Age-32'),
col3= c('Status-Employed', '', 'Status-Employed')
)
Connecting the name of a column and its value with a dash. If a value is missing, the column name shouldn't connect to the value of the table. Could anyone help? Thanks in advance!
With imap, it is a single step. As data.frames are named list with columns of equal length, the imap loops over the list, with the anonymous function call (~), get the .y as the column name and the values as .x, then paste it with str_c
library(purrr)
library(stringr)
imap_dfc(df, ~ case_when(.x ==""|is.na(.x) ~ as.character(.x), TRUE ~ str_c(.y, .x, sep='-')))
# A tibble: 3 x 3
# State Age Status
# <chr> <chr> <chr>
#1 State-CA Age-46 Status-Employed
#2 State-IN Age-29 ""
#3 State-CHI Age-32 Status-Employed
In base R
df[] <- Map(function(x, y) ifelse(x=="", x, paste(x, y, sep="-")),df, names(df))
I think what you are looking for has been answered on this thread - Insert Column Name into its Value using R. Hope you find this helpful!
Also, this code should work for you -
col_names <- names(df)
for (c in col_names) {
df[[c]] <- ifelse(df[[c]] != "", paste(c, df[[c]], sep = "-"), "")
}
df
Output -
State Age Status
1 State-CA Age-46 Status-Employed
2 State-IN Age-29
3 State-CHI Age-32 Status-Employed

flag specific pattern using string r

I am working with a data set where I need to flag all specific codes that start with "C13.xxx." There are other tree codes in the column and all tree codes are separated as follows: "C13.xxx|B12.xxx" - and all tree codes have a period in them. But the data set has other variables that are causing my string r function to flag characters that are not tree codes. Example:
library(tidyverse)
# test data
test <- tribble(
~id, ~treecode, ~contains_c13_xxx,
#--|--|----
1, "B12.123|C13.234.432|A11.123", "yes",
2, "C12.123|C13039|", "no"
)
# what I tried
test %>% mutate(contains_C13_error = ifelse(str_detect(treecode, "C13."), 1, 0))
# code above is flagging both id's as containing C13.xxx
in id 2, there is a variable that begins with C13, but it is not a tree codes (all tree codes have a period). The contains_c13_xxx variable is what I would like the code to produce. In the string detect function, I specified the period, so I'm not sure what is going wrong here.
The tricky part is there are multiple tree codes in the same column with a separator which makes it difficult to flag. We can bring each treecode into separate rows and then check for the code that we need. Using separate_rows from tidyr.
library(dplyr)
test %>%
tidyr::separate_rows(treecode, sep = "\\|") %>%
group_by(id) %>%
summarise(contains_C13_error = any(startsWith(treecode, "C13.")),
treecode = paste(treecode, collapse = "|"))
# A tibble: 2 x 3
# id contains_C13_error treecode
# <dbl> <lgl> <chr>
#1 1 TRUE B12.123|C13.234.432|A11.123
#2 2 FALSE C12.123|C13039|
This is assuming that there could be codes of the pattern "C13" without a dot. If the treecode would always have "C13" followed by a dot then simply escaping the dot in your regex would work.
Base R solution:
# Split on the | delim:
split_treecode <- strsplit(df$treecode, "[|]")
# Roll out the ids the number of times of each relevant treecode:
rolled_out_df <- data.frame(id = rep(df$id, sapply(split_treecode, length)), tc = unlist(split_treecode))
# Test whether or not string contains "C13"
rolled_out_df$contains_c13_xxx <- grepl("C13.", rolled_out_df$tc, fixed = T)
# Does the id have an element containing "C13" ?
rolled_out_df$contains_c13_xxx <- ifelse(ave(rolled_out_df$contains_c13_xxx,
rolled_out_df$id,
FUN = function(x){as.logical(sum(x))}), "yes", "no")
# Build back orignal df:
df <- merge(df[,c("id", "treecode")], unique(rolled_out_df[,c("id", "contains_c13_xxx")]), by = "id")
Data:
df <-
structure(
list(
id = c(1, 2),
treecode = c("B12.123|C13.234.432|A11.123",
"C12.123|C13039|"),
contains_c13_xxx = c("yes", "no")
),
row.names = c(NA,-2L),
class = "data.frame"
)

How to remove missing values (NA) when uniting columns?

I am trying to unite 5 columns into one new column using the Unite function. However, all rows contain lots of NA values, creating variables that look like
Mother|NA|NA|NA|NA
NA|NA|Father|Mother|NA
Mother|Father|NA|Stepmother|NA
I've tried to unite them using this code:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
But that gives me the following error:
Error: TRUE must evaluate to column positions or names, not a logical vector
I've also looked on the forum, and found that possibly the na.rm function of unite is not active?
Here is some data to recreate my dataset
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', 'NA', 'Mother')
Parent2 <- c('NA', 'NA', 'Father')
Parent3 <- c('NA', 'Father', 'NA')
Parent4 <- c('NA', 'Mother', 'Stepmother')
Parent5 <- c('NA', 'NA', 'NA')
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5)
Would love to know how to unite my columns without NA's.
UPDATE:
I've now updated the tidyr package and I added "na = c("", "NA")" to my read_csv command.
Now the
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE, na.rm = TRUE)
Command works, however for some reasons the NA at the end of the value stays. Now my columns look like this:
Mother|NA
Father|Mother|NA
Mother|Father|Stepmother|NA
Does anyone know what went wrong now?
You have got couple of problems,
1) the NAs are not reals NA's (Check is.na(df$Parent2))
2) Your columns are factors
While constructing the dataframe use stringsAsFactors = FALSE
df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4,
Parent5, stringsAsFactors = FALSE)
and then replace NA and use unite
library(dplyr)
df %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
# Name Postalcode Parent_full
#1 Paul 4732 Mother
#2 Edward 9045 Father|Mother
#3 Mary 3476 Mother|Father|Stepmother
If the data is already loaded, we can change them by using mutate_if
df %>%
mutate_if(is.factor, as.character) %>%
na_if('NA') %>%
tidyr::unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE)
Your main problem here is that you haven't updated to tidyr 1.0 yet. That error message is the best that the previous version can do with the input na.rm = TRUE, since that argument didn't exist before. It thinks you're giving it a named argument as part of the ....
Specifically, just run install.packages("tidyr") and it should work. You might need to restart R first, so tidyr isn't currently loaded.
If your missing values are "NA" strings, then, as Ronak pointed out, you need to use na_if() on them first. It's strange to me because your initial code chunk makes it look like those are proper NAs, due to the red highlighting. But then your reprex code has 'NA' values which would definitely be strings. Anyway, you say you're reading in from CSV, so, it would be cleaner and quicker to run the CSV-reading code so as to read NAs in properly with an na argument or the like.
Response to Edit: That does seem like a bug, that NAs at the end of the united string don't get properly removed. Well, anyway, the fix is easy, and probably better than anything else we could do:
df2 <- df %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE) %>%
mutate_at("Parent_full", . %>%
str_remove("(^|\\|)NA$") %>%
na_if(""))
This ensures two things: 1) that the letters "NA" at the end of a string are only removed if they're there because of the unite(), with a pipe (if anything) in front of them; and 2) if there's no non-missing values on a line here, then the value will be a proper NA rather than "NA", "", or what have you, which I assume is what you want.
Update: I've found that the bug applies to any column that contains nothing but NAs, i.e. na.rm = TRUE only removes NAs from columns that have at least one non-missing value. I've filed a bug report: https://github.com/tidyverse/tidyr/issues/765
Given this, though, the optimal solution is probably just to remove any columns that are all NA beforehand. If this is production code, though, then that gets real tricky, since you have to specify the unite() so as to not break if any or even all of the columns to be united are dropped by that prior step.
Update 2: As a response to the bug report pointed out, the issue is actually that that all-missing column is logicals. So that makes the optimal solution: read in such columns as character, or coerce them to character before uniting. Full reprex for that:
library(tidyverse)
Name <- c('Paul', 'Edward', 'Mary')
Postalcode <- c('4732', '9045', '3476')
Parent <- c('Mother', NA, 'Mother')
Parent2 <- c(NA, NA, 'Father')
Parent3 <- c(NA, 'Father', NA)
Parent4 <- c(NA, 'Mother', 'Stepmother')
Parent5 <- c(NA, NA, NA)
(df <- data.frame(Name, Postalcode, Parent, Parent2, Parent3, Parent4, Parent5))
#> Name Postalcode Parent Parent2 Parent3 Parent4 Parent5
#> 1 Paul 4732 Mother <NA> <NA> <NA> NA
#> 2 Edward 9045 <NA> <NA> Father Mother NA
#> 3 Mary 3476 Mother Father <NA> Stepmother NA
(df2 <- df %>%
mutate_at(vars(Parent:Parent5), as.character) %>%
unite(Parent_full, Parent:Parent5, sep = "|", na.rm = TRUE))
#> Name Postalcode Parent_full
#> 1 Paul 4732 Mother
#> 2 Edward 9045 Father|Mother
#> 3 Mary 3476 Mother|Father|Stepmother
Created on 2019-09-27 by the reprex package (v0.3.0)
unite() (and na.rm = TRUE) only works for character columns (as far as I can tell). This isn't made clear in the help docs.
For factors, it also returns the integer code rather than the factor level - something to watch out for.
Numeric: Doesn't remove NAs:
df <- data.frame("to.combine1" = c(NA, 1, 3),
"to.combine2" = c(2, NA, 3))
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "numeric" "numeric"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 NA_2
#> 2 1_NA
#> 3 3_3
Factor: Doesn't remove NAs and uses integer code rather than level:
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = TRUE)
sapply(df, class) #not functional, just illustrative
#> to.combine1 to.combine2
#> "factor" "factor"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#>1 NA_1
#>2 1_NA
#>3 2_2
Character: Expected behaviour
df <- data.frame("to.combine1" = as.character(c(NA, 1, "a")),
"to.combine2" = as.character(c(2, NA, "a")),
stringsAsFactors = FALSE)
sapply(df, class) #not functional, just illustrative
#>to.combine1 to.combine2
#>"character" "character"
unite(df, "combined", to.combine1:to.combine2, sep="_", na.rm = TRUE)
#> combined
#> 1 2
#> 2 1
#> 3 a_a
You can remove the NAs later with something like this
df %>%
unite(Parent_full, Parent:Parent5, sep = "|", remove = TRUE) %>%
mutate(Parent_full = gsub("(?<![a-zA-Z])NA\\||\\|NA(?![a-zA-Z])|\\|NA$", '', Parent_full, perl = T))
Name Postalcode Parent_full
1 Paul 4732 Mother
2 Edward 9045 Father|Mother
3 Mary 3476 Mother|Father|Stepmother
It replaces NA| not preceded by a letter or |NA not followed by a letter or |NA at the end of the string, with an empty string

Concatenate columns and add them to beginning of Data Frame

Noob here to R. Trying to figure something out. I need to build a function that adds a new column to the beginning of a dataset. This new column is a concatenation of the values in other columns that the user specifies.
Imagine this is the data set named myDataSet:
col_1 col_2 col_3 col_4
bat red 1 a
cow orange 2 b
dog green 3 c
The user could use the function like so:
addPrimaryKey(myDataSet, cols=c(1,3,4))
to get the result of a new data set with columns 1, 3 and 4 concatenated into a column called ID and added to the beginning, like so:
ID col_1 col_2 col_3 col_4
bat1a bat red 1 a
cow2b cow orange 2 b
dog4c dog green 3 c
This is the script I have been working on but I have been staring at it so long, I think I have made a few mistakes. I can't figure out how to get the column numbers from the arguments into the paste function properly.
addPrimaryKey <- function(df, cols=NULL){
newVector = rep(NA, length(cols)) ##initialize vector to length of columns
colsN <- as.numeric(cols)
df <- cbind(ID=paste(
for(i in 1:length(colsN)){
holder <- df[colsN[i]]
holder
}
, sep=""), df) ##concatenate the selected columns and add as ID column to df
df
}
Any help would be greatly appreciated. Thanks so much
paste0 works fine, with some help from do.call:
do.call(paste0, mydf[c(1, 3, 4)])
# [1] "bat1a" "cow2b" "dog3c"
Your function, thus, can be something like:
addPrimaryKey <- function(inDF, cols) {
cbind(ID = do.call(paste0, inDF[cols]),
inDF)
}
You may also want to look at interaction:
interaction(mydf[c(1, 3, 4)], drop=TRUE)
# [1] bat.1.a cow.2.b dog.3.c
# Levels: bat.1.a cow.2.b dog.3.c
This should do the trick
addPrimaryKey <-function(df, cols){
q<-apply(df[,cols], 1, function(x) paste(x, collapse=""))
df<-cbind(q, df)
return(df)
}
Just add in some conditional logic for your nulls
Two other options for combining columns are dplyr::mutate() and tidyr::unite():
library(dplyr)
df %>%
mutate(new_col = paste0(col1, col3, col4)) %>%
select(new_col, everything()) # to order the column names with the new column first
library(tidyr)
df %>%
unite(new_col, c(col1, col3, col4), sep = '', remove = FALSE)
The default argument in tidy::unite() is remove = TRUE, which drops the original columns from the data frame leaving only the new column.

Resources