Question:
How can you use R to remove all special characters from a dataframe, quickly and efficiently?
Progress:
This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.
Problem:
My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.
df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)
df - This is the original dataframe:
dfa - This is gsub applied to column b. Good!
df_all - This is gsub applied to the entire dataframe. Bad!
Summary:
Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?
Here is a possible solution using dplyr:
# Example data
bla <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
bla %>%
mutate_all(funs(gsub("[[:punct:]]", "", .)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
Update:
mutate_all has been superseded, and funs is deprecated as of dplyr 0.8.0. Here is an updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:
as.data.frame(gsub("[[:punct:]]", "", as.matrix(df)))
I like Ryan's answer using dplyr. As mutate_all and funs are now deprecated, here is my suggested updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use across() from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
Related
This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c
I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?
You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))
If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))
Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)
Using the mtcars dataframe, how can I get a new dataframe that contains the string "3"
So far I have:
mtcars<-lapply(mtcars, function(x) as.character(x))
myindices<-sapply(mtcars, function(x) { grep("3",x, ignore.case = TRUE) })
This gives me a list of indices. How do I just get a filtered dataframe from the original.
Feel free to criticise my approach, it is the end result that I am really interested in
We can use filter_all from dplyr. This returns a dataframe with rows that has at least one column containing the string "3":
library(dplyr)
mtcars %>%
filter_all(any_vars(grepl("3", .)))
If we want a dataframe with rows that has all columns containing the string "3". We use all_vars instead of any_vars:
mtcars %>%
filter_all(all_vars(grepl("3", .)))
We can uses grepl with Reduce from base R
out <- mtcars[Reduce(`|`, lapply(mtcars, grepl, pattern = "3")),]
dim(out)
#[1] 31 11
Similar to your sapply solution:
mtcars[sapply(1:nrow(mtcars), function(i) any(grepl("3", mtcars[i,], fixed = T))),]
Or, you could do this as well:
mtcars[grepl("3", do.call(paste0, mtcars), fixed = T),]
Another base R solution:
mtcars[apply(mtcars,1,function(x) grepl("3",paste(x,collapse=""))),]
We may use toString.
mtcars.3 <- mtcars[grep("3", apply(mtcars, 1, toString)), ]
Check:
rbind(mtcars=dim(mtcars), mtcars.3=dim(mtcars.3))
[,1] [,2]
mtcars 32 11
mtcars.3 31 11
I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson
I'm new to R. In a data frame, I wanted to create a new column #21 that is equal to the sum of column #1 to #20,row by row.
I knew I could do
df$Col21<-df$Col1+df$Col2+.....+df$Col20
But is there a more concise expression?
Also, can I achieve this if using column names not numbers? Thanks!
There is rowSums:
df$Col21 = rowSums(df[,1:20])
should do the trick, and with names:
df$Col21 = rowSums(df[,paste("Col", 1:20, sep="")])
With leading zeros and 3 digits, try:
df$Col21 = rowSums(df[,sprintf("Col%03d", 1:20, sep="")])
I find the dplyr functions for column selection very intuitive, like starts_with(), ends_with(), contains(), matches() and num_range():
df <- as.data.frame(replicate(20, runif(10)))
names(df) <- paste0("Col", 1:20)
library(dplyr)
# e.g.
summarise_each(df, funs(sum), starts_with("Col"))
# or
rowSums(select(df, contains("8")))