How to use regex over entire dataframe in R - r

new user to R so please go easy on me.
I have dataframe like:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?
I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!
[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"
Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?

Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:
df[] <- lapply(df, function (x) {gsub("Z", "", x)})
For a stringr solution (that also uses dplyr), try:
library(tidyverse)
df <- mutate_all(df,
funs(str_replace_all(., "Z", "")))
P.S. I recommend using df <- instead of df = in the future. Good luck!
EDIT: corrected typo - thanks #thelatemail

You may use a simple ^Z regex in the following way:
df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
Confidence = c("ZLow", "High", "Med"),
Coverage = c("sub", "sub", "super"),
Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z', replacement ="")
> df
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).

You are close. If you want to go with base gsub
data$Mineral = gsub("Z", "", data$Mineral)
You can do this for all columns. Or use a combination of apply strategies (see other answers!)
PS. Naming your data data is not a good idea. At least do my_data

You could do:
as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))

You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:
> as.data.frame(apply(df, 2,
function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
function(col) stringi::stri_replace_first_regex(col, '^Z', '')))
Mineral Confidence Coverage Aspect
1 feldspar Low sub Pos
2 granite High sub Und
3 Silica Med super Neg
(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame?
)
As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...
the entire df: stri_replace_first_fixed(df, '^Z', '')
by rows: stri_replace_first_fixed(df[1,], '^Z', '')
by columns: stri_replace_first_fixed(df[,1], '^Z', '')
Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

Related

Resolving a formatter string

Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?
The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def
To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.
We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list

How do I convert a string to a number in R if the string contains a letter?

I am currently helping a friend with his research and am gathering information about different natural disasters that occured from 2004-2016. The data can be found using this link:
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
when you import it to R it gives helpful information, however, my friend, and now I, am only interested in State, Year, Month, Event, Type, County, Direct & indirect deaths and injuries, and property damage. So first I am extracting the columns I need and will later in the code combine them back together, however the data is currently in string mode, for the Property Damage column I need it to present as numeric since it is in cash value. So for example, I have a data entry in that column that looks like "8.6k" and I need it as this 8600 and for all the "NA" entries to be replaced with a 0.
I have this so far but it gives me back a string of "NA"s. Can anyone think of a better way of doing this?
State<- W2004$STATE
Year<-W2004$YEAR
Month<-W2004$MONTH_NAME
Event<-W2004$EVENT_TYPE
Type<-W2004$CZ_TYPE
County<-W2004$CZ_NAME
Direct_Death<-W2004$DEATHS_DIRECT
Indirect_Death<-W2004$DEATHS_INDIRECT
Direct_Injury<-W2004$INJURIES_DIRECT
Indirect_Injury<-W2004$INJURIES_INDIRECT
W2004$DAMAGE_PROPERTY<-as.numeric(W2004$DAMAGE_PROPERTY)
Damage_Property<-W2004$DAMAGE_PROPERTY
l <- cbind( all the columns up there)
print(l)
We can try using a case when expression here, to map each type of unit to a bona fide number. Going with the two examples you actually showed us:
library(dplyr)
x <- c("1.00M", "8.6k")
result <- case_when(
grepl("\\d+k$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000,
grepl("\\d+M$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000000,
TRUE ~ as.numeric(sub("\\D+$", "", x))
)
You can extract the letter and use switch() which is easily maintainable, if you want to add additional symbols it is very easy.
First, the setup:
options(scipen = 999) # to prevent R from printing scientific numbers
library(stringr) # to extract letters
This is the sample vector:
numbers_with_letters <- c("1.00M", "8.6k", 50)
Use lapply() to loop through vector, extract the letter, replace it with a number, remove the letter, convert to numeric, and multiply:
lapply(numbers_with_letters, function(x) {
letter <- str_extract(x, "[A-Za-z]")
letter_to_num <- switch(letter,
k = 1000,
M = 1000000,
1) # 1 is the default option if no letter found
numbers_with_letters <- as.numeric(gsub("[A-Za-z]", "", x))
#remove all NAs and replace with 0
numbers_with_letters[is.na(numbers_with_letters)] <- 0
return(numbers_with_letters * letter_to_num)
})
This returns:
[[1]]
[1] 1000000
[[2]]
[1] 8600
[[3]]
[1] 50
[[4]]
[1] 0
Maybe I'm oversimplifying here, but . . .
library(tidyverse)
data <- tibble(property_damage = c("8.6k", "NA"))
data %>%
mutate(
as_number = if_else(
property_damage != "NA",
str_extract(property_damage, "\\d+\\.*\\d*"),
"0"
),
as_number = as.numeric(as_number)
)

modify data frame based on regex using gsub in r

I have been matching text strings between two vectors in a dataframe. Several values have exactly three characters and match up as part of another word in some other string. I would like to find the regular expression for this. Here is an example:
a <- c("urban", "crabtree", "rba", "rba hks","barbara", "lederbach")
b <- c("rba", "rba", "rba", "rba", "rba", "rba")
df <- data.frame(a, b)
I would like to substitute a blank space (i.e. "") for those values where "rba" only appears as part of the word. The desired output is:
b <- c("", "", "rba", "rba", "", "")
So it's sort of like:
grep("\\b...\\b", df$a, value = TRUE)
But I want to modify column b and insert "" wherever there is no match.
I'm aware that %in% can be used for exact matches, but I was hoping for something using gsub:
funb <- function(x) gsub("\\b...\\b", "", x)
df$b <- lapply(df$b, funb)
but I haven't had much luck. Clearly somthing is amiss, can someone help me get the desired result? Any advice or suggestions would be much appreciated. Thanks.
Based on #David Arenburg's comment above, the general solution to this problem is:
b[!stri_detect_regex(a, paste0("\\b", b, "\\b"))] <- ""
which edits elements in column b, as desired.

searching fields using grepl in R

I'm trying to use grepl to flag some data that might be interesting in a genetics dataset I have.
An example of the data looks like this
test <- c("AAT,TAA,TGA,A,G", "A,AAT,AAAT,AATAAT", "CA,CAA,CAAA")
pattern <- c("TAA", "G", "CAA")
df <- data.frame(test, pattern)
What I am trying to do is to create a third column, say result that evaluates whether the value in the pattern column is in the test column.
I tried this:
df.result <- df %>% mutate(result = grepl(pattern, test))
But for some reason I get a TRUE, TRUE, FALSE in the result column, which isn't what I'm expecting - I would expect a TRUE, FALSE, TRUE result.
I've played around with things like adding a comma to the end of each field, but that didn't seem to work either.
Would appreciate any help with this!
Thanks,
Steve
Use the apply() function:
df$result <- apply(df, 1, FUN=function(x) grepl(x[2], x[1])
df
# test pattern result
# 1 AAT,TAA,TGA,A,G TAA TRUE
# 2 A,AAT,AAAT,AATAAT G FALSE
# 3 CA,CAA,CAAA CAA TRUE
The apply function loops through each row of the df separately, feeding grepl with per row information. grepl cannot process a vector with three elements in the pattern argument. The help page says:
If a character vector of length 2 or more is supplied [as pattern], the first element is used with a warning.
Thus, the original command grepl(df$pattern, df$test) compared the first element from pattern (TAA) to the whole vector in test.
This can be otherwise done with mapply
df$result <- mapply(grepl, df$pattern, df$test)
df$result
#[1] TRUE FALSE TRUE
The stringi package provides string matching functions that are vectorised over both string and pattern;
library(stringi)
df %>% mutate(result = stri_detect_regex(test, pattern))
is one answer to the original question. An answer to the question about avoiding substring matches is
df %>% mutate(result = stri_detect_regex(test, stri_join('(^|,)', pattern, '(,|$)')))

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Resources