how to eliminate specific columns by column name

how to eliminate specific columns by column name - r

I have a data set df and I have 300 columns. I also have a vector names which is a vector of characters. I'm trying to eliminate the columns that match the characters in names. I tried
> head(names)
[1] "X749.-4" "X339" "X449" "X486" "X300" "X301"
real.final<-df[-names]
Error in -names : invalid argument to unary operator
Would there be a way to remove the columns mentioned in the names?

I would use setdiff instead. Here's an example:
## This is head(names)
x <- c("X749.-4", "X339", "X449", "X486", "X300", "X301")
## Imagine this is names(df)
y <- c(letters[1:2], x, LETTERS[1:2])
setdiff(y, x)
# [1] "a" "b" "A" "B"
## So, you could try:
df[, setdiff(y, x)]

The negation operator "-" will not work with character arguments passed as arguments to "[". You need to either use a lgocal vector with "!" as illustrated by user2568648, or you need to convert the character vector into numeric vector with grep:
#Failed attemtpt : real.final <- df[-grep(names, names(df) )]
Perhaps:
real.final <- df[ -as.vector(sapply(names[1], grep, x=c(names,names)))]
Another error:
real.final <- subset( df, select=-names)
Error in -"Result" : invalid argument to unary operator
Success with:
subset(df, select=-which(names(df) %in% names))
I don't like to use -which() because it will bite you if there are no "hits", but it's probably safe as an argument to subset.

You can use the which function. For example to drop the columns named "X749.-4" and "X486":
df <- df[ , -which(names(df) %in% c("X749.-4", "X486"))]

Would this work? [NO - see comment from Dwin below for correction]
subset.df<-subset(df, !(colnames(df) %in% names))

Related

extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg. [+229][+57].
Is there a convenient way in R to do this?

Using base R, then try it with
> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]" "[+229]"
Or you can use
> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"

We can use str_extract_all from stringr
stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]" "[+229]"
Wrap it in unique if you need only unique values.
Similarly, in base R using regmatches and gregexpr
regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]
data
x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"

Seems like you want to remove the alphabetical characters, so
gsub("[[:alpha:]]", "", x)
where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "". This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\.
If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters). Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches. Since numbers always occur in [], I'll simplify the regular expression to match one or more (+) characters from the collection +[:digit:].
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57" "+229"
xx is a list of length equal to the length of x. I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ], and concatenates them
fun <- function(x)
paste0("[", unique(x), "]", collapse = "")
This needs to be applied to each element of the list, and simplified to a vector, a task for sapply().
> sapply(xx, fun)
[1] "[+229][+57]"
A minor improvement is to use vapply(), so that the result is robust (always returning a character vector with length equal to x) to zero-length inputs
> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun) # Hey, this returns a list :(
list()
> vapply(xx, fun, "character") # vapply() deals with 0-length inputs
character(0)

Replace Column Names With String Right of "_"

I have a dataframe (d3) which has some column names with "Date_Month.Year", I want to replace those column names with just "Month.Year" so if there are multiple columns with the same "Month.Year" they will just be a summed column.
Below is the code I tried and the output
library(stringr)
print(colnames(d3))
#below is output of the print statement
#[1] "ProductCategoryDesc" "RegionDesc" "SourceDesc" "variable"
#[5] "2019-02-28_Feb.2019" "2019-03-01_Mar.2019" "2019-03-04_Mar.2019" "2019-03-05_Mar.2019"
#[9] "2019-03-06_Mar.2019" "2019-03-07_Mar.2019" "2019-03-08_Mar.2019"
d3 <- d3 %>% mutate(col = str_remove(col, '*._'))
Here is the error I get:
Evaluation error: argument str should be a character vector (or an object coercible to).
So I got the first part of my problem answered I used to get all column names in Month.Year format but now I am having issues with summing the columns that have the same name, for that I looked at Sum and replace columns with same name R for a data frame containing different classes
colnames(d3) <- gsub('.*_', '', colnames(d3))
Below is the code I used to get the columns summed that have a duplicate name, however with this code it is not necessarily putting the summed values in the correct columns.
indx <- sapply(d3, is.numeric)#check which columns are numeric
nm1 <- which(indx)#get the numeric index of the column
indx2 <- duplicated(names(nm1))|duplicated(names(nm1),fromLast=TRUE)
nm2 <- nm1[indx2]
indx3 <- duplicated(names(nm2))
d3[nm2[!indx3]] <- Map(function(x,y) rowSums(x[y],na.rm = FALSE),
list(d3),split(nm2, names(nm2)))
d3 <- d3[ -nm2[indx3]]

If you want to change the column names, you should be changing colnames:
colnames(d3) <- gsub('.*_', '', colnames(d3))
Note, in your regex, quantifiers (ie *) go after the thing they quantify. So it should be .*_ rather than *._
An example where we remove text before a . in iris:
colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# In regex, . means any character, so to match an actual '.',
# we need to 'escape' it with \\.
colnames(iris) <- gsub('.*\\.', '', colnames(iris))
colnames(iris)
[1] "Length" "Width" "Length" "Width" "Species"

colnames(d3) <- sapply(colnames(d3), function(colname){
return( str_remove(colname, '.*_') )
})
The regex should be ".*_" to match the case you need

With rlang, convert contents of `...` to a character vector

I'd like to be able to create a character vector based on the names supplied to the ... part of a function.
For instance, if I have the function foo(...) and I type foo(x, y), how do I create a character vector that looks like c("x", "y")?
I'm most interested in figuring out how to use rlang for this, but base solutions would be great as well.

Do you mean something like this?
foo <- function(...) unname(purrr::map_chr(rlang::exprs(...), as.character))
foo(x, y)
#[1] "x" "y"
identical(foo(x, y), c("x", "y"))
#[1] TRUE
Alternatively we can use as.character directly on the list returned from rlang::exprs
foo <- function(...) as.character(rlang::exprs(...))
In response to #joran's question, I'm not sure to be honest; consider the following case
as.character(rlang::exprs(NULL, a, b))
#[1] "NULL" "a" "b"
map_chr(rlang::exprs(NULL, a, b), as.character)
#Error: Result 1 is not a length 1 atomic vector
So as.character converts NULL to "NULL" whereas map_chr(..., as.character) throws an error on account of the NULL list entry.

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.

The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))

Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"

x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

excluding FALSE elements from a character vector by using logical vector

I manage to do the following:
stuff <- c("banana_fruit","apple_fruit","coin","key","crap")
fruits <- stuff[stuff %in% grep("fruit",stuff,value=TRUE)]
but I can't get select the-not-so-healthy stuff with the usual thoughts and ideas like
no_fruit <- stuff[stuff %not in% grep("fruit",stuff,value=TRUE)]
#or
no_fruit <- stuff[-c(stuff %in% grep("fruit",stuff,value=TRUE))]
don't work. The latter just ignores the "-"

> stuff[grep("fruit",stuff)]
[1] "banana_fruit" "apple_fruit"
> stuff[-grep("fruit",stuff)]
[1] "coin" "key" "crap"
You can only use negative subscripts with numeric/integer vectors, not logical because:
> -TRUE
[1] -1
If you want to negate a logical vector, use !:
> !TRUE
[1] FALSE

As Joshua mentioned: you can't use - to negate your logical index; use ! instead.
stuff[!(stuff %in% grep("fruit",stuff,value=TRUE))]
See also the stringr package for this kind of thing.
stuff[!str_detect(stuff, "fruit")]

There is also a parameter called 'invert' in grep that does essentially what you're looking for:
> stuff <- c("banana_fruit","apple_fruit","coin","key","crap")
> fruits <- stuff[stuff %in% grep("fruit",stuff,value=TRUE)]
> fruits
[1] "banana_fruit" "apple_fruit"
> grep("fruit", stuff, value = T)
[1] "banana_fruit" "apple_fruit"
> grep("fruit", stuff, value = T, invert = T)
[1] "coin" "key" "crap"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to eliminate specific columns by column name - r

I would use setdiff instead. Here's an example: ## This is head(names) x <- c("X749.-4", "X339", "X449", "X486", "X300", "X301") ## Imagine this is names(df) y <- c(letters[1:2], x, LETTERS[1:2]) setdiff(y, x) # [1] "a" "b" "A" "B" ## So, you could try: df[, setdiff(y, x)]

You can use the which function. For example to drop the columns named "X749.-4" and "X486": df <- df[ , -which(names(df) %in% c("X749.-4", "X486"))]

Would this work? [NO - see comment from Dwin below for correction] subset.df<-subset(df, !(colnames(df) %in% names))

Related

extract substring in R

Replace Column Names With String Right of "_"

With rlang, convert contents of `...` to a character vector

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

excluding FALSE elements from a character vector by using logical vector

Categories

Resources