keep duplicates using `make_clean_names` in R janitor package - r

I am trying to clean a character column using make_clean_names function in janitor package in R. I need to keep the duplicated in this case and not add a numeric to it. Is this possible? My code is like this
x <- c(' x y z', 'xyz', 'x123x', 'xy()','xyz','xyz')
janitor::make_clean_names(x)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz_2" "xyz_3"
janitor::make_clean_names(x, unique_sep = '.')
[1] "x_y_z" "xyz" "x123x" "xy" "xyz.1" "xyz.2"
janitor::make_clean_names(x, unique_sep = NULL)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz_2" "xyz_3"
Using unique_sep = NULL doesn't seem to work. Any other way to keep unique values?
Desired Output:
[1] "x_y_z" "xyz" "x123x" "xy" "xyz" "xyz"
I know how to use regular expressions to do this. Just searching for a shortcut.
PS: I know this function is created to clean names of a data.frame, I am trying to apply this to a different use case. This functionality might help a lot in cleaning character columns.

You can use sapply to go through the vector elements one by one and thus avoid adding numeric suffixes to duplicates:
sapply(x, make_clean_names, USE.NAMES = F)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz" "xyz"

Unfortunately no, it's not possible. If you look at the code for make_clean_names you'll see it ends with this:
# Handle duplicated names - they mess up dplyr pipelines. This appends the
# column number to repeated instances of duplicate variable names.
while (any(duplicated(cased_names))) {
dupe_count <-
vapply(
seq_along(cased_names), function(i) {
sum(cased_names[i] == cased_names[1:i])
},
1L
)
cased_names[dupe_count > 1] <-
paste(
cased_names[dupe_count > 1],
dupe_count[dupe_count > 1],
sep = "_"
)
}
I think you're on the right track passing the unique_sep argument through to the underlying function that make_clean_names uses, snakecase::to_any_case. But that while loop, recently introduced to ensure there are never duplicated names resulting from make_clean_names, will always deduplicate at the end.
You could try adapting your own function that is the first part of make_clean_names, without the loop, or you could perhaps make use of snakecase::to_any_case.

Related

Extracting numbers (in decimal and </> form) from strings in R

I have a dataset in which a column (the result variable) contains data in both numeric and character form [e.g. positive, negative, <0.1, 600, >1000 etc].
I want to extract only the numeric data in this column (i.e. <0.1, 600, >1000). Ideally without the use of any external packages.
I tried the following:
x<-gsub('\\D','', x)
But it removes the decimals or less than/more than sign (e.g. 1.56 became 156, <1.0 became 10)
I then tried the following:
x<-as.numeric(gsub("(\\D)\\.","", x))
This time round it keeps the decimal but coerced other values such as <0.1, >100 to become NAs instead.
So my question is, is there any way I can modify the function such that it will keep values containing the '<' or '>" as it is without replacement.
Meaning from
x = c("negative","positive","1.22","<1.0",">200")
I will be able to get back
x = c("","","1.22","<1.0",">200)
I would really appreciate if someone can teach me how to resolve this issue thanks!
Do you need this?
> gsub("[^0-9.<>]", "", x)
[1] "" "" "1.22" "<1.0" ">200"
Does this work for you ? Using grep we can find which all items of the vectors contains numbers, then using value=TRUE will give us those items present. Another way could be using grepl to get logical output for the match. Also in your case \\D would not work as it is match to all non digits including dot, greater than signs.
grep('\\d+', x, value=TRUE)
would yield : [1] "1.22" "<1.0" ">200"
grepl('\\d+', x)
would yield: [1] FALSE FALSE TRUE TRUE TRUE
You may also try gsub using:
> gsub('[a-zA-Z]+', '', x)
[1] "" "" "1.22" "<1.0" ">200"
Using str_remove
library(stringr)
str_remove_all(x, "[A-Za-z]+")
#[1] "" "" "1.22" "<1.0" ">200"
What, what about something like this? Find that elements that do not match your conditions and set them to an empty string.
x <- x[grep('[a-zA-Z]', x)] <- ""

R subsetting list "incorrect number of dimensions"

I am working with some text in a list. The text is separated by CR/LF, so I split the string on that. Then I have to clean up the list to make it usable.
library(tidyverse)
my_list <-("abc\r\ndef\r\nghi\r\njkl\r\n")
# The str_split gives me a list that has an empty element at the end. Why?
split_list <- str_split(my_list, "\r\n")
[[1]]
[1] "abc" "def" "ghi" "jkl" ""
I need to remove the first two elements and then sort in reverse order:
split_list %>%
split_list[[1]][-1:-2] %>%
sort(split_list, decreasing = TRUE)
But it fails with Error in.[split_list[[1]], -1:-2] : incorrect number of dimensions
I've read so many discussions of subsetting but they all seem more complicated than my example. I clearly don't understand this yet. Thank you for your suggestions!
You could do :
library(magrittr)
split_list %>% .[[1]] %>% tail(-2) %>% sort(decreasing = TRUE)
#[1] "jkl" "ghi" ""
Here's a way of using "[[" and "[" inside the tidyverse framework. They are both functions so you need to backtick them when they are used in this manner. (Your error arises from referring to the data-object twice. You should not need to refer to split_list twice.) The tidyverse creates an implicit pass-through of the leading data-object as it gets progressively modified by the sequence of functions. Functions become somewhat like 'infix'-functions in base R:
split_list %>%
`[[`(1) %>% # pulls first column from split_list
`[`(-1:-2) %>% # both extraction functions used by back-ticked names
sort( decreasing = TRUE)
[1] "jkl" "ghi" ""
It's actually quite similar to the arrangement you could use in the base R use of these functions which are also infix:
sort( split_list
[[ 1]]
[ (-1:-2)],
decreasing = TRUE)
[1] "jkl" "ghi" ""
If you are only working on one vector such that str_split only ever returns a list with one element containing the split vector, you could wrap your str_split() inside the unlist() function to obtain the vector of split elements directly. It could look something like this:
sort(unlist(str_split(my_list, "\r\n"))[-c(1:2)], decreasing = TRUE)
Above I also subset the unlisted vector to remove the first two elements and then wrap the entire expression inside the sort() function with decreasing = TRUE.

strsplit not behaving as expected R

I have a basic problem in R, everything I'm working with is familiar to me (data, functions) but for some reason I can't get the strsplit or the gsub function to work as expected. I also tried the stringr package. I'm not going to bother putting up code using that package because I know this problem is simple and can be done with the two functions mentioned above. Personally, I feel like putting up a page for this isn't even necessary but my patience is pretty thin at this point.
I am trying to remove the "." and the number followed by the '.' in an Ensemble Gene ID. Simple, I know.
id <- "ENSG00000223972.5"
gsub(".*", "", id)
strsplit(id, ".")
The asterisk symbol was meant to catch anything after the '.' and remove it but I don't know for sure if that's what it does. The strsplit should definitely output a list of two items, the first being everything before the '.' and the second being the one digit after. All it returns is a list with 17 "" symbols, for no space and one for each character in the string. I think it's an obvious thing that I'm missing but I haven't been able to figure it out. Thanks in advance.
Read the help file for ?strsplit, you cannot use "."
id <- "ENSG00000223972.5"
gsub("[.]", "", id)
strsplit(id, split = "[.]")
Output:
> gsub("[.]", "", id)
[1] "ENSG000002239725"
> strsplit(id, split = "[.]")
[[1]]
[1] "ENSG00000223972" "5"
Help:
unlist(strsplit("a.b.c", "."))
## [1] "" "" "" "" ""
## Note that 'split' is a regexp!
## If you really want to split on '.', use
unlist(strsplit("a.b.c", "[.]"))
## [1] "a" "b" "c"
## or
unlist(strsplit("a.b.c", ".", fixed = TRUE))

R List with sub-lists: Extract all elements that match a rule into array

I have a R list of objects which are again lists of various types. I want "cost" value for all objects whose category is "internal". What's a good way of achieving this?
If I had a data frame I'd have done something like
my_dataframe$cost[my_dataframe$category == "internal"]
What's the analogous idiom for a list?
mylist<-list(list(category="internal",cost=2),
list(category="bar",cost=3),list(category="internal",cost=4),
list(category='foo',age=56))
Here I'd want to get c(2,4). Subsetting like this does not work:
mylist[mylist$category == "internal"]
I can do part of this by:
temp<-sapply(mylist,FUN = function(x) x$category=="internal")
mylist[temp]
[[1]]
[[1]]$category
[1] "internal"
[[1]]$cost
[1] 2
[[2]]
[[2]]$category
[1] "internal"
[[2]]$cost
[1] 4
But how do I get just the costs so that I can (say) sum them up etc.? I tried this but does not help much:
unlist(mylist[temp])
category cost category cost
"internal" "2" "internal" "4"
Is there a neat, compact idiom for doing what I want?
The idiom you are looking for is
sapply(mylist, "[[", "cost")
which returns a list of the extracted vector, should it exist, and NULL if it does not.
[[1]]
[1] 2
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
NULL
If you just want the sum of categories that are internal you can do the following assuming you want a vector.
sum(sapply(mylist[temp], "[[", "cost"))
And if you want a list of the same result you can do
sapply(mylist,function(x) x[x$category == "internal"]$cost)
One of the beautiful, but challenging things about R is that there are so many ways to express the same language.
You might note from the other answers that you can interchange sapply and lapply since lists are just heterogenous vectors, the following will also return 6.
do.call("sum",lapply(mylist, function(x) x[x[["category"]] == "internal"]$cost))
Yet another attempt, this time using ?Filter and a custom function to do the necessary selecting:
sum(sapply(Filter(function(x) x$category=="internal", mylist), `[[`, "cost"))
#[1] 6
Could try something like this. For all sublists, if the category is "internal", get the cost, otherwise return NULL which will be ignored when you unlist the result:
sum(unlist(lapply(mylist, function(x) if(x$category == "internal") x$cost)))
# [1] 6
A safer way is to also check if category exists in the sublist by checking the length of category:
sum(unlist(lapply(mylist, function(x) if(length(x$category) && x$category == "internal") x$cost)))
# [1] 6
This will avoid raising an error if the sublist doesn't contain the category field.
I approached your question by rlist package. This method is similar to apurrr package method #alistaire mentioned.
library(rlist); library(dplyr)
mylist %>%
list.filter(category=="internal") %>%
list.mapv(cost) %>% sum()
# list.mapv returns each member of a list by an expression to a vector.
The purrr package has some nice utilities for manipulating lists. Here, keep lets you specify a predicate function that returns a Boolean for whether to keep a list element:
library(purrr)
mylist %>%
keep(~.x[['category']] == 'internal') %>%
# now select the `cost` element of each, and simplify to numeric
map_dbl('cost') %>%
sum()
## [1] 6
The predicate structure with ~ and .x is a shorthand equivalent to
function(x){x[['category']] == 'internal'}
Here's a dplyr option:
library(dplyr)
bind_rows(mylist) %>%
filter(category == 'internal') %>%
summarize(total = sum(cost))
# A tibble: 1 x 1
total
<dbl>
1 6

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources