Sets in R DataFrame - r

I have a csv that looks like
Deamon,Host,1:2:4,aaa.03
Pixe,Paradigm,1:3:5,11.us
I need to read this into a dataframe for analysis but the 3rd column in my data is separated by : and need to be read like a set or list 1.e splitted by : so that it returns (1,2,4) . Is it possible to have a columns that has a class list in R . Or How best do you think i can approach this problem.

You can use strsplit to split a character vector into a list of components:
x <- c("1:2:4", "1:3:5")
strsplit(x, split=":")
[[1]]
[1] "1" "2" "4"
[[2]]
[1] "1" "3" "5"

As noted above, the answer will vary depending on if the number of separators in the columns are consistent or not. The answer is more straight forward if that number is consistent. Here's one answer to do that building off of Andrie's strsplit answer:
dat <- read.csv("yourData.csv", header=FALSE, stringsAsFactors = FALSE)
#If always going to be a consistent number of separators
dat <- cbind(dat, do.call("rbind", strsplit(dat[, 3], ":")))
V1 V2 V3 V4 1 2 3
1 Deamon Host 1:02:04 aaa.03 1 02 04
2 Pixe Paradigm 1:03:05 11.us 1 03 05
Note that the above is essentially how colsplit.character from package reshape is implemented and may be a better option for you as it forces you to give proper names.
If the number of separators is different, then using rbind.fill is an option from package plyr. rbind.fill expects data.frames which was a bit annoying, and I couldn't figure out how to get a one row data.frame without first converting to a matrix, so I imagine this can be made more efficient, but here's the basic idea:
library(plyr)
x <- c("1:2:4", "1:3:5:6:7")
rbind.fill(
lapply(
lapply(strsplit(x, ":"), matrix, nrow = 1)
, as.data.frame)
)
V1 V2 V3 V4 V5
1 1 2 4 <NA> <NA>
2 1 3 5 6 7
Which can then be cbinded as shown above.

Try using gsub to replace that character:
R> str <- "1:2:4"
R> str
[1] "1:2:4"
R> gsub(":", ",", str)
[1] "1,2,4"
Make sure the column is a string not a factor beforehand.

Related

Coercing String to Vector

I'm trying to create a calculator that multiplies permutation groups written in cyclic form (the process of which is described in this post, for anyone unfamiliar: https://math.stackexchange.com/questions/31763/multiplication-in-permutation-groups-written-in-cyclic-notation). Although I know this would be easier to do with Python or something else, I wanted to practice writing code in R since it is relatively new to me.
My gameplan for this is take an input, such as "(1 2 3)(2 4 1)" and split it into two separate lists or vectors. However, I am having trouble starting this because from my understanding of character functions (which I researched here: https://www.statmethods.net/management/functions.html) I will ultimately have to use the function grep() to find the points where ")(" occur in my string to split from there. However, grep only takes vectors for its argument, so I am trying to coerce my string into a vector. In researching this problem, I have mostly seen people suggest to use as.integer(unlist(str_split())), however, this doesn't work for me as when I split, not everything is an integer and the values become NA, as seen in this example.
library(tidyverse)
x <- "(1 2 3)(2 4 1)"
x <- as.integer(unlist(str_split(x," ")))'
x
Is there an alternative way to turn a string into a vector when there are not just integers involved? I also realize that the means by which I am trying to split up the two permutations is very roundabout, but that is because of the character functions that I researched this seems like the only way. If there are other functions that would make this easier, please let me know.
Thank you!
Comments in the code.
x <- "(1 2 3)(2 4 1)"
out1 <- strsplit(x, split = ")(", fixed = TRUE)[[1]] # split on close and open bracket
out2 <- gsub("[\\(|\\)]", replacement = "", out1) # remove brackets
out3 <- strsplit(out2, " ") # tease out numbers between spaces
lapply(out3, as.integer)
[[1]]
[1] 1 2 3
[[2]]
[1] 2 4 1
There aren't really any scalars on R. Single values like 1, TRUE, and "a" are all 1-element vectors. grep(pattern, x) will work fine on your original string. As a starting point for getting towards your desired goal, I would suggest splitting the groups using:
> str_extract_all(x, "\\([0-9 ]+\\)")
[[1]]
[1] "(1 2 3)" "(2 4 1)"
If we need to split the strings with the brackets
strsplit(x, "(?<=\\))(?=\\()", perl = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
Or we can use convenient wrapper from qdapRegex
library(qdapRegex)
ex_round(x, include.marker = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
alternative: using library(magrittr)
x <- "(1 2 3)(2 4 1)"
x %>%
gsub("^\\(","c(",.) %>% gsub("\\)\\(","),c(",.) %>% gsub("(?=\\s\\d)",", ",.,perl=T) %>%
paste0("list(",.,")") %>% {eval(parse(text=.))}
result:
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 4 1
You could use chartr with read.table :
read.table(text= chartr("()"," \n",x))
# V1 V2 V3
# 1 1 2 3
# 2 2 4 1

Convert letters to numbers

I have a bunch of letters, and cannot for the life of me figure out how to convert them to their number equivalent.
letters[1:4]
Is there a function
numbers['e']
which returns
5
or something user defined (ie 1994)?
I want to convert all 26 letters to a specific value.
I don't know of a "pre-built" function, but such a mapping is pretty easy to set up using match. For the specific example you give, matching a letter to its position in the alphabet, we can use the following code:
myLetters <- letters[1:26]
match("a", myLetters)
[1] 1
It is almost as easy to associate other values to the letters. The following is an example using a random selection of integers.
# assign values for each letter, here a sample from 1 to 2000
set.seed(1234)
myValues <- sample(1:2000, size=26)
names(myValues) <- myLetters
myValues[match("a", names(myValues))]
a
228
Note also that this method can be extended to ordered collections of letters (strings) as well.
You could try this function:
letter2number <- function(x) {utf8ToInt(x) - utf8ToInt("a") + 1L}
Here's a short test:
letter2number("e")
#[1] 5
set.seed(123)
myletters <- letters[sample(26,8)]
#[1] "h" "t" "j" "u" "w" "a" "k" "q"
unname(sapply(myletters, letter2number))
#[1] 8 20 10 21 23 1 11 17
The function calculates the utf8 code of the letter that it is passed to, subtracts from this value the utf8 code of the letter "a" and adds to this value the number one to ensure that R's indexing convention is observed, according to which the numbering of the letters starts at 1, and not at 0.
The code works because the numeric sequence of the utf8 codes representing letters respects the alphabetic order.
For capital letters you could use, accordingly,
LETTER2num <- function(x) {utf8ToInt(x) - utf8ToInt("A") + 1L}
The which function seems appropriate here.
which(letters == 'e')
#[1] 5
Create a lookup vector and use simple subsetting:
x <- letters[1:4]
lookup <- setNames(seq_along(letters), letters)
lookup[x]
#a b c d
#1 2 3 4
Use unname if you want to remove the names.
thanks for all the ideas, but I am a dumdum.
Here's what I did. Made a mapping from each letter to a specific number, then called each letter
df=data.frame(L=letters[1:26],N=rnorm(26))
df[df$L=='e',2]

R lists of characters to one data.frame

I've been looking around for quite a while now, but can't seem to solve this problem, although I feel like it should be an easy one.
I have 54 factors containing differing amounts of strings, names of pathways to be exact. For example, here are two factors with the elements they contain:
> PWe1
[1] Gene_Expression
[2] miR-targeted_genes_in_muscle_cell_-_TarBase
[3] Generic_Transcription_Pathway
> PWe2
[1] miR-targeted_genes_in_epithelium_-_TarBase
[2] miR-targeted_genes_in_leukocytes_-_TarBase
[3] miR-targeted_genes_in_lymphocytes_-_TarBase
[4] miR-targeted_genes_in_muscle_cell_-_TarBase
What I would like to do is take these, and combine them into one big data frame with 54 columns, where each column has the names of one corresponding factor. I've tried cbind, cbind.data.frame and a couple of other options but those return numeric values instead of strings.
Expected output:
PWe1 PWe2
Gene_Expression miR-targeted_genes_in_epithelium_-_TarBase
miR-targeted_genes_in_muscle_cell_-_TarBase miR-targeted_genes_in_leukocytes_-_TarBase
Generic_Transcription_Pathway miR-targeted_genes_in_lymphocytes_-_TarBase
NA miR-targeted_genes_in_muscle_cell_-_TarBase
I'm quite a beginner when it comes to R, could anyone nudge me towards a possible solution?
Thanks in advance!
lst <- mget(ls(pattern="PW")) #<--- Create list with all necessary vectors.
ind <- lengths(lst) #<--- find maximum length
as.data.frame(do.call(cbind,
lapply(lst, `length<-`, max(ind)))) #<--- Convert to data.frmae
# PWe1 PWe2
# 1 Gene_Expression miR-targeted_genes_in_epithelium_-_TarBase
# 2 miR-targeted_genes_in_muscle_cell_-_TarBase miR-targeted_genes_in_leukocytes_-_TarBase
# 3 Generic_Transcription_Pathway miR-targeted_genes_in_lymphocytes_-_TarBase
# 4 <NA> miR-targeted_genes_in_muscle_cell_-_TarBase
l1 <- max(length(v1), length(v2))
length(v1) <- l1
length(v2) <- l1
cbind(as.character(v1), as.character(v2))
# [,1] [,2]
#[1,] "Gene_Expression" "miR-#targeted_genes_in_epithelium_-_TarBase"
#[2,] "miR-targeted_genes_in_muscle_cell_-_TarBase" "miR-#targeted_genes_in_leukocytes_-_TarBase"
#[3,] "Generic_Transcription_Pathway" "miR-#targeted_genes_in_lymphocytes_-_TarBase"
#[4,] NA "miR-#targeted_genes_in_muscle_cell_-_TarBase"
If you convert your factors to characters before you use cbind, you don't get numeric values:
testFrame <- data.frame(cbind(as.character(PWe1), as.character(PWe3))
If the length of both vectors differs, cbind throws a warning and elements of the shorter vectors will be replicated. If that is unsatisfying in your case, maybe a data.frame object might not be the right choice?

Paste column values together in a data frame

I am trying to paste together the rowname along with the data in the desired column. I wrote the following code but somehow couldnot find a way to do it correctly.
The desired output will be: "a,1,11" "b,2,22" "c,3,33"
x = data.frame(cbind(f1 = c(1,2,3), f2 = c(5,6,7), f3=c(11,22,33)), row.names= c('a','b','c'))
x
# f1 f2 f3
# a 1 5 11
# b 2 6 22
# c 3 7 33
do.call("paste", c(rownames(x), x[c('f1','f3')], sep=","))
# [1] "a,b,c,1,11" "a,b,c,2,22" "a,b,c,3,33"
Two main points:
Use apply instead of do.call(paste, .)
Use cbind instead of c in this case.
If you would rather use c, you would need to coerce the row names to a list or column first, eg: c(list(rownames(x)), x)
Try the following:
apply(cbind(rownames(x), x[c('f1','f3')]), 1, paste, collapse=",")
a b c
"a,1,11" "b,2,22" "c,3,33"
Your do.call instructs R to paste the list c(rownames(x), x[c('f1','f3')]) together. But take a look at your list.
> c(rownames(x), x[c('f1','f3')])
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
$f1
[1] 1 2 3
$f3
[1] 11 22 33
The c command takes the elements of each argument and joins them together. This properly deconstructs x[c('f1','f3')] but also deconstructs rownames(x) in a way you don't want. Obeying the standard recycling rule, paste then takes an item from each list element and patches them together with sep=",".
You could fix this by encapsulating rownames(x) inside a list structure so that your list of arguments comes out properly:
do.call("paste", c(list(rownames(x)), x[c('f1','f3')], sep=","))
No need for do.call or apply:
paste(rownames(x),x[[1]],x[[3]] , sep=",")
[1] "a,1,11" "b,2,22" "c,3,33"

Extracting unique numbers from string in R

I have a list of strings which contain random characters such as:
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
I'd like to know which numbers are present at least once (unique()) in this list. The solution of my example is:
solution: c(7,667,11,5,2)
If someone has a method that does not consider 11 as "eleven" but as "one and one", it would also be useful. The solution in this condition would be:
solution: c(7,6,1,5,2)
(I found this post on a related subject: Extracting numbers from vectors of strings)
For the second answer, you can use gsub to remove everything from the string that's not a number, then split the string as follows:
unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2
For the first answer, similarly using strsplit,
unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1] 7 667 11 5 2
PS: don't name your variable list (as there's an inbuilt function list). I've named your data as ll.
Here is yet another answer, this one using gregexpr to find the numbers, and regmatches to extract them:
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
temp1 <- gregexpr("[0-9]", l) # Individual digits
temp2 <- gregexpr("[0-9]+", l) # Numbers with any number of digits
as.numeric(unique(unlist(regmatches(l, temp1))))
# [1] 7 6 1 5 2
as.numeric(unique(unlist(regmatches(l, temp2))))
# [1] 7 667 11 5 2
A solution using stringi
# extract the numbers:
nums <- stri_extract_all_regex(list, "[0-9]+")
# Make vector and get unique numbers:
nums <- unlist(nums)
nums <- unique(nums)
And that's your first solution
For the second solution I would use substr:
nums_first <- sapply(nums, function(x) unique(substr(x,1,1)))
You could use ?strsplit (like suggested in #Arun's answer in Extracting numbers from vectors (of strings)):
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
## split string at non-digits
s <- strsplit(l, "[^[:digit:]]")
## convert strings to numeric ("" become NA)
solution <- as.numeric(unlist(s))
## remove NA and duplicates
solution <- unique(solution[!is.na(solution)])
# [1] 7 667 11 5 2
A stringr solution with str_match_all and piped operators. For the first solution:
library(stringr)
str_match_all(ll, "[0-9]+") %>% unlist %>% unique %>% as.numeric
Second solution:
str_match_all(ll, "[0-9]") %>% unlist %>% unique %>% as.numeric
(Note: I've also called the list ll)
Use strsplit using pattern as the inverse of numeric digits: 0-9
For the example you have provided, do this:
tmp <- sapply(list, function (k) strsplit(k, "[^0-9]"))
Then simply take a union of all `sets' in the list, like so:
tmp <- Reduce(union, tmp)
Then you only have to remove the empty string.
Check out the str_extract_numbers() function from the strex package.
pacman::p_load(strex)
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
charvec <- unlist(list)
print(charvec)
#> [1] "djud7+dg[a]hs667" "7fd*hac11(5)" "2tu,g7gka5"
str_extract_numbers(charvec)
#> [[1]]
#> [1] 7 667
#>
#> [[2]]
#> [1] 7 11 5
#>
#> [[3]]
#> [1] 2 7 5
unique(unlist(str_extract_numbers(charvec)))
#> [1] 7 667 11 5 2
Created on 2018-09-03 by the reprex package (v0.2.0).

Resources