I have a numeric vector that I've imported from excel that is formatted in a "weird" way. For example: 12.000 stands for 12000. I want to convert all numeric variables that have decimals to entire values (in this example multiplying by 1000 - since R reads 12.000 as 12, and what I really want is 12000). I've tried to convert it to character and then manipulate it in order to add zeros. I don't think this is the best way, but what I'm trying looks like this:
vec <- c(12.000, 5.300, 5.000, 33.400, 340, 3200)
vec <- as.character(vec)
> vec
[1] "12" "5.3" "5" "33.4" "340" "3200"
x <- "([0 -9]{1})"
xx <- "([0 -9]{2})"
x.x <- "([0 -9]{1}\\.[0 -9]{1})"
xx.x <- "([0 -9]{2}\\.[0 -9]{1})"
I created this regular expressions so what I could do is create a condition that if grep(x, vec) is true, then I do : paste0("000", vec) for when vec is true in the condition set. My idea is to do this for all possible cases, which are: add "000" if x or if xx & add "00" if x.x or if xx.x
Does anyone has an idea of what I could do? If there is any simpler idea?
Thank you!!
You need to read the vector as a character in the first place. If you read as numeric R will interpret it as a number and remove the decimal followed by 0
df <- read.csv(text= "Index, Vec
1, 12.00
2, 5.3
3, 5
4, 33.4
5, 340
6, 3200",
colClasses = c("numeric", "character"))
isDot <- grepl("\\.", df$Vec)
df$Vec[isDot] <- as.numeric(df$Vec[isDot])*1000
df$Vec <- as.numeric(df$Vec)
Related
I'm trying to find patterns in a set of strings as the following example:
"2100780D001378FF01E1000000040000--------01A456000000------------"
"3100782D001378FF03E1008100040000--------01A445800000------------"
If I use the standard get_pattern from the bpa library, since it looks individually to every string I will get
"9999999A999999AA99A9999999999999--------99A999999999------------"
But my idea would be to find something like:
"X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
The main objective is to find the set of strings with the most similar "pattern"
My first idea was to calculating the hamming distance between them and then analyzing the groups resulting from this distance but it gets tedious. Is there any "automatic" approach?
Any idea of how I can accomplish this mission?
for your sample data, the code below is working.. no idea how it scales to production...
library( data.table )
#sample data
data <- data.table( name = c("2100780D001378FF01E1000000040000--------01A456000000------------",
"3100782D001378FF03E1008100040000--------01A445800000------------"))
# name
# 1: 2100780D001378FF01E1000000040000--------01A456000000------------
# 2: 3100782D001378FF03E1008100040000--------01A445800000------------
#use data.table::tstrsplit() to split the string to individual characters
l <- lapply( data.table::tstrsplit( data$name, ""), function(x) {
#if the same character appears in all strings on the same position,return the character, else return 'X'
if ( length( unique( x ) ) == 1 ) as.character(x[1]) else "X"
})
#paste it all together
paste0(l, collapse = "")
# [1] "X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
small explanation
data.table::tstrsplit( data$name, "") returns the following list
[[1]]
[1] "2" "3"
[[2]]
[1] "1" "1"
[[3]]
[1] "0" "0"
etc...
Using lapply(), you can loop over this list, determining the length of the vector with unique elements. Ith this length == 1, then the same character exists in all strings on this position, so return the character.
If the length > 1, then multiple characters apprear on this possition in different strings, and return "X".
Update
if you are after the hamming distances, use the stringdist-package
library(stringdist)
m <- stringdist::stringdistmatrix(a = data$name, b = data$name, ,method="hamming" )
# [,1] [,2]
# [1,] 0 8
# [2,] 8 0
#to get to the minimum value for each row, exclude the diagonal first (by making it NA)
# and the find the position with the minimum value
diag(m) <- NA
apply( m, 1, which.min )
# [1] 2 1
Here is a base R solution, where a custom function findPat is defined and Reduce is applied to find common pattern among a set of strings, i.e.,
findPat <- function(s1,s2){
r1 <- utf8ToInt(s1)
r2 <- utf8ToInt(s2)
r1[bitwXor(r1,r2)!=0]<- utf8ToInt("X")
pat <- intToUtf8(r1)
}
pat <- Reduce(findPat,list(s1,s2,s3))
such that
> pat
[1] "X10078XDX0X378FF0XE100XX00040000--------01AXXXXX0000------------"
DATA
s1 <- "2100780D001378FF01E1000000040000--------01A456000000------------"
s2 <- "3100782D001378FF03E1008100040000--------01A445800000------------"
s3 <- "4100781D109378FF03E1008100040000--------01A784580000------------"
What I tried to do:
In aphid package there is a function deriveHMM() which needs to be fed with a list like:
x <- list(c("c"="10.0", "b"="5.0","c"="10.0", "a"="1.0", "a"="2.0",...))
which needs to be created of a very large input vector like
iv <- c(10, 5, 10, 1, 2,...)
It is important, that the order of my original input vector remains unchanged.
I need to automatically create this list by a large input of doubles from a .csv file (import of doubles to R worked fine). Each double has to get a name depending on its closest distance to a predefined value, for example:
all doubles ranging from 0 to 2.5 should be named "a"
all doubles ranging from 2.5 to 7.5 should be named "b"
all doubles greater than 7.5 should be named "c"
and after that all doubles be converted to a character (or string (?)) so the method deriveHMM() accepts the input.
I would be very happy to have suggestions. I am new to R and this is my first post on Stackoverflow.com. I am not an experienced programmer, but I try my best to understand your help.
EDIT:
Updated the question, because what I need is a "List of named vectors of characters", exactly like in my example above without changing the order.
This solution uses findInterval to get an index into a tags vector, the vector of names.
set.seed(1234) # Make the results reproducible
x <- runif(10, 0, 20)
tags <- letters[1:3]
breaks <- c(0, 2.5, 7.5, Inf)
names(x) <- tags[findInterval(x, breaks)]
x
# a c c c c
# 2.2740682 12.4459881 12.1854947 12.4675888 17.2183077
# c a b c c
#12.8062121 0.1899151 4.6510101 13.3216752 10.2850228
Edit.
If you need x to be of class "character", get the index into tags first, then coerce x to character and only then assign the names attribute.
i <- findInterval(x, breaks)
x <- as.character(x)
names(x) <- tags[i]
x
# a c c
# "2.27406822610646" "12.4459880962968" "12.1854946576059"
# c c c
# "12.4675888335332" "17.2183076711372" "12.8062121057883"
# a b c
#"0.189915127120912" "4.65101012028754" "13.321675164625"
# c
# "10.2850228268653"
Here is an example, where x represents your input vector.
x <- seq(1, 10, 0.5)
The first step is to give your elements names depending on their values.
names(x) <- ifelse(x <= 2.5, "a", ifelse(x > 2.5 & x <= 7.5, "b", "c"))
Next, split your vector and a apply as.character. We can use by here.
lst <- by(x, names(x), as.character, simplify = TRUE)
is.list(lst)
# [1] TRUE
Result
lst
#names(x): a
#[1] "1" "1.5" "2" "2.5"
#-----------------------------------------------------------------------------------------------------------------------
#names(x): b
# [1] "3" "3.5" "4" "4.5" "5" "5.5" "6" "6.5" "7" "7.5"
#-----------------------------------------------------------------------------------------------------------------------
#names(x): c
#[1] "8" "8.5" "9" "9.5" "10"
You could also use split and lapply as shown below, by is shorthand of such an approach.
lapply(split(x, names(x)), as.character)
I have data in matrices and the matrices are stored in a list, and I want the sum of the a specific row in each matrix.
some example data
A1<-matrix(0:9, nrow=5, ncol=2)
A2<-matrix(10:19, nrow=5, ncol = 2)
A3<-matrix(20:29, nrow=5, ncol = 2)
Mylist<-list(A1, A2, A3)
I can get the sum of all rows in each matrix with
lapply(Mylist, function(x) apply(x, 1, sum) )
but I only want the sum of a specific row, could be row 1, could be row 4, depending on what I want to look at. I know I can read it off of the results I generate with the code above but I want a cleaner solution that only gives me the results. Thanks
You can use purrr:map().
If you know the output type (in this case, seems to be all integers), you can be more specific, like map_int(). With map() you'll get a list back, with a specific map version, like map_int(), you get a vector back instead.
library(tidyverse)
ix <- 3 # let's say we want the sum of the third row
map_int(Mylist, ~sum(.x[ix, ]))
[1] 9 29 49
If the row index you care about changes per matrix, you can use map2() instead, which takes two inputs of the same length:
ixs <- c(1, 2, 3)
map2_int(Mylist, ixs, ~sum(.x[.y, ]))
[1] 5 27 49
Alternately, if you need to work in base R, you can just take the desired index (here, ix) of sum(), you don't need apply() inside lapply():
lapply(Mylist, function(x) sum(x[ix, ]))
[[1]]
[1] 9
[[2]]
[1] 29
[[3]]
[1] 49
one.row.sum <- function(df, row.num) lapply(Mylist, function(df) sum(df[row.num, ]))
one.row.sum(Mylist, 1)
[[1]]
[1] 5
[[2]]
[1] 25
[[3]]
[1] 45
I have a bunch of letters, and cannot for the life of me figure out how to convert them to their number equivalent.
letters[1:4]
Is there a function
numbers['e']
which returns
5
or something user defined (ie 1994)?
I want to convert all 26 letters to a specific value.
I don't know of a "pre-built" function, but such a mapping is pretty easy to set up using match. For the specific example you give, matching a letter to its position in the alphabet, we can use the following code:
myLetters <- letters[1:26]
match("a", myLetters)
[1] 1
It is almost as easy to associate other values to the letters. The following is an example using a random selection of integers.
# assign values for each letter, here a sample from 1 to 2000
set.seed(1234)
myValues <- sample(1:2000, size=26)
names(myValues) <- myLetters
myValues[match("a", names(myValues))]
a
228
Note also that this method can be extended to ordered collections of letters (strings) as well.
You could try this function:
letter2number <- function(x) {utf8ToInt(x) - utf8ToInt("a") + 1L}
Here's a short test:
letter2number("e")
#[1] 5
set.seed(123)
myletters <- letters[sample(26,8)]
#[1] "h" "t" "j" "u" "w" "a" "k" "q"
unname(sapply(myletters, letter2number))
#[1] 8 20 10 21 23 1 11 17
The function calculates the utf8 code of the letter that it is passed to, subtracts from this value the utf8 code of the letter "a" and adds to this value the number one to ensure that R's indexing convention is observed, according to which the numbering of the letters starts at 1, and not at 0.
The code works because the numeric sequence of the utf8 codes representing letters respects the alphabetic order.
For capital letters you could use, accordingly,
LETTER2num <- function(x) {utf8ToInt(x) - utf8ToInt("A") + 1L}
The which function seems appropriate here.
which(letters == 'e')
#[1] 5
Create a lookup vector and use simple subsetting:
x <- letters[1:4]
lookup <- setNames(seq_along(letters), letters)
lookup[x]
#a b c d
#1 2 3 4
Use unname if you want to remove the names.
thanks for all the ideas, but I am a dumdum.
Here's what I did. Made a mapping from each letter to a specific number, then called each letter
df=data.frame(L=letters[1:26],N=rnorm(26))
df[df$L=='e',2]
I have a .csv file containing 22.388 rows with comma seperated numbers. I want to find all possible combinations of pairs of the numbers for each row seperately and list them pair for pair, so that I'll be able to make a visual representation of them as clusters.
An example of two rows from my file would be
"2, 13"
"2, 8, 6"
When I use the str() function R says the file contains factors. I guess it needs to be integers, but I need the rows to be seperate, therefore I've wrapped each row in " ".
I want possible combinations of pairs for each row like this.
2, 13
2, 8
2, 6
8, 6
I've already gotten an answer from #flodel saying
Sample input - replace textConnection(...) with your csv filename.
csv <- textConnection("2,13
2,8,6")
This reads the input into a list of values:
input.lines <- readLines(csv)
input.values <- strsplit(input.lines, ',')
This creates a nested list of pairs:
pairs <- lapply(input.values, combn, 2, simplify = FALSE)
This puts everything in a nice matrix of integers:
pairs.mat <- matrix(as.integer(unlist(pairs)), ncol = 2, byrow = TRUE)
pairs.mat
But I need the function to run through each row in my .csv file seperately, so I think I need to do a for loop with the function - I just can't get my head around it.
Thanks in advance.
Not sure exactly what you're after but maybe something like this:
dat <- readLines(n=2) #read in your data
2, 13
2, 8, 6
## split each string on "," and then remove white space
## and put into a list with lapply
dat2 <- lapply(dat, function(x) {
as.numeric(gsub("\\s+", "", unlist(strsplit(x, ","))))
})
## find all combinations using outer with outer (faster
## than expand.grid and we can take just a triangle)
dat3 <- lapply(dat2, function(x) {
y <- outer(x, x, paste)
c(y[upper.tri(y)])
})
## then split on the spaces and convert back to numeric
## stored as a list
lapply(strsplit(unlist(dat3), " "), as.numeric)
## > lapply(strsplit(unlist(dat3), " "), as.numeric)
## [[1]]
## [1] 2 13
##
## [[2]]
## [1] 2 8
##
## [[3]]
## [1] 2 6
##
## [[4]]
## [1] 8 6