convert character string into integer for modulo operation - r

I want to map md5 hashed character strings to weekday numbers (0-6) via modulo operation. Therefore I need to transform the character hashes into integers (numeric). I haven't found a way to output the hashes in byte form instead of ascii strings (via digest package). Any hints with base R or different approaches appreciated.

If you really want to do this, you'll require multiple-precision arithmetic, because a single md5 hash has 128 bits, which is too large to fit into a normal integer value. This can be done using the gmp package.
library('digest');
library('gmp');
as.integer(do.call(c,lapply(strsplit(sapply(letters,digest,'md5'),''), function(x) sum(as.bigz(match(x,c(0:9,letters[1:6]))-1)*as.bigz(16)^((length(x)-1):0)) ))%%7);
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
Let's break that down:
sapply(letters,digest,'md5')
## a b c ...
## "127a2ec00989b9f7faf671ed470be7f8" "ddf100612805359cd81fdc5ce3b9fbba" "6e7a8c1c098e8817e3df3fd1b21149d1" ...
I wanted to design this algorithm to be vectorized, and decided to use the built-in letters vector as 26 arbitrary input values for demonstration purposes. Unfortunately the dream of a fully vectorized algorithm (i.e. with no hidden loops) was dashed right away, since digest() is not vectorized for some reason, which is why I had to use sapply() here to produce a vector of md5 hashes corresponding to the inputs.
strsplit(...,'')
## $a
## [1] "1" "2" "7" "a" "2" "e" "c" "0" "0" "9" "8" "9" "b" "9" "f" "7" "f" "a" "f" "6" "7" "1" "e" "d" "4" "7" "0" "b" "e" "7" "f" "8"
##
## $b
## [1] "d" "d" "f" "1" "0" "0" "6" "1" "2" "8" "0" "5" "3" "5" "9" "c" "d" "8" "1" "f" "d" "c" "5" "c" "e" "3" "b" "9" "f" "b" "b" "a"
##
## $c
## [1] "6" "e" "7" "a" "8" "c" "1" "c" "0" "9" "8" "e" "8" "8" "1" "7" "e" "3" "d" "f" "3" "f" "d" "1" "b" "2" "1" "1" "4" "9" "d" "1"
## ...
Splits the hashes into character vectors, each element being one hex digit of the hash. We now have a list of 26 character vectors.
lapply(..., function(x) ... )
Process each character vector one at a time. Diving into the function (example output will be given for the value of x corresponding to input string 'a'):
match(x,c(0:9,letters[1:6]))-1
## [1] 1 2 7 10 2 14 12 0 0 9 8 9 11 9 15 7 15 10 15 6 7 1 14 13 4 7 0 11 14 7 15 8
This returns the value of each digit as a plain old integer, by finding the index within the hex digit sequence (c(0:9,letters[1:6])) and subtracting one.
as.bigz(...)
## Big Integer ('bigz') object of length 32:
## [1] 1 2 7 10 2 14 12 0 0 9 8 9 11 9 15 7 15 10 15 6 7 1 14 13 4 7 0 11 14 7 15 8
Cast to big integer, required for the arithmetic we're about to do.
...*as.bigz(16)^((length(x)-1):0)
## Big Integer ('bigz') object of length 32:
## [1] 21267647932558653966460912964485513216 2658455991569831745807614120560689152 581537248155900694395415588872650752 51922968585348276285304963292200960 649037107316853453566312041152512
## [6] 283953734451123385935261518004224 15211807202738752817960438464512 0 0 2785365088392105618523029504
## [11] 154742504910672534362390528 10880332376531662572355584 831136500985057557610496 42501298345826806923264 4427218577690292387840
## [16] 129127208515966861312 17293822569102704640 720575940379279360 67553994410557440 1688849860263936
## [21] 123145302310912 1099511627776 962072674304 55834574848 1073741824
## [26] 117440512 0 720896 57344 1792
## [31] 240 8
Treating the hash as a big-endian hex number, multiply each digit value by its place value.
sum(...)
## Big Integer ('bigz') :
## [1] 24560512346470571536449760694956189688
Add up each place-value-weighted digit value to get the bigz representation of the hash.
This completes the lapply() function. Thus, coming out of the lapply() call is a list of bigz values corresponding to the hashes:
lapply(..., function(x) ... )
## $a
## Big Integer ('bigz') :
## [1] 24560512346470571536449760694956189688
##
## $b
## Big Integer ('bigz') :
## [1] 295010738308890763454498908323798711226
##
## $c
## Big Integer ('bigz') :
## [1] 146851381511772731860674382282097773009
## ...
do.call(c,...)
## Big Integer ('bigz') object of length 26:
## [1] 24560512346470571536449760694956189688 295010738308890763454498908323798711226 146851381511772731860674382282097773009 277896596675540352347406615789605003835 196274166648971101707441276945175337351
## [6] 152164057440943545205375583549802787690 177176961461451259509149953911555923867 104722841650969351697149582356678916643 338417919426764038104581950237023359466 337938589168387959049175020406476846763
## [11] 182882473465429367490220828342074920857 80661780033646501757972845962914093977 251563583963884775614900275564391350478 279860001817578054753205218523665183571 158142488666995307556311659134646734337
## [16] 116423801372716526262639744414150237351 97172586736798383425273805088952414146 316382305028166656556246910315962582893 245775506345085992020540282526076959865 96713787940004003047734284080139522561
## [21] 227309401343419671779216095382349119699 250431221767618781785406207793096585421 33680856367414392588062933086110875192 119974848773126933055729663395967301868 296965764652868210844163281547943654188
## [26] 118199003122415992890118393158735259681
This "unlists" the list. Note: I tried sapply() instead of lapply(), and alternatively unlist(), and neither worked. This is probably related to the bigz class, possibly to the fact that a vector of bigz values is actually weirdly encoded as a single vector of raw.
...%%7
## Big Integer ('bigz') object of length 26:
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
And finally we can take the modulus on 7.
as.integer(...)
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
Last step is to convert back to plain old integer from bigz.

Related

Appending lists with corresponding values from tibbles

Suppose my data is ordered as listed tibbles with a corresponding tibble that provides further info. Row "a" in infos refers thus to tibble "a" from the list.
list_in <- list(a=tibble(I=c(6:10),
II=c(2:6),
III=letters[1:5]),
b=tibble(I=c(1:5),
II=c(2:6),
III=letters[2:6]),
c=tibble(I=c(7:11),
II=c(3:7),
III=letters[5:9]))
infos <- tibble(id=c("a","b","c"),
weights=c(1:3),
grades=letters[4:6])
In order to do further calculations, is there a way to use lapply or a loop to append list_in, so that list_out also contains the corresponding values from infos? The expected output would look like this:
# install.packages("rlist")
library(rlist)
list_out <- list((list.append(list_in$a, weights=infos$weights[1], grades=infos$grades[1])),
(list.append(list_in$b, weights=infos$weights[2], grades=infos$grades[2])),
(list.append(list_in$c, weights=infos$weights[3], grades=infos$grades[3])))
but this way to get there feels very awkward and only works for very small data sets.
Thanks in advance!
You can use lapply and c() to append each tibble with the corresponding row of infos.
list_out2 <- lapply(names(list_in), \(x) {
c(list_in[[x]], infos[infos$id == x, -1])
})
all.equal(list_out, list_out2)
# [1] TRUE
list_out2
[[1]]
[[1]]$I
[1] 6 7 8 9 10
[[1]]$II
[1] 2 3 4 5 6
[[1]]$III
[1] "a" "b" "c" "d" "e"
[[1]]$weights
[1] 1
[[1]]$grades
[1] "d"
[[2]]
[[2]]$I
[1] 1 2 3 4 5
[[2]]$II
[1] 2 3 4 5 6
[[2]]$III
[1] "b" "c" "d" "e" "f"
[[2]]$weights
[1] 2
[[2]]$grades
[1] "e"
[[3]]
[[3]]$I
[1] 7 8 9 10 11
[[3]]$II
[1] 3 4 5 6 7
[[3]]$III
[1] "e" "f" "g" "h" "i"
[[3]]$weights
[1] 3
[[3]]$grades
[1] "f"
You can do a left_join between the tibble in the list and the extra info:
append_info <- function(n) {
out <- list_in[[n]] %>%
mutate(id = n) %>%
left_join(infos, by = 'id') %>%
select(-id)
return(out)
}
lapply(names(list_in), append_info)
Using Map
Map(c, list_in, split(infos[-1], infos$id))
-output
$a
$a$I
[1] 6 7 8 9 10
$a$II
[1] 2 3 4 5 6
$a$III
[1] "a" "b" "c" "d" "e"
$a$weights
[1] 1
$a$grades
[1] "d"
$b
$b$I
[1] 1 2 3 4 5
$b$II
[1] 2 3 4 5 6
$b$III
[1] "b" "c" "d" "e" "f"
$b$weights
[1] 2
$b$grades
[1] "e"
$c
$c$I
[1] 7 8 9 10 11
$c$II
[1] 3 4 5 6 7
$c$III
[1] "e" "f" "g" "h" "i"
$c$weights
[1] 3
$c$grades
[1] "f"

How to compare vectors with different structures

I have two vectors (fo, fo2) and I would like to compare if the numbers are matching between them (such as with intersect(fo,fo2)).
However, fo and fo2 can't be compared directly. fo is numeric (each element is typed into c() ) while fo2 is read from a string such as "1 3 6 7 8 10 11 13 14 15".
The output of the vectors are produced here for illustration. Any help is greatly appreciated!
# fo is a vector
> fo <- c(1,3,6,7,8,9,10,11)
> fo
[1] 1 3 6 7 8 10 11
> is.vector(fo)
[1] TRUE
# fo2 is also a vector
> library(stringr)
> fo2 <- str_split("1 3 6 7 8 10 11 13 14 15", " ")
> fo2
[[1]]
[1] "1" "3" "6" "7" "8" "10" "11" "13" "14" "15"
> is.vector(fo2)
[1] TRUE
> intersect(fo,fo2)
list()
fo2 here is list vector but fo is atomic vector so to get the intersect e.g.
intersect(fo , fo2[[1]])
#> [1] "1" "3" "6" "7" "8" "10" "11"
to learn the difference see Vectors
Another option:
fo %in% fo2[[1]]
Output:
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
Check with setdiff:
setdiff(fo, fo2[[1]])
Output:
[1] 9

how to select only integer values of a column [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 6 years ago.
my data have many columns with different names and want see all numeric values only in column name_id and store those values in z.
I want z should contains only numeric values of column name_id of data, if any alphabet is there in column then it should not get store in z.
z <- unique(data$name_id)
z
#[1] 10 11 12 13 14 3 4 5 6 7 8 9
#Levels: 10 11 12 13 14 3 4 5 6 7 8 9 a b c d e f
when i tried this
z <- unique(as.numeric(data$name_id))
z
# [1] 1 2 3 4 5 6 7 8 9 10 11 12
output contains values only till 12 but column has values greater than 12 also
Considering your data frame as
> b
[1] "1" "2" "3" "4" "5" "13" "14" "15" "45" "567" "999" "Name" "Age"
Apply this :
regexp <- "[[:digit:]]+"
> z <- str_extract(b , regexp)
z[is.na(z)] <- ""
> z
[1] "1" "2" "3" "4" "5" "13" "14" "15" "45" "567" "999" "" ""
Hope this helps .

Find the min and max from an unstructured array

I have the following vector and it shows the possible values that a variable can take. As you can see, it's not user-friendly and I'm having a hard time finding a systemic way of going through and identifying the min and max values. Does anyone have any suggestions?
[211] "-1\n1-960" "-1\n1-960"
[213] "-1\n1-960" "-1\n1\n2\n3"
[215] "-1\n0\n1\n\n2\n3\n\n4\n\n5" "-1\nF\nG\nH\nP\nR\nS\nU"
[217] "-1\n0\n1\n2\n3" "-1\n0\n1"
[219] "-1\n0\n1\n2\n3\n4\n5\n6" "-1\n0-255"
[221] "-1\n0-255" "-1\n0-255"
[223] "-1\n0-255" "-1\n0-255"
[225] "-1\n0\n0.01–0.99\n1\n1.01–99.99" "-1\n0\n1\n2\n3\n4\n5\n\n6\n\n7\n8\n\n9\n10\n11\n12"
[227] "-1\n0\n1\n\n2\n\n3\n4\n5\n\n6" "-1\n0\n1\n2\n\n3\n\n4\n5\n6"
The value "-1\n1-960" refers to the possible range of values being between 1 and 960. -1 doesn't mean anything and should be disregarded, along with all letters.
For example:
"-1\n1-960"
"-1\n0\n1\n\n2\n\n3\n4\n5\n\n6" "-1\n0\n1\n2\n\n3\n\n4\n5\n6"
Should result in:
max min
960 1
6 0
6 0
After removing the leading -1, you can split on newlines. Then, since a - means a range, you can also split on - characters, as the two numbers give the min and max of the range. So here's some code:
lapply(
strsplit(
gsub('^-1\n', '', dat),
'\n|-'
),
function(x) range(x)
)
[[1]]
[1] "1" "960"
[[2]]
[1] "1" "960"
[[3]]
[1] "1" "960"
[[4]]
[1] "1" "3"
[[5]]
[1] "" "5"
[[6]]
[1] "F" "U"
[[7]]
[1] "0" "3"
[[8]]
[1] "0" "1"
[[9]]
[1] "0" "6"
[[10]]
[1] "0" "255"
[[11]]
[1] "0" "255"
[[12]]
[1] "0" "255"
[[13]]
[1] "0" "255"
[[14]]
[1] "0" "255"
[[15]]
[1] "0" "1.01–99.99"
[[16]]
[1] "" "9"
[[17]]
[1] "" "6"
[[18]]
[1] "" "6"
Expanding my comment with additional code which might or might not be a partial answer:
I'm guessing that -255 is some sort of missing value marker. Some of those character values (at the moment) could be parsed in R as "numeric" values, but others would throw an error if you tried to parse as such. What were you expecting from 1-960. That's an expression, so neither numeric nor character.
dat <- c( "-1\n1-960" , "-1\n1-960",
"-1\n1-960" , "-1\n1\n2\n3" ,
"-1\n0\n1\n\n2\n3\n\n4\n\n5" , "-1\nF\nG\nH\nP\nR\nS\nU",
"-1\n0\n1\n2\n3" , "-1\n0\n1" ,
"-1\n0\n1\n2\n3\n4\n5\n6" , "-1\n0-255" ,
"-1\n0-255" , "-1\n0-255" ,
"-1\n0-255" , "-1\n0-255" ,
"-1\n0\n0.01–0.99\n1\n1.01–99.99" , "-1\n0\n1\n2\n3\n4\n5\n\n6\n\n7\n8\n\n9\n10\n11\n12" ,
"-1\n0\n1\n\n2\n\n3\n4\n5\n\n6" , "-1\n0\n1\n2\n\n3\n\n4\n5\n6" )
scandat <- sapply( dat, function(x) try( scan(textConnection(x)) ) )
# Lots of error messages but wrapping the scan call in try let's it continue
# So these are the items that could be parsed as numeric:
> scandat[ sapply(scandat,class)=="numeric" ]
$`-1\n1\n2\n3`
[1] -1 1 2 3
$`-1\n0\n1\n\n2\n3\n\n4\n\n5`
[1] -1 0 1 2 3 4 5
$`-1\n0\n1\n2\n3`
[1] -1 0 1 2 3
$`-1\n0\n1`
[1] -1 0 1
$`-1\n0\n1\n2\n3\n4\n5\n6`
[1] -1 0 1 2 3 4 5 6
$`-1\n0\n1\n2\n3\n4\n5\n\n6\n\n7\n8\n\n9\n10\n11\n12`
[1] -1 0 1 2 3 4 5 6 7 8 9 10 11 12
$`-1\n0\n1\n\n2\n\n3\n4\n5\n\n6`
[1] -1 0 1 2 3 4 5 6
$`-1\n0\n1\n2\n\n3\n\n4\n5\n6`
[1] -1 0 1 2 3 4 5 6
I'm not cleaning this up but you could replace the funky names with womething else and it would print better:
> sapply( scandat[ sapply(scandat,class)=="numeric" ], function(x) list(minx=min(x), maxx=max(x) )
+ )
-1\n1\n2\n3 -1\n0\n1\n\n2\n3\n\n4\n\n5 -1\n0\n1\n2\n3 -1\n0\n1 -1\n0\n1\n2\n3\n4\n5\n6
minx -1 -1 -1 -1 -1
maxx 3 5 3 1 6
-1\n0\n1\n2\n3\n4\n5\n\n6\n\n7\n8\n\n9\n10\n11\n12 -1\n0\n1\n\n2\n\n3\n4\n5\n\n6 -1\n0\n1\n2\n\n3\n\n4\n5\n6
minx -1 -1 -1
maxx 12 6 6

R - Splitting a column text into 2 columns without delimiter

I need to manipulate the following data frame (data) so that the PATCH_CODE column is split into 2 resulting columns where the 1st column contains the letter of the string and the 2nd column contains the number as in the 2nd example dataframe below.
EDIT PATCH_CODE is not always 2 letters, occasional cases have a single letter in which case I need to force a 1 into the resulting code column.
initial data frame: head(data,4)
PATCH_CODE TERR PC1
A1 MENS_10 0.8629186
A3 MENS_10 -0.2703238
B1 MENS_10 0.9516067
B2 MENS_10 -0.1722446
resulting data frame:
PATCH CODE TERR PC1
A 1 MENS_10 0.8629186
A 3 MENS_10 -0.2703238
B 1 MENS_10 0.9516067
B 2 MENS_10 -0.1722446
I have seen examples of how to accomplish this when the column to be split has an identifiable text delimiter such as a comma by using colsplit in reshape but I have failed to find a solution for a structure like mine. Is this possible?
output of str(data)
'data.frame': 240 obs. of 3 variables:
$ PATCH_CODE: Factor w/ 42 levels "A","A1","A2",..: 2 3 4 7 8 12 13 16 17 18 ...
$ TERR : Factor w/ 19 levels "MENS_10","MENS_14",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PC1 : num 0.548 1.228 0.273 5.548 3.853 ...
You can use strsplit. Passing an empty string as a delimiter results in a split at each letter.
a <- c("A1", "B1", "C2", "D5", "R3")
strsplit(a, "")
[[1]]
[1] "A" "1"
[[2]]
[1] "B" "1"
[[3]]
[1] "C" "2"
[[4]]
[1] "D" "5"
[[5]]
[1] "R" "3"
If you want to put that in a matrix
> do.call(rbind, strsplit(a, ""))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "1"
[3,] "C" "2"
[4,] "D" "5"
[5,] "R" "3"
By the sounds of your description, strsplit should work fine. If your data are a little more complicated, you can also look at a possible regex-based solution.
For this particular example, try:
do.call(rbind, strsplit(mydf$PATCH_CODE,
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "3"
# [3,] "B" "1"
# [4,] "B" "2"

Resources