How to compare vectors with different structures - r

I have two vectors (fo, fo2) and I would like to compare if the numbers are matching between them (such as with intersect(fo,fo2)).
However, fo and fo2 can't be compared directly. fo is numeric (each element is typed into c() ) while fo2 is read from a string such as "1 3 6 7 8 10 11 13 14 15".
The output of the vectors are produced here for illustration. Any help is greatly appreciated!
# fo is a vector
> fo <- c(1,3,6,7,8,9,10,11)
> fo
[1] 1 3 6 7 8 10 11
> is.vector(fo)
[1] TRUE
# fo2 is also a vector
> library(stringr)
> fo2 <- str_split("1 3 6 7 8 10 11 13 14 15", " ")
> fo2
[[1]]
[1] "1" "3" "6" "7" "8" "10" "11" "13" "14" "15"
> is.vector(fo2)
[1] TRUE
> intersect(fo,fo2)
list()

fo2 here is list vector but fo is atomic vector so to get the intersect e.g.
intersect(fo , fo2[[1]])
#> [1] "1" "3" "6" "7" "8" "10" "11"
to learn the difference see Vectors

Another option:
fo %in% fo2[[1]]
Output:
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
Check with setdiff:
setdiff(fo, fo2[[1]])
Output:
[1] 9

Related

r - How to assign a list of results to a list of names

How does one take a list of character values and a list of number values and use the characters as a label. Like if I have a list of 30 numbers and 10 characters how do I insert those characters into the list of numbers with even spacing?
a
b
c
1
2
3
4
5
6
7
8
9
turn this into:
a
1
2
3
b
1
...
The input letters are formatted like:
c("a", "b", "c")
and the numbers:
"x"
"1" 0.56
"x"
"1" 0.45
"x"
"1" 0.44
"x"
"1" 0.67
"x"
"1" 0.29
"x"
"1" 0.02
"x"
"1" 0.13
"x"
"1" 0.15
As I mentioned in the comment, I'm unclear on what data types you want for the input and output. I'm taking a guess here:
c <-c("a","b","c") # some character
n <- c(1,2,3,4 ,5,6,7,8,9) # some numeric
m <- 1
for(i in 1:length(c)){
print(c[i])
print(n[m]); m <- m+1
print(n[m]); m <- m+1
}
[1] "a"
[1] 1
[1] 2
[1] "b"
[1] 3
[1] 4
[1] "c"
[1] 5
[1] 6
i.e.
m <- 1
a <- character()
for(i in 1:length(c)){
a <- c(a,(c[i]))
a <- c(a,n[m]); m <- m+1
a <- c(a,n[m]); m <- m+1
}
a
> a
[1] "a" "1" "2" "b" "3" "4" "c" "5" "6"
We can build a matrix and unwrap it:
x <- letters[1:3]
y <- 1:9
c(rbind(x,matrix(y,nrow=length(x))))
# [1] "a" "1" "2" "3" "b" "4" "5" "6" "c" "7" "8" "9"
Or we could use split<- :
lx <- length(x)
ly <- length(y)
res <- character(lx+ly)
split(res,c(1,rep(2,ly/lx))) <- list(x,y)
res
# [1] "a" "1" "2" "3" "b" "4" "5" "6" "c" "7" "8" "9"
c(1,rep(2,ly/lx)) is building the factor that will be recycled, then the original values are assigned to their new positions.

R Writing to data frame from inside for-loop

Brand new to R programming so please forgive me if I'm using wrong terminologies.
I'm trying to insert/append values to a data frame from inside a for-loop.
I can get the right values if I just print() them, but when I try to put it inside the data frame, I get mostly NA's. If I run this code it prints out the values I want.
output <- data.frame()
for (i in seq_along(Reasons)){
assign(paste(Reasons[i]), sum(ER$Reason == paste(Reasons[i])))
Tot <- get(paste(Reasons[i]))
assign(paste(Reasons[i],'ER',sep="_"), sum(grepl("ER|Er", ER$Disposition) & ER$Reason == paste(Reasons[i])))
Er <- get(paste(Reasons[i],'ER',sep="_"))
assign(paste(Reasons[i],'adm',sep="_"), sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & ER$Reason == paste(Reasons[i])))
Adm <- get(paste(Reasons[i],'adm',sep="_"))
assign(paste(Reasons[i],'admrate',sep="_"), sprintf("%.0f%%", (sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & ER$Reason == paste(Reasons[i])))/(sum(ER$Reason == paste(Reasons[i])))*100))
Rate <- get(paste(Reasons[i],'admrate',sep="_"))
print(c(Er,Adm,Tot,Rate))
#clear variables just created
rm(list=ls(pattern=Reasons[i]))
rm(Tot,Er,Adm,Rate)
}
[1] "7" "13" "20" "65%"
[1] "4" "8" "12" "67%"
[1] "12" "12" "24" "50%"
[1] "23" "7" "30" "23%"
[1] "7" "1" "8" "12%"
[1] "3" "1" "4" "25%"
[1] "3" "0" "3" "0%"
[1] "6" "5" "11" "45%"
[1] "2" "9" "11" "82%"
[1] "2" "4" "6" "67%"
[1] "10" "4" "14" "29%"
[1] "5" "0" "5" "0%"
[1] "10" "4" "14" "29%"
[1] "0" "3" "3" "100%"
[1] "7" "3" "10" "30%"
[1] "0" "4" "4" "100%"
But when I use
output <- rbind(output, c(Er, Adm, Tot, Rate))
Instead of
print(c(Er,Adm,Tot,Rate))
I get the first row of values (7, 13, 20, 65%), then all NA's except the "7" in rows 5 and 15... What am I doing wrong?
Thank you in advance
As I don't know what your data look like I cannot reproduce your error. If I understand it correctly, for each value in Reasons you want to find (a) the total number of observations, (b) the number of observations with the string "Er" in the variable Disposition, (c) the number of observations with the string "Admi" in the variable Disposition and (d) the percentage of observations with the string "Admi" in the variable Disposition. If that is the case then you don't have to use assign and get to do this.
Here is a simpler way to do it (although it's not the best way to do it, see below):
## Here I just generated some data that might look like the data
## you are dealing with:
Reasons <- LETTERS[1:10]
ER <- data.frame(Reason = LETTERS[sample.int(10,100, replace = TRUE)],
Disposition = c("ER", "Admi", "SomethingElse")[sample.int(3,100, replace = TRUE)])
output <- data.frame()
for (i in seq(along = Reasons)){
Tot <- sum(ER$Reason ==Reasons[i])
Er <- sum(grepl("ER|Er", ER$Disposition) & (ER$Reason ==Reasons[i]))
Adm <- sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & (ER$Reason ==Reasons[i]))
Rate <- paste(round(Adm/Tot*100), "%")
output <- rbind(output, c(Er, Adm, Tot, Rate))
}
> output
X.4. X.3. X.10. X.30...
1 4 3 10 30 %
2 2 3 6 50 %
3 2 1 6 17 %
4 5 2 14 14 %
5 3 5 11 45 %
6 2 4 11 36 %
7 3 6 14 43 %
8 2 2 5 40 %
9 1 7 11 64 %
10 4 4 12 33 %
Dynamically appending rows to a data frame or matrix is generally not a very good idea as it is quite memory intensive. If you know the dimensions of your matrix beforehand (as you do) you should initialize it with the right size and then fill the entries inside your loop:
## Initialize data:
output <- matrix(nrow = length(Reasons), ncol = 4)
for (i in seq(along = Reasons)){
Tot <- sum(ER$Reason ==Reasons[i])
Er <- sum(grepl("ER|Er", ER$Disposition) & (ER$Reason ==Reasons[i]))
Adm <- sum(grepl("Admi|admi|ADMI|ADmi", ER$Disposition) & (ER$Reason ==Reasons[i]))
Rate <- paste(round(Adm/Tot*100), "%")
output[i,] <- c(Er, Adm, Tot, Rate)
}
There are, however, even simpler ways to do this kind of evaluation. You could e.g. use the dplyr package, where you can group the data by a variable (the different Values of ER$Reason in your case) and the evaluate the values you need:
## Load the package 'dplyr'
library(dplyr)
## Group the variable and evaluate:
output <- ER %>% group_by(Reason) %>%
dplyr::summarise(Er = sum(grepl("ER|Er", Disposition)),
Adm = sum(grepl("Admi|admi|ADMI|ADmi", Disposition)),
Tot = n(),
Rate = paste(round(Adm/Tot*100), "%"))
> output
# A tibble: 10 × 5
Reason Er Adm Tot Rate
<chr> <int> <int> <int> <chr>
1 A 4 3 10 30 %
2 B 2 3 6 50 %
3 C 2 1 6 17 %
4 D 5 2 14 14 %
5 E 3 5 11 45 %
6 F 2 4 11 36 %
7 G 3 6 14 43 %
8 H 2 2 5 40 %
9 I 1 7 11 64 %
10 J 4 4 12 33 %

Most parsimonious way to make R stopifnot() (or similar) evaluate to TRUE in two cases of NA?

Let's say I have some vectors:
> a=c(1:5, NA, 7:10)
> b=a
> a
[1] 1 2 3 4 5 NA 7 8 9 10
> b
[1] 1 2 3 4 5 NA 7 8 9 10
If I use the stopifnot() function, then this will generate the error, because of the NA values, but I would like it not to do so...
> stopifnot(a==b)
Error: a == b are not all TRUE
> a==b
[1] TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE TRUE
>
I could modify my vectors so that I get the behaviour that I want
> a[is.na(a)]="missing"
> b[is.na(b)]="missing"
> a
[1] "1" "2" "3" "4" "5" "missing" "7" "8" "9" "10"
> b
[1] "1" "2" "3" "4" "5" "missing" "7" "8" "9" "10"
> stopifnot(a==b)
> a==b
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
>
But then I have the hassle of having to reset the "missing" values back to NA
> a[a=="missing"]=NA
> b[b=="missing"]=NA
> a
[1] "1" "2" "3" "4" "5" NA "7" "8" "9" "10"
> b
[1] "1" "2" "3" "4" "5" NA "7" "8" "9" "10"
And I have to reconvert the type, which is annoying
> typeof(a)
[1] "character"
> typeof(b)
[1] "character"
> a=as.numeric(a)
> b=as.numeric(b)
> a
[1] 1 2 3 4 5 NA 7 8 9 10
> b
[1] 1 2 3 4 5 NA 7 8 9 10
Is there a better way?
Typically you would use identical or, in particular for floating point numbers, the less strict all.equal:
a <- c(1:5, NA, 7:10)
b <- a
stopifnot(isTRUE(all.equal(a, b)))
#no output

Converting a vector of strings into a numerical vector, based on string-sequences

I have a vector like
A <- c("A","A","B","B", "C","C","C", "D")
i would like to convert it into a numerical vector based on the sequence in A, that would look like:
c(1:2, 3:4, 5:7, 8)
Is this possible?
Try:
A <- c("A","A","B","B", "C","C","C", "D")
as.numeric(factor(A))
[1] 1 1 2 2 3 3 3 4
and in case you really want a sequence from 1 to the length of the vector:
labels(factor(A))
[1] "1" "2" "3" "4" "5" "6" "7" "8"
or
1:length(A)
[1] 1 2 3 4 5 6 7 8
If the first sequence is what you want, you may find plyr::mapvalues interesting in case you have more complicated cases at some point. For instance,
library(plyr)
mapvalues(A, from=unique(A), to=1:4)
[1] "1" "1" "2" "2" "3" "3" "3" "4"
This comes in handy when you need a bit more control. For instance, you could easily supply other output as to argument, e.g.month.name[1:4].

convert character string into integer for modulo operation

I want to map md5 hashed character strings to weekday numbers (0-6) via modulo operation. Therefore I need to transform the character hashes into integers (numeric). I haven't found a way to output the hashes in byte form instead of ascii strings (via digest package). Any hints with base R or different approaches appreciated.
If you really want to do this, you'll require multiple-precision arithmetic, because a single md5 hash has 128 bits, which is too large to fit into a normal integer value. This can be done using the gmp package.
library('digest');
library('gmp');
as.integer(do.call(c,lapply(strsplit(sapply(letters,digest,'md5'),''), function(x) sum(as.bigz(match(x,c(0:9,letters[1:6]))-1)*as.bigz(16)^((length(x)-1):0)) ))%%7);
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
Let's break that down:
sapply(letters,digest,'md5')
## a b c ...
## "127a2ec00989b9f7faf671ed470be7f8" "ddf100612805359cd81fdc5ce3b9fbba" "6e7a8c1c098e8817e3df3fd1b21149d1" ...
I wanted to design this algorithm to be vectorized, and decided to use the built-in letters vector as 26 arbitrary input values for demonstration purposes. Unfortunately the dream of a fully vectorized algorithm (i.e. with no hidden loops) was dashed right away, since digest() is not vectorized for some reason, which is why I had to use sapply() here to produce a vector of md5 hashes corresponding to the inputs.
strsplit(...,'')
## $a
## [1] "1" "2" "7" "a" "2" "e" "c" "0" "0" "9" "8" "9" "b" "9" "f" "7" "f" "a" "f" "6" "7" "1" "e" "d" "4" "7" "0" "b" "e" "7" "f" "8"
##
## $b
## [1] "d" "d" "f" "1" "0" "0" "6" "1" "2" "8" "0" "5" "3" "5" "9" "c" "d" "8" "1" "f" "d" "c" "5" "c" "e" "3" "b" "9" "f" "b" "b" "a"
##
## $c
## [1] "6" "e" "7" "a" "8" "c" "1" "c" "0" "9" "8" "e" "8" "8" "1" "7" "e" "3" "d" "f" "3" "f" "d" "1" "b" "2" "1" "1" "4" "9" "d" "1"
## ...
Splits the hashes into character vectors, each element being one hex digit of the hash. We now have a list of 26 character vectors.
lapply(..., function(x) ... )
Process each character vector one at a time. Diving into the function (example output will be given for the value of x corresponding to input string 'a'):
match(x,c(0:9,letters[1:6]))-1
## [1] 1 2 7 10 2 14 12 0 0 9 8 9 11 9 15 7 15 10 15 6 7 1 14 13 4 7 0 11 14 7 15 8
This returns the value of each digit as a plain old integer, by finding the index within the hex digit sequence (c(0:9,letters[1:6])) and subtracting one.
as.bigz(...)
## Big Integer ('bigz') object of length 32:
## [1] 1 2 7 10 2 14 12 0 0 9 8 9 11 9 15 7 15 10 15 6 7 1 14 13 4 7 0 11 14 7 15 8
Cast to big integer, required for the arithmetic we're about to do.
...*as.bigz(16)^((length(x)-1):0)
## Big Integer ('bigz') object of length 32:
## [1] 21267647932558653966460912964485513216 2658455991569831745807614120560689152 581537248155900694395415588872650752 51922968585348276285304963292200960 649037107316853453566312041152512
## [6] 283953734451123385935261518004224 15211807202738752817960438464512 0 0 2785365088392105618523029504
## [11] 154742504910672534362390528 10880332376531662572355584 831136500985057557610496 42501298345826806923264 4427218577690292387840
## [16] 129127208515966861312 17293822569102704640 720575940379279360 67553994410557440 1688849860263936
## [21] 123145302310912 1099511627776 962072674304 55834574848 1073741824
## [26] 117440512 0 720896 57344 1792
## [31] 240 8
Treating the hash as a big-endian hex number, multiply each digit value by its place value.
sum(...)
## Big Integer ('bigz') :
## [1] 24560512346470571536449760694956189688
Add up each place-value-weighted digit value to get the bigz representation of the hash.
This completes the lapply() function. Thus, coming out of the lapply() call is a list of bigz values corresponding to the hashes:
lapply(..., function(x) ... )
## $a
## Big Integer ('bigz') :
## [1] 24560512346470571536449760694956189688
##
## $b
## Big Integer ('bigz') :
## [1] 295010738308890763454498908323798711226
##
## $c
## Big Integer ('bigz') :
## [1] 146851381511772731860674382282097773009
## ...
do.call(c,...)
## Big Integer ('bigz') object of length 26:
## [1] 24560512346470571536449760694956189688 295010738308890763454498908323798711226 146851381511772731860674382282097773009 277896596675540352347406615789605003835 196274166648971101707441276945175337351
## [6] 152164057440943545205375583549802787690 177176961461451259509149953911555923867 104722841650969351697149582356678916643 338417919426764038104581950237023359466 337938589168387959049175020406476846763
## [11] 182882473465429367490220828342074920857 80661780033646501757972845962914093977 251563583963884775614900275564391350478 279860001817578054753205218523665183571 158142488666995307556311659134646734337
## [16] 116423801372716526262639744414150237351 97172586736798383425273805088952414146 316382305028166656556246910315962582893 245775506345085992020540282526076959865 96713787940004003047734284080139522561
## [21] 227309401343419671779216095382349119699 250431221767618781785406207793096585421 33680856367414392588062933086110875192 119974848773126933055729663395967301868 296965764652868210844163281547943654188
## [26] 118199003122415992890118393158735259681
This "unlists" the list. Note: I tried sapply() instead of lapply(), and alternatively unlist(), and neither worked. This is probably related to the bigz class, possibly to the fact that a vector of bigz values is actually weirdly encoded as a single vector of raw.
...%%7
## Big Integer ('bigz') object of length 26:
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
And finally we can take the modulus on 7.
as.integer(...)
## [1] 3 2 1 1 5 5 5 5 1 4 4 6 5 3 5 4 0 2 0 4 5 4 6 3 6 1
Last step is to convert back to plain old integer from bigz.

Resources