Disappearing value in a matrix (but same number of row) in R - r

Something incredibly weird is happening in my file. Some values are disappearing but I have the same amount of row. Somes values are still in the matrix, so I don't understand.
# Data : meps >> https://github.com/JMcrocs/MEPVote/raw/master/meps.rds
> str(meps)
'data.frame': 784 obs. of 2338 variables:
$ mepid: num 197701 197533 197521 187917 124986 ...
$ EPG : chr "GUE.NGL" "GUE.NGL" "GUE.NGL" "GUE.NGL" ...
> mepsMatrix <- as.matrix(meps)
> str(mepsMatrix)
chr [1:784, 1:2338] "197701" "197533" "197521" "187917" "124986" "197529" "197468" " 96706" " 88715" "197416" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:784] "197701" "197533" "197521" "187917" ...
..$ : chr [1:2338] "mepid" "EPG" "1" "2" ...
> nrow(meps)
[1] 784
> nrow(mepsMatrix)
[1] 784
> 28229 %in% meps[,'mepid']
[1] TRUE
> 28229 %in% mepsMatrix[,'mepid']
[1] FALSE
The weirdest part is that I can find it with the RStudio viewer.
Can someone help me, please? I would be grateful!

Look at this minimal example:
df <- data.frame(a = c(2, 20), b = c("a", "b"))
m <- as.matrix(df)
2 %in% df[, "a"]
#> TRUE
2 %in% m[, "a"]
#> FALSE
" 2" %in% m[, "a"]
#> TRUE
2 %in% trimws(m[, "a"])
#> TRUE
m
#> a b
#> [1,] " 2" "a"
#> [2,] "20" "b"
Apparently as.matrix forces the same padding to keep the same string lengths when converts to character.

Related

list a combination of 3 numbers in 6 different ways?

From a range of numbers from 001 to 999, I would like to be able to formulate a function where from 001 to 199, the combinations of numbers will be listed in up to 6 different ways. Example 192 as 192, 129, 291, 219, 912, 921. The listing should obviously begin with 001 which will show as: 001, 010, 100.
I'm not sure what format you want the results in.
As commented, these are permutations: combinat::permn is probably the most convenient way to achieve this.
Format a number with zero-padding ("%03d"), split into characters (strsplit(.,"")):
f0 <- function(x) strsplit(sprintf("%03d",x),"")[[1]]
Create all permutations, squash them back into strings (paste/collapse), and select the unique values (e.g. 000 has only one unique value)
f1 <- function(x) unique(sapply(combinat::permn(f0(x)),paste,collapse=""))
Apply to each of the integers
result <- lapply(0:999,f1)
head(result)
[[1]]
[1] "000"
[[2]]
[1] "001" "010" "100"
[[3]]
[1] "002" "020" "200"
[[4]]
[1] "003" "030" "300"
[[5]]
[1] "004" "040" "400"
[[6]]
[1] "005" "050" "500"
Later values do indeed have up to six entries.
You could make vectors of indices with tidyr::crossing or expand.grid:
library(tidyverse)
indices <- crossing(x = 1:3, y = 1:3, z = 1:3) %>%
filter(x != y, x != z, y != z) %>%
pmap(~unname(c(...)))
indices %>% str
#> List of 6
#> $ : int [1:3] 1 2 3
#> $ : int [1:3] 1 3 2
#> $ : int [1:3] 2 1 3
#> $ : int [1:3] 2 3 1
#> $ : int [1:3] 3 1 2
#> $ : int [1:3] 3 2 1
...which you can then use to subset each input vector as you iterate across them:
perms <- pmap(crossing(x = 0:9, y = 0:9, z = 0:9), function(...){
map_chr(indices, function(x) paste(c(...)[x], collapse = "")) %>%
unique()
})
perms[500:510] %>% str(vec.len = 6)
#> List of 11
#> $ : chr [1:3] "499" "949" "994"
#> $ : chr [1:3] "500" "050" "005"
#> $ : chr [1:6] "501" "510" "051" "015" "150" "105"
#> $ : chr [1:6] "502" "520" "052" "025" "250" "205"
#> $ : chr [1:6] "503" "530" "053" "035" "350" "305"
#> $ : chr [1:6] "504" "540" "054" "045" "450" "405"
#> $ : chr [1:3] "505" "550" "055"
#> $ : chr [1:6] "506" "560" "056" "065" "650" "605"
#> $ : chr [1:6] "507" "570" "057" "075" "750" "705"
#> $ : chr [1:6] "508" "580" "058" "085" "850" "805"
#> $ : chr [1:6] "509" "590" "059" "095" "950" "905"
This ultimately is still a lot of iteration, so while it works fast enough for 6000 iterations, a vectorized approach would scale better.
Here is a solution that gives the desired output with no duplication and no additional calls to clean up duplicate results. We take advantage of std::next_permutation from the the algorithm library in C++, which takes a vector as input and generates lexicographical permutations until the first permutation is reached. This means, we only generate 3 permutations for 001, 1 permutation for 999, and 6 permutation for 123.
We start by generating all combinations of as.character(0:9) of length 3 with repetition by utilizing gtools::combinations.
## install.packages("gtools")
myCombs <- gtools::combinations(10, 3, as.character(0:9), repeats.allowed = TRUE)
nrow(myCombs)
[1] 220
Here is an Rcpp version that exposes std::next_permutation to R:
## install.packages("Rcpp")
Rcpp::cppFunction(
"CharacterVector permuteDigits(CharacterVector v) {
std::string myStr;
std::vector<std::string> result;
for (std::size_t i = 0; i < v.size(); ++i)
myStr += v[i];
do {
result.push_back(myStr);
} while(std::next_permutation(myStr.begin(), myStr.end()));
return wrap(result);
}"
)
And finally, we bring it altogether with lapply:
permutedCombs <- lapply(1:nrow(myCombs), function(x) {
permuteDigits(myCombs[x, ])
})
Here is some sample output:
permutedCombs[1:5]
[[1]]
[1] "000"
[[2]]
[1] "001" "010" "100"
[[3]]
[1] "002" "020" "200"
[[4]]
[1] "003" "030" "300"
[[5]]
[1] "004" "040" "400"
permutedCombs[151:155]
[[1]]
[1] "356" "365" "536" "563" "635" "653"
[[2]]
[1] "357" "375" "537" "573" "735" "753"
[[3]]
[1] "358" "385" "538" "583" "835" "853"
[[4]]
[1] "359" "395" "539" "593" "935" "953"
[[5]]
[1] "366" "636" "663"
And here is proof that we have all 1000 results with no duplications:
sum(lengths(permutedCombs))
[1] 1000
identical(sort(as.integer(do.call(c, permutedCombs))), 0:999)
[1] TRUE

R spread function (error in ... undefined columns selected)

I googled my error, but that didn't helped me.
Got a data frame, with a column x.
unique(df$x)
The result is:
[1] "fc_social_media" "fc_banners" "fc_nat_search"
[4] "fc_direct" "fc_paid_search"
When I try this:
df <- spread(data = df, key = x, value = x, fill = "0")
I got the error:
Error in `[.data.frame`(data, setdiff(names(data), c(key_var, value_var))) :
undefined columns selected
But that is very weird, because I used the spread function (in the same script) different times.
So I googled, saw some "solutions":
I removed all the "special" characters. As you can see, my unique
values do not contain special characters (cleaned it). But this didn't
help.
I checked if there are any columns with the same name. But all column names
are unique.
#Gregor, #Akrun:
> str(df)
'data.frame': 100 obs. of 22 variables:
$ visitor_id : chr "321012312666671237877-461170125342559040419" "321012366667112237877-461121705342559040419" "321012366661271237877-461170534255901240419" "321012366612671237877-461170534212559040419" ...
$ visit_num : chr "1" "1" "1" "1" ...
$ ref_domain : chr "l.facebook.com" "X.co.uk" "x.co.uk" "" ...
$ x : chr "fc_social_media" "fc_social_media" "fc_social_media" "fc_social_media" ...
$ va_closer_channel : chr "Social Media" "Social Media" "Social Media" "Social Media" ...
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "0" "0" "0" "0" ...
$ Hard Drive : chr "0" "0" "0" "0" ...
The error could be due to a column without a name i.e "". Using a reproducible example
library(tidyr)
spread(df, x, x)
Error in [.data.frame(data, setdiff(names(data), c(key_var,
value_var))) : undefined columns selected
We could make it work by changing the column name
names(df) <- make.names(names(df))
spread(df, x, x, fill = "0")
# X fc_banners fc_direct fc_nat_search fc_paid_search fc_social_media
#1 1 0 0 0 0 fc_social_media
#2 2 fc_banners 0 0 0 0
#3 3 0 0 fc_nat_search 0 0
#4 4 0 fc_direct 0 0 0
#5 5 0 0 0 fc_paid_search 0
data
df <- data.frame(x = c("fc_social_media", "fc_banners",
"fc_nat_search", "fc_direct", "fc_paid_search"), x1 = 1:5, stringsAsFactors = FALSE)
names(df)[2] <- ""

how can I extract numbers from a string in R?

names(score)
[1] "(Intercept)" "aado2_calc(20,180]" "aado2_calc(360,460]"
[4] "aado2_calc(460,629]" "albumin[1,1.8]" "albumin(1.8,2.2]"
[7] "albumin(2.2,2.8]" "aniongap(15,18]" "aniongap(18,20]"
[10] "aniongap(20,22]" "aniongap(22,25]" "aniongap(25,49]"
I want to extract the two numbers within parenthesis (numbers outside the parenthesis are not needed) and there are "(" or "[". the first number will be assigned to an object "low" and the second to "high".
You can use the readr package and the function parse_number for ease of use. For more power you'd want to use something like the base regular expression functions in r, or a package like stringi
Just like #jake-kaupp said - use stringi :) As you can see, stringi solution is shorter, easier to understand and much faster - up to 30 times!
Short answer:
arr <- stri_extract_all_regex(x, "(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", simplify = NA)
data.frame(low = as.numeric(arr[,1]), high = as.numeric(arr[,2]))
Long answer:
require(stringi)
require(microbenchmark)
grepFun <- function(x){
mat <- regmatches(x,
gregexpr("(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", x, perl = TRUE))
newnames <- lapply(mat, function(m) {
if (! length(m)) return(list(low = NA, high = NA))
setNames(as.list(as.numeric(m)), nm = c("low", "high"))
})
do.call(rbind.data.frame, newnames)
}
striFun <- function(x){
arr <- stri_extract_all_regex(x, "(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", simplify = NA)
data.frame(low = as.numeric(arr[,1]), high = as.numeric(arr[,2]))
}
# both functions work the same
grepFun(scorenames)
low high
1 NA NA
2 20.0 180.0
3 360.0 460.0
4 460.0 629.0
...
12 25.0 49.0
striFun(scorenames)
low high
1 NA NA
2 20.0 180.0
3 360.0 460.0
4 460.0 629.0
...
12 25.0 49.0
# generating more complicated vector
n <- 10000
x <- stri_paste(stri_rand_strings(n, length = 1:10), sample(c("(","["),n,TRUE),
sample(1000,n,TRUE), ",", sample(1000,n,TRUE), sample(c(")","]"), n, TRUE))
head(x) # check first elements
[1] "O[68,434]" "Ql[783,151)" "Zk0(773,60)" "ETfV(446,518]" "Xixbr(576,855)" "G6QnHu(92,955)"
#short test using new data
grepFun(x[1:6])
low high
1 68 434
2 783 151
3 773 60
4 446 518
5 576 855
6 92 955
striFun(x[1:6])
low high
1 68 434
2 783 151
3 773 60
4 446 518
5 576 855
6 92 955
#and some benchmark to prove performance
microbenchmark(grepFun(x), striFun(x))
Unit: milliseconds
expr min lq mean median uq max neval
grepFun(x) 330.27733 366.09306 416.56330 406.08914 465.29829 568.15250 100
striFun(x) 11.57449 11.97825 13.38157 12.46927 13.67699 25.97455 100
scorenames <- c(
"(Intercept)" ,"aado2_calc(20,180]" ,"aado2_calc(360,460]"
,"aado2_calc(460,629]" ,"albumin[1,1.8]" ,"albumin(1.8,2.2]"
,"albumin(2.2,2.8]" ,"aniongap(15,18]" ,"aniongap(18,20]"
,"aniongap(20,22]" ,"aniongap(22,25]" ,"aniongap(25,49]"
)
The first step might be to extract everything within the "parens"-delimiters (to include (), [], and the comma ,).
mat <- regmatches(scorenames,
gregexpr("(?<=[\\[\\(,])[0-9.]+(?=[\\]\\),])", scorenames, perl = TRUE))
str(mat)
# List of 12
# $ : chr(0)
# $ : chr [1:2] "20" "180"
# $ : chr [1:2] "360" "460"
# $ : chr [1:2] "460" "629"
# $ : chr [1:2] "1" "1.8"
# $ : chr [1:2] "1.8" "2.2"
# $ : chr [1:2] "2.2" "2.8"
# $ : chr [1:2] "15" "18"
# $ : chr [1:2] "18" "20"
# $ : chr [1:2] "20" "22"
# $ : chr [1:2] "22" "25"
# $ : chr [1:2] "25" "49"
From here, we can see that (1) the first one is problematic (no surprise, you need to figure out what you want here), and (2) the rest look about right.
Here's one rough way to process this list. This is very trusting and naïve ... you should probably add checks to ensure the list is of length 2, that everything converts correctly (perhaps in a tryCatch), etc.
newnames <- lapply(mat, function(m) {
if (! length(m)) return(list(low = NA, high = NA))
setNames(as.list(as.numeric(m)), nm = c("low", "high"))
})
str(newnames)
# List of 12
# $ :List of 2
# ..$ low : logi NA
# ..$ high: logi NA
# $ :List of 2
# ..$ low : num 20
# ..$ high: num 180
# $ :List of 2
# ..$ low : num 360
# ..$ high: num 460
# ...snip...
You can turn this into a data.frame with:
head(do.call(rbind.data.frame, newnames))
# low high
# 1 NA NA
# 2 20.0 180.0
# 3 360.0 460.0
# 4 460.0 629.0
# 5 1.0 1.8
# 6 1.8 2.2

converting string to numeric in R

I have a problem regarding data conversion using R language.
I have two data that being stored in variables named lung.X and lung.y, below are the description of my data.
> str(lung.X)
chr [1:86, 1:7129] " 170.0" " 104.0" " 53.7" " 119.0" " 105.5" " 130.0" ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:86] "V3" "V4" "V5" "V6" ...
..$ : chr [1:7129] "A28102_at" "AB000114_at" "AB000115_at" "AB000220_at" ...
and
> str(lung.y)
num [1:86] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
lung.X is a matrix (row: 86 col: 7129) and lung.y is an array of numbers (86 entries)
Do anyone know how to convert above data into the format below?
> str(lung.X)
num [1:86, 1:7129] 170 104 53.7 119 105.5 130...
I thought I should do like this
lung.X <- as.numeric(lung.X)
but I got this instead
> str(lung.X)
num [1:613094] 170 104 53.7 119 105.5 130...
The reason of doing this is because I need lung.X to be numerical only.
Thank you.
You could change the mode of your matrix to numeric:
## example data
m <- matrix(as.character(1:10), nrow=2,
dimnames = list(c("R1", "R2"), LETTERS[1:5]))
m
# A B C D E
# R1 "1" "3" "5" "7" "9"
# R2 "2" "4" "6" "8" "10"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
mode(m) <- "numeric"
str(m)
# num [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "R1" "R2"
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# NULL
m
# A B C D E
# R1 1 3 5 7 9
# R2 2 4 6 8 10
Give this a try: m <- matrix(as.numeric(lung.X), nrow = 86, ncol = 7129)
If you need it in dataframe/list format, df <- data.frame(m)

Why does R convert character to factor

New to R and can't figure this out. I have a vector of characters, place it into a data.frame and they change to "factor":
> name <- c("Ann","Bob", "Carl", "Dan","Ed")
> class(name)
[1] "character" # Expected this.
> wt <- c(123,234,222,199,201)
> class(wt)
[1] "numeric" # Expected this.
> a <- data.frame(name, wt)
> class(a$wt)
[1] "numeric" # Expected this.
> class(a$name)
[1] "factor" # ???
I am not sure why this is happening.
As mentioned in the comments, use stringsAsFactors = FALSE when creating your data.frame:
str(data.frame(name, wt, stringsAsFactors = FALSE))
# 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201
The default behavior is for stringsAsFactors = TRUE. This default behavior can be changed at startup, but you may not want to do this for compatibility with other people's scripts.
Some other packages that build upon data.frames have different default behavior. For instance, consider data.table from the "data.table" package or data_frame from the "dplyr" package:
library(data.table)
str(data.table(name, wt))
# Classes ‘data.table’ and 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201
# - attr(*, ".internal.selfref")=<externalptr>
library(dplyr)
str(data_frame(name, wt))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 2 variables:
# $ name: chr "Ann" "Bob" "Carl" "Dan" ...
# $ wt : num 123 234 222 199 201

Resources