Format thousand to Ks in R - r

How to format numbers like 465456.6789 to beautiful 465,4K in R? Other examples 13567.566 to 13,5K 3567.5 to 3,5K and so on. In general I want something like
roundup_to <- function(x, to = 10, up = FALSE){
if(up) round(.Machine$double.eps^0.5 + x/to)*to else round(x/to)*to
}
roundup_to(c((74453.867574737)), to = 100)
to become 74,5K

You can also have a look at function scales::label_number_si which rounds the number.
a <- c(465456.6789, 3567.5, 1465458.12)
scales::label_number_si(accuracy = 0.1)(a)
#[1] "465.5K" "3.6K" "1.5M"

You could do:
a <- c(465456.6789, 13567.566, 3567.5)
sprintf("%sK", format(round(a/1000, 1), dec=","))
[1] "465,5K" " 13,6K" " 3,6K"

Related

K-mer words in R

I am still new to R programming and I just have no idea how to write this same code below from python to R.
human_data is dataframe from CSV file. the word includes sequence of letters. Basically, I want to convert my 'word' column sequence of string into all possible k-mer words of length 6.
def getKmers(sequence, size=6):
return [sequence[x:x+size] for x in range(len(sequence) - size + 1)]
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
You could use the library quanteda too, in order to compute the k-mers (k-grams), the following code shows an example:
library(quanteda)
k = 6 # 6-mers
human_data = data.frame(sequence=c('abcdefghijkl', 'xxxxyyxxyzz'))
human_data$words <- apply(human_data, 1,
function(x) char_ngrams(unlist(tokens(x['sequence'],
'character')), n=k, concatenator = ''))
human_data
# sequence words
#1 abcdefghijkl abcdef, bcdefg, cdefgh, defghi, efghij, fghijk, ghijkl
#2 xxxxyyxxyzz xxxxyy, xxxyyx, xxyyxx, xyyxxy, yyxxyz, yxxyzz
I hope this helps, using R basic commands:
df = data.frame(words=c('asfdklajsjahk', 'dkajsadjkfggfh', 'kfjlhdaDDDhlw'))
getKmers = function(sequence, size=6) {
kmers = c()
for (x in 1:(nchar(sequence) - size + 1)) {
kmers = c(kmers, substr(sequence, x, x+size-1))
}
return(kmers)
}
sapply(df$words, getKmers)

How to read a .MAP file extension in R?

Is there a simple way to read a file of .MAP extension in R? I have tried a few options below but had no success. Here is a .MAP file for a reproducible example.
context: For some odd reason, the spatial regionalization used in health planning policies in Brazil is only available in this format. I would like to convert it to geopackage so we can add it to the geobr package.
# none of these options work
mp <- sf::st_read("./se_mapas_2013/se_regsaud.MAP")
mp <- rgdal::readGDAL("./se_mapas_2013/se_regsaud.MAP")
mp <- rgdal::readOGR("./se_mapas_2013/se_regsaud.MAP")
mp <- raster::raster("./se_mapas_2013/se_regsaud.MAP")
mp <- stars::read_stars("./se_mapas_2013/se_regsaud.MAP")
ps. there is a similar question on SO focused on Python, unfortunately unanswered
UPDATE
We have found a publication that uses a custom function that reads the .MAP file. See example below. However, it returns a "polylist" object. Is there a simple way to convert it to a simple feature?
original custom function
read.map = function(filename){
zz=file(filename,"rb")
#
# header of .map
#
versao = readBin(zz,"integer",1,size=2) # 100 = versao 1.00
#Bounding Box
Leste = readBin(zz,"numeric",1,size=4)
Norte = readBin(zz,"numeric",1,size=4)
Oeste = readBin(zz,"numeric",1,size=4)
Sul = readBin(zz,"numeric",1,size=4)
geocodigo = ""
nome = ""
xleg = 0
yleg = 0
sede = FALSE
poli = list()
i = 0
#
# repeat of each object in file
#
repeat{
tipoobj = readBin(zz,"integer",1,size=1) # 0=Poligono, 1=PoligonoComSede, 2=Linha, 3=Ponto
if (length(tipoobj) == 0) break
i = i + 1
Len = readBin(zz,"integer",1,size=1) # length byte da string Pascal
geocodigo[i] = readChar(zz,10)
Len = readBin(zz,"integer",1,size=1) # length byte da string Pascal
nome[i] = substr(readChar(zz,25),1,Len)
xleg[i] = readBin(zz,"numeric",1,size=4)
yleg[i] = readBin(zz,"numeric",1,size=4)
numpontos = readBin(zz,"integer",1,size=2)
sede = sede || (tipoobj = 1)
x=0
y=0
for (j in 1:numpontos){
x[j] = readBin(zz,"numeric",1,size=4)
y[j] = readBin(zz,"numeric",1,size=4)
}
# separate polygons
xInic = x[1]
yInic = y[1]
for (j in 2:numpontos){
if (x[j] == xInic & y[j] == yInic) {x[j]=NA; y[j] = NA}
}
poli[[i]] = c(x,y)
dim(poli[[i]]) = c(numpontos,2)
}
class(poli) = "polylist"
attr(poli,"region.id") = geocodigo
attr(poli,"region.name") = nome
attr(poli,"centroid") = list(x=xleg,y=yleg)
attr(poli,"sede") = sede
attr(poli,"maplim") = list(x=c(Oeste,Leste),y=c(Sul,Norte))
close(zz)
return(poli)
}
using original custom function
mp <- read.map("./se_mapas_2013/se_regsaud.MAP")
class(mp)
>[1] "polylist"
# plot
plot(attributes(mp)$maplim, type='n', asp=1, xlab=NA, ylab=NA)
title('Map')
lapply(mp, polygon, asp=T, col=3)
The problems were: use of readChar with trailing nul bytes - changed to readBin(); 8-bit characters that rawToChar() would not accept (on my UTF-8 system); multiple slivers in some files that needed dropping; and some others. I added the edited read.map() function above to maptools, but with a different name and not exported. So now (with maptools rev 370 from https://r-forge.r-project.org/R/?group_id=943 when build completes):
library(maptools)
o <- maptools:::readMAP2polylist("se_regsaud.MAP")
oo <- maptools:::.makePolylistValid(o)
ooo <- maptools:::.polylist2SpP(oo, tol=.Machine$double.eps^(1/4))
rn <- row.names(ooo)
df <- data.frame(ID=rn, row.names=rn, stringsAsFactors=FALSE)
res <- SpatialPolygonsDataFrame(ooo, data=df)
library(sf)
res_sf <- st_as_sf(res)
res_sf
plot(st_geometry(res_sf))
This approach re-uses the maptools code dating back almost twenty years, with minor edits to handle subsequent changes in reading binary files, and fixing slivers.
EDIT: looks like this doesn't work generally across all files so proper conversion to sf would need a deeper look.
Here's a quick stab at resurrection. It might be incorrect to cumulatively sum to get the multi linestrings, I tested with se_municip.MAP and it only had NAs as the closing row of each ring. If it potentially has non-connected multi-rings (multipolygon) then this approach won't work completely.
x <- read.map("se_municip.MAP")
df <- setNames(as.data.frame(do.call(rbind, x)), c("x", "y"))
df$region.name <- rep(attr(x, "region.name"), unlist(lapply(x, nrow)))
## in case there are multi-rings
df$linestring_id <- cumsum(c(0, diff(is.na(df$x))))
df$polygon_id <- as.integer(factor(df$region.name))
df <- df[!is.na(df$x), ]
sfx <- sfheaders::sf_polygon(df, x = "x", y = "y", linestring_id = "linestring_id", polygon_id = "polygon_id", keep = TRUE)
#sf::st_crs(sfx) <- sf::st_crs(<whatever it is probably 4326>)
plot(sf::st_geometry(sfx), reset = FALSE)
maps::map(add = TRUE)
Interesting that you came across an official version of a forgotten legacy!
(BTW can I publish the data sets in a package?)

order strings according to some characters

I have a vector of strings, each of those has a number inside and I like to sort this vector according to this number.
MWE:
> str = paste0('N', sample(c(1,2,5,10,11,20), 6, replace = FALSE), 'someotherstring')
> str
[1] "N11someotherstring" "N5someotherstring" "N2someotherstring" "N20someotherstring" "N10someotherstring" "N1someotherstring"
> sort(str)
[1] "N10someotherstring" "N11someotherstring" "N1someotherstring" "N20someotherstring" "N2someotherstring" "N5someotherstring"
while I'd like to have
[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"
I have thought of using something like:
num = sapply(strsplit(str, split = NULL), function(s) {
as.numeric(paste0(head(s, -15)[-1], collapse = ""))
})
str = str[sort(num, index.return=TRUE)$ix]
but I guess there might be something simpler
There is an easy way to do this via gtools package,
gtools::mixedsort(str)
#[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"

Avoiding for loop, Naming Example

I would like to avoid using for loop in following example. Goal is to repeat string vector multiple times with different second part which changes each repetition. Is that possible?
str2D = mtcars
Vector = c(10,20)
Dimen = dim( str2D )
nn = c()
for ( i in Dimen[2]*(1:length(Vector)) ){
nn[ (i+1-Dimen[2]): i ] = rep(paste("|d",Vector[i/Dimen[2]],sep=""), Dimen[2] )
}
Name = paste( rep(names(str2D) , length(Vector) ),nn,sep="")
Correct result for "Name" vector is following:
"mpg|d10" "cyl|d10" "disp|d10" "hp|d10" "drat|d10" "wt|d10" "qsec|d10" "vs|d10" "am|d10" "gear|d10" "carb|d10" "mpg|d20" "cyl|d20" "disp|d20" "hp|d20" "drat|d20" "wt|d20" "qsec|d20" "vs|d20" "am|d20" "gear|d20" "carb|d20"
Thank you
I don't quite understand the end goal here but at least this achieves your desired output without a loop:
Name <- paste0(paste(names(mtcars)), "|d", rep(1:2, each = length(names(mtcars))), "0")
> Name
[1] "mpg|d10" "cyl|d10" "disp|d10" "hp|d10" "drat|d10" "wt|d10" "qsec|d10"
[8] "vs|d10" "am|d10" "gear|d10" "carb|d10" "mpg|d20" "cyl|d20" "disp|d20"
[15] "hp|d20" "drat|d20" "wt|d20" "qsec|d20" "vs|d20" "am|d20" "gear|d20"
[22] "carb|d20"

Converting geo coordinates from degree to decimal

I want to convert my geographic coordinates from degrees to decimals, my data are as follows:
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281
I have this code but I can not see why it is does not work:
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"\\.")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
return(dec[1]+dec[2]/60+dec[3]/3600)
}
don_convert=don1
for(i in 1:nrow(don1)){don_convert[i,2]=convert(as.character(don1[i,2])); don_convert[i,3]=convert(as.character(don1[i,3]))}
The convert function works but the code where I am asking the loop to do the job for me does not work.
Any suggestion is apperciated.
Use the measurements package from CRAN which has a unit conversion function already so you don't need to make your own:
x = read.table(text = "
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header = TRUE, stringsAsFactors = FALSE)
Once your data.frame is set up then:
# change the degree symbol to a space
x$lat = gsub('°', ' ', x$lat)
x$long = gsub('°', ' ', x$long)
# convert from decimal minutes to decimal degrees
x$lat = measurements::conv_unit(x$lat, from = 'deg_dec_min', to = 'dec_deg')
x$long = measurements::conv_unit(x$long, from = 'deg_dec_min', to = 'dec_deg')
Resulting in the end product:
lat long
105252 30.4210666666667 9.02218333333333
105253 30.65395 8.18018333333333
105255 31.6293333333333 8.10066666666667
105258 31.6865 8.10928333333333
105259 31.68715 8.11036666666667
105260 31.6481833333333 8.10468333333333
Try using the char2dms function in the sp library. It has other functions that will additionally do decimal conversion.
library("sp")
?char2dms
A bit of vectorization and matrix manipulation will make your function much simpler:
x <- read.table(text="
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header=TRUE, stringsAsFactors=FALSE)
x
The function itself makes use of:
strsplit() with the regex pattern "[°\\.]" - this does the string split in one step
sapply to loop over the vector
Try this:
convert<-function(x){
z <- sapply((strsplit(x, "[°\\.]")), as.numeric)
z[1, ] + z[2, ]/60 + z[3, ]/3600
}
Try it:
convert(x$long)
[1] 9.108611 8.391944 8.111111 8.254722 8.272778 8.178056
Disclaimer: I didn't check your math. Use at your own discretion.
Thanks for answers by #Gord Stephen and #CephBirk. Sure helped me out.
I thought I'd just mention that I also found that measurements::conv_unit doesn't deal with "E/W" "N/S" entries, it requires positive/negative degrees.
My coordinates comes as character strings "1 1 1W" and needs to first be converted to "-1 1 1".
I thought I'd share my solution for that.
df <- c("1 1 1E", "1 1 1W", "2 2 2N","2 2 2S")
measurements::conv_unit(df, from = 'deg_min_sec', to = 'dec_deg')
[1] "1.01694444444444" NA NA NA
Warning message:
In split(as.numeric(unlist(strsplit(x, " "))) * c(3600, 60, 1), :
NAs introduced by coercion
ewns <- ifelse( str_extract(df,"\\(?[EWNS,.]+\\)?") %in% c("E","N"),"+","-")
dms <- str_sub(df,1,str_length(df)-1)
df2 <- paste0(ewns,dms)
df_dec <- measurements::conv_unit(df2,
from = 'deg_min_sec',
to = 'dec_deg'))
df_dec
[1] "1.01694444444444" "-1.01694444444444" "2.03388888888889" "-2.03388888888889"
as.numeric(df_dec)
[1] 1.016944 -1.016944 2.033889 -2.033889
Have a look at the command degree in the package OSMscale.
As Jim Lewis commented before it seems your are using floating point minutes. Then you only concatenate two elements on
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
Having degrees, minutes and seconds in the form 43°21'8.02 which as.character() returns "43°21'8.02\"", I updated your function to
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"'")
tmp3=strsplit(tmp2[[1]][2],"\"")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]][1]),as.numeric(tmp3[[1]]))
c<-abs(dec[1])+dec[2]/60+dec[3]/3600
c<-ifelse(dec[1]<0,-c,c)
return(c)
}
adding the alternative for negative coordinates, and works great for me . I still don't get why char2dms function in the sp library didn't work for me.
Thanks
Another less elegant option using substring instead of strsplit. This will only work if all your positions have the same number of digits. For negative co-ordinates just multiply by -1 for the correct decimal degree.
x$LatDD<-(as.numeric(substring(x$lat, 1,2))
+ (as.numeric(substring(x$lat, 4,9))/60))
x$LongDD<-(as.numeric(substring(x$long, 1,1))
+ (as.numeric(substring(x$long, 3,8))/60))

Resources