how to read letters as numbers in R - r

I have some data that are in a code as battle ship game, like this:A0,A1,B0,B4,K12 and I want to transform these into coordinate points. The letter should be the x-coordinate and the number the y-coordinate. Besides that, I should transform the letters in numbers to multiply them. Like that:
A0 = 0 , 0;
A1 = 0 , 15;
A2 = 0 , 30;
B3 = 15 , 45

Here you go:
BattleshipConversion <- function(mystring)
{
return(c(which(LETTERS==substr(mystring,1,1))-1,as.integer(substr(mystring,2,3)))*15)
}
Result:
>BattleshipConversion("B1")
15 15
>BattleshipConversion("A10")
0 150
So what is happening above?
LETTERS is an R pre-generated vector of capital letters. which takes the index position of the letter in that vector, so which(LETTERS=='A') will give 1. We subtract 1 from that.
substr is a function that extracts a substring from a string, taking string, start and stop as arguments. counting starts with the first element, which in R is 1. substring(mystring,1,1) takes the first character element of mystring and stops there.
as.integer simply converts the 1-2 digit integer stored as character into a proper integer format.
we save it all in a combined vector using c(), and everything gets multiplied by 15, per the OP's specification
the function returns the result.
Note that this assumes your input string is correctly formatted. It will only work up to Z and 99, i.e. will fail on an AA14 or B101. You may want to add in some safeguards.

This is vectorized and can be extended to double letters easily:
fun <- function(s) {
x <- gsub("[[:digit:]]", "", s) #remove numbers
y <- gsub("[[:alpha:]]", "", s) #remove letters
x <- match(x, LETTERS) - 1 #match against letters
y <- as.integer(y)
cbind(x = x * 15, y = y * 15)
}
fun(c("A0", "A1", "A2", "B3"))
# x y
#[1,] 0 0
#[2,] 0 15
#[3,] 0 30
#[4,] 15 45

Say you have these positions:
pos<-c("A0","A1","A2","B3","K12")
You can:
require(data.table) #just to use tstrsplit
res<-setNames(as.data.frame(tstrsplit(pos,"(?<=[A-Z])",perl=TRUE),stringsAsFactors=FALSE),c("x","y"))
res[[1]]<-(match(res[[1]],LETTERS)-1)*15
res[[2]]<-as.numeric(res[[2]])*15
cbind(pos,res)
# pos x y
#1 A0 0 0
#2 A1 0 15
#3 A2 0 30
#4 B3 15 45
#5 K12 150 180

Here is a dplyr answer
library(dplyr)
library(tidyr)
library(rex)
template = rex(capture(letters),
capture(numbers) )
coordinates = c("A0","A1","B0","B4","K12")
letter_frame =
data_frame(LETTERS,
x_small = 1:26)
result =
data_frame(coordinate = coordinates) %>%
extract(coordinate, c("letter", "y_small"), template, convert = TRUE) %>%
left_join(letter_frame) %>%
mutate(x = x_small*15,
y = y_small*15)

BSconverter <- function(str){
let <- substr(str,1,1)
num <- as.integer(substr(str,2,nchar(str))) * 15
letnum <- (which(LETTERS==let)-1) * 15
c(letnum, num)
}
> BSconverter("K12")
[1] 150 180

Related

Creating a vector of numbers based on letters

So, this is the question:
"Create a function that given one word, return the position
of word’s letters on letters vector. For example, if the word
is ‘abba’, the function will return 1 2 2 1."
What I have so far is this:
l <- function(word) {
chr <- c()
y <- c()
strsplit(chr,word)
i<-1
while(i<length) {
o<-letters[i]
x<-chr[i]
if(o==x) {
y[i]<-i
}
i+1
}
y
}
I have tried running l("hello") and it returns NULL. I'm very lost and would appreciate any help! Thank you!
With base R:
lapply(strsplit(x, "", fixed = TRUE), match, letters)
[[1]]
[1] 1 2 2 1
I provide another interesting function in base:
x <- "abcxyz"
strtoi(strsplit(x, "")[[1]], 36) - 9
# [1] 1 2 3 24 25 26
strtoi() transforms the base-n numeral system into base-10 (i.e. decimal) numeral system. Take base-16 (i.e. hexadecimal) for example, strtoi("12", base = 16) will get 18 because 12 in hexadecimal is 18 in decimal. If base is 36, strtoi() will map (1~9, a~z) to 1~35, namely, a~z in a base-36 system is 10~35 in decimal. -9 in my code will convert 10~35 to 1~26, which is what the OP requires. Another common use is to transform binary number into decimal. E.g. strtoi("01001", base = 2) gets 9.
library(purrr)
my_fun <- function(x) {
x %>%
strsplit("") %>%
map(factor, levels = letters) %>%
map(as.numeric)
}
x <- c("abba", "hello")
my_fun(x)
#> [[1]]
#> [1] 1 2 2 1
#>
#> [[2]]
#> [1] 8 5 12 12 15
Here we use that factors are integers under the hood.
Let str be a character vector, e.g. str <- c('a', 'b', 'b', 'a'). When we run factor(str, levels = letters) we convert it to a factor with 26 levels: 'a', 'b', 'c', and so on. If we apply as.integer to it, a will become 1, because it's the first level, 'b' - 2 and so on.

Counting the frequency of differing patterns in a character string

I currently have a string in R that looks like this:
a <- "BMMBMMMMBMMMBMMBBMMM"
First, I need to determine the frequency of different patterns of "M" that appear in the string.
In this example it would be:
MM = 2
MMM = 2
MMMM = 1
Secondly, I then need to designate a numerical value/score for each different pattern.
i.e:
MM = 1
MMM = 2
MMMM = 3
This would mean that the total value/score of M's in a would equal 9.
If anyone knows any script that would allow me to do this for multiple strings like this in a dataframe that would be great?
Thank you.
a <- "BMMBMMMMBMMMBMMBBMMM"
tbl <- table(strsplit(a, "B"), exclude="")
tbl
# MM MMM MMMM
# 2 2 1
score <- sum(tbl * 1:3)
score
# 9
You could also use the table function.
a_list<-unlist(strsplit(a,"B"))
a_list<-a_list[!a_list==""] #remove cases when 2 B are together
a_list<-table(a_list)
# MM MMM MMMM
# 2 2 1
Here's a solution that uses the dplyr package. First, I load the library and define my string.
library(dplyr)
a <- "BMMBMMMMBMMMBMMBBMMM"
Next, I define a function that counts the occurrences of character x in string y.
char_count <- function(x, y){
# Get runs of same character
tmp <- rle(strsplit(y, split = "")[[1]])
# Count runs of character stored in `x`
tmp <- data.frame(table(tmp$lengths[tmp$values == x]))
# Return strings and frequencies
tmp %>%
mutate(String = strrep(x, Var1)) %>%
select(String, Freq)
}
Then, I run the function.
# Run the function
res <- char_count("M", a)
# String Freq
# 1 M 2
# 2 MM 2
# 3 MMM 1
Finally, I define my value vector and calculate the total value of vector a.
# My value vector
value_vec <- c(M = 1, MM = 2, MMM = 3)
# Total `value` of vector `a`
sum(value_vec * res$Freq)
#[1] 9
It it's acceptable to skip the first step you could do:
nchar(gsub("(B+M)|(^M)","",a))
# [1] 9
First compute all diffrent patterns that appear in your sting :
a <- "BMMBMMMMBMMMBMMBBMMM"
chars = unlist(strsplit(a, ""))
pat = c()
for ( i in 1:length(chars)){
for (j in 1:(length(chars) - i+1)){ pat = c(pat, paste(chars[j:(j+i-1)], collapse = ""))}}
pat =sort(unique(pat))
pat[1:5] : [1] "B" "BB" "BBM" "BBMM" "BBMMM"
Next, count the occurence of each pattern :
counts = sapply(pat, function(w) length(gregexpr(w, a, fixed = TRUE)[[1]]))
Finally build a nice dataframe to summary everything up :
df = data.frame(counts = counts, num = 1:length(pat))
head(df, 10)
counts num
B 6 1
BB 1 2
BBM 1 3
BBMM 1 4
BBMMM 1 5
BM 5 6
BMM 5 7
BMMB 2 8
BMMBB 1 9
BMMBBM 1 10
library(stringr)
str_count(a, "MMMM")
gives 1
str_count(gsub("MMMM", "", a), "MMM") # now count how many times "MMM" occurs, but first delete the "MMMM"
gives 2
str_count(gsub("MMM", "", a), "MM") #now count how many times "MM" occurs, but first delete the "MMM"'s
gives 2

Removing points in a vector based on another

I've two vectors with data x and y. Let's say the first one is the distance and the second the temperature.
How can I remove from both x and y all points which distance is lower a constant distance 'd' between two consecutive points ( xi - xi-1 )
x = (1,2,3,8,12)
y = (10,12,11,9,12)
remove points with a distance smaller than 5
x = 1, 2(out as 2-1 <5), 3 (out as 3-1 <5), 8, 12 (fine as last even thoug 12-8<5)
x = (1,8,12)
y = (10,9,12)
Here is one idea assuming that your first and last elements are never removed,
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= 5)[-(length(x)-1)], TRUE)]
#10 9 12
# 1 8 12
#To make it a bit more clear on how the named vector is structured (still a vector)
names(v1)
#[1] "10" "9" "12" <- Note: I get 9 whereas you get 11
unname(v1)
#[1] 1 8 12
Or you can make it a function,
rm_elements <- function(x, y, n){
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= n)[-(length(x)-1)], TRUE)]
return(list(x = unname(v1), y = as.numeric(names(v1))))
}
rm_elements(x, y, 5)
#$x
#[1] 1 8 12
#$y
#[1] 10 9 12
EDIT: To accomodate your comment for when you have them in a data frame, then we can alter the function a bit to accept a data frame (no matter how you name the variables), and return a subset of that data frame, i.e.
rm_elements <- function(df, n){
v1 <- df[c(TRUE, (diff(df[[1]]) >= n)[-(nrow(df)-1)], TRUE),]
return(v1)
}
#Make a data frame from the vectors,
d1 <- data.frame(x=x, y=y)
rm_elements(d1, 5)
which gives,
x y
1 1 10
4 8 9
5 12 12

in R is there a better way for loops a dataframe to create 3d binary vector

I have a dataframe that has a value that is a character and i need to convert the values to a 3d binary vector based on the value
DF = data.frame(Names = c("A1","A2","A3","A4"), TestScore = c("100 pts","NA","45 pt","75 pt."))
I need to create a function to split the TestScore to just the numeric value that would contain only the value (100,NA,50,75) and then create a 3d binary vector with of three categories where the 2nd and 3rd categories would be average vs best student. All 0 or NA would be consider the first category of NoScore. I need to use this 3d vector for analyzing the NoScore relationship to another variable
The only way I can think about it is do a for-loop on the rows of the dataframe, split the TestScore and then do an if-then else but I am not creating a 3d dimensional vector but I am creating 2 variable (Ind and Cat) that have values for Score but I think I should use my array of [i,j,k] to store values of 0 or 1.
Here is what I have for looking at the dataframe :
my_awards_array <- array(0, dim=c(movie_nrow,3,3))
convert_Awards <- function(df){
library(stringr)
awards_split_df = data.frame(str_split_fixed(df$TestScore," ",2))
summary(awards_split_df)
ls.str(awards_split_df)
awards_split_df[] <- data.frame(lapply(awards_split_df,as.character),stringsAsFactors=FALSE)
summary(awards_split_df)
ls.str(awards_split_df)
############################################################################
## First column contain the Number to convert as the value for binary Vector
############################################################################
### Second column would be eliminated ("win" or "wins." "win.")
awards_split_df$X2 <- NA
awards_conv_df <- data.frame(lapply(awards_split_df,as.integer))
colnames(awards_conv_df) = c("AwardsNum","AwardsType")
awards_final_df <- awards_conv_df
for (i in 1:nrow(awards_final_df)) {
origAwardsNum = awards_final_df[i,c('AwardsNum')]
if (is.na(origAwardsNum) ) {
awards_final_df[i,c('AwardsCat')] = "NoAwards";
awards_final_df[i,c('AwardsInd')] = 0;
break
} else if (origAwardsNum == 0 ){
awards_final_df[i,c('AwardsCat')] = "NoAwards";
awards_final_df[i,c('AwardsInd')] = 1;
break
} else if ((origAwardsNum) > 0 & (origAwardsNum <= 50)) {
awards_final_df[i,c('AwardsInd')] = 2;
awards_final_df[i,c('AwardsCat')] = "SomeAwards";
break
} else {
awards_final_df[i,c('AwardsInd')] = 3;
awards_final_df[i,c('AwardsCat')] = "ManyAwards";
break
}
}
return(awards_final_df)
}
Here is the expected INPUT results
DF
Names TestScore
1 A1 100 pts
2 A2 NA
3 A3 45 pt
4 A4 75 pt.
AFTER
Names TestScore AwardsCat AwardsInd <- 3d binary vector [0,0,0]
1 A1 100 pts ManyAwards [0,0,1]
2 A2 NA NA NA
3 A3 45 pt SomeAwards [0,1,0]
4 A4 75 pt. ManyAwards [0,0,1]
UPDATE:
I have to improve my strsplit approach, I found out that the data for TestScore can come like this
TestScore = 45 pt & 10 bonus --> which will equal to 45 + 10 = 55 to determine AwardsCat

How to find and label start / end points of unique sequences in R

I have a sequence of 1s and 0s alongside a time vector. I'd like to find the start and end time points of all the sequences of 1s, and give each sequence a unique ID. Here is some example data and my attempt so far.
Create dummy data
# Create the sequence
x = c(0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,0)
# Create the time vector
t = 10:34
This is my effort
#Get changepoints using diff()
diff_result <- diff(x)
# Use ifelse() to get start and end times (i.e. on and off)
on_t <- ifelse(diff_result == 1, t, NA)
off_t <- ifelse(diff_result == -1, t, NA)
# Combine into data frame and remove NAs, add 1 to on_t
results <- data.frame(on_t = on_t[!is.na(on_t)] + 1, off_t = off_t[!is.na(off_t)])
# Create unique ID for each sequence
results$ID <- factor(1:nrow(results))
print(results)
on_t off_t ID
1 14 17 1
2 21 26 2
3 30 33 3
I'm sure there's a better way...
Put the two vectors in a data.table and then do typical group by, filter and mutate transformation is another option:
library(data.table)
dt = data.table(seq = x, time = t)
dt[, .(on_t = min(time), off_t = max(time), lab = unique(seq)), .(ID = rleid(seq))]
# Use rleid to create a unique ID for each sequence as a group by variable, find the start
# and end point for each sequence as well as a label for each sequence;
[lab == 1]
# filter label so that the result only contains time for sequence of 1
[, `:=`(lab = NULL, ID = seq_along(ID))][]
# Remove label and recreate the ID
# ID on_t off_t
# 1: 1 14 17
# 2: 2 21 26
# 3: 3 30 33
Following OP's logic, which might be a better way:
d = diff(c(0, x, 0))
# prepend and append a 0 at the beginning and ending of x to make sure this always work
# if the sequence starts or ends with 1.
results = data.frame(on_t = t[d == 1], off_t = t[(d == -1)[-1]])
# pick up the time where 1 sequence starts as on time, and 0 starts as off time. Here d is
# one element longer than t and x but since the last element for d == 1 will always be false, it won't affect the result.
results$ID = 1:nrow(results)
# create an ID
results
# on_t off_t ID
# 1 14 17 1
# 2 21 26 2
# 3 30 33 3
You can also do it this way.
x = c(0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,0)
# Create the time vector
t = 10:34
xy <- data.frame(x, t)
mr <- rle(xy$x)$lengths
xy$group <- rep(letters[1:length(mr)], times = mr)
onesies <- xy[xy$x == 1, ]
out <- by(onesies, INDICES = onesies$group,
FUN = function(x) {
data.frame(on_t = x$t[1], off_t = x$t[nrow(x)], ID = unique(x$group))
})
do.call("rbind", out)
on_t off_t ID
b 14 17 b
d 21 26 d
f 30 33 f
Here is one method for finding the starting and stopping positions of the above vector:
# get positions of the 1s
onePos <- which(x == 1)
# get the ending positions
stopPos <- onePos[c(which(diff(onePos) != 1), length(onePos))]
# get the starting positions
startPos <- onePos[c(1, which(diff(onePos) != 1) + 1)]
The values of the ts can be obtained through subsetting:
t[startPos]
[1] 14 21 30
t[stopPos]
[1] 17 26 33
Finally, to add an id:
df <- data.frame(id=seq_along(startPos), on_t=t[startPos], off_t=t[stopPos])

Resources