I'm trying to convert an integer value to its bit expression and convert parts of that again to integer (as per this blogpost, for context).
Transforming integers to bits does seem to work, but I'm missing something when transforming the values back into the initial integer.
'intToBase2' <- function(x){
x %>%
intToBits() %>%
rev %>%
as.character() %>%
{sapply(strsplit(., "", fixed = TRUE), '[', 2)} %>%
paste0(.,collapse = '')
}
val <- 1855928457
intToBase2(val) # does seem to return the correct bit expression, as expected
However, if I try to reverse the same logic, something gets lost somewhere:
val %>%
intToBase2 %>% # Get expression in bits
{strsplit(.,'')[[1]]} %>% # split
sprintf('0%s',.) %>% # add leading zeros
rev %>% # reverse order
as.raw %>% # expression to raw
readBin(.,what='integer')
R> [1] 16777217
What am I missing? I assume some of the steps I did are incorrect.
This function reverses yours, and is vectorized:
base2ToInt <- function(string)
{
sapply(strsplit(string, ""), function(x) sum(rev(+(x == "1")) * 2^(seq(x)-1)))
}
So you can do
base2ToInt(c("1", "0", "10", "11111111"))
#> [1] 1 0 2 255
and
base2ToInt(intToBase2(val))
#> [1] 1855928457
> base2ToInt(intToBase2(val)) == val
#> [1] TRUE
It works by splitting the string(s) into characters, so each string becomes a vector of "1"s and "0"s. These are converted to numeric values, then reversed, then multiplied by a sequence of powers of 2 of the same length. This is summed to give an answer for each string.
It's a bit "code golf", but bit manipulation often is...
Related
I am trying to create a couple new variables such as region_code and landowner_code based on the result of substring from another variable site_name. The original site_name is mostly but not always segmented with the "-" character, like "ABCDE" in df2. For these anomalies I need to deal with separately.
library(tidyverse)
df1 <- data.frame(site_name = c("S10-020-38",
"M4-010-8",
"P10-005-001"))
# df with an anomaly "ABCDE"
df2 <- data.frame(site_name = c("S10-020-38",
"M4-010-8",
"P10-005-001",
"ABCDE"))
I am expecting the result to be (from df2):
site_name
region_code
landowner_code
S10-020-38
10
10-020
M4-010-8
4
4-010
P10-005-001
10
10-005
ABCDE
ABCDE
ABCDE
I have been able to do region_code successfully:
# region code -- this works
df2 %>% mutate(
region_code = case_when(
site_name == "ABCDE" ~ "ABCDE",
TRUE ~ str_sub(site_name,
start = 2,
end = str_locate(site_name, "-")[,1]-1)))
but not landowner_code, even if I exclude the anomaly record:
# site code -- this doesn't work, but why?
str_locate_all(df2$site_name, "-")
df2 %>%
head(3) %>%
mutate(
site_code = case_when(
site_name == "ABCDE" ~ "ABCDE",
TRUE ~ str_sub(site_name,
start = 2,
end = str_locate_all(df2$site_name, "-") %>% map(~ .x[2,2])
)))
Running the landowner_code code gives an error message:
Error in mutate():
Problem while computing site_code = case_when(...).
Caused by error in .x[2, 2]:
subscript out of bounds
But why out of bounds? From str_locate_all(df2$site_name, "-")I believe each element in the list is a 2x2 matrix? Or how do I create landowner_code variable properly?
You can use gsub instead of your combination of str_sub and str_locate_all. gsub returns the same text if the pattern isn't matched, the exact way you want to deal with anomalies, and in my opinion, makes the code easier to understand.
pattern = "[A-Z]{1}([0-9]+)-([0-9]+)-[0-9]+"
df2 %>%
mutate(region_code = gsub(pattern, "\\1", site_name),
landowner_code = gsub(pattern, "\\1-\\2", site_name))
site_name region_code landowner_code
1 S10-020-38 10 10-020
2 M4-010-8 4 4-010
3 P10-005-001 10 10-005
4 ABCDE ABCDE ABCDE
REGEX explanation:
pattern [A-Z]{1}: matches any upper case letter exactly one time
pattern ([0-9]+)-: after that, any number of algarisms before a -. The () indicate a group, that can be referred by the "\\1" syntax in the second argument. This gets repeated three times, the last one without () because we don't need to refer to it later.
"\\1": return only the first group
"\\1-\\2": return the first group, followed by a '-', followed by the second group
I have a data frame where some values for "revenue" are listed in the hundreds, say "300," and others are listed as "1.5k." Obviously this is annoying, so I need to find some way of splitting the "k" and "." characters from those values and only those values. Any thoughts?
Another way to do this is just with Regex (and tidyverse for pipes)
library(tidyverse)
string <- c("300", "1.5k")
string %>% ifelse(
# check if string ends in k (upper/lower case)
grepl("[kK]$", .),
# if string ends in k, remove it and multiply by 1000
1000 * as.numeric(gsub("[kK]$", "", .)),
.) %>% as.numeric()
[1] 300 1500
You could create a function that remove "k", change to a numeric vector and multiple by 1,000.
to_1000 <- function(x){
x %>%
str_remove("k") %>%
as.numeric() %>%
{.*1000}
}
x <- c("3000","1.5k")
tibble(x) %>%
mutate(x_num = if_else(str_detect(x,"k"),to_1000(x),as.numeric(x)))
# A tibble: 2 x 2
x x_num
<chr> <dbl>
1 3000 3000
2 1.5k 1500
Have a scenario where I have a lengthy (12 digit) index value being read into r as a double. I need to concact this with some other identifiers, but mutate(x = as.character(x)) converts to scientific format:
index <- c(123000789000, 123456000000, 123000000012)
concact_val <- c("C", "A", "B")
df <-
bind_cols(
as_tibble(index),
as_tibble(concact_val)
)
df %>%
mutate(index = as.character(index))
This outputs:
index concact_val
1.23e11 C
1.23e11 A
1.23e11 B
Whereas ideally I'd like to be able to do this:
df %>%
mutate(index = as.character(index),
index = paste0(concact_val, index)) %>%
select(-concact_val)
to output:
index
C123000789000
A123456000000
B123000000012
Is there a way around this? In this example, I created a vector for the index, but in the frame I'm reading in it's being read as a double via an API (unfortunately, I can't change the col type prior to reading in, it's being read differently than read_csv).
Use sprintf:
df %>%
mutate(result = sprintf("%s%0.0f", concact_val, index))
# # A tibble: 3 x 3
# index concact_val result
# <dbl> <chr> <chr>
# 1 123000789000 C C123000789000
# 2 123456000000 A A123456000000
# 3 123000000012 B B123000000012
If there is the chance that some index have fractional components, this will round them silently. If that's a concern (and you don't want to round), you can instead use floor(index) inside the sprintf.
We may use as.bigz from gmp
paste0(concact_val, gmp::as.bigz(index))
[1] "C123000789000" "A123456000000" "B123000000012"
Or another option is to specify the scipen in options to avoid converting to scientific format
options(scipen = 999)
In addition to sprintf and gmp solutions, we may try another option like below as a programming practice
f <- function(x) {
res <- c()
while (x) {
res <- append(res, x %% 10)
x <- x %/% 10
}
paste0(rev(res), collapse = "")
}
paste0(concact_val, Vectorize(f)(index))
# [1] "C123000789000" "A123456000000" "B123000000012"
I have a vector of string where people where asked to guess someones age, this includes statements like "50-60", "ca. 50" or ">50". I want to use regular expressions to match these cases and get the real numerical values. "50-60" should produce 55 (as the mean of both values), the other two examples 50.
For each variant, I wanted to have a case in a switch like below, but it doesn't seem to work. Is it even possible to use a regex in a switch?
switch (string,
str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+") = {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
},
str_detect(string, "((ca)\.?)|>|~[:blank:]*[:digit:]+") = {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
},
str_detect(string, "[:digit:]+[:punct:]") = {
value <- str_sub(string, 1, 2) %>% as.numeric()
},
print(string, " could not be matched")
)
The expressions themselves are working as intended (as far as I tested), so I guess I can't use them in the switch like this. However I couldn't find a solution anywhere.
Edit: Added what the expected output for the examples is
We can do this with tidyverse methods
Convert the string to a tibble/data.frame
Remove the characters not neeeded with str_remove_all
Then, separate the column into two by specifying the sep
Get the rowMeans
library(dplyr)
library(tidyr)
library(stringr)
tibble(mystring) %>%
mutate(mystring = str_remove_all(mystring, "[A-Za-z.><]+")) %>%
separate(mystring, into = c('col1', 'col2'), sep="[- ]+",
convert = TRUE) %>%
transmute(out = rowMeans(., na.rm = TRUE))
-output
# A tibble: 3 x 1
out
<dbl>
1 55
2 50
3 50
data
mystring <- c("50-60", "ca. 50", ">50")
You can use a nested if/else approach -
library(stringr)
string <- "50-60"
if(str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+")) {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
value
} else if(str_detect(string, "((ca)\\.?)|>|~[:blank:]*[:digit:]+")) {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value
} else if(str_detect(string, "[:digit:]+[:punct:]")) {
value <- str_sub(string, 1, 2) %>% as.numeric()
value
} else NA
#[1] 55
For string <- "ca. 50" it returns 50.
mystring <- c("50-60", "ca. 50", ">50")
library(stringr)
lapply(str_extract_all(mystring, "[0-9]+"),
function(x) if (length(x) == 1) as.numeric(x[1]) else mean(as.numeric(x)))
[[1]]
[1] 55
[[2]]
[1] 50
[[3]]
[1] 50
I have Values such as :
df[,1:2]
Results in
I want to create a new column that has the difference between the Ins and Outs.
These are TIME values,
Expected output is :
1201
0718 ( neglecting Negative values )
.. and So on.
library(stringr)
# generate few rows of data
In <- c('143','1239')
Out <- c('1344','521')
df <- data.frame(cbind(In, Out), stringsAsFactors=FALSE)
# pad with zero if needed (e.g. 143 -> 0143)
df$In[str_length(df$In) == 3] <- paste(0,df$In[str_length(df$In) == 3], sep='')
df$Out[str_length(df$Out) == 3] <- paste(0,df$Out[str_length(df$Out) == 3], sep='')
df$In <- strptime(df$In, format='%H%M')
df$Out <- strptime(df$Out, format='%H%M')
df$diff <- df$In - df$Out
This gives:
> df$diff
Time differences in hours
[1] -12.01667 7.30000
Is this what you are looking for?
If I understand correctly, the OP wants to compute the absolute time difference where the time of the day (neglecting the date) is given as character strings in the form HMM or HHMM.
There are classes which support time of the day (without date) directly, e.g., the hms package or the ITime class of the data.table package.
As an additional challenge, the timestamps are not given in a standard time format HH:MM, e.g., 09:43.
Here is an approach which uses as.ITime() after the strings have been padded.
# create sample data frame
df <- data.frame(In = c("143", "1239"),
Out = c("1344", "521"))
library(magrittr) # piping is used for readability
# pad strings and coerce to ITime class
df$In %<>%
stringr::str_pad(4L, pad = "0") %>%
data.table::as.ITime("%H%M")
df$Out %<>%
stringr::str_pad(4L, pad = "0") %>%
data.table::as.ITime("%H%M")
# compute absolute difference
df$absdiff <- abs(df$In - df$Out)
df
In Out absdiff
1 01:43:00 13:44:00 12:01:00
2 12:39:00 05:21:00 07:18:00
Now, the OP seems to expect the result in the same non-standard format HHMM (without the : field separator) as the input values. This can be achieved by
df$absdiff %>%
as.POSIXct() %>%
format("%H%M")
[1] "1201" "0718"