Remove a sequence in a character in R

Remove a sequence in a character in R - r

I have the following character in a data.frame:
b <- "http://datos.labcd.mx/dataset/5b18cc1e-d2f2-46b0-bf2c-e699ae2af713/resource/e265a46f-7a9f-4a30-ae0d-d5937fff17c1/download/201003.csv"
I just want to extract the number 201003.
How should I do that?

b <- "http://datos.labcd.mx/dataset/5b18cc1e-d2f2-46b0-bf2c-e699ae2af713/resource/e265a46f-7a9f-4a30-ae0d-d5937fff17c1/download/201003.csv"
Try this on 'b':
file_name <- basename(b)
file_name
# [1] "201003.csv"
number <- strsplit(file_name, "\\.")[[1]]
number
# [1] "201003" "csv"
number = as.numeric(number[1])
number
# [1] 201003
Hope this helped.

Related

Extracting exact value from a data.frame in R

I can not find the answer to this question. All the answers my search engine gives me is how to round the number instead of how to get an unrounded number. Suppose I have a data.frame:
a <- c(1:4)
b <- c(1.123456789, 2.123456789, 3.123456789, 4.123456789)
df <- data.frame(a, b)
All methods I know return me a rounded number to the 6th digit after point:
df[2,2]
# [1] 2.123457
df[2,]
# a b
# 2 2 2.123457
df$b
# [1] 1.123457 2.123457 3.123457 4.123457
df$b[2]
# [1] 2.123457
df[df$a == 2, ]
# a b
# 2 2 2.123457
So, how to get the exact value? My desired output would be
[1] 2.123456789
Thank you!

How to find patterns between sets of strings in R?

I'm trying to find patterns in a set of strings as the following example:
"2100780D001378FF01E1000000040000--------01A456000000------------"
"3100782D001378FF03E1008100040000--------01A445800000------------"
If I use the standard get_pattern from the bpa library, since it looks individually to every string I will get
"9999999A999999AA99A9999999999999--------99A999999999------------"
But my idea would be to find something like:
"X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
The main objective is to find the set of strings with the most similar "pattern"
My first idea was to calculating the hamming distance between them and then analyzing the groups resulting from this distance but it gets tedious. Is there any "automatic" approach?
Any idea of how I can accomplish this mission?

for your sample data, the code below is working.. no idea how it scales to production...
library( data.table )
#sample data
data <- data.table( name = c("2100780D001378FF01E1000000040000--------01A456000000------------",
"3100782D001378FF03E1008100040000--------01A445800000------------"))
# name
# 1: 2100780D001378FF01E1000000040000--------01A456000000------------
# 2: 3100782D001378FF03E1008100040000--------01A445800000------------
#use data.table::tstrsplit() to split the string to individual characters
l <- lapply( data.table::tstrsplit( data$name, ""), function(x) {
#if the same character appears in all strings on the same position,return the character, else return 'X'
if ( length( unique( x ) ) == 1 ) as.character(x[1]) else "X"
})
#paste it all together
paste0(l, collapse = "")
# [1] "X10078XD001378FF0XE100XX00040000--------01A4XXX00000------------"
small explanation
data.table::tstrsplit( data$name, "") returns the following list
[[1]]
[1] "2" "3"
[[2]]
[1] "1" "1"
[[3]]
[1] "0" "0"
etc...
Using lapply(), you can loop over this list, determining the length of the vector with unique elements. Ith this length == 1, then the same character exists in all strings on this position, so return the character.
If the length > 1, then multiple characters apprear on this possition in different strings, and return "X".
Update
if you are after the hamming distances, use the stringdist-package
library(stringdist)
m <- stringdist::stringdistmatrix(a = data$name, b = data$name, ,method="hamming" )
# [,1] [,2]
# [1,] 0 8
# [2,] 8 0
#to get to the minimum value for each row, exclude the diagonal first (by making it NA)
# and the find the position with the minimum value
diag(m) <- NA
apply( m, 1, which.min )
# [1] 2 1

Here is a base R solution, where a custom function findPat is defined and Reduce is applied to find common pattern among a set of strings, i.e.,
findPat <- function(s1,s2){
r1 <- utf8ToInt(s1)
r2 <- utf8ToInt(s2)
r1[bitwXor(r1,r2)!=0]<- utf8ToInt("X")
pat <- intToUtf8(r1)
}
pat <- Reduce(findPat,list(s1,s2,s3))
such that
> pat
[1] "X10078XDX0X378FF0XE100XX00040000--------01AXXXXX0000------------"
DATA
s1 <- "2100780D001378FF01E1000000040000--------01A456000000------------"
s2 <- "3100782D001378FF03E1008100040000--------01A445800000------------"
s3 <- "4100781D109378FF03E1008100040000--------01A784580000------------"

Looping over multiple lists with base R

In python we can do this..
numbers = [1, 2, 3]
characters = ['foo', 'bar', 'baz']
for item in zip(numbers, characters):
print(item[0], item[1])
(1, 'foo')
(2, 'bar')
(3, 'baz')
We can also unpack the tuple rather than using the index.
for num, char in zip(numbers, characters):
print(num, char)
(1, 'foo')
(2, 'bar')
(3, 'baz')
How can we do the same using base R?

To do something like this in an R-native way, you'd use the idea of a data frame. A data frame has multiple variables which can be of different types, and each row is an observation of each variable.
d <- data.frame(numbers = c(1, 2, 3),
characters = c('foo', 'bar', 'baz'))
d
## numbers characters
## 1 1 foo
## 2 2 bar
## 3 3 baz
You then access each row using matrix notation, where leaving an index blank includes everything.
d[1,]
## numbers characters
## 1 1 foo
You can then loop over the rows of the data frame to do whatever you want to do, presumably you actually want to do something more interesting than printing.
for(i in seq_len(nrow(d))) {
print(d[i,])
}
## numbers characters
## 1 1 foo
## numbers characters
## 2 2 bar
## numbers characters
## 3 3 baz

For another option, how about mapply, which is the closest analog to zip I can think of in R. Here I'm using the c function to make a new vector, but you could use any function you'd like:
numbers<- c(1, 2, 3)
characters<- c('foo', 'bar', 'baz')
mapply(c,numbers, characters, SIMPLIFY = FALSE)
[[1]]
[1] "1" "foo"
[[2]]
[1] "2" "bar"
[[3]]
[1] "3" "baz"
Which way is of most use depends on what you want to do with your output, but as the other answers mention, a dataframe is the most natural approach in R (and pandas dataframe probably in python).

To index a vector in R, where the vector is variable x would be x[1]. This would return the first element of the vector. R element numbering starts at 1 in contrast to Python which starts at 0.
For this problem it would be:
x = seq(1,10)
j = seq(11,20)
for (i in 1:length(x)){
print (c(x[i],j[i]))
}

Many functions in R are vectorized and don't require loops:
numbers = c(1, 2, 3)
characters = c('foo', 'bar', 'baz')
myList <- list(numbers, characters)
myDF <- data.frame(numbers,characters, stringsAsFactors = F)
print(myList)
print(myDF)

This is the conceptual equivalent:
for (item in Map(list,numbers,characters)){ # though most of the time you would actually do all your work inside Map
print(item[c(1,2)])
}
# [[1]]
# [1] 1
#
# [[2]]
# [1] "a"
#
# [[1]]
# [1] 2
#
# [[2]]
# [1] "b"
#
# [[1]]
# [1] 3
#
# [[2]]
# [1] "c"
#
# [[1]]
# [1] 4
#
# [[2]]
# [1] "d"
#
# [[1]]
# [1] 5
#
# [[2]]
# [1] "e"
Though most of the time you would actually do all your work inside Map and do something like this:
Map(function(nu,ch){print(data.frame(nu,ch))},numbers,characters)
This is the closest I could get to a clone:
zip <- function(...){ Map(list,...)}
print2 <- function(...){do.call(cat,c(list(...),"\n"))}
for (item in zip(numbers,characters)){
print2(item[[1]],item[[2]])
}
# 1 a
# 2 b
# 3 c
# 4 d
# 5 e
to be able to call items by their names (still works with indices):
zip <- function(...){
names <- sapply(substitute(list(...))[-1],deparse)
Map(function(...){setNames(list(...),names)}, ...)
}
for (item in zip(numbers,characters)){
print2(item[["numbers"]],item[["characters"]])
}

The tidyverse solution would be to use purrr::map2 function. Ex:
numbers <- c(1, 2, 3)
characters <- c('foo', 'bar', 'baz')
map2(numbers, characters, ~paste0(.x, ',', .y))
#[[1]]
#[1] "1,foo"
#[[2]]
#[1] "2,bar"
#[[3]]
#[1] "3,baz"
See API here

Other scalable alternatives: Store the vectors in the list and iterate over.
vect1 <- c(1, 2, 3)
vect1 <- c('foo', 'bar', 'baz')
vect2 <- c('a', 'b', 'c')
idx_list <- list(vect1, vect2)
idx_vect <- c(1:length(idx_list[[1]]))
for(i in idx_vect){
x <- idx_list[[1]][i]
j <- idx_list[[2]][i]
print(c(i, x, j))
}

R: cluster strings with same begin

I've got the following vector
words <- c("verkoop", "verkoopartikel", "artikelnummer", "bank", "bankinfo", "bankrekeningnummer", "artikelnaam")
How can I cluster the words that begin with the same letters?
So here, this would be:
verkoop, verkoopartikel
artikelnummer, artikelnaam
bank, bankinfo, bankrekeningnummer

Here's a potential solution which first extracts the unique starting letters and then clusters the words in the vector using pattern matching:
words <- c("verkoop", "verkoopartikel", "artikelnummer", "bank", "bankinfo", "bankrekeningnummer", "artikelnaam")
l <- unique(substring(words,1,1))
l <- paste0("^", l) # the ^ indicates that the string should start with this letter
lapply(l, function(x,y) y[grep(x,y)], y=words)
# [[1]]
# [1] "verkoop" "verkoopartikel"
# [[2]]
# [1] "artikelnummer" "artikelnaam"
# [[3]]
# [1] "bank" "bankinfo" "bankrekeningnummer"

For two words to belong to the same cluster, how many initial letters should they share? The following example works with n_init = 4 letters.
library(dplyr)
n_init <- 4
data.frame(words) %>%
mutate(cluster = as.numeric(as.factor(substring(words, 1, n_init))))

Looping a function for each row in csv

I have a .csv file containing 22.388 rows with comma seperated numbers. I want to find all possible combinations of pairs of the numbers for each row seperately and list them pair for pair, so that I'll be able to make a visual representation of them as clusters.
An example of two rows from my file would be
"2, 13"
"2, 8, 6"
When I use the str() function R says the file contains factors. I guess it needs to be integers, but I need the rows to be seperate, therefore I've wrapped each row in " ".
I want possible combinations of pairs for each row like this.
2, 13
2, 8
2, 6
8, 6
I've already gotten an answer from #flodel saying
Sample input - replace textConnection(...) with your csv filename.
csv <- textConnection("2,13
2,8,6")
This reads the input into a list of values:
input.lines <- readLines(csv)
input.values <- strsplit(input.lines, ',')
This creates a nested list of pairs:
pairs <- lapply(input.values, combn, 2, simplify = FALSE)
This puts everything in a nice matrix of integers:
pairs.mat <- matrix(as.integer(unlist(pairs)), ncol = 2, byrow = TRUE)
pairs.mat
But I need the function to run through each row in my .csv file seperately, so I think I need to do a for loop with the function - I just can't get my head around it.
Thanks in advance.

Not sure exactly what you're after but maybe something like this:
dat <- readLines(n=2) #read in your data
2, 13
2, 8, 6
## split each string on "," and then remove white space
## and put into a list with lapply
dat2 <- lapply(dat, function(x) {
as.numeric(gsub("\\s+", "", unlist(strsplit(x, ","))))
})
## find all combinations using outer with outer (faster
## than expand.grid and we can take just a triangle)
dat3 <- lapply(dat2, function(x) {
y <- outer(x, x, paste)
c(y[upper.tri(y)])
})
## then split on the spaces and convert back to numeric
## stored as a list
lapply(strsplit(unlist(dat3), " "), as.numeric)
## > lapply(strsplit(unlist(dat3), " "), as.numeric)
## [[1]]
## [1] 2 13
##
## [[2]]
## [1] 2 8
##
## [[3]]
## [1] 2 6
##
## [[4]]
## [1] 8 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove a sequence in a character in R - r

I have the following character in a data.frame: b <- "http://datos.labcd.mx/dataset/5b18cc1e-d2f2-46b0-bf2c-e699ae2af713/resource/e265a46f-7a9f-4a30-ae0d-d5937fff17c1/download/201003.csv" I just want to extract the number 201003. How should I do that?

Related

Extracting exact value from a data.frame in R

How to find patterns between sets of strings in R?

Looping over multiple lists with base R

R: cluster strings with same begin

Looping a function for each row in csv

Categories

Resources