Disappearing row after string split - r

I have a column of coordinates that I am splitting with strsplit() and removing unwanted character from with gsub(). Note that there are 3034 rows.
> head(bike_parking$Geom)
[1] "(37.7606289177, -122.410647009)" "(37.752476948, -122.410625009)"
[3] "(37.7871729481, -122.402401009)" "(37.7776039475, -122.422764009)"
[5] "(37.7658325695, -122.46649784)" "(37.7693399479, -122.432820008)"
> length(bike_parking$Geom)
[1] 3034
> sum(is.na(bike_parking$Geom))
[1] 0
For some reason, after I run
dat <- data.frame(do.call(rbind, strsplit(as.vector(gsub("[()]", "", bike_parking$Geom)), split = ",")))
I am left with 3033. How did that happen and what steps do I take to figure out what went wrong?
> head(dat)
X1 X2
1 37.7606289177 -122.410647009
2 37.752476948 -122.410625009
3 37.7871729481 -122.402401009
4 37.7776039475 -122.422764009
5 37.7658325695 -122.46649784
6 37.7693399479 -122.432820008
> nrow(dat)
[1] 3033

It seems like your strings do not have the same structure everywhere. You will somehow have to know which structure they all have in common to split them properly. From the comments below the question, I derive that some strings may not contain a comma to split the coordinates. You can remove all commas and split the strings at the empty space instead. I'll post a solution in base R and a solution with the stringr-package.
Option 1: Base R:
We can remove the parentheses and commas from your strings by using gsub(). Then we can split the strings at the space using strsplit(). The result will be:
splitted <- strsplit(gsub("[(),]", "", bike_parking$Geom), " ")
# [[1]]
# [1] "37.7606289177" "-122.410647009"
# [[2]]
# [1] "37.752476948" "-122.410625009"
# [[3]]
# [1] "37.7871729481" "-122.402401009"
# [[4]]
# [1] "37.7776039475" "-122.422764009"
# [[5]]
# [1] "37.7658325695" "-122.46649784"
# [[6]]
# [1] "37.7693399479" "-122.432820008"
We have to reorganise these results a bit, so you'll end up with a data.frame with two columns:
sapply(1:2, function(x) sapply(splitted, `[[`, x))
# [,1] [,2]
# [1,] "37.7606289177" "-122.410647009"
# [2,] "37.752476948" "-122.410625009"
# [3,] "37.7871729481" "-122.402401009"
# [4,] "37.7776039475" "-122.422764009"
# [5,] "37.7658325695" "-122.46649784"
# [6,] "37.7693399479" "-122.432820008"
Option 2: Stringr: This package contains a function str_split() (not strsplit()!), that allows you to skip the last step in the base R solution, because you can immediately get a data.frame instead of a list with vectors:
str_split(gsub("[(),]", "", bike_parking$Geom), " ", simplify=TRUE)

Related

How can I extract unit preceded by number with str_extract?

I think str_extract can do this, but I fail to figure out this. my data contains chinese character so there is no blank white between characters. I simulate the data in english as:
> dd<-c("wwe12hours,fgg23days","ffgg12334hours,23days","ffff1days")
> target <- c("hours","days","hours","days")
> target
[1] "hours" "days" "hours" "days"
How can I achieve the target?
my real case is:
> dd <- c("腹痛发热12小时,再发2天","腹痛132324月,再发1天","发热4天")
> target <- c("小时","月","天")
> target
[1] "小时" "月" "天"
It seems you are looking for regex to capture the units. Since you have a vector of length three, we would prefer to return another vector of length three. From your example(ENGLISH ONE) it is not clear how you obtain a target of 4 units. Although I perceive you meant to have 5 if not 3.
here is how you could tackle. This can generally be used for any language:
English:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "hours,days" "hours,days" "days"
Chinese:
gsub("\\p{L}*+\\d+", "", dd, perl = TRUE)
[1] "小时,天" "月,天" "天"
regmatches(ddc,gregexpr("(?<=\\d)\\p{L}+",ddc,perl = TRUE))
[[1]]
[1] "小时" "天"
[[2]]
[1] "月" "天"
[[3]]
[1] "天"
or if you want to use other packages:
using str_extract_all:
library(stringr)
str_extract_all(ddc,"(?<=\\d)\\p{L}+")
You could use str_match_all :
library(stringr)
unlist(sapply(str_match_all(dd, '\\d+(\\w+)'), function(x) x[, 2]))
#[1] "hours" "days" "hours" "days" "days"
This captures the first word that comes after a number.
where
str_match_all(dd, '\\d+(\\w+)') #returns
#[[1]]
# [,1] [,2]
#[1,] "12hours" "hours"
#[2,] "23days" "days"
#[[2]]
# [,1] [,2]
#[1,] "12334hours" "hours"
#[2,] "23days" "days"
#[[3]]
# [,1] [,2]
#[1,] "1days" "days"
As mentioned by #Onyambu, we can use a lookbehind regex to avoid using sapply to subset the capture group.
unlist(str_extract_all(dd,"(?<=\\d)[A-z]+"))
Base R solution:
cleaned_dd <- gsub("[[:punct:]].*", "",
unlist(lapply(strsplit(
gsub("[[:digit:]]", " ", dd), "\\s+"
), '[',-1)))

How to insert back a character in a string at the exact position where it was originally

I have strings that have dots here and there and I would like to remove them - that is done, and after some other operations - these are also done, I would like to insert the dots back at their original place - this is not done. How could I do that?
library(stringr)
stringOriginal <- c("abc.def","ab.cd.ef","a.b.c.d")
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
I see that str_sub() may help, for example str_sub(stringModified[2], 3,2) <- "." gets me somewhere, but it is still far from the right place, and also I have no idea how to do it programmatically. Thank you for your time!
update
stringOriginal <- c("11.123.100","11.123.200","1.123.1001")
stringOriginalF <- as.factor(stringOriginal)
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
stringNumFac <- sort(as.numeric(stringModified))
stringi::stri_sub(stringNumFac[1:2], 3, 2) <- "."
stringi::stri_sub(stringNumFac[1:2], 7, 6) <- "."
stringi::stri_sub(stringNumFac[3], 2, 1) <- "."
stringi::stri_sub(stringNumFac[3], 6, 5) <- "."
factor(stringOriginal, levels = stringNumFac)
after such manipulation, I am able to order the numbers and convert them back to strings and use them later for plotting.
But since I wouldn't know the position of the dot, I wanted to make it programmatical. Another approach for factor ordering is also welcomed. Although I am still curious about how to insert programmatically back a character in a string at the exact position where it was originally.
This might be one of the cases for using base R's strsplit, which gives you a list, with a vector of substrings for each entry in your original vector. You can manipulate these with lapply or sapply very easily.
split_string <- strsplit(stringOriginal, "[.]")
#> split_string
#> [[1]]
#> [1] "11" "123" "100"
#>
#> [[2]]
#> [1] "11" "123" "200"
#>
#> [[3]]
#> [1] "1" "123" "1001"
Now you can do this to get the numbers
sapply(split_string, function(x) as.numeric(paste0(x, collapse = "")))
# [1] 11123100 11123200 11231001
And this to put the dots (or any replacement for the dots) back in:
sapply(split_string, paste, collapse = ".")
# [1] "11.123.100" "11.123.200" "1.123.1001"
And you could get the location of the dots within each element of your original vector like this:
lapply(split_string, function(x) cumsum(nchar(x) + 1))
# [[1]]
# [1] 3 7 11
#
# [[2]]
# [1] 3 7 11
#
# [[3]]
# [1] 2 6 11

How to replace space with "_" after last slash in a string with R

I have a list of strings, and for each string, I need to replace all spaces after the last slash with an "_". Here's a minimum reproducible example.
my_list <- list("abc/as 345/as df.pdf", "adf3344/aer4 ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr serf_dff.xls", "abc/34 5 5/dfr 345 dsdf 334.pdf")
After doing the replacement, the result should be:
list("abc/as 345/as_df.pdf", "adf3344/aer4_ffsd.doc", "abc/3455/dfr.xls", "abc/3455/dfr_serf_dff.xls", "abc/34 5 5/dfr_345_dsdf_334.pdf")
I thought of matching the text after the last slash using regex, and then replace " " for "_", but didn't find a way to implement it.
It would be something like this:
gsub(pattern, "_", my_list),
in which pattern would be a regex that would be saying: match every space after the last slash (there is at least one slash in every element of the list).
You may use negative lookahead:
gsub(" (?!.*/.*)", "_", unlist(my_list), perl = TRUE)
# [1] "abc/as 345/as_df.pdf" "adf3344/aer4_ffsd.doc"
# [3] "abc/3455/dfr.xls" "abc/3455/dfr_serf_dff.xls"
# [5] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here we match and replace all such spaces that ahead of them there are no more slashes left.
You can use dirname, basename and file.path :
as.list(file.path(
dirname(unlist(my_list)),
gsub(" ", "_", basename(unlist(my_list)))
))
# [[1]]
# [1] "abc/as 345/as_df.pdf"
#
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
#
# [[3]]
# [1] "abc/3455/dfr.xls"
#
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
#
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
or a bit more efficient and compact :
as.list(file.path(
dirname(. <- unlist(my_list)),
gsub(" ", "_", basename(.))
))
Here's a thought. First, split by slash:
l2 <- strsplit(unlist(my_list), "/")
l2
# [[1]]
# [1] "abc" "as 345" "as df.pdf"
# [[2]]
# [1] "adf3344" "aer4 ffsd.doc"
# [[3]]
# [1] "abc" "3455" "dfr.xls"
# [[4]]
# [1] "abc" "3455" "dfr serf_dff.xls"
# [[5]]
# [1] "abc" "34 5 5" "dfr 345 dsdf 334.pdf"
Now we do a gsub on just the last element of each split-string, recombining with slashes:
mapply(function(a,i) paste(c(a[-i], gsub(" ", "_", a[i])), collapse="/"),
l2, lengths(l2), SIMPLIFY=FALSE)
# [[1]]
# [1] "abc/as 345/as_df.pdf"
# [[2]]
# [1] "adf3344/aer4_ffsd.doc"
# [[3]]
# [1] "abc/3455/dfr.xls"
# [[4]]
# [1] "abc/3455/dfr_serf_dff.xls"
# [[5]]
# [1] "abc/34 5 5/dfr_345_dsdf_334.pdf"
Here's a solution that uses the gsubfn package.
You use the regex (/[^/]+)$ to find the content following the last slash and you edit that content with a function that converts spaces to underscores.
library(gsubfn)
change_space_to_underscore <- function(x) gsub(x = x, pattern = "[[:space:]]+", replacement = "_")
gsubfn(x = my_list,
pattern = "(/[^/]+)$",
replacement = change_space_to_underscore)

R: strsplit based on two conditions, keeping deliminator

I am trying to split sentences based on different criteria. I am looking to split some sentences after "traction" and some after "ramasse". I looked up the grammar rules for grepl but didn't really understand.
A data frame called export has a column ref, which has str values ending either with "traction" or "ramasse".
>export$ref
ref
[1] "62133130_074_traction"
[2] "62156438_074_ramasse"
[3] "62153874_070_ramasse"
[4] "62138861_074_traction"
And I want to split str values in ref column into two.
ref R&T
[1] "62133130_074_" "traction"
[2] "62156438_074_" "ramasse"
[3] "62153874_070_" "ramasse"
[4] "62138861_074_" "traction"
What I tried(none of them was good)
strsplit(export$ref, c("traction", "ramasse"))
strsplit(export$ref, "\\_(?<=\\btraction)|\\_(?<=\\bramasse)", perl = TRUE)
strsplit(export$ref, "(?=['traction''ramasse'])", perl = TRUE)
Any help would be appreciated!
Here's a different approach:
strsplit(x, "_(?=[^_]+$)", perl = TRUE)
[[1]]
[1] "62133130_074" "traction"
[[2]]
[1] "62156438_074" "ramasse"
[[3]]
[1] "62153874_070" "ramasse"
[[4]]
[1] "62138861_074" "traction"
This means split the column / vector at an underscore ("_") which is followed by any number of symbols that don't contain another underscore.
Here is another option using stringr::str_split:
library(stringr);
str_split(ref, pattern = "_(?=[A-Za-z]+)", simplify = T)
# [,1] [,2]
#[1,] "62133130_074" "traction"
#[2,] "62156438_074" "ramasse"
#[3,] "62153874_070" "ramasse"
#[4,] "62138861_074" "traction"
Sample data
ref <- c(
"62133130_074_traction",
"62156438_074_ramasse",
"62153874_070_ramasse",
"62138861_074_traction")

Collapsing mixed types into a neat comma separated string

I have a list of mixed types which I would like to collapse into a neat comma separated string to be read somewhere else. The following is a MWE:
a <- "name"
b <- as.vector(c(10))
names(b) <- c('s')
c <- as.vector(c(1, 2))
names(c) <- c('p1', 'p2')
d <- 20
r <- list(a, b, c, d)
r
# [[1]]
# [1] "name"
#
# [[2]]
# s
# 10
#
# [[3]]
# p1 p2
# 1 2
#
# [[4]]
# [1] 20
I want this:
# [1] '"name","10","1,2","20"'
But this is as far as I got:
# Collapse individual elements into individual strings.
# `sapply` with `paste` works perfectly:
> sapply(r, paste, collapse = ",")
# [1] "name" "10" "1,2" "20"
# Try paste again (doesn't work):
> paste(sapply(r, paste, collapse = ","), collapse = ',')
# [1] "name,10,1,2,20"
I tried paste0, cat to no avail. The only way I could do it is using write.table and passing it a buffer memory. That way is too complicated, and quite error prone. I need to have my code working on a cluster with MPI.
You need to add in the quotes - the ones printed after your sapply are just markers to show they are strings. This seems to work...
cat(paste0('"',sapply(r, paste, collapse = ','),'"',collapse=','))
"name","10","1,2","20"
You might need to try with and without the cat if you are writing to a file. Without it, at the terminal, you get backslashes before the 'real' quotes.

Resources