Concatenate rows of matrix with two separators - r

My goal is to turn a matrix of coordinate pairs into a single character string with the coordinate pairs pasted together. For example, I have the coordinates of a lines string:
mat <- routes#lines[[9]]#Lines[[1]]#coords
mat
[,1] [,2]
[1,] -122.4491 37.7698
[2,] -122.4519 37.7694
[3,] -122.4491 37.7698
Which I would like to convert into this:
"-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
Where lat and lon of a single pair are separated by a comma, and pairs are separated by a semicolon.
apply(format(mat), 1, paste, sep=";", collapse = "")
does not produce the desired output. How would one do this in R ?
Here is the sample data:
dput(mat)
structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694,
37.7698), .Dim = c(3L, 2L))

mat <- structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694, 37.7698), .Dim = c(3L, 2L))
This is a matrix, with coordinates in two columns. You can just use one call to paste.
paste(mat[,1], mat[,2], sep = ",", collapse = ";")
# [1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"
Here sep sets the delimiter between lat and long coordinates (cells in the same row), and collapse sets the delimiter between coordinate pairs (the delimiter between different rows).

mat <- structure(c(-122.4491, -122.4519, -122.4491, 37.7698, 37.7694, 37.7698), .Dim = c(3L, 2L))
You were close. Your use of apply is appropriate, but because you are operating first row-wise, you need to worry about one delimiting first:
apply(mat, 1, paste, collapse=",")
# [1] "-122.4491,37.7698" "-122.4519,37.7694" "-122.4491,37.7698"
... and then combine all of those with a single external paste:
paste(apply(mat, 1, paste, collapse=","), collapse=";")
# [1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"

Another option is to convert to data.frame and then use do.call
do.call(paste, c(as.data.frame(mat), collapse=";", sep=","))
#[1] "-122.4491,37.7698;-122.4519,37.7694;-122.4491,37.7698"

Related

Extract a portion of a string in R

I have a dataframe with ~200,000 rows and I need to extract a portion of a string from one of the columns. I have tried split and str_split and not getting at what I need.
I need to cut/split/explode on the / forward slash and only return the first field.
From the following sample I need only subdomain.domain.org
subdomain.domain.org/Site/SiteXYZ/Users/Usergroup
I have tried:
splitA <- str_split(my_data$OrganizationalUnit, pattern = "/", simplify = TRUE)
which returns a dataframe with multiple columns, one for each string between the delimiters
head(splitA)
[,1] [,2] [,3] [,4] [,5]
[1,] "subdomain.domain.org" "Site" "SiteXYZ" "Users" "Usergroup"
and
splitB <- str_split(my_data$OrganizationalUnit, pattern = "/", n = '2', simplify = TRUE)
which returns a dataframe with multiple columns, one for each string between the delimiters
head(splitB)
[,1] [,2]
[1,] "subdomain.domain.org" "Site/SiteXYZ/Users/Usergroup"
and
splitC <- **str_split**(my_data$OrganizationalUnit, pattern = "/", n = '1', simplify = TRUE)
which returns a dataframe with one column, that looks like the original
head(splitC)
[,1]
[1,] "subdomain.domain.org/Site/SiteXYZ/Users/Usergroup"
My end goal is to either:
Extract this string as part of larger queries
Add a column with this field
or add this as a column upon csv import
Any help will be much appreciated.
Thank you,
-Jacob
You were very close, if I understood right what you are after. As stated in the help file of ?str_split(), the simplify argument means:
"If FALSE, the default, returns a list of character vectors. If TRUE returns a character matrix."
If you are dealing with a character vector of domains, you can use simplify=TRUE and extract the first column of the matrix with (...)[, 1]. In any case, it is more efficient to use a combination of str_sub() and str_locate().
library(tidyverse)
x <- c("subdomain.domain.org/Site/SiteXYZ/Users/Usergroup",
"othersubdomain.domain.com/Site/SiteXYZ/Users/Usergroup",
"yetanothersubdomain.domain.com/Site/SiteXYZ/Users")
str_sub(x, start = 1, str_locate(string = x, "/")[, 1]-1)
# -1 otherwise / is kept in resulting string
str_split(x, pattern = "/", n = '2', simplify = TRUE)[, 1]
[1] "subdomain.domain.org" "othersubdomain.domain.com" "yetanothersubdomain.domain.com"
In a data.frame
If it is a column in a data.frame, you can use mutate() from {dplyr}:
library(tidyverse)
df <- tibble(domain = x,
y = rnorm(length(x))
)
df_extracted_domain <- df %>% mutate(
domain_suffix = str_sub(x, start = 1, str_locate(string = x, "/")[,1]-1)
)
in case you are not familiar with tidyverse and specially the pipe operator %>%, just read it as "and then". so the above line you read:
take df, "and then" mutate df (by adding a new variable, called domain_suffix). The mutated data.frame is assigned to a new object (or overwritten, if you call them the same)
benchmarking
y <- rep(x, 1e3)
bm <- rbenchmark::benchmark(
str_split = {
str_split(y, pattern = "/", n = '2', simplify = TRUE)[,1]
},
str_sub = {
str_sub(y, start = 1, str_locate(string = x, "/")-1)
},
replications = 1000
)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 str_split 1000 1.4 7.1 1.4 0.011 0 0
# 2 str_sub 1000 0.2 1.0 0.2 0.000 0 0

Text to columns by fixed width in R

I have a large data frame in which I'm trying to separate the values from one column into two. The values are character then text such as AU2847 or AU1824. I want the first column to be AU and the second to be the corresponding 4 digit number.
I am also restricted to the base r packages so I believe strsplit will be our best bet- but can't figure out how to make it split after 2nd character and create 2 columns from it.
I regularly use these two functions:
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
and
substrLeft <- function(x, n){
substr(x, 1,n)
}
Which cutoff n characters left or right of the string
There are several options to do this. You can subset by position using substr(), or you can use gsub() and call be reference too. Subsetting by position will be faster but inflexible (you would have to have a huge dataframe to notice a difference in time), and using regex (gsub() will be a little slower but is much more flexible). E.g.:
df[c("col2", "col3", "col2b", "col3b")] <- list(substr(df$col1, 1, 2),
substr(df$col1, 3, 6),
gsub("([[:alpha:]]+)(\\d+)", "\\1", df$col1),
gsub("([[:alpha:]]+)(\\d+)", "\\2", df$col1))
df
col1 col2 col3 col2b col3b
1 AU2847 AU 2847 AU 2847
2 AU1824 AU 1824 AU 1824
Data:
df <- data.frame(col1 = c("AU2847", "AU1824"), stringsAsFactors = F)
You could try:
as.data.frame(
do.call(rbind,
strsplit(sub("^(.+?)(\\d+)", "\\1_\\2", df$col),
split="_")
)
)
Whereby df is the name of your data frame and col the name of your column.
This then inserts artificially an underscore between the text and first number - this way you can use underscore as an argument to strsplit.
We can use strsplit() together with a regular expression which uses a lookbehind assertion:
x <- c("AU2847", "AU1824")
strsplit(x, "(?<=[A-Z]{2})", perl = TRUE)
[[1]]
[1] "AU" "2847"
[[2]]
[1] "AU" "1824"
The lookbehind regular expression tells strsplit() to split each string after two capital letters. There is no need to artificially introduce a character to split on as in arg0naut91's answer.
Now, the OP has mentioned that the character vector to be splitted is a column of a larger data.frame. This requires some additional code to append the list output of strsplit() as new columns to the data.frame:
Let's assume we have this data.frame
DF <- data.frame(x, stringsAsFactors = FALSE)
Now, the new columns can be appended by:
DF[, c("col1", "col2")] <- do.call(rbind, strsplit(DF$x, "(?<=[A-Z]{2})", perl = TRUE))
DF
x col1 col2
1 AU2847 AU 2847
2 AU1824 AU 1824

Splitting values of cut() into additional columns?

I have used cut() on a column to bin the data. I have about 130 rows that are of the structure (93.7,94.1] all in one column. I would like to split the the values into their own columns but am struggling to do this. Here is my code so far:
binned=cut(df$value, 136)
binned_df = data.frame(levels(binned))
Here are the first 10 rows of binned_df:
(42.9,43.4]
(43.4,43.8]
(43.8,44.2]
(44.2,44.6]
(44.6,44.9]
(44.9,45.3]
(45.3,45.7]
(45.7,46.1]
(46.1,46.5]
(46.5,46.9]
Are there any functions to do this? Any help would be much appreciated as I am quite new to R.
We can use read.csv to split the column into two, after removing the ( and ] with gsub. It would use the sep as ,
df1 <- read.csv(text = gsub("\\(|\\]", "", binned_df[[1]]), header = FALSE)
data
binned_df <- structure(list(col = c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]",
"(44.2,44.6]", "(44.6,44.9]", "(44.9,45.3]", "(45.3,45.7]", "(45.7,46.1]",
"(46.1,46.5]", "(46.5,46.9]")), class = "data.frame", row.names = c(NA,
-10L))
x <- c("(42.9,43.4]", "(43.4,43.8]", "(43.8,44.2]")
limits <- do.call(rbind, #combine result in matrix
strsplit( #split by ,
substring(x, 2, nchar(x) - 1), #remove first and last char
",", fixed = TRUE))
mode(limits) <- "numeric" #change to numeric
limits
# [,1] [,2]
#[1,] 42.9 43.4
#[2,] 43.4 43.8
#[3,] 43.8 44.2

Convert named vector to data frame using attribute values

I have a vector of characters. Each element of the vector has a name attribute which represents the row index of a data frame and the column index of a data frame, separated by a period. Here's a toy data set:
# Create vector of characters
a <- c("foo","bar","dog","cat")
# Assign attributes. The data frame is 2x2:
attr(a, "names") <- c("1.1", "1.2", "2.1", "2.2")
I am trying to use the attribute names to convert the vector into a data frame, where each element in the data frame is the value in the vector and the element's row is the number before the period in the attribute name and the element's column is the number after the decimal in the attribute name. The toy example's output should look like:
data.frame(var1 = c("foo","dog"), var2 = c("bar", "cat"))
My actual vector is quite large so I am looking to do this efficiently.
You can use indexing by row/column value to do this efficiently:
row.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 1))
col.nums <- as.numeric(sapply(strsplit(names(a), "\\."), "[", 2))
mat <- matrix(NA, max(row.nums), max(col.nums))
mat[cbind(row.nums, col.nums)] <- a
mat
# [,1] [,2]
# [1,] "foo" "bar"
# [2,] "dog" "cat"
Split a on the suffix values and coerce that to a data frame. Omit
the stringsAsFactors=FALSE if you prefer factor columns.
the unname if rownames on the result are acceptable
Code--
as.data.frame(split(unname(a), sub(".*[.]", "", names(a))), stringsAsFactors = FALSE)
giving:
X1 X2
1 foo bar
2 dog cat
I would probably use regex to extract row and column positions, as follows.
my.rows <- as.integer(gsub("\\..*$", "", names(a)))
my.cols <- as.integer(gsub("^.*\\.", "", names(a)))
new.data <- data.frame(matrix(NA, nrow = max(my.rows), ncol = max(my.cols)))
for (i in 1:length(a)) {
new.data[my.rows[i], my.cols[i]] <- a[i]
}
new.data
We can use dplyr and tidyr. b2 is the final output.
library(dplyr)
library(tidyr)
b <- data_frame(Name = names(a), Value = a)
b2 <- b %>%
separate(Name, into = c("Group", "Var")) %>%
spread(Var, Value) %>%
select(-Group)

Find a vector of strings in R

I have a vector of strings like:
vector=c("a","hb","cd")
and also I have a matrix which has a column, each element of this column is a list of strings which separated by "|" separator, like:
1 "ab|hb"
2 "ab|hbc|cd"
I want to find each string of vector appears in which row of matrix completely.
For the above vector, the result is:
NA, 1, 2
You can use strsplit for splitting strings:
x <- strsplit("ab|hbc|cd", split="|", fixed=T)
and then check if values of vector appear in the data, e.g.
sapply(vector, function(x) x %in% strsplit("a|ab|cd|efg|bh",
split="|", fixed=T)[[1]])
Warning: strsplit outputs data as a list, so in the example above I extract only the first element of the list with [[1]], however you can deal with it in other way if you choose.
EDIT: answering to your question on data as a vector:
data <- c("ab|cd|ef", "aaa|b", "ab", "wf", "fg|hb|a", "cd|cd|df")
sapply(sapply(data, function(x) strsplit(x, split="|", fixed=T)[[1]]),
function(y) sapply(vector, function(z) z %in% y))
Here's an approach using regular expressions:
# Example data
vector <- c("a","hb","cd")
mat <- matrix(c("ab|hb", "ab|hbc|cd"), nrow = 2)
sapply(paste0("\\b", vector, "\\b"), function(x)
if(length(tmp <- grep(x, mat[ , 1]))) tmp else NA,
USE.NAMES = FALSE)
# [1] NA 1 2

Resources