Convert string data into data frame - r

I am new to R, any suggestions would be appreciated.
This is the data:
coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"
I would like this to become:
Latitude Longitude
-79.43591570873059 43.68015339477487
-79.43491506339724 43.68036886994886
-79.43394727223847 43.680578504490335
-79.43388162422195 43.68058996121469
-79.43281544978878 43.680808044458765
-79.4326971769691 43.68079658822322

You can use scan with a little gsub:
matrix(scan(text = gsub("[()]", "", coordinates), sep = ","),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("Lat", "Long")))
# Read 12 items
# Lat Long
# [1,] -79.43592 43.68015
# [2,] -79.43492 43.68037
# [3,] -79.43395 43.68058
# [4,] -79.43388 43.68059
# [5,] -79.43282 43.68081
# [6,] -79.43270 43.68080
The precision is still there--just truncated in the matrix display.
Two clear advantages:
Fast.
Handles multi-element "coordinates" vector (eg: coordinates <- rep(coordinates, 10) as an input).
Here's another option:
library(data.table)
fread(gsub("[()]", "", gsub("), (", "\n", toString(coordinates), fixed = TRUE)), header = FALSE)
The toString(coordinates) is for cases when length(coordinates) > 1. You could also use fread(text = gsub(...), ...) and skip using toString. I'm not sure of the advantages or limitations of either approach.

We can use str_extract_all from stringr
library(stringr)
df <- data.frame(Latitude = str_extract_all(coordinates, "(?<=\\()-\\d+\\.\\d+")[[1]],
Longitude = str_extract_all(coordinates, "(?<=,\\s)\\d+\\.\\d+(?=\\))")[[1]])
df
# Latitude Longitude
#1 -79.43591570873059 43.68015339477487
#2 -79.43491506339724 43.68036886994886
#3 -79.43394727223847 43.680578504490335
#4 -79.43388162422195 43.68058996121469
#5 -79.43281544978878 43.680808044458765
#6 -79.4326971769691 43.68079658822322
Latitude captures the negative decimal number from opening round brackets (() whereas Longitude captures it from comma (,) to closing round brackets ()).
Or without regex lookahead and behind and capturing it together using str_match_all
df <- data.frame(str_match_all(coordinates,
"\\((-\\d+\\.\\d+),\\s(\\d+\\.\\d+)\\)")[[1]][, c(2, 3)])
To convert data into their respective types, you could use type.convert
df <- type.convert(df)

Here is a base R option:
coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"
coordinates <- gsub("^\\(|\\)$", "", coordinates)
x <- strsplit(coordinates, "\\), \\(")[[1]]
df <- data.frame(lat=sub(",.*$", "", x), lng=sub("^.*, ", "", x), stringsAsFactors=FALSE)
df
The strategy here is to first strip the leading trailing parentheses, then string split on \), \( to generate a single character vector with each latitude/longitude pair. Finally, we generate a data frame output.
lat lng
1 -79.43591570873059 43.68015339477487
2 -79.43491506339724 43.68036886994886
3 -79.43394727223847 43.680578504490335
4 -79.43388162422195 43.68058996121469
5 -79.43281544978878 43.680808044458765
6 -79.4326971769691 43.68079658822322

Yet another base R version with a bit of regex, relying on the fact that replacing the punctuation with blank lines will mean they get skipped on import.
read.csv(text=gsub(")|(, |^)\\(", "\n", coordinates), col.names=c("lat","long"), header=FALSE)
# lat long
#1 -79.43592 43.68015
#2 -79.43492 43.68037
#3 -79.43395 43.68058
#4 -79.43388 43.68059
#5 -79.43282 43.68081
#6 -79.43270 43.68080
Advantages:
Deals with vector input as well like the other scan answer.
Converts to correct numeric types in output
Disadvantages:
Not super fast

We can use rm_round from qdapRegex
library(qdapRegex)
read.csv(text = rm_round(coordinates, extract = TRUE)[[1]], header = FALSE,
col.names = c('lat', 'lng'))
# lat lng
#1 -79.43592 43.68015
#2 -79.43492 43.68037
#3 -79.43395 43.68058
#4 -79.43388 43.68059
#5 -79.43282 43.68081
#6 -79.43270 43.68080
Or in combination with tidyverse
library(tidyr)
library(dplyr)
rm_round(coordinates, extract = TRUE)[[1]] %>%
tibble(col1 = .) %>%
separate(col1, into = c('lat', 'lng'), sep= ",\\s*", convert = TRUE)
# A tibble: 6 x 2
# lat lng
# <dbl> <dbl>
#1 -79.4 43.7
#2 -79.4 43.7
#3 -79.4 43.7
#4 -79.4 43.7
#5 -79.4 43.7
#6 -79.4 43.7

Related

Read file with decimals as comma not working in R cordinates?

I have received a field with XY coordinates that have a decimal as a comma, but I cannot correctly open this file in R.
Here is how my data should look like - note comma as decimal indicator in y coordinates:
"OBJECTID";"x_coord";"y_coord"
"1";"664936,3059";"5582773,2319" # comma as separator
"2";"604996,5803";"5471445,4964"
"3";"772846,82";"5353980,45"
"4";"552181,8639";"5535271,7626"
"5";"604022,9011";"5470134,0649"
But, specifying the dec = ',' in read.csv just reads it as a normal values:
xy <- read.delim(paste(path, "test_decimals2.txt", sep = '/'),
sep = '\t', dec = ",",skip = 0)
Missing comma separator from y coordinates:
OBJECTID x_coord y_coord
1 1 6649363059 55827732319 # not a comma separator anymore
2 2 6049965803 54714454964
3 3 77284682 535398045
4 4 5521818639 55352717626
5 5 6040229011 54701340649
I have tried to convert the data to txt, etc. but still have the same problem. Do someone know how to make sure that dec = ',' will work correctly? Thank you!
(coordintaes are in UTM, that's why they look a bit weird)
I copied the data into a txt file and this worked just fine for me:
read.csv2("test_decimals2.txt")
Out:
OBJECTID x_coord y_coord
1 1 664936.3 5582773
2 2 604996.6 5471445
3 3 772846.8 5353980
4 4 552181.9 5535272
5 5 604022.9 5470134
a <- read.csv2("test_decimals2.txt", dec = ",", as.is = TRUE, header = FALSE, skip = 1)
a |> mutate(X = sub(",", ".", V2), Y = sub(",", ".", V3)) |>
select(V1, X, Y)
I think read.csv2 is the answer for those european style csv-Data:
xy <- read.csv2(file = paste(path, "test_decimals2.txt", sep = '/'))
Your approach might also be correct. The data is correct, it just does not display enough digits in the print.
Try printing your data with:
> print.data.frame(xy, digits = 10)
OBJECTID x_coord y_coord
1 1 664936.3059 5582773.232
2 2 604996.5803 5471445.496
3 3 772846.8200 5353980.450
4 4 552181.8639 5535271.763
5 5 604022.9011 5470134.065
This prints the dataframe with a limit of ten digits.
Adding to the previous answers. If you want to use read.delim() function you can simply type:
xy <- read.delim(file_path, sep = ';', skip = 0, dec = ",")
OR
xy <- read.delim(file_path, sep = ';', skip = 0, colClasses = "character")
xy$x_coord <- as.numeric(gsub(",", ".", xy$x_coord))
xy$y_coord <- as.numeric(gsub(",", ".", xy$y_coord))
Then, you may change the default display use to print more digits by using:
options(digits = 12)
Than, printing the data.frame would lead to :
OBJECTID x_coord y_coord
1 1 664936.3059 5582773.2319
2 2 604996.5803 5471445.4964
3 3 772846.8200 5353980.4500
4 4 552181.8639 5535271.7626
5 5 604022.9011 5470134.0649
One possible approach:
testcommas <- read_delim(path,"\t", col_types=cols(.default="c"))
testcommas
# A tibble: 4 x 3
item coord_y coord_x
<chr> <chr> <chr>
1 1 156,158 543,697
2 2 324,678 169,385
3 3 097,325 325,734
4 4 400,211 158,687
With this, all columns will be of type "character". Then you may change them to numeric if/as needed. One way:
testcommas <- data.frame(sapply(testcommas, function(x) as.numeric(sub(",", ".", x, fixed = TRUE))))
testcommas
item coord_y coord_x
1 1 156.158 543.697
2 2 324.678 169.385
3 3 97.325 325.734
4 4 400.211 158.687

How do you use quotes to remove multiple sets of quotes from columns in R?

I am trying to clean some data in R, but I am having trouble working through the regex. I tried using the noquote function in R. But, it didn't seem to help
data %>% head()
X..Latitude.. X..Longitude..
1 ""52","3726380"" ""4","8941060""
2 ""52","4103320"" ""4","7490690""
3 ""52","3828340"" ""4","9204560""
4 ""52","4362550"" ""4","8167080""
5 ""52","3615820"" ""4","8854790""
6 ""52","3702150"" ""4","8951670""
data %>% noquote()
1 ""52","3726380"" ""4","8941060""
2 ""52","4103320"" ""4","7490690""
3 ""52","3828340"" ""4","9204560""
4 ""52","4362550"" ""4","8167080""
5 ""52","3615820"" ""4","8854790""
6 ""52","3702150"" ""4","8951670""
Reproducible data
structure(list(X..Latitude.. = c("\"\"52\",\"3726380\"\"", "\"\"52\",\"4103320\"\"", "\"\"52\",\"3828340\"\"", "\"\"52\",\"4362550\"\"", "\"\"52\",\"3615820\"\"", "\"\"52\",\"3702150\"\""), X..Longitude.. = c("\"\"4\",\"8941060\"\"", "\"\"4\",\"7490690\"\"", "\"\"4\",\"9204560\"\"", "\"\"4\",\"8167080\"\"", "\"\"4\",\"8854790\"\"", "\"\"4\",\"8951670\"\"")), row.names = c(NA, 6L), class = "data.frame")
Looks like the data was read incorrectly.
A way to correct this after reading the data is to remove all the quotes and replace "," with "." to specify decimal numbers. We can also clean up the name of the columns.
data[] <- lapply(data, function(x) gsub('"', '', sub(',', '.', x)))
names(data) <- gsub('[X.]', '', names(data))
data
# Latitude Longitude
#1 52.3726380 4.8941060
#2 52.4103320 4.7490690
#3 52.3828340 4.9204560
#4 52.4362550 4.8167080
#5 52.3615820 4.8854790
#6 52.3702150 4.8951670
in base R you could just re-read your data:
read.table(text=do.call(paste, data), sep=" ", dec=",",col.=c("Latitude","Longitude"))
Latitude Longitude
1 52.37264 4.894106
2 52.41033 4.749069
3 52.38283 4.920456
4 52.43626 4.816708
5 52.36158 4.885479
6 52.37022 4.895167

How do I turn a redis string value that is a numpy array into a dataframe quickly?

I currently have a python program which does some calculating of time series data, and sends datapts to a redis cache. Each data point is a numpy array which looks like this:
"[ 1.18103230e+07 7.89070000e+04 -1.88109969e-01 -2.17373938e-01\n 1.00433488e+01 -1.39566174e-03 -1.95357823e-03 8.36936470e-02\n -1.26680427e+00 -1.85034338e+00 2.00000000e+00]"
Then, I want R to call the cache and convert this string in a dataframe or list of some sort. Currently I have this:
len <- as.numeric(as.character(redisLLen(list)))
v <- redisLRange(list, 0, len)
counter <- 1
for (item in v) {
item <- strsplit(gsub("(^\\[|\\]$)", "", v), ",")[[counter]]
item <- strsplit(item, " +")
df <- rbind(df, item)
counter <- counter + 1
}
This works fine and there is no issues, but the problem is the R code has to work in realtime and this is actually a very slow method. Are they any faster ways to turn this redis string value into an R dataframe? Any help would be appreciated.
UPDATE:
In my python code I removed the square brackets, so that there would be less to remove in gsub() in R. Moreover, a good chunk of time was spent doing rbind(). So instead of adding to a dataframe, I made an initial one, and inserted elements. This is now the data point:
"1.18103230e+07 7.89070000e+04 -1.88109969e-01 -2.17373938e-01\n 1.00433488e+01 -1.39566174e-03 -1.95357823e-03 8.36936470e-02\n -1.26680427e+00 -1.85034338e+00 2.00000000e+00",
and this is the updated code I now use:
len <- as.numeric(redisLLen(list))
v <- redisLRange(list, 0, len-1)
df <- data.frame(matrix(NA, ncol = 8, nrow=len))
counter <- 1
for (item in v) {
item <- unlist(strsplit(item, " +"))
df[counter, 1:ncol(df)] <- item
counter <- counter + 1
}
If it is a string, then we can use read.table
read.table(text = gsub("[][]", "", v1), header = FALSE, fill = TRUE)
# V1 V2 V3 V4
#1 1.181032e+07 7.890700e+04 -0.188109969 -0.21737394
#2 1.004335e+01 -1.395662e-03 -0.001953578 0.08369365
#3 -1.266804e+00 -1.850343e+00 2.000000000 NA
Or more faster fread
library(data.table)
fread(text = gsub("[][]", "", v1))
If we need as a single column, read with scan and convert to data.frame
v2 <- scan(text = gsub("[][\n]", "", v1), what = numeric(), quiet = TRUE)
data.frame(col1 = v2)
# col1
#1 1.181032e+07
#2 7.890700e+04
#3 -1.881100e-01
#4 -2.173739e-01
#5 1.004335e+01
#6 -1.395662e-03
#7 -1.953578e-03
#8 8.369365e-02
#9 -1.266804e+00
#10 -1.850343e+00
#11 2.000000e+00
Or with fread
fread(text = gsub(' ', '\n', gsub("[][]|\n", "", v1)))
It is not clear how the object was created. We could convert python objects to R using reticulate
library(reticulate)
np <- import("numpy", convert=FALSE)
np1 <- np$array(c(1.18103230e+07, 7.89070000e+04, -1.88109969e-01, -2.17373938e-01,1.00433488e+01, -1.39566174e-03, -1.95357823e-03, 8.36936470e-02,
-1.26680427e+00, -1.85034338e+00, 2.00000000e+00))
data.frame(col1 = py_to_r(np1))
# col1
#1 1.181032e+07
#2 7.890700e+04
#3 -1.881100e-01
#4 -2.173739e-01
#5 1.004335e+01
#6 -1.395662e-03
#7 -1.953578e-03
#8 8.369365e-02
#9 -1.266804e+00
#10 -1.850343e+00
#11 2.000000e+00
data
v1 <- "[ 1.18103230e+07 7.89070000e+04 -1.88109969e-01 -2.17373938e-01\n 1.00433488e+01 -1.39566174e-03 -1.95357823e-03 8.36936470e-02\n -1.26680427e+00 -1.85034338e+00 2.00000000e+00]"

R How to split a column of strings into multiple columns using a format code/string?

I'm working with Census (CTPP) data, and the GEOID field is a long string that contains lots of geographic information. The format of this string changes for various Census tables, but they provide a code lookup. Here are a sample GEOID and format 'code'. (The parts I can already parse have been removed. This is the part of the GEOID I can't parse.)
geoid <- "0202000000126"
format <- "ssccczzzzzzzz"
This means that the first two characters ("02") signify the state (Alaska), the next three ("020") are the county, and the remaining characters are the zone.
I have a table of these geoid/format pairs, and the format can be different for each row.
s: state
c: county
p: place
z: zone
(others not used in this simple example)
df <- data.frame(
geoid = c(
"0224230",
"0202000000126"
),
format = c(
"ssppppp",
"ssccczzzzzzzz"
)
)
# A tibble: 2 x 2
geoid format
<chr> <chr>
1 0224230 ssppppp
2 0202000000126 ssccczzzzzzzz
What I'd like to do is break up the geoid column into columns for each geography like so:
# A tibble: 2 x 6
geoid format s p c z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 24230 NA NA
2 0202000000126 ssccczzzzzzzz 02 NA 020 00000126
I've looked at several approaches. extract() from stringr looked promising. I'm also pretty sure I'll need a custom function that I mapply(?)/map over my data frame.
A base alternative:
geo_codes <- c("s", "c", "p", "z")
# get starting position and lengths of consecutive characters in 'format'
g <- gregexpr("(.)\\1+", df$format)
# use the result above to extract corresponding substrings from 'geoid'
geo <- regmatches(df$geoid, g)
# select first element in each run of 'format' and split
# used to name substrings from above
fmt <- strsplit(gsub("(.)\\1+", "\\1", df$format), "")
# for each element in 'geo' and 'fmt',
# 1. create a named vector
# 2. index the vector with 'geo_codes'
# 3. set names of the full length vector
t(mapply(function(geo, fmt){
setNames(setNames(geo, fmt)[geo_codes], geo_codes)},
geo, fmt))
# s c p z
# [1,] "02" NA "24230" NA
# [2,] "02" "020" NA "00000126"
Another alternative,
geo <- strsplit(df$geoid, "")
fmt <- strsplit(df$format, "")
t(mapply(function(geo, fmt) unlist(lapply(split(geo, factor(fmt, levels = geo_codes)), function(x){
if(length(x)) paste(x, collapse = "") else NA})), geo, fmt))
My first alternative is about 2 times faster than the second, benchmarked on 2e5 rows.
As is so often the case, writing up the question and the minimum example helped me simplify the problem and identify a solution. I'm sure there is a fancier solution out there, but this is what I came up with, and it's easy(ish) to get your head around.
While the formats vary, there are a limited number of unique characters. In the toy example in this problem, only s, c, p, z. So here's what I did:
First, I created a function that takes a single format string, a single geoid string, and a single subgeo character/code. The function determines which character positions in format match subgeo and then returns those positions from geoid.
extract_sub_geo <- function(format, geoid, subgeo) {
geoid_v <- unlist(strsplit(geoid, ""))
format_v <- unlist(strsplit(format, ""))
positions <- which(format_v == subgeo)
result <- paste(geoid_v[positions], collapse = "")
return(result)
}
extract_sub_geo("ssccczzzzzzzz", "0202000000126", "s")
[1] "02"
I then looped over each unique code and used pmap() to apply the function to my entire data frame.
geo_codes <- c("s", "c", "p", "z")
for (code in geo_codes) {
df <- df %>%
mutate(
!!code := pmap_chr(list(format, remainder, !!(code)), extract_sub_geo)
)
}
# A tibble: 2 x 6
geoid format s c p z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 "" 02000 ""
2 0202000000126 ssccczzzzzzzz 02 020 "" 00000126
Probably cleaner to do the loop in base R instead of dplyr.
A tidyverse solution:
library(tidyverse)
create_new_code <- function(id, format, char) {
format %>%
str_locate_all(paste0(char, "*", char)) %>%
unlist() %>%
{substr(id, .[1], .[2])}
}
create_new_codes <- function(id, format) {
c("s", "p", "c", "z") %>%
set_names() %>%
map(create_new_code, id = id, format = format)
}
bind_cols(df,
with(df, map2_df(geoid, format, create_new_codes)))
# geoid format s p c z
#1 0224230 ssppppp 02 24230 <NA> <NA>
#2 0202000000126 ssccczzzzzzzz 02 <NA> 020 00000126

How can I replace all the spaces between words with point in a data frame in r?

I have a data frame, like this:
my.tree <- data.frame(Tree=c("Acer campestre", "Abies alba", "Pyrus communis", "Robinia pseudoacacia", "Tilia cordata"),
Freq=c(23,65,47,69,65))
I want to replace all the spaces between words with point at once. I want to create new data frame (or modify this data frame) where there will be points between words of tree's name, e.g. Acer.campestre, Abies.alba, Pyrus.communis etc.
Is it possible to replace at once or how can I do these change easier?
You can do:
> library(dplyr); mutate(my.tree, Tree = gsub(" ", ".", Tree))
# Tree Freq
#1 Acer.campestre 23
#2 Abies.alba 65
#3 Pyrus.communis 47
#4 Robinia.pseudoacacia 69
#5 Tilia.cordata 65
It might be safer (and more conventional) to use gsub, but you could also use make.names:
make.names(my.tree$Tree)
# [1] "Acer.campestre" "Abies.alba" "Pyrus.communis"
# [4] "Robinia.pseudoacacia" "Tilia.cordata"
Or even chartr:
chartr(" ", ".", my.tree$Tree)
# [1] "Acer.campestre" "Abies.alba" "Pyrus.communis"
# [4] "Robinia.pseudoacacia" "Tilia.cordata"
You can do:
my.tree$Tree <- gsub(pattern = " ", replacement = ".", x = my.tree$Tree)
> my.tree
# Tree Freq
#1 Acer.campestre 23
#2 Abies.alba 65
#3 Pyrus.communis 47
#4 Robinia.pseudoacacia 69
#5 Tilia.cordata 65

Resources