Parsing in a R Data Frame by columns, not by sep - r

I have Two Line Element data https://celestrak.org/NORAD/elements/ that I have made single line, resulting in 1,00s of rows of 160 numbers and characters. Unlike a CSV there are no separators. Using R, how do I parse the data into the correct column width? Here is an example of the data, and some of the first columns.
1 00011U 59001A 18243.16403752 .00000112
123456789012345678901234567890
... col# content
1 01–01 Line number,example - 1
2 03–07 Satellite number, example - 25544
3 08–08 Classification (U=Unclassified), example - U
4 10–11 Intl Designator (Last two digits of launch year), example - 98
5 12–14 Intl Designator (Launch number - year),example - 067
6 15–17 Intl Designator (piece of the launch), example - A
thank you so much in advance

You can parse these kinds of "fixed width format" files in R using read.fwf(). You have to specify the width of each column, I'm having a little bit of trouble matching your example data to the column descriptions you provided but this mostly works:
read.fwf(
textConnection("1 00011U 59001A 18243.16403752 .00000112"),
widths = c(2, 5, 2, 2, 3, 4),
# Just reading everything as a string for the moment
colClasses = "character"
)
Output:
V1 V2 V3 V4 V5 V6
1 1 00011 U 59 001 A 18

Related

Extract portion of nested tibbles?

I've got a pile of nested tibbles that are from the tidyrss package. The data looks like this:
What I'm trying to do is take the four common items from each tibble and tidy them, so that the output looks like this:
item_title
item_link
item_description
item_pub_date
title from article 1
some url
longer text
posix date
title from article 2
some url
longer text
posix date
title from article 3
some url
longer text
posix date
title from article 4
some url
longer text
posix date
Thus far I've tried unlist() and deframe() and both of those just make a general mess of things - and an added twist is not all the list items are tibbles. Some are functions, and I want to ignore those. What's the best tidyverse approach to tackle this task?
map_dfr seems to do what you want! It loops over a list and applies a function to each one - in this case, the only "function" we want to apply is returning the data frame/tibble, but that also allows us to skip the functions:
clean_feed_df <- list(
data.frame(item_title=sample(letters, 3),
item_link=sample(letters, 3),
item_desc=sample(letters, 3),
item_date=sample(letters, 3)),
data.frame(item_title=as.character(sample(1:100, 5)),
item_link=as.character(sample(1:100, 5)),
item_desc=as.character(sample(1:100, 5)),
item_date=as.character(sample(1:100, 5))),
function(x)sum(x)
)
map_dfr(clean_feed_df, function(rssentry){
if(is(rssentry, "data.frame")){
return(rssentry)
}
})
which returns
item_title item_link item_desc item_date
1 s s u i
2 x d o x
3 t x d h
4 40 51 21 91
5 4 25 37 34
6 5 44 18 71
7 65 70 83 90
8 32 85 76 89

R-- Extract specific text and the subsequent text/numbers

I have a column with text strings in it, and I would like to extract not just a specific string, but also the string or number/s following this specified string. What is a good solution for this?
In the example below- I would like to create a column "extract" and str_extract the words "lot" and "unit" AND also extract the subsequent numbers following this text.
id
notes
extract
1
LOT 56, STRATA TITLE, 56/SP77100,
LOT 56
2
18/SP71866, COMMERCIAL, 17/SP71866, lot 18
lot 18
3
unit 9; 3R/PS732002
unit 9
4
V1602 F63, Section 8 Block 68 Unit 3
Unit 3
Have looked at a lot of regex code but nothing helpful to find how to extract subsequent values from the specified target text string.
Tried this so far from another StackOverflow problem-
result <- table %>%
mutate(extract = str_extract(notes, "(?lot\\s)\\W\\s?\\d+\\")) %>%
mutate(lot = str_squish(lot))
You can use
str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+")
See the regex demo.
Details
(?i) - case insensitive flag
\b - a word boundary
(?:lot|unit) - either lot or unit
\W* - any zero or more non-word chars
\d+ - one or more digits.
R test:
library(dplyr)
library(stringr)
df <- data.frame(notes=c("LOT 56, STRATA TITLE, 56/SP77100,","18/SP71866, COMMERCIAL, 17/SP71866, lot 18","unit 9; 3R/PS732002", "V1602 F63, Section 8 Block 68 Unit 3"))
df %>%
+ mutate(extract = str_extract(notes, "(?i)\\b(?:lot|unit)\\W*\\d+"))
notes extract
1 LOT 56, STRATA TITLE, 56/SP77100, LOT 56
2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 unit 9; 3R/PS732002 unit 9
4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3
A base R option using regmatches
transform(
df,
extract = unlist(regmatches(notes, gregexpr("\\b(lot|unit)\\s\\d+", notes, ignore.case = TRUE)))
)
gives
id notes extract
1 1 LOT 56, STRATA TITLE, 56/SP77100 LOT 56
2 2 18/SP71866, COMMERCIAL, 17/SP71866, lot 18 lot 18
3 3 unit 9; 3R/PS732002 unit 9
4 4 V1602 F63, Section 8 Block 68 Unit 3 Unit 3
Data
> dput(df)
structure(list(id = 1:4, notes = c("LOT 56, STRATA TITLE, 56/SP77100",
"18/SP71866, COMMERCIAL, 17/SP71866, lot 18", "unit 9; 3R/PS732002",
"V1602 F63, Section 8 Block 68 Unit 3")), class = "data.frame", row.names = c(NA,
-4L))

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59

Fixing Column Issue When Importing Data in R

Currently having an issue importing a data set of tweets so that every observation is in one column
This is the data before import; it includes three cells for each tweet, and a blank space in between.
T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz
library(tidyverse)
tweets1 <- read_csv("tweets.txt.gz", col_names = F,
skip_empty_rows = F)
This is the output:
Parsed with column specification:
cols(
X1 = col_character()
)
Warning message:
“71299 parsing failures.
row col expected actual file
35 -- 1 columns 2 columns 'tweets.txt.gz'
43 -- 1 columns 2 columns 'tweets.txt.gz'
59 -- 1 columns 2 columns 'tweets.txt.gz'
71 -- 1 columns 5 columns 'tweets.txt.gz'
107 -- 1 columns 3 columns 'tweets.txt.gz'
... ... ......... ......... ...............
See problems(...) for more details.
”
# A tibble: 1,220,233 x 1
X1
<chr>
1 "T\t2009-06-11 00:00:03"
2 "U\thttp://twitter.com/imdb"
3 "W\tNo Post Title"
4 NA
5 "T\t2009-06-11 16:37:14"
6 "U\thttp://twitter.com/ncruralhealth"
7 "W\tNo Post Title"
8 NA
9 "T\t2009-06-11 16:56:23"
10 "U\thttp://twitter.com/boydjones"
# … with 1,220,223 more rows
The only issue are the many parsing failures, where problems(tweets1) shows that R expected one column, but got multiple. Any ideas on how to fix this? My output should provide me with 1.4 million rows according to my Professor, so unsure if this parsing issue is the key here. Any help is appreciated!
Maybe something like this will work for you.
data
data <- 'T 2009-06-11 00:00:03
U http://twitter.com/imdb
W No Post Title
T 2009-06-11 16:37:14
U http://twitter.com/ncruralhealth
W No Post Title
T 2009-06-11 16:56:23
U http://twitter.com/boydjones
W listening to "Big Lizard - The Dead Milkmen" ♫ http://blip.fm/~81kwz'
For a large file, fread() should be quick. The sep = NULL is saying basically just read in full lines. You will replace input = data with file = "tweets.txt.gz".
library(data.table)
read_rows <- fread(input = data, header = FALSE, sep = NULL, blank.lines.skip = TRUE)
processing
You could just stay with data.table, but I noticed you in the tidyverse already.
library(dplyr)
library(stringr)
library(tidyr)
Basically I am grabbing the first character (T, U, W) and storing it into a variable called Column. I am adding another column called Content for the rest of the string, with white space trimmed on both ends. I also added an ID column so I know how to group the clusters of 3 rows.
Then you basically just pivot on the Column. I am not sure if you wanted this last step or not, so remove as needed.
read_rows %>%
mutate(ID = rep(1:3, each = n() / 3),
Column = str_sub(V1, 1, 1),
Content = str_trim(str_sub(V1, 2))) %>%
select(-V1) %>%
pivot_wider(names_from = Column, values_from = Content)
result
# A tibble: 3 x 4
ID T U W
<int> <chr> <chr> <chr>
1 1 2009-06-11 00:00:03 http://twitter.com/imdb No Post Title
2 2 2009-06-11 16:37:14 http://twitter.com/ncruralhealth No Post Title
3 3 2009-06-11 16:56:23 http://twitter.com/boydjones "listening to \"Big Lizard - The Dead Milkmen\" ♫ http://blip.fm/~81kwz"

How to I add a leading numeric identifier (not necessarily zero) to a character string in r

I apologize if this is a duplicate, I've searched through all of the "add leading zero" content I can find, and I'm struggling to find a solution I can work with. I have the following:
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier)
and I want a modified siteid that is always six (6) characters long with zeroes to fill the gaps. The Site ID can vary in nchar from 1-3, the modifier is always a length of 2, and the number of zeroes can vary depending on the length of the site ID (so that 6 is always the final modified length).
I would like the following final output:
df
# siteid modifier mod.siteid
#1 1 44 440001
#2 11 22 220011
#3 111 11 110111
Thanks for any suggestions or direction. This could also be numeric, but it seems like character manipulation has more options...?
The vocabulary here is left pad and paste here is one way using sprintf()::
df$mod.siteid <- with(df, sprintf("%s%04d", modifier, as.integer(siteid)))
# Note:
# code simplified thanks to suggestion by Maurits.
Output:
siteid modifier mod.siteid
1 1 44 440001
2 11 22 220011
3 111 11 110111
Data:
df <- data.frame(
siteid = c("1", "11", "111"),
modifier = c("44", "22", "11"),
stringsAsFactors = FALSE
)
Extra: If you don't want to left pad with 0, then using the stringi package is one option: with(df, paste0(modifier, stringi::stri_pad_left(siteid, 4, "q")))
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier, stringsAsFactors = FALSE)
df$mod.siteid = paste0( df$modifier,
formatC( as.numeric(df$siteid), width = 4, format = "d", flag="0") )
df
# siteid modifier mod.siteid
# 1 1 44 440001
# 2 11 22 220011
# 3 111 11 110111

Resources