I have a dataset that I have tried to give a sample of using the dput command below. The problem I'm running into is trying to separate out the data by delimiter.
> dput(head(team_data))
structure(list(X1 = 2:6,
names2 = c("Andre Callender Seton Hall Preparatory School (West Orange, NJ)", "Gosder Cherilus Somerville (Somerville, MA)", "Justin Bell Mount Vernon (Alexandria, VA)", "Tom Anevski Elder (Cincinnati, OH)", "Brad Mueller Mars Area (Mars, PA)"),
pos2 = c("RB 5-10 185", "OT 6-7 270", "TE 6-3 250", "OT 6-5 265", "CB 6-0 170"), rating2 = c("0.8667 194 18 8", "0.8667 262 20 1", "0.8333 306 14 7", "0.8333 377 25 13", "0.8333 496 36 16"),
status2 = c("Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003", "Enrolled 6/30/2003"), team = c("Boston-College", "Boston-College", "Boston-College", "Boston-College", "Boston-College"), year = c(2003L, 2003L, 2003L, 2003L, 2003L)),
.Names = c("X1", "names2", "pos2", "rating2", "status2", "team", "year"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
The following is the code I am trying to execute on the above dataset. The following two functions work fine and as expected as far as I can tell.
library(rvest)
library(stringr)
library(tidyr)
library(readxl)
df2<-separate(data=team_data,col=pos2,into= c("Position","Height","Weight"),sep=" ")
df3<-separate(data=df2,col=rating2,into= c("Rating","National","Position","State Rank"),sep=" ")
But then I have significant trouble trying to further separate out the columns of the dataframe. I have tried various ways (examples below) but all of the pieces of code below produce the same error, "Error: Data source must be a dictionary".
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep="(")
df4<-separate(data=df3,col=names2,into= c("Name","Geo"),sep='\\(|\\)')
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
df4<-separate(data=df3,col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
The ultimate goal would be to separate out the "names2" column at the "(" and the "," and remove the ")" so that I would end up with 3 columns of data. For the other column ("status2") the goal would be to separate out the "Enrolled" from the date of enrollment.
From what I have read the error I'm getting indicates that I am duplicating column names, but I can't figure out where that is happening.
You are using Position twice, once in df2 and once in df3. This works for me:
team_data %>%
separate(col=pos2, into= c("Position","Height","Weight"), sep=" ") %>%
separate(col=rating2,into= c("Rating","National","Position2","State Rank"),sep=" ")%>%
separate(col=names2,into= c("Name","Geo"),sep="\\(") %>%
separate(col=status2,into= c("Date_Enrollment","Enroll_Status"),sep=" ")
Related
I have a dataset where sometimes the unit of measure is not separated from the number by a space, and I would like to add it in. I have a list of units of measure that may be used in the dataset and I want to make sure that every time they appear there is a space.
My data is something like:
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
And I would like:
result <- c("black box 125 CM", "10 KG white chair", "bottle of water 1000 ML")
The units of measure that might appear:
measure <- c("ML", "MG", "F", "CM", "CPR", "FL", "CPS", "KG")
So far I have tried (but it is not working):
for (i in 1:NROW(measure)) {
replacement <- paste0("\\s", measure[i])
result <- gsub("(?<=[[:digit:]])"measure[i], replacement, mydata, perl = TRUE)
}
If it were for one substitution I would be able to do it with:
result <- gsub("(?<=[[:digit:]])MG", " MG", mydata, perl = TRUE)
I just do not know how I am supposed to write measure[i] in the gsub function, I cannot find the right syntax.
Any suggestions? Thank you very much in advance.
Regex lookahead can do this.
gsub(paste0("(?<=[0-9])(", paste(measure, collapse = "|"), ")"), " \\1",
mydata, perl = TRUE)
# [1] "black box 125 CM" "10 KG white chair" "bottle of water 1000 ML"
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
stringr::str_replace_all(mydata, "[:digit:]([ML|MG|F|C[M|PR|PS]|FL|KG])", " \\1")
Gives
[1] "black box 12 CM" "1 KG white chair" "bottle of water 100 ML"
Note the special handling of the three cases which all begin with C.
As an aside, if I were having to be this fussy about spaces, I'd also be minded to be fussy about getting the case of the SI units corrrect: "KG" is not kilogramme but Kelvin ⋅ 6.674×10−11 m3⋅kg−1⋅s−2, as near as I can figure it!
This is what I came up with and works for me.
mydata <- c("black box 125CM", "10KG white chair", "bottle of water 1000ML")
measure <- c("ML", "MG", "F", "CM", "CPR", "FL", "CPS", "KG")
measure <- paste(measure, collapse = "|")
result <- sub(paste0("([", measure, "])"), " \\1", mydata)
edit: This will also add spaces if there is already a space,r2evans solution would be more preferable.
If, as in the example, the measure always appears after the numbers, then this works:
sub("(\\d+)", "\\1 ", mydata)
[1] "black box 125 CM" "10 KG white chair" "bottle of water 1000 ML"
I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"
I have data from a PDF file that I am reading into R.
library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>%
readr::read_lines()
When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.
A reproducible example is below:
ex_result <- c("03/11/2012 BES 3RD BES inc and corp no- no- sale -",
" group with sale no- sale",
" boxes",
"03/11/2012 KRS six and firefly 45 mg/dL 100 - 200",
" seven",
"03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87")
I am trying to use read_fwf with fwf_widths as I read that it can handle multi-line input if you give the widths for multi-line records.
ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100),
c("date", "name","description", "value", "unit","range","ab_flag")))
I determined the sizes by typing in the console nchar with the longest string that I saw for that column.
Using fwf_widths I can get the date column by defining in the width = argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.
Ultimately this is the desired output:
desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))
I am trying to see:
How can I get fwf_widths to recognize multi-line text and missing columns?
Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)
str_subset(ex_result,pattern = "\/\d{2}\/")
[1] "03/11/2012 BES 3RD BES inc and corp no- no- sale -"
[2] "03/11/2012 KRS six and firefly 45 mg/dL 100 - 200"
[3] "03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87"
After this expression
good.rows<-ifelse(nchar(ufo$DateOccurred)!=10 | nchar(ufo$DateReported)!=10,
FALSE, TRUE)
I expected to get vectors of Booleans but I got
length(good.rows)
[1] 0
This is logical(empty) as I can see in R studio. What can I do to solve this?
dput(head(ufo))
"structure(list(DateOccured = structure(c(9412, 9413, 9131, 9260,
9292, 9428), class = "Date"), DateReported = structure(c(9412,
9414, 9133, 9260, 9295, 9427), class = "Date"), Location = c(" Iowa City, IA",
" Milwaukee, WI", " Shelton, WA", " Columbia, MO", " Seattle, WA",
" Brunswick County, ND"), ShortDescription = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Duration = c(NA, "2 min.", NA, "2 min.", NA, "30 min."), LongDescription = c("Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
"Man on Hwy 43 SW of Milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. Bizarre!",
"Telephoned Report:CA woman visiting daughter witness discs and triangular ships over Squaxin Island in Puget Sound. Dramatic. Written report, with illustrations, submitted to NUFORC.",
"Man repts. son's bizarre sighting of small humanoid creature in back yard. Reptd. in Acteon Journal, St. Louis UFO newsletter.",
"Anonymous caller repts. sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No other facts reptd. No return tel. #.)",
"Sheriff's office calls to rept. that deputy, 20 mi. SSE of Wilmington, is looking at peculiar, bright white, strobing light."
)), row.names = c(NA, 6L), class = "data.frame")"
There are a couple of reasons why this could be happening:
You're dataset is empty, check this using the dim() method.
The columns are not of type Character check this using the class()
method.
If both of these are correct try running the nchar(...) statements
separately.
Below I've create an example that works correctly, where I've gone through the above mentioned steps. In future please provide a reproducible example as part of your question.
# Create sample data
ufo <- data.frame(DateOccurred=c("a","bb","ccc"),
DateReported=c("a","bb","ccc"),
stringsAsFactors = FALSE)
print(ufo)
# Check size of data (make sure data has rows and columns are of type Character)
dim(ufo)
class(ufo$DateOccurred)
class(ufo$DateReported)
# Check nchar statements (Should run without error/warnings)
nchar(ufo$DateOccurred)
nchar(ufo$DateReported)
# Actual
good.rows <- ifelse(nchar(ufo$DateOccurred)!=3 | nchar(ufo$DateReported)!=3,
FALSE, TRUE)
print(good.rows)
length(good.rows)
I have a dataframe which consists of one column. I would like to separate the text into seperate columns based on a vector of delimiters.
Input:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"
The intended output:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
I was keen to use the separate function from tidyr but can't quite figure it out so that it will separate according to a list of delimiters
The list would be:
mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")
I then tried:
Mypath %>% separate(Mypath, mywords)
But I amd clearly mis-understanding the function which I guess can't take a list of delimiters
Error: `var` must evaluate to a single number or a column name, not a list
Is there a simple way of doing this using tidyr (or csplit with a list or any other way for that matter)
Maybe make sure that it's like a dcf file, and you can use read.dcf:
Notice that "mywords" is a little different from yours. I've added colons to "Hospital Number" and "Patient Name".
mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:",
"Histology:","Diagnosis:")
Convert the relevant column to character, add a colon after "Hospital Number".
Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)
Make it such that each key: value pair is on its own line.
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)
Use read.dcf to read it in:
out <- read.dcf(textConnection(temp))
Here's some sample data that makes it easier to see the resulting structure:
example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
"var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
# report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2 var 1 xyz var 2: more text here var 5: not all values are there
And, going through the same steps:
mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
# var 1 var 2 var 3 var 4 var 5
# [1,] "abc" "some, text" "112" "value" "even more here"
# [2,] "xyz" "more text here" NA NA "not all values are there"
read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
# var 1 var 3 var 5
# [1,] "abc" "112" "even more here"
# [2,] "xyz" NA "not all values are there"