I have two data frames:
df<-structure(list(`Active Contact*` = "Entries must be in a Yes or No format. Only active staff may be added to a protocol.",
`First Name*` = "Free text field. [255]", `Middle Name
` = "Free text field. [255]",
`Last Name*` = "Free text field. [255]", `Email**
` = "This field is required when the contact is a user or the contact has any of the Receives Broadcast Emails, Receives Notifications, or Receives Administrative System Notifications settings set to Yes.\r\nThis field must be mapped if Email is selected in the Unique Identifier field. Entries must be unique across all contacts (both active and inactive) and must be in a valid email format (abc#efg.zyx). [254]"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(ActiveContact = c("Yes", "Yes", "Yes", "Yes",
"Yes", "Yes", "Yes"), fname = c("practice", "practice", "practice",
"practice", "practice", "practice", "practice"), middlename = c(NA,
NA, NA, NA, NA, NA, NA), lname = c("PI", "research nurse", "research nurse",
"research nurse", "regulatory", "regulatory", "regulatory"),
email = c("ppi#lifespan.org", "prn#lifespan.org", "prn#lifespan.org",
"prn#lifespan.org", "preg#lifespan.org", "preg#lifespan.org",
"preg#lifespan.org")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
I need to use the the column name from df, and also the first row from df.... as column name and first row in df2 (replacing the column name from df2, and also pushing the first row in df2 down 1 row to fit).
My expected output would be:
I know the column names are terrible (weird symbols and spaces and things I hate), and also I know the first row that I need is full of all sorts of stuff I typically hate, but I need this for my output format.
Thank you!
You can try to row bind them, simultaneously renaming the columns of df2
rbind(df,setNames(df2,names(df)))
Output:
`Active Contact*` `First Name*` `Middle Name\n ~ `Last Name*` `Email**\n \n ~
<chr> <chr> <chr> <chr> <chr>
1 Entries must be in a Yes or No fo~ Free text field~ Free text field. [255] Free text fie~ "This field is required when the contact is a us~
2 Yes practice NA PI "ppi#lifespan.org"
3 Yes practice NA research nurse "prn#lifespan.org"
4 Yes practice NA research nurse "prn#lifespan.org"
5 Yes practice NA research nurse "prn#lifespan.org"
6 Yes practice NA regulatory "preg#lifespan.org"
7 Yes practice NA regulatory "preg#lifespan.org"
8 Yes practice NA regulatory "preg#lifespan.org"
names(df2) <- names(df)
df3 <- rbind(df, df2)
Related
We have a dataframe with a Phone column that has phone numbers, however phone numbers are duplicated in many of the columns:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242",
"256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020",
NA, "765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142",
"", "549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Title Phone
1 Head Coach (904) 256-7242\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t(904) 256-7242
2 Athletic Trainer 256-765-5020\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t549-5849
7 Coach 516-302-1039
The correct output would remove phone number duplicates:
structure(list(Title = c("Head Coach", "Athletic Trainer", "Head Coach",
"Assistant Coach", "Student Assistant", "Head Men's Basketball Coach", "Coach"
), Phone = c("(904) 256-7242",
"256-765-5020",
NA, "765.285.8142",
"", "549-5849", "516-302-1039"
)), row.names = c(1L,2L, 3L,4L,5L,6L,7L ), class = "data.frame")
Typically I would share our progress on this, but quite frankly we are lost as to how to even get started on this. Seems like a very difficult problem especially given (a) the \r\n\t\t\t\ that appear in the strings, (b) that there are NA and missing values and (c) not every row is duplicated, (d) different formats (some area codes, some with ., some with -, some with ()). Any recommendations on how to clean this column?
df$Phone = sub('\r.*', '', df$Phone)
Title Phone
1 Head Coach (904) 256-7242
2 Athletic Trainer 256-765-5020
3 Head Coach <NA>
4 Assistant Coach 765.285.8142
5 Student Assistant
6 Head Men's Basketball Coach 549-5849
7 Coach 516-302-1039
We could remove the whitespace with gsub, split at the delimiter created (,) and extract the first element
df1$Phone <- sapply(strsplit(gsub("[\r\n\t]+", ",", df1$Phone), ","), \(x) x[1])
-output
df1$Phone
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" NA "549-5849" "516-302-1039"
Or another option is trimws - specify the whitespace to match the one or more [\r\n\t] followed by other characters (.*)
trimws(df1$Phone, whitespace = "[\r\n\t]+.*")
[1] "(904) 256-7242" "256-765-5020" NA
[4] "765.285.8142" "" "549-5849" "516-302-1039"
I have a one-column CSV. Each cell has a large text that's the source code from a page in a database of company descriptions (I'm only including a small portion of the text, it's more extensive.) There are about 30,000 cells/companies in the CSV.
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
I'd like to extract specific elements from each text (Company Class, Location, Founding Date, Company Status, etc.) and put them in a structured CSV.
There are two issues with the texts. 1. Not all cells have the same text information. For instance, some don't have "Location" (it's not that the element "Location" has an empty value in the text, it's simply not there). If the text doesn't contain "Location" as a possibility, I'd want that field left empty in the extracted CSV. 2. Some texts have the field more than once. For example, some companies have two or more locations. I'd like to extract all of them, the first time a location appears would be "Location" in the CSV and the second "Location2" and so on.
What's the best way to approach this? I don't have much experience working with texts in R so any help would be appreciated.
Edit: To clarify, in the final CSV, I need to create a single row per text with multiple column entries ("Company Name", "Location", etc.). Where the column name is not in the text, I'd like the column cell left blank.
Edit: Here is an example with two locations:
structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>Banking</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>NYC</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Atlanta</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>43448901</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>07/03/1997</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Solvent</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
Here's an approach based on the package stringr:
library(stringr)
df <- data.frame(
desc = unlist(str_extract_all(x, "(?<=>)[\\w\\s]+(?=:<)")),
detail = unlist(str_extract_all(x, "(?<=span>)[\\w\\s/-]+(?!:)(?=</span>)"))
)
Result:
df
desc detail
1 Company Class E-Commerce
2 Location Chicago
3 Company Number 70956655
4 Founding Date 01/09/2015
5 Company Status Bankrupt
EDIT:
If you want to have the details each in an extra column, then this should work (where the column name is not in the text, you'll get NA):
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location = str_extract(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location Company_Number Founding_Date Company_Status
1 E-Commerce Chicago 70956655 01/09/2015 Bankrupt
EDIT 2:
If you may have two Location values:
df <- data.frame(
Company_Class = str_extract(x, "(?<=Company Class:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"),
Location1 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[1],
Location2 = unlist(str_extract_all(x, "(?<=Location:[<>/a-z]{1,50})[A-Z][\\w\\s-]+"))[2],
Company_Number = str_extract(x, "(?<=Company Number:[<>/a-z]{1,50})\\d+"),
Founding_Date = str_extract(x, "(?<=Founding Date:[<>/a-z]{1,50})\\d+/\\d+/\\d+"),
Company_Status = str_extract(x, "(?<=Company Status:[<>/a-z]{1,50})[A-Z][\\w\\s-]+")
)
Result:
df
Company_Class Location1 Location2 Company_Number Founding_Date Company_Status
1 Banking NYC Atlanta 43448901 07/03/1997 Solvent
Data:
x <- structure(list(Company = c("<H5>Company Information</H5>\r\n\r\n</td>\r\n\r\n</tr>\r\n\r\n</table>\r\n\r\n<table>\r\n\r\n<tr>\r\n\r\n<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Location:</span></td><td><span>Chicago</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Number:</span></td><td><span>70956655</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>\r\n\r\n</tr>\r\n\r\n<tr>\r\n\r\n<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>",
NA)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
Since the source code is html you can use rvest package:
Here is an example from your submitted code:
html = "<H5>Company Information</H5>
</td>
</tr>
</table>
<table>
<tr>
<td><span>Company Class:</span></td><td><span>E-Commerce</span></td>
</tr>
<tr>
<td><span>Location:</span></td><td><span>Chicago</span></td>
</tr>
<tr>
<td><span>Company Number:</span></td><td><span>70956655</span></td>
</tr>
<tr>
<td><span>Founding Date:</span></td><td><span>01/09/2015</span></td>
</tr>
<tr>
<td><span>Company Status:</span></td><td><span>Bankrupt</span></td>"
library(rvest)
#> Loading required package: xml2
x = read_html(html)
tble = html_table(x)
tble
#> [[1]]
#> X1 X2
#> 1 Company Class: E-Commerce
#> 2 Location: Chicago
#> 3 Company Number: 70956655
#> 4 Founding Date: 01/09/2015
#> 5 Company Status: Bankrupt
Created on 2021-02-07 by the reprex package (v0.3.0)
After that extracting values from the table that you require should be easy.
I have got a dataset called colours, in which I am interested to find some keywords based on a list (colouryellow , colourblue, colourwhite) that I have created. This is an example of the dataset:
USER
MESSAGE
23456
The colouryellow is very bright!
31245
Most girls like colourpink
99999
I am having a break
9877
The colouryellow is like the sun
Is there a way where I can obtain the number of times each keywords based on the list appear on the column MESSAGE?
For example, the output would be like:
Keyword
Frequency of Keywords
colouryellow
2
colourblue
0
colourwhite
0
I have tried the following code but it does not provide me the frequency for each keyword, instead displays them together.
colour= read.csv("C: xxxxxx")
keywordcount= dplyr::filter(colour, grepl("colouryellow|colourblue|colourwhite, MESSAGE))
Thank you in advance.
Some things you can do.
some_colours <- c("colouryellow", "colourblue", "colourwhite")
some_col_regex <- paste0("\\b(", paste(some_colours, collapse = "|"), ")\\b")
grepl(some_col_regex, colour$MESSAGE)
# [1] TRUE FALSE FALSE TRUE
lengths(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE)))
# [1] 1 0 0 1
table(unlist(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE))))
# colouryellow
# 2
Data
colour <- structure(list(USER = c(23456L, 31245L, 99999L, 9877L), MESSAGE = c("The colouryellow is very bright!", "Most girls like colourpink", "I am having a break", "The colouryellow is like the sun")), class = "data.frame", row.names = c(NA, -4L))
After this expression
good.rows<-ifelse(nchar(ufo$DateOccurred)!=10 | nchar(ufo$DateReported)!=10,
FALSE, TRUE)
I expected to get vectors of Booleans but I got
length(good.rows)
[1] 0
This is logical(empty) as I can see in R studio. What can I do to solve this?
dput(head(ufo))
"structure(list(DateOccured = structure(c(9412, 9413, 9131, 9260,
9292, 9428), class = "Date"), DateReported = structure(c(9412,
9414, 9133, 9260, 9295, 9427), class = "Date"), Location = c(" Iowa City, IA",
" Milwaukee, WI", " Shelton, WA", " Columbia, MO", " Seattle, WA",
" Brunswick County, ND"), ShortDescription = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Duration = c(NA, "2 min.", NA, "2 min.", NA, "30 min."), LongDescription = c("Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
"Man on Hwy 43 SW of Milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. Bizarre!",
"Telephoned Report:CA woman visiting daughter witness discs and triangular ships over Squaxin Island in Puget Sound. Dramatic. Written report, with illustrations, submitted to NUFORC.",
"Man repts. son's bizarre sighting of small humanoid creature in back yard. Reptd. in Acteon Journal, St. Louis UFO newsletter.",
"Anonymous caller repts. sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No other facts reptd. No return tel. #.)",
"Sheriff's office calls to rept. that deputy, 20 mi. SSE of Wilmington, is looking at peculiar, bright white, strobing light."
)), row.names = c(NA, 6L), class = "data.frame")"
There are a couple of reasons why this could be happening:
You're dataset is empty, check this using the dim() method.
The columns are not of type Character check this using the class()
method.
If both of these are correct try running the nchar(...) statements
separately.
Below I've create an example that works correctly, where I've gone through the above mentioned steps. In future please provide a reproducible example as part of your question.
# Create sample data
ufo <- data.frame(DateOccurred=c("a","bb","ccc"),
DateReported=c("a","bb","ccc"),
stringsAsFactors = FALSE)
print(ufo)
# Check size of data (make sure data has rows and columns are of type Character)
dim(ufo)
class(ufo$DateOccurred)
class(ufo$DateReported)
# Check nchar statements (Should run without error/warnings)
nchar(ufo$DateOccurred)
nchar(ufo$DateReported)
# Actual
good.rows <- ifelse(nchar(ufo$DateOccurred)!=3 | nchar(ufo$DateReported)!=3,
FALSE, TRUE)
print(good.rows)
length(good.rows)
In a single dataset (QueryTM), I have two columns Query and TM. I want to check if the Query contains value of TM (in the same row) or not. Consider an example, If TM is "Coca Cola" and Query is "Coca Cola India", Query should match with TM. However, if query is "Coca Colala India", it shouldn't match. The results are to be stored in another column, say Result
I am using R as the platform.
You will need to add word boundaries to capture exact matching. Using mapply you can do,
dd$result <- mapply(grepl, paste0('\\b', dd$TM, '\\b'), dd$Query)
dd
# TM Query result
#1 Coca Cola Coca Colala India FALSE
#2 Fanta Orange Fanta Orange India TRUE
DATA
dput(dd)
structure(list(TM = c("Coca Cola", "Fanta Orange"), Query = c("Coca Colala India",
"Fanta Orange India")), .Names = c("TM", "Query"), row.names = c(NA,
-2L), class = "data.frame")