Selecting / grouping by first character string until a specific symbol - r

Good day,
I'm currently working on a dataset where theres a column in this format.
PA-121-1512-asa-1241
PWW-121-1571-accs-21561
PSAWA-171-1616-gfaa-161
QSF-16-1613-63-asdfa
H-Elevator-15-asf-1112
QSF-asa-sda-afas-112
The first sequence of letters before the "-" symbol is identified as "building location" Due to this i would like to save these first sequence of letters in a seperate column.
I would like to know how to select > copy > paste these values in a new column so i end up with a column like e.g.:
Location:
PA
PPW
PSAWA
QSF
H
QSF
I tried the function:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)")
The PA-121-1512-asa-1241 is a example i selected the whole column.
Here what i got printed out was PA- instead of just PA.
If more data a more elaborate explenation is needed please do tell me. Im still fairly new to writing questions on this site.
Happy hollidays!,
E.D.D.
Post post...
After looking at my code again to copy paste a propper example as Mr. Cyrus suggested, I've found my mistake. instead of:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)")
it is:
str_extract("PA-121-1512-asa-1241", "[^-]+")
This returns PA instead of PA-
This shows that reading trough your code 50x does help because the previous 49 didnt.
If anyone has a more elegant / efficient method, I'm still interested! Since running this code trough 5million rows took me quite a while.

instead of:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)") it is
str_extract("PA-121-1512-asa-1241", "[^-]+")
This returns "PA" instead of "PA-"
This shows that reading trough your code 50x does help because the previous 49 didnt.
str_extract() required format for the "pattern" is quite confusing to get right. Reading more into Grep has been advised to me by my teacher.

Related

When does pressing the Enter (Return) key matter when creating a regex matching expression?

I want to search for multiple codes appearing in a cell. There are so many codes that I'd like to write parts of the code in succeeding lines. For example, let's say I am looking for "^a11","^b12", "^c67$" or "^d13[[:blank:]]". I am using:
^a11|^b12|^c67$|^d13[[:blank:]]
This seems to work. Now, I tried:
^a11|^b12|
^c67$|^d13[[:blank:]]
That also seemed to work. However when I tried:
^a11|^b12|^c67$|
^d13[[:blank:]]
It did not count the last one.
Note that my code is wrapped into a function. So the above is an argument that I feed the function. I'm thinking that's the problem, but I still don't know why one truncation works while the other does not.
I realized the answer today. The problem is that since I am feeding the regex argument, it will count the next line in the succeeding code.
Thus, the code below was only "working" because ^c67$ is empty.
^a11|^b12|
^c67$|^d13[[:blank:]]
And the code below was not working because ^d13 is not empty but also this setup looks for (next line)^d13[[:blank:]] instead of just ^d13[[:blank:]]
^a11|^b12|^c67$|
^d13[[:blank:]]
So an inelegant fix is:
^a11|^b12|^c67$|
^nothinghere|^d13[[:blank:]]
This inserts a burner code that is empty which is affected by the line break.

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

How to edit hidden character in String

The appearance of "textparcali" in RStudio Source Editor was as follows.
In textparcali (tbl_df), I ran the following code to delete single strings.
textparcali$word<-gsub("\\W*\\b\\w\\b\\W*",'', textparcali$word)
But the deletion was interesting. You can see the picture below. Please note lines 67 and 50.
Everything was fine for line 50 and lines like that. However, this was not the case for line 67 (and I think there are others like it).
I focused on one line(67) to understand why you deleted it wrong. I've already seen what it says on this line in the editor. But I also wanted to look at the console. I wrote the following code to the console.
textparcali$word[67]
The word on line 67 looks different in the console. The value that doesn't appear when you make a copy paste but surprisingly appears on the console:
The reason I put it as a picture is because this character disappears after the copy-paste command.
You can download the file containing this character from the link below. However, you should open it with Notepad ++.
Character.txt
Gsub did his job right. How is that possible? What's the name of this character? When I try to write code that destroys this character, the " sign changes and does not delete.
textparcali$word<-gsub('[[:punct:]]+',' ',textparcali$word) command also does not work.
What is the explanation of my experience? I do not know. Is there a way to destroy this character? What caused this? I ve asked a lot.
Thank you all.
(I apologize for the bad scribbles in the pictures.)
I found the surprise character.
Above Right, Combining Dot ͘ ͘
The following is the code required to eliminate this character.
c<-"surprise character"
c
[1] "\u0358"
textparcali$word<-gsub("\u0358","",textparcali$word,ignore.case = FALSE)
textparcali$word<-gsub("\u307","",textparcali$word,ignore.case = FALSE)
Code 307 did the job for me. However, you should determine what the actual code is. If not, your character code may be incorrect.
More detailed information can be found in the links below.
https://gist.github.com/ngs/2782436
https://www.charbase.com/0358-unicode-combining-dot-above-right
Thanks a lot!

Cleaning forum post with multiple quotations in rvest + stringr

I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers
Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.

Grabbing part of a link from a URL in R

I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.
You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/ remove everything up to and including the last /
| or
\\..*$ remove everything after the ., starting from the end of the string
By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!
The basename function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"
This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"

Resources