Dealing with spaces and "weird" characters in column names with dplyr::rename() - r

I have table with difficult headers like this:
Subject Cat Nbr Title Instruction..Mode!
1 XYZ 101 Intro I ONLINE
2 XYZ 102 Intro II CAMPUS
3 XYZ 135 Advanced CAMPUS
I would like to rename the columns with dplyr::rename()
df %>%
rename(subject = Subject,
code = Cat Nbr,
title = title,
mode = Instruction..Mode!)
But I am getting an Error: unexpected symbol in:
How might I reconcile this?

To refer to variables that contain non-standard characters or start with a number, wrap the name in back ticks, e.g., `Instruction..Mode!`

Related

Extract text from CSV in R

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.
One example sentence:
so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day
Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.
Any ideas on how to do this? Or which function/package to use?
Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?
I don't want to copy and paste every cell as a "strng" -- I was hoping I could someone just upload the entire CSV
To load the CSV into R, you could use readr::read_CSV(YOUR_FILE.CSV). There are more options, some of which are available to you if you use the "File -- Import Dataset -- From Text (readr)" menu option in RStudio.
Supposing you have the data loaded, you will likely need to rely on some form of "regex" to parse the text into sections based on the brackets. There are some base R functions for this, but I find the functions in stringr (part of the tidyverse meta-package) to be useful for this. And tidyr::separate_rows is a nice way to split the text into more lines.
In the regex below, there are a few ingredients:
(?=...) means to split before the [ but to keep it.
\\[ is how we refer to [ because brackets have special meaning in regex so we need to "escape" them to treat them as a literal character.
(?<=...) means to split after the ] but keep it.
| in the last row means "or"
(Granted, I'm still a regex beginner, so I expect there are more concise ways to do this.)
So we could do something like:
df1 <- data.frame(text = "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day")
library(tidyverse)
df1 %>%
mutate(orig_row = row_number()) %>%
separate_rows(text, sep = "(?=\\[)") %>%
separate_rows(text, sep = "(?<=\\] )") %>%
mutate(language = if_else(str_detect(text, "\\[|\\]"), "Espanol", "English"),
text = str_remove_all(text, "\\[|\\]"))
Result
# A tibble: 5 × 3
text orig_row language
<chr> <int> <chr>
1 "so " 1 English
2 "usualmente " 1 Espanol
3 "maybe " 1 English
4 "me levanto como a las nueve y media " 1 Espanol
5 "like I exercise and the I like either go to class online or in person like it depends on the day" 1 English

Grepl for 2 words/phrases in proximity in R (dplyr)

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)
R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Combine separate column/rows as one column/row in R

using txt.file, i have this dataset:
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
using this function,
test<-read.table("data.txt",header=T)
Xenopsylla cheopis Echinolaelaps sp.
Maxomys rajah 1 3
Callosciurus prevostii borneensis 4 2
R seems to recognize my data as different columns/rows and produce this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 2 did not have 4 elements
i tried to use textConnection but it seems that it does not produce what i want
First of all just store your data in a character vector as I did here:
test<-readChar("C:/Users/Julian/Downloads/file.txt", file.info("C:/Users/Julian/Downloads/file.txt")$size)
Obviously, you need to replace the path of my file with yours.
Then you get rid of the space between Genus and Species using gsub()
test<-gsub("([[:lower:]])([[:space:]])([[:lower:]])", "\\1\\3",test)
Finally, you can read your data using read.table() with the text argument:
a<-read.table(text=test,sep="\t",header=TRUE,row.names = 1)
a
Xenopsyllacheopis Echinolaelapssp. Ixodessp.
Maxomysrajah 3 8 9
Callosciurusprevostiiborneensis 5 7 1
Sundamysmuelleri 3 5 7
Niviventercremoriventer 6 8 9
EDIT:
To answer OP's new question in the comments:
"([[:lower:]])([[:space:]])([[:lower:]])"
enables us to find all the parts of the strings that we created with readChar() that match this pattern. This pattern is: a lowercase letter followed by a blank space followed by a lowercase letter.
You can understand this match the genus and species name but not a species name and the following genus because a genus starts with an uppercase letter.
Now the "\\1\\3" part means that we keep the first and third part of our
"([[:lower:]])([[:space:]])([[:lower:]])" pattern. That is ([[:lower:]]) and ([[:lower:]]). Because there is no space between "\\1 and \\3 in "\\1\\3" we will join them without spaces. Therefore we will have Genusspecies instead of Genus species.

R - Change value names in data.frame but keeping a distinguishing mark

i have the following data.frame:
> goals.names
id name
1 1 Registro NL Widget
2 2 Fidelizado
3 3 Entusiasmado
4 4 Registro Newsletter
How can I change every id value to look like goal1, goal2, goal3, goal4, goalX? I suppose, that I have to first get the id, save it to a variable, and use it again to substitute it with the new key. Something for what "for each" would really be helpfull, but I have not found a substitute for "for each" in R.
You could try mutate() from the dplyr package, like this:
library(dplyr)
goals.names <- mutate(goals.names, id = paste("goal",id,sep=""))

Resources