Modify string names in a data frame based on a condition

Modify string names in a data frame based on a condition - r

I have a data frame with a variable called "Control_Category". The variable has six names in it, which for simplicity sake I am going to make generic:
df <- data.frame(Control_Category = c("Really Long Name One",
"Super Really Long Name Two",
"Another Really Flippin' Long Name Three",
",Seriously, It's a Fourth Long Name",
"Definitely a Fifth Long Name",
"Finally, This guy is done, number six"))
I'm using this to make a slight joke. So, while the names are long they are tidy in that the values for each (1-6) are consistent. In this specific character vector of the data.frame, there are hundreds and hundreds of entries that match any one of those six.
What I need to do is to replace the long names with a short name. Therefore, where any of the above names are identified, replace that name with a shorter version, like:
One
Two
Three
Four
Five
Six
I tried a function using 'case_when' and it failed miserably. Any help would be appreciated.
Additional Information Based on Questions From Community
The order of the items doesn't matter. There isn't a designation of 1 - 6. There just happen to be six and I made six stupid long strings. The strings themselves are long.
So, anywhere "Super Really Long Name Two" exists, that value needs to be updated to something like 'TWO" or a "Short_Name" that that approximate "TWO". In reality, the category is called "Audit, Testing and Examination Results". The short name would ideally just be "AUDIT".

You could just use gsub() once for each replacement:
df$Control_Category <- gsub('Really Long Name One', 'One', df$Control_Category)
You can repeat similar logic to handle the other five long/short name pairs.

Here's a larger data frame with long names:
set.seed(101)
long_names <- c("Really Long Name One",
"Super Really Long Name Two",
"Another Really Flippin' Long Name Three",
",Seriously, It's a Fourth Long Name",
"Definitely a Fifth Long Name",
"Finally, This guy is done, number six")
df <- data.frame(control_category=sample(long_names, 100, replace=TRUE))
head(df)
## control_category
## 1 Another Really Flippin' Long Name Three
## 2 Really Long Name One
## 3 Definitely a Fifth Long Name
## 4 ,Seriously, It's a Fourth Long Name
## 5 Super Really Long Name Two
## 6 Super Really Long Name Two
Using the unique function will give you the category names:
category <- unique(df$control_category)
print(category)
## [1] Another Really Flippin' Long Name Three
## [2] Really Long Name One
## [3] Definitely a Fifth Long Name
## [4] ,Seriously, It's a Fourth Long Name
## [5] Super Really Long Name Two
## [6] Finally, This guy is done, number six
## 6 Levels: ,Seriously, It's a Fourth Long Name ...
Notice that the levels are in alphabetical order (see levels(category)). In this case, the simplest way is to change the order manually by looking at the current order. In this case, category[c(2, 5, 1, 4, 3, 6)] will give you the right order. Finally,
df$control_category <- factor(
df$control_category,
levels=category[c(2, 5, 1, 4, 3, 6)],
labels=c("one", "two", "three", "four", "five", "six")
)
head(df)
## control_category
## 1 three
## 2 one
## 3 five
## 4 four
## 5 two
## 6 two

Related

Extract part of a string from one column, paste into a new column

Disclaimer: Totally inexperience with R so please bear with me!...
Context: I have a series of .csv files in a directory. These files contain 7 columns and approx 100 rows. I've compiled some scripts that will read in all of the files, loop over each one adding some new columns based on different factors (e.g. if a specific column makes reference to a "box set" then it creates a new column called "box_set" with "yes" or "no" for each row), and write out over the original files. The only thing that I can't quite figure out (and yes, I've Googled high and low) is how to split one of the columns into two, based on a particular string. The string always begins with ": Series" but can end with different numbers or ranges of numbers. E.g. "Poldark: Series 4", "The Musketeers: Series 1-3".
I want to be able to split that column (currently named Programme_Title) into two columns (one called Programme_Title and one called Series_Details). Programme_Title would just contain everything before the ":" whilst Series_Details would contain everything from the "S" onwards.
To further complicate matters, the Programme_Title column contains a number of different strings, not all of which follow the examples above. Some don't contain ": Series", some will include the ":" but will not be followed by "Series".
Because I'm terrible at explaining these things, here's a sample of what it currently looks like:
Programme_Title
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace
And here's what I want it to look like:
Programme_Title Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur Series 1-2
Poldark Series 4
The Musketeers Series 1-3
War and Peace
As I said, I'm a total R novice so imagine that you're speaking to a 5 yr old. If you need more info to be able to answer this then please let me know.
Here's the code that I'm using to do everything else (I'm sure it's a bit messy but I cobbled it together from different sources, and it works!)
### Read in files ###
filenames = dir(pattern="*.csv")
### Loop through all files, add various columns, then save ###
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}

I don't have your exact data structure, but I created some example for you that should work:
library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
movieName
1 This is a test
2 This is another test: Series 1-5
3 This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")
for (row in 1:nrow(df)) {
df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
Title Series
1 This is a test
2 This is another test Series 1-5
3 This is yet another test

I tried to capture all the examples you might encounter, but you can easily add things to capture variants not covered in the examples I provided.
Edit: I added a test case that did not include : or series. It will just produce a NA for the Series Details.
## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)
## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
"Y: Series 1-2",
"Z : Series 1-10",
"The Z and Z: 1-3",
"XX Series 1-3",
"AA AA"))
## Example of the data we want to format, see the different cases covered
print(data)
title
<chr>
1 X:Series 1-6
2 Y: Series 1-2
3 Z : Series 1-10
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA
## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%
## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
## Sounds like you will always have either :, series, or :Series If this is different you can easily
## change/update this to capture other cases
mutate(title = case_when(
str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
!(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
TRUE ~ title)) %>%
## first separate the columns based on :
separate(col = title,into = c("Programme_Title","Series_Details"), sep = ':') %>%
##This just removes all white space at the ends to clean it up
mutate(Programme_Title = str_trim(Programme_Title),
Series_Details = str_trim(Series_Details))
## Output of the data to see how it was formatted
print(data_formatted)
Programme_Title Series_Details
<chr> <chr>
1 X Series 1-6
2 Y Series 1-2
3 Z Series 1-10
4 The Z and Z Series 1-3
5 XX Series 1-3
6 AA AA NA

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers

One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.

If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.

I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")

The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo

Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

R text mining: Create document term matrix from dataframe, convert to dataframe, retain columns from original dataframe

Thanks to lawyeR for recommending the tidytext package. Here is some code based on that package that seems to work pretty well on my sample data. It doesn't work quite so well though when the value of the text column is blank. (There are times when this will happen and it will make sense to keep the blank rather than filtering it.) I've set the first observation for TVAR to a blank to illustrate. The code drops this observation. How can I get R to keep the observation and to set the frequencies for each word to zero? I tried some ifelse statements using and not using the pipe. It's not working so well though. The troulbe seems to center around the unnest_tokens function from the tidytext package.
sampletxt$TVAR[1] <- ""
chunk_words <- sampletxt %>%
group_by(PTNO, DATE, TYPE) %>%
unnest_tokens(word, TVAR, to_lower = FALSE) %>%
count(word) %>%
spread(word, n, 0)
I have an R data frame. I want to use it to create a document term matrix. Presumably I would want to use the tm package to do that but there might be other ways. I then want to convert that matrix back to a data frame. I want the final data frame to contain identifying variables from the original data frame.
Question is, how do I do that? I found some answers to a similar question, but that was for a data frame with text and a single ID variable. My data are likely to have about half a dozen variables that identify a given text record. So far, attempts to scale up the solution for a single ID variable haven't proven all that successful.
Below are some sample data. I created these for another task which I did manage to solve.
How can I get a version of this data frame that has an additional frequency column for each word in the text entries and that retains variables like PTNO, DATE, and TYPE?
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 3, 3),
DATE = structure(c(16801, 16436, 16436, 16832, 16845), class = "Date"),
TYPE = c(
"Progress note",
"Progress note",
"CAT scan",
"Progress note",
"Progress note"
),
TVAR = c(
"This sentence contains the word metastatic and the word Breast plus the phrase denies any symptoms referable to.",
"This sentence contains tnm code T-1, N-O, M-0. This sentence contains contains tnm code T-1, N-O, M-1. This sentence contains tnm code T1N0M0. This sentence contains contains tnm code T1NOM1. This sentence is a sentence!?!?",
"This sentence contains Dr. Seuss and no target words. This sentence contains Ms. Mary J. blige and no target words.",
"This sentence contains the term stageIV and the word Breast. This sentence contains no target words.",
"This sentence contains the word breast and the term metastatic. This sentence contains the word breast and the term stage IV."
)), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))

The quanteda package is faster and more straightforward than tm, and works nicely with tidytext as well. Here's how to do it:
These operations create a corpus from your object, create a document-feature matrix, and then return a data.frame that combines the variables with the feature counts. (Additional options are available when creating the dfm, see ?dfm).
library("quanteda")
samplecorp <- corpus(sampletxt, text_field = "TVAR")
sampledfm <- dfm(samplecorp)
result <- cbind(docvars(sampledfm), as.data.frame(sampledfm))
You can then group by the variables to get your result. (Here I am showing just the first 6 columns.)
dplyr::group_by(result[, 1:6], PTNO, DATE, TYPE)
# # A tibble: 5 x 6
# # Groups: PTNO, DATE, TYPE [5]
# PTNO DATE TYPE this sentence contains
# * <dbl> <date> <chr> <dbl> <dbl> <dbl>
# 1 1 2016-01-01 Progress note 1 1 1
# 2 2 2015-01-01 Progress note 5 6 6
# 3 2 2015-01-01 CAT scan 2 2 2
# 4 3 2016-02-01 Progress note 2 2 2
# 5 3 2016-02-14 Progress note 2 2 2
packageVersion("quanteda")
# [1] ‘0.99.6’

this function from package "SentimentAnalysis" is the easiest way to do this, especially if you are trying to convert from column of a dataframe to DTM (though it also works on txt file!):
library("SentimentAnalysis")
corpus <- VCorpus(VectorSource(df$column_or_txt))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
it is simple and works like a charm each time, with both chinese and english, for those doing text mining in chinese.

Using gsub from all values in dataframe with strings

If I have a dataframe was values such as:
df<- c("One", "Two Three", "Four", "Five")
df<-data.frame(df)
df
"One"
"Two Three"
"Four"
"Five"
And I have another dataframe such as:
df2<-c("the park was number one", "I think the park was number two three", "Nah one and two is ok", "tell me about four and five")
df2<-data.frame(df2)
df2
the park was number one
I think the park was number two three
Nah one and two is ok
tell me about four and five
If one of the values found in df are found in any of the strings of df2[,1], how do I replace it with a word like "it".
I want to replace my final df2 with this:
df3
the park was number it
I think the park was number it
Nah it and two is ok
tell me about it and it
I know it probably has to do with something like:
gsub(df,"it", df2)
But I don't think that is right.
Thanks!

You could do something like
sapply(df$df,function(w) df2$df2 <<- gsub(paste0(w,"|",tolower(w)),"it",df2$df2))
df2
df2
1 the park was number it
2 I think the park was number it
3 Nah it and two is ok
4 tell me about it and it
The <<- operator makes sure that the version of df2 in the global environment is changed by the function. The paste0(w,"|",tolower(w)) allows for differences of capitalisation, as in your example.
Note that you should add stringAsFactors=FALSE to your dataframe definitions in the question.