Using gsub from all values in dataframe with strings - r

If I have a dataframe was values such as:
df<- c("One", "Two Three", "Four", "Five")
df<-data.frame(df)
df
"One"
"Two Three"
"Four"
"Five"
And I have another dataframe such as:
df2<-c("the park was number one", "I think the park was number two three", "Nah one and two is ok", "tell me about four and five")
df2<-data.frame(df2)
df2
the park was number one
I think the park was number two three
Nah one and two is ok
tell me about four and five
If one of the values found in df are found in any of the strings of df2[,1], how do I replace it with a word like "it".
I want to replace my final df2 with this:
df3
the park was number it
I think the park was number it
Nah it and two is ok
tell me about it and it
I know it probably has to do with something like:
gsub(df,"it", df2)
But I don't think that is right.
Thanks!

You could do something like
sapply(df$df,function(w) df2$df2 <<- gsub(paste0(w,"|",tolower(w)),"it",df2$df2))
df2
df2
1 the park was number it
2 I think the park was number it
3 Nah it and two is ok
4 tell me about it and it
The <<- operator makes sure that the version of df2 in the global environment is changed by the function. The paste0(w,"|",tolower(w)) allows for differences of capitalisation, as in your example.
Note that you should add stringAsFactors=FALSE to your dataframe definitions in the question.

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

Matching phrases (word-search), noting finding in different columns

I'm working with: feedback from teachers, multiple selections from a google form. All selections end up producing a single column.
There was also the option to type in responses... so some words/responses are only present one time. There are about 50 responses, with between 1, and 7 words(ish) per entry. For the MRE I've included 5 as a vector.
eg_responses <- c("Difficult, Challenging, Fair", "Necessary, Useful", "Cruel", "Easy, Challenging", "School's shouldn't have to do that")
words_example <- c("Difficult","Easy","Fair","Challenging","Necessary","Useful")
While I can get a summary of the results, I need to create isolated responses so that I can select and compare with other variables later on...
What I would like to be able to do:
have columns representative of each word, and an extra column with other comments.
Look through each row for each word that was in the drop down.
while there are rows left to check look through word x.
if word is present mark in x column.
if word is not present mark NA in x column.
next column and next word when rows run out.
any other words/phrases left at the end go in the last column,
if no other words phrases are left mark end column with NA.
I'm sure I've been overcomplicating things. But I'm an absolute beginner. I'm working in R.
I have tried, separating the words, turning things into factors, then turning to character... Searching for exact complete phrases, or searching through words, would be really helpful.
There are 6 other sections with the same problem, and other sections of other variables.
I would like to have something like this at the end...
(apologies, just going to make it up bit of a chicken and egg)
respcolnames <- c( "Challenging", "Fair", "Unfair", "Other" )
row1 <- c("NA", "Fair", "NA", "NA")
row2 <- c("Challenging", "NA", "Unfair", "NA")
row3 <- c("NA", "NA","NA", "School's shouldn't have to do that")
teacher_responses <- cbind(row1, row2,row3)
names(teacher_responses) <- respcolnames
so that the corresponding response is in the correct column. I can then strip away the "Other" responses for a more social science analysis, and use the drop down response selections for some graphics.

R - Filter any rows and show all columns

I would like to an output that shows the column names that has rows containing a string value. Assume the following...
Animals Sex
I like Dogs Male
I like Cats Male
I like Dogs Female
I like Dogs Female
Data Missing Male
Data Missing Male
I found an SO tread here, David Arenburg provided answer which works very well but I was wondering if it is possible to get an output that doesn't show all the rows. So If I want to find a string "Data Missing" the output I would like to see is...
Animals
Data Missing
or
Animal
TRUE
instead of
Anmials Sex
Data Missing Male
Data Missing Male
I have also found using filters such as df$columnName works but I have big file and a number of large quantity of column names, typing column names would be tedious. Assume string "Data Missing" is also in other columns and there could be different type of strings. So that is why I like David Arenburg's answer, so bear in mind I don't have two columns, as sample given above.
Cheers
One thing you could do is grep for "Data Missing" like this:
x <- apply(data, 2, grep, pattern = "Data Missing")
lapply(x, length) > 1
This will give you the:
Animal
TRUE
result you're after. It's also good because it checks all columns, which you mentioned was something you wanted.
If we want only the first row where it matches, use match
data[match("Data Missing", data$Animals), "Animals", drop = FALSE]
# Animals
#5 Data Missing

Modify string names in a data frame based on a condition

I have a data frame with a variable called "Control_Category". The variable has six names in it, which for simplicity sake I am going to make generic:
df <- data.frame(Control_Category = c("Really Long Name One",
"Super Really Long Name Two",
"Another Really Flippin' Long Name Three",
",Seriously, It's a Fourth Long Name",
"Definitely a Fifth Long Name",
"Finally, This guy is done, number six"))
I'm using this to make a slight joke. So, while the names are long they are tidy in that the values for each (1-6) are consistent. In this specific character vector of the data.frame, there are hundreds and hundreds of entries that match any one of those six.
What I need to do is to replace the long names with a short name. Therefore, where any of the above names are identified, replace that name with a shorter version, like:
One
Two
Three
Four
Five
Six
I tried a function using 'case_when' and it failed miserably. Any help would be appreciated.
Additional Information Based on Questions From Community
The order of the items doesn't matter. There isn't a designation of 1 - 6. There just happen to be six and I made six stupid long strings. The strings themselves are long.
So, anywhere "Super Really Long Name Two" exists, that value needs to be updated to something like 'TWO" or a "Short_Name" that that approximate "TWO". In reality, the category is called "Audit, Testing and Examination Results". The short name would ideally just be "AUDIT".
You could just use gsub() once for each replacement:
df$Control_Category <- gsub('Really Long Name One', 'One', df$Control_Category)
You can repeat similar logic to handle the other five long/short name pairs.
Here's a larger data frame with long names:
set.seed(101)
long_names <- c("Really Long Name One",
"Super Really Long Name Two",
"Another Really Flippin' Long Name Three",
",Seriously, It's a Fourth Long Name",
"Definitely a Fifth Long Name",
"Finally, This guy is done, number six")
df <- data.frame(control_category=sample(long_names, 100, replace=TRUE))
head(df)
## control_category
## 1 Another Really Flippin' Long Name Three
## 2 Really Long Name One
## 3 Definitely a Fifth Long Name
## 4 ,Seriously, It's a Fourth Long Name
## 5 Super Really Long Name Two
## 6 Super Really Long Name Two
Using the unique function will give you the category names:
category <- unique(df$control_category)
print(category)
## [1] Another Really Flippin' Long Name Three
## [2] Really Long Name One
## [3] Definitely a Fifth Long Name
## [4] ,Seriously, It's a Fourth Long Name
## [5] Super Really Long Name Two
## [6] Finally, This guy is done, number six
## 6 Levels: ,Seriously, It's a Fourth Long Name ...
Notice that the levels are in alphabetical order (see levels(category)). In this case, the simplest way is to change the order manually by looking at the current order. In this case, category[c(2, 5, 1, 4, 3, 6)] will give you the right order. Finally,
df$control_category <- factor(
df$control_category,
levels=category[c(2, 5, 1, 4, 3, 6)],
labels=c("one", "two", "three", "four", "five", "six")
)
head(df)
## control_category
## 1 three
## 2 one
## 3 five
## 4 four
## 5 two
## 6 two

Transforming character strings in R

I have to merge to data frames in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:
DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu"
DataFrame2$Name:
"Kathleen VAN BREMPT"
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"
Is there a way in R to make these two variables usable as an identifier for merging the data frames?
Best, Thomas
You can use gsub to convert the names around:
> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER"
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul"
>
This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.
Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!
Barry
Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):
DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")
Output:
num Name.x Name.y revname
1 1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2 2 Gräßle Ingeborg Ingeborg GRÄSSLE GRÄSSLE Ingeborg
3 3 Gauzès Jean-Paul Jean-Paul GAUZÈS GAUZÈS Jean-Paul
4 4 Winkler Iuliu Iuliu WINKLER WINKLER Iuliu
The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:
merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")
--
David.
Can you add an additional column/variable to each data frame which is a lowercase version of the original name:
DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)
Then perform a merge on this:
MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")
In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:
> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
newyork NEWJersey Vormont
32 30 45

Resources