This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 11 months ago.
Need some help with my R code please folks!
My table has two columns
a list of codes, with numerous codes in the same "cell" separated by commas
a description that applies to all of the codes in the same row
I want to split the values in the first column so that there is only 1 code per row and the corresponding description is repeated for every relevant code.
I really don't know where to start sorry, I actually don't really know what to search for!
You can use separate_rows from tidyr:
library(tidyr)
separate_rows(df, numbers, convert = TRUE)
Or in base R, we can use strsplit:
s <- strsplit(df$numbers, split = ",")
output <- data.frame(numbers = unlist(s), descriptions = rep(df$descriptions, sapply(s, length)))
Output
numbers descriptions
<int> <chr>
1 This is a description for ID1
2 This is a description for ID2
3 This is a description for ID2
4 This is a description for ID2
5 This is a description for ID3
6 This is a description for ID3
Data
df <- tibble(
numbers = c("1", "2,3,4", "5,6"),
descriptions = c("This is a description for ID1", "This is a description for ID2", "This is a description for ID3")
)
# numbers descriptions
# <chr> <chr>
# 1 This is a description for ID1
# 2,3,4 This is a description for ID2
# 5,6 This is a description for ID3
Related
Context
I have two vectors. fruits_Jack_eat is a vector of length=1 that stores the fruits Jack ate. fruits_list is a vector of length=3 that stores different types of fruits.
Question
I want to find out if Jack ate 1 or more fruits in the fruits_list. But the situation is not that simple. fruits_list[1] is 'Navel orange'. one of the fruits Jack ate is XXXorange. although XXXorange is not exactly the same as Navel orange, I still think the situation is a match.
Reproducible code
fruits_Jack_eat = 'XXXorange,PPPapple,QQQbanana'
fruits_list = c('Navel orange', 'Super big apple', 'Very yellow banana')
Expect output
When I enter fruits_Jack_eat and fruits_list, the result should return a dataframe. The first column is a logical vector that indicates whether or not the match is on. The second column is a character vector indicating the characters in fruits_Jack_eat that are similar to fruits_list. Maybe like this:
df_output = data.frame(matched = TRUE, matched_char = c('orange,apple,banana'))
> df_output
matched matched_char
1 TRUE orange,apple,banana
What I've done
how to get percentage character match between two strings using sqldf in R
Identify the percentage of string Match in R
Maybe this helps
library(stringr)
library(tibble)
matched_char <- str_extract(fruits_list,
str_replace_all(str_remove_all(fruits_Jack_eat, "[A-Z]+"), ",", "|"))
tibble(matched = any(length(matched_char) > 0),
matched_char = str_c(matched_char, collapse = ","))
# A tibble: 1 × 2
matched matched_char
<lgl> <chr>
1 TRUE orange,apple,banana
I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.
I'm trying to do something that I thought would be pretty simple that has me stumped.
Say I have the following data frame:
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code))
> df
id code
1 bob_geldof blah
2 billy_bragg di
3 melvin_smith blink
And another like this:
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2))
> alternates
ID1 ID2
1 bob_geldof the_builder
2 melvin_smith kelvin
If the character string in df$id matches alternates$ID1, I'd like to replace it with alternates$ID2. If it doesn't match I'd like to just leave it as it is.
The final df should look like
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
This is obviously a silly example and my real dataset requires lots of replacements.
I've included the 'code' column to demonstrate that I'm working with a data frame and not just a character vector.
I’ve been using gsub to replace them individually but it's time consuming and the list keeps changing.
I looked into str_replace but it seems you can only specify one replacement value.
Any help would be much appreciated.
Cheers!
EDIT: Not all ids contain underscores, and I need to retain the bit that does match. E.g. bob_geldolf becomes bob_the_builder.
EDIT 2(!): Thanks for your suggestions everyone. I've got round the problem by merging the data frames (so that there are NAs where there's no change to be made), and creating new IDs using an ifelse statement. It's a bit clunky but it works!
When creating the dataframes use stringsAsFactors = FALSE so as to not deal with factors. Then, if the rows are ordered, just apply:
df <- as.data.frame(cbind(id,code),stringsAsFactors = FALSE)
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = FALSE)
df$id[c(TRUE,FALSE)]=paste(gsub("(.*)(_.*)","\\1",df$id[c(TRUE,FALSE)]),
alternates$ID2,sep="_")
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
If they are unordered, we can use dlyr:
df%>%rowwise()%>%mutate(id=if_else(length(which(alternates$ID1==id))>0,
paste(gsub("(.*)(_.*)","\\1",id),
alternates$ID2[which(alternates$ID1==id)],sep="_"),
id))
# A tibble: 3 x 2
id code
<chr> <chr>
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
We are using the same logic as before. Here we check the df by row. If its id matches any of alternatives$ID1 (checked by length()), we update it.
The following solution uses base-R and is streamlined a bit. Step 1: merge the main "df" and the "alternates" df together, using a left-join. Step 2: check where there the ID2 value is not missing (NA) and then assign those values to "id". This will keep your original id where available; and replace it with ID2 where those matching IDs are available
The solution:
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
With full original data frame definitions (using stringsAsFactors=F):
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code),stringsAsFactors = F)
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = F)
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
Results: (the full merge below, you can also do combined[,c("id","code")] for the streamlined results). Here, the non-matching "billy_bragg" is kept; and the others are replaced with the matched ID
> combined
id code ID2
1 billy_bragg di <NA>
2 the_builder blah the_builder
3 kelvin blink kelvin
This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I am trying to import a CSV file with R, and am having trouble.
The CSV entries look like
"Name 1" , "Name 2, Name 3, Name 4"
If I import straight to R, the data is read in like
Name 1 Name 2,Name 3,Name 4
but I would like it to look like
Name 1 Name 2
Name 1 Name 3
Name 1 Name 4
Is there a way to break up the second column during import so I can have two columns with only one name in each?
Thanks
Use strsplit to break the column up into an actual list class, then use tidyr::unnest to get your result:
d = data.frame(a = "Name 1" , b = "Name 2, Name 3, Name 4")
d$b = strsplit(d$b, ",")
library(tidyr)
unnest(d)
# a b
# 1 Name 1 Name 2
# 2 Name 1 Name 3
# 3 Name 1 Name 4
Depending on how clean things are, you might need to trim whitespace of the new column, see ?trimws for help there.
This question already has answers here:
How to strsplit different number of strings in certain column by do function
(1 answer)
tidyr separate only first n instances [duplicate]
(2 answers)
Closed 5 years ago.
I have a column with Full names that should be separated into three columns just by spaces. The problem is that some full names contains more than three words, and 4-th and other words shouldn't be omitted, but added to third part.
For instance, "Abdullaeva Mehseti Nuraddin Kyzy" should be separated as:
| Abdullaeva | Mehseti | Nuraddin Kyzy |
I tried to split column with (tidyr) package as follow, but in this way 3d part contains only 1 word after second space.
df<-df %>%
separate('FULL_NAME', c("1st_part","2d_part","3d_part"), sep=" ")
Any help will be appreciated.
Use extra argument:
# dummy data
df1 <- data.frame(x = c(
"some name1",
"justOneName",
"some three name",
"Abdullaeva Mehseti Nuraddin Kyzy"))
library(tidyr)
library(dplyr)
df1 %>%
separate(x, c("a1", "a2", "a3"), extra = "merge")
# a1 a2 a3
# 1 some name1 <NA>
# 2 justOneName <NA> <NA>
# 3 some three name
# 4 Abdullaeva Mehseti Nuraddin Kyzy
# Warning message:
# Too few values at 2 locations: 1, 2
From manual:
extra
If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times
Since for this dataset you said that you only have name1, name2, last name, then you can also use str_split_fixed from stringr, i.e.
setNames(data.frame(stringr::str_split_fixed(df1$x, ' ', 3)), paste0('a', 1:3))
Which gives,
a1 a2 a3
1 some name1
2 justOneName
3 some three name
4 Abdullaeva Mehseti Nuraddin Kyzy
Note that you can fill the empty slots with NA as per usual