Complex String Split to Columns In R

Complex String Split to Columns In R - r

I'm working with a very messy data set that has a column that needs to be split into several more columns based on a standard delimiter ",|".
This is what entries in said column look like:
Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
The main problem I'm having is that not all descriptors that need to be split appear in the same order. Sometimes color is first, other times it appears last. Additionally, Some metrics do not appear, for example, "Video-Status" isn't in every row.
What would be the best way to go about creating 6 new columns from the data I've provided? Scratching my head here...

There is an obscure R function, read.dcf that can deal with Name:Value pair data. Here's an example with multiple rows, with the order and the completeness of each pair varying:
x <- "Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes"
x2 <- "Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada"
dat <- data.frame(col = c(x,x2), stringsAsFactors=FALSE)
dat
# col
#1 Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
#2 Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada
Then process after collapsing to one long piece of text with line breaks:
read.dcf(textConnection(paste(gsub(",[|]", "\n", dat$col), collapse="\n\n")))
# Color Texture Shape ID Location Video-Status
#[1,] "Red" "Rough" "Circular" "1323" "Canada" "Yes"
#[2,] "Red" "Rough" "Circular" "1323" "Canada" NA

I would do this using various tidyr functions. I created some sample data with entries being swapped and missing.
library(tidyverse)
df %>%
rowid_to_column("row") %>%
separate_rows(V1, sep = "\\|") %>%
mutate(V1 = str_replace(V1, ",$", "")) %>%
separate(V1, c("key", "value"), sep = ":") %>%
spread(key, value, fill = NA)
# row Color ID Location Shape Texture Video-Status
#1 1 Red 1323 Canada Circular Rough Yes
#2 2 Red 1323 Canada Circular Rough Yes
#3 3 Red 1323 Canada Circular Rough <NA>
Explanation: We first separate entries into different rows by splitting entries at "|", remove trailing ",", separate entries into different columns by splitting entries at ":" and finally reshape from long to wide to produce your expected output.
Sample data
df <- read.table(text =
"Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada")

Related

How do I find the most common words in a character vector in R?

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?

using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))

Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"

How to optimize For Loops in R? I am aware of the apply function but currently facing problem in applying it

So basically I have a vector of tags that I want to find in my Transcript column (row by row) and if I find any word from the tags in my Transcript string, I want to create a separate column concatenating all the tags as shown in the example below (see image):
tags=c("loan","deposit","quarter","morning")
So, the output should look like this:
Output Result
Currently, I am able to tag this by using two for loops i.e. one to go over Tags vector and the other to go over my data frame's Transcript column one-by-one. But, I have a tag list of around 500 words and data frame has more than 100,000 rows. So, I am concerned about the run time. Is there any better way to optimize my R code using apply function or any other method?
Using, the following code to tag all the rows of Transcript column one-by-one
for (i in 1:length(tags)) {
for (j in 1:nrow(FinalData)){
check_tag <- str_extract(string = FinalData$Cleaned_Transcript[j], pattern = tags[i])
if (is.na(check_tag)==FALSE) {
FinalData$Tags[j] <- stri_remove_empty(paste(FinalData$Tags[j],check_tag,sep = ","))
}
}
}

Not sure if you are open to not using a for loop, but if so, here's a tidyverse approach.
library(tidyverse)
dat <- data.frame(Transcript = c("This is example text a", "this is loan", "deposit is not quarter"))
# as per comment from TO, we want to provide an input vector of tags
my_tags <- c("loan", "deposit", "quarter", "morning")
my_tags_collapsed <- str_c(my_tags, collapse = "|")
# We can now use the collapsed tags in the str_extract_all function
dat %>%
mutate(test = str_extract_all(Transcript, my_patterns_collapsed)) %>%
unnest_wider(test) %>%
mutate(across(-Transcript, replace_na, "")) %>%
mutate(Tags_Marked = apply(across(-Transcript), 1, str_c, collapse = ",")) %>%
select(Transcript, Tags_Marked)
Which gives:
# A tibble: 3 x 2
Transcript Tags_Marked
<chr> <chr>
1 This is example text a ,
2 this is loan loan,
3 deposit is not quarter deposit,quarter
Admittedly, this is not 100% ok, since you still get the comma separator for 0-length characters.
Alternative could be to not concatenate the strings into one column, but keep them as separate columns which would mean that you could stop much earlier:
dat %>%
mutate(test = str_extract_all(Transcript, my_tags_collapsed)) %>%
unnest_wider(test)
which would give you:
# A tibble: 3 x 3
Transcript ...1 ...2
<chr> <chr> <chr>
1 This is example text a NA NA
2 this is loan loan NA
3 deposit is not quarter deposit quarter

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.

An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))

Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

How to transpose the first rows into new columns in R?

I want to transpose the first two rows into two new columns, and remain the rest of data frame. How do I do it in R?
My original data
A <- c("2012","PL",3,2)
B <- c("2012","PL",6,1)
C <- c("2012","PL",7,4)
DF <- data.frame(A,B,C)
My final data after transpose
V1 <- c("2012","2012")
V2 <- c("PL","PL")
A <- c(3,2)
B <- c(6,1)
C <- c(7,4)
DF <- data.frame(V1,V2,A,B,C)
Where V1 and V2 are the names for new columns and they are created automatically.
Thank you for any assistance.

Base R:
cbind(t(DF[1:2, 1, drop=FALSE]), DF[-(1:2),])
# Warning in data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
# 1 2 A B C
# 1 2012 PL 3 6 7
# 2 2012 PL 2 1 4
though I have some concerns about the apparent key property of "2012" and "PL". That is, you start with three instances of each and end with two. Logically it makes sense, though really to me it looks as if you have a matrix of numbers associated with a single "2012","PL", but perhaps that's not how the data is coming to you. (If you can change the format of the data before getting to this point such that you have a matrix and its associated keys, then it might make data munging more direct, declarative, and resistant to bugs.)

Here is an option with slice
library(dplyr)
DF %>%
select(A) %>%
slice(1:2) %>%
t %>%
as.data.frame %>%
bind_cols(DF %>%
slice(-(1:2)))

R - Count exact matches in string from list of words, then calculate overall sentiment using score per word

I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries:
1 - Is there a way to get it to match only to exact words?
2 - Is there a way to achieve this without using the nested loop?

One tidyverse possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Complex String Split to Columns In R - r

Related

How do I find the most common words in a character vector in R?

How to optimize For Loops in R? I am aware of the apply function but currently facing problem in applying it

Determine the size of string in a particular cell in dataframe: R

How to transpose the first rows into new columns in R?

R - Count exact matches in string from list of words, then calculate overall sentiment using score per word

Categories

Resources