concatenate strings of one column separated by comma in R - r

I would like to concatenate string with double quotes which is in a column followed by comma as I would like import this to an SQL query.
A<-c('John', 'Kate', 'Kaitlyn', 'Arun',' Chen')
df<- data.frame(A)
So I would like to concatenate all the rows of A and get a string from this column A and would like to get the output as text file. Below is the expected text. Any suggestions?
"John", "Kate", "Kaitlyn", "Arun", "Chen"

Try this approach:
#Data
A<-c('John', 'Kate', 'Kaitlyn', 'Arun',' Chen')
df<- data.frame(A,stringsAsFactors = F)
df$A <- paste0('\"',trimws(df$A),'\"')
#Collapse
df2 <- data.frame(val=paste0(df$A,collapse = ', '),stringsAsFactors = F)
#Export
write.table(df2,file='File.txt',row.names = F,col.names = F,quote = F)
Output:

Related

Multiline text in R dataframe

I'm trying to include a multiline text in a dataframe cell, however R keeps reading the \n as a next row, resulting in row mismatches. If i change the 'code' input to a simple string, the code works fine.
Defined dataframe:
df <- data.frame(
"Id" = character(),
"Name" = character(),
"Code" = character()
)
Adding new row:
NewRow <- data.frame(
"Id" = Id, # Simple string
"Name" = Name, # Simple string
"Code" = Code # Complex multiline string containing '#' and '\n' (10+ lines)
)
df <- rbind(df, NewRow)
Received error: Error in data.frame: arguments imply differing number of rows: 1, 0
Does anyone know how to get around this problem?
Many thanks in advance!
Maybe what you can try is to clean up the Code variable a bit, before adding it to the dataframe. In this sense, you can remove \n and # from the Code variable, and then add it inside the dataframe. For this you can use stringr and dplyr, to update the Code variable:
### Using the replace option:
Code <- Code %>%
str_replace_all("\\\n", "") %>%
str_replace_all("#", "")
### Using the remove option:
Code <- Code %>%
str_remove_all("\\\n") %>%
str_remove_all("#")

data.table with unique names from nested list with different classes and missing names

I have a nested list of this structure:
test_list <- list(
"some string",
list(type = "entry_type", text = "some content"),
list("more strings"),
list(type = "another_type", text = "more text yet"),
""
)
So it is a list containing lists and plain entries, while the only names are those of the items of the nested lists - which are duplicated.
My goal is to transfer this into a data.table with the original names (provided they exist) but made unique.
Currently I use this pipeline:
library(data.table)
dt <- as.data.table(flatten(test_list))
unique_names <- paste0("V", seq_len(length(names(dt))))
propper_names <- names(dt)
new_names <- propper_names
blank_names <- which(new_names == "")
new_names[blank_names] <- unique_names[blank_names]
duplicates_names <- which(duplicated(new_names))
new_names[duplicates_names] <- paste(
propper_names[duplicates_names],
unique_names[duplicates_names],
sep = "_"
)
setnames(
dt,
new_names
)
Is there a nicer/faster/better/more robust way to accomplish this goal?
You could use unlist rapply() to retain the deeper-nested names:
setDT(as.list(rapply(test_list, identity)))[]
setnames(DT, make.names(names(DT), unique = TRUE))
DT
# V1 type text V2 type.1 text.1 V3
# 1: some string entry_type some content more strings another_type more text yet

R Splitting strings by multiple splitting criteria and storing the splitting criteria

I'm trying to split a string by multiple criteria and store the splitting criteria for each split.
I have been trying to use the stringr::str_split package but cannot pass more than one splitting criteria to the function.
For example if I have the following string:
data = "Julie (title) : This is the text Julie has: said. Extra sentence one. Extra sentence 2 and so on. Rt Hon Ellen: This is the text Ellen has said in response to Julie. TITLE OF SECTION Julie: More words from Julie."
and splitting criteria:
names = c("Julie:", "Ellen:")
I would like an output like this:
data.frame(Names = c("Julie:", "Ellen:","Julie:"),
text = c(" This is the text Julie has: said. Extra sentence one. Extra sentence 2 and so on. ", "This is the text Ellen has said in response to Julie.","More words from Julie."))
I have seen your comments in coatless's answer and created a sample data, which probably reflects what you said. One way would be the following. I first create a data frame. I split the string for each sentence using unnest_tokens(). Then, I split the sentences using separate(). Finally, I replaced NAs with person's name. I hope this will help you to some extent.
library(tidyverse)
library(tidytext)
library(zoo)
so <- tibble(text = "Ana: I went to school today. I learned text mining. Bob: That is great! Ana: I know what to do: practice.")
unnest_tokens(so, output = sentence,
input = text,
token = "sentences") %>%
separate(col = sentence, into = c("person", "sentence"), sep = ": ",
extra = "merge", fill = "left") %>%
mutate(person = na.locf(person))
# A tibble: 4 x 2
# person sentence
# <chr> <chr>
#1 ana i went to school today.
#2 ana i learned text mining.
#3 bob that is great!
#4 ana i know what to do: practice.
Long-winded inefficient Base R solution:
# Store a vector of the names:
text_names <- c("Julie", "Ellen")
# Create a dataframe of the patterns:
pattern_search <- data.frame(name_search = c(paste0(text_names, ":"),
paste0(text_names, " :"),
paste0(text_names, ".* :")),
stringsAsFactors = F)
# Split the text into sentences:
split_text <- data.frame(sentences = trimws(unlist(strsplit(df$Text, "[.]")), "both"), stringsAsFactors = F)
# Extract the names, store them in a vector:
names_in_order <- gsub("[[:punct:]]|\\s+.*",
"",
regmatches(grep(paste0(pattern_search$name_search, collapse = "|"),
split_text$sentences, value = T),
regexpr(paste0(pattern_search$name_search, collapse = "|"),
grep(paste0(pattern_search$name_search, collapse = "|"),
split_text$sentences, value = T))))
# Store a logical vector denoting which elements the names should go:
split_text$who_said_this <- grepl(paste0(pattern_search$name_search, collapse = "|"),
split_text$sentences)
# Replace all occurences of TRUE with the elements of the vector of names:
split_text$who_said_this[which(split_text$who_said_this == TRUE)] <- names_in_order
# Replace FALSE with NA values:
split_text$who_said_this[which(split_text$who_said_this == "FALSE")] <- NA
# Store a vector that's values denote the number of times dialogue changes between the names:
split_text$speech_group_no <- ave(split_text$who_said_this,
split_text$who_said_this,
FUN = seq.int)
# Apply a function to fill NA values with the non-NA value above it:
split_text <- data.frame(lapply(split_text, function(x){na.omit(x)[cumsum(!is.na(x))]}),
stringsAsFactors = F)
# Row-wise concatenate the dataframe by group:
split_text <- aggregate(list(sentences = c(split_text$sentences)),
list(speech_group_no = paste0(split_text$who_said_this, " - ", split_text$speech_group_no)),
paste0,
sep = ". ")
# Flatten list vector into a character vector and clean up punctuation:
split_text$sentences <- gsub(" [,] ", " ", sapply(split_text$sentences, toString))
# Order the dialogue:
split_text <- split_text[match(split_text$speech_group_no,
paste(names_in_order, ave(names_in_order, names_in_order, FUN = seq.int), sep = " - ")),]
Data:
df <- structure(
list(Text = "Julie (title) : This is the text Julie has: said. Extra sentence one. Extra sentence 2 and so on. Rt Hon Ellen: This is the text Ellen has said in response to Julie. TITLE OF SECTION Julie: More words from Julie."),
class = "data.frame",
row.names = c(NA,-1L)
)

Filtering process not fetching full data? Using dplyr filter and grep

I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”.
When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.
Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link.
Satish
Code is given below
library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")
sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)
sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)
write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)
Data (and code) is given at the link below. Sorry, I should have used dput.
https://spaces.hightail.com/space/arJlYkgIev
Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.
#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")
Edit: Simple way to split the above column into multiple columns
#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <- strsplit(new_sec1$V1, " ")
new_sec1 <- lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <- do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)
Change columns names for your analysis.

Formatting / adjusting incoming string to R

I'm having trouble doing some extraction & coercing of a string in R. I'm not very good with R... just enough to be dangerous. Any help would be appreciated.
I am trying to take a string of this form:
"AAA,BBB,CCC'
And create two items:
A list containing each element separately (i.e. 3 entries) - c("AAA","BBB","CCC"). I've tried strsplit(string, ",") but I get a list of length 1
A data frame with names = lower case entries, values = entries. e.g. df = data.frame(aaa=AAA, bbb=BBB, ccc=CCC). I'm not sure how to pull out each of the elements, and lowercase the references.
Hopefully this is doable with R. Appreciate your time!
If the string is malformed read in with quotes changed
malform <- read.table("weirdstring.txt", colClasses='character',quote = "")
str = gsub("\'|\"", "", malform[1,1])
The string should now look like:
str = "AAA,BBB,CCC"
## as list
ll <- unlist(strsplit(str, ","))
## df
df <- data.frame(t(ll))
names(df) <- sapply(ll, tolower)

Resources