I'm using a for loop to create a document term matrix. My actual problem uses an obscure package called RMeCab to tokenize Japanese text, but here a more standard equivalent using strsplit. My current code:
Documents <- data.frame(Names= c("A","B"),Texts=c("A string of words","A different string"), stringsAsFactors = FALSE)
OUTPUT <- NULL
COMBINED <- NULL
i <- 1
for (i in 1:length(Documents$Texts)){
OUTPUT <- data.frame(unlist(strsplit(Documents$Texts, " ")))
OUTPUT$doc <- Documents$Names[i]
COMBINED <- rbind(COMBINED, OUTPUT)
}
Document_Term_Matrix <- as.data.frame.matrix(table(COMBINED))
It works, but I'd like to use a more efficient apply function. If I run
L_OUTPUT <- lapply(Documents[,2],function(x) strsplit(x, " "))
I get the separate words as elements of a list, but how do I append the document name from Documents$Names?
More specifically with a list structure:
[[1]]
これ です は ぺん
1 1 1 1
[[2]]
です は 人 彼
1 1 1 1
How do I get a data with a column like this
これ は ぺん です 彼 は 人 です
And the second column showing the documents names
One One One One Two Two Two Two
Those words corresponding to the list elements [[1]], [[2]], etc.
It is better to use packages such as tm for these kind of operations, but here is a solution using base R,
list1 <- strsplit(Documents$Texts, ' ')
v1 <- unique(unlist(list1))
Document_Term_Matrix <- as.data.frame(t(sapply(v1, function(i) lapply(list1, function(j)
sum(grepl(i, j))))))
names(Document_Term_Matrix)<- Documents$Names
Document_Term_Matrix
# A B
#A 1 1
#string 1 1
#of 1 0
#words 1 0
#different 0 1
you can use functions from tm package which are suitable for large text datasets:
library(tm)
# create corpora from your documents
corp = VCorpus(DataframeSource(Documents), readerControl = list(reader = readTabular(mapping = list(content = "Texts", id = "Names"))))
# create term document matrix
tdm = TermDocumentMatrix(corp, control = list(tokenize = function(x) unlist(strsplit(as.character(x), "[[:space:]]+"))
, stopwords = FALSE
, tolower = TRUE
, weighting = weightTf))
inspect(tdm)
# get the result as matrix
tdm.m = matrix(tdm, nrow = tdm$nrow, ncol = tdm$ncol)
rownames(tdm.m) = tdm$dimnames$Terms
colnames(tdm.m) = tdm$dimnames$Docs
I also think there is a mistake in your question (but I cannot add comments).
You're missing [i] in your for cycle, so you get the total number of terms in all the documents. It should be something like this:
OUTPUT <- data.frame(unlist(strsplit(Documents$Texts[i], " ")))
Related
I have StringTie data for a parental cell line and a KO cell line (which I'll refer to as B10). I am interested in comparing the parental and B10 cell lines. The issue seems to be that my StringTie files are separate, meaning I have one for the parental cell line and one for B10. I've included the code I have written to date for context along with the error messages I received and troubleshooting steps I have already tried. I have no idea where to go from here and I'd appreciate all the help I could get. This isn't something that anyone in my lab has done before so I'm struggling to do this without any guidance.
Thank you all in advance!
`# My code to go from StringTie to count data:
(I copy pasted this so all my notes are included. I'm new to R so they're really just for me. I'm not trying to explain to everyone what every bit of the code means condescendingly. You all likely know much more that I do)
# Open Data
# List StringTie output files for all samples
# All files should be in same directory
files_B10 <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/B10", recursive = TRUE, full.names = TRUE)
files_parental <- list.files("C:/Users/kimbe/OneDrive/Documents/Lab/RNAseq/StringTie/data/parental", recursive = TRUE, full.names = TRUE)
tmp_B10 <- read_tsv(files_B10[1])
tx2gene_B10 <- tmp_B10[, c("t_name", "gene_name")]
txi_B10 <- tximport(files_B10, type = "stringtie", tx2gene = tx2gene_B10)
tmp_parental <- read_tsv(files_parental[1])
tx2gene_parental <- tmp_parental[, c("t_name", "gene_name")]
txi_parental <- tximport(files_parental, type = "stringtie", tx2gene = tx2gene_parental)
# Create a filter (vector) showing which rows have at least two columns with 5 or more counts
txi_B10.filter<-apply(txi_B10$counts,1,function(x) length(x[x>5])>=2)
txi_parental.filter<-apply(txi_parental$counts,1,function(x) length(x[x>5])>=2)
head(txi_parental.filter)
sum(txi_B10.filter)
# Now filter the txi object to keep only the rows of $counts, $abundance, and $length where the txi.filter value is >=5 is true
txi_B10$counts<-txi_B10$counts[txi_B10.filter,]
txi_B10$abundance<-txi_B10$abundance[txi_B10.filter,]
txi_B10$length<-txi_B10$length[txi_B10.filter,]
txi_parental$counts<-txi_parental$counts[txi_parental.filter,]
txi_parental$abundance<-txi_parental$abundance[txi_parental.filter,]
txi_parental$length<-txi_parental$length[txi_parental.filter,]
# save count data as csv files
write.csv(txi_B10$counts, "txi_B10.counts.csv")
write.csv(txi_parental$counts, "txi_parental.counts.csv")
# Open count data
# Do this in order that the files are organized in file manager
txi_B10_counts <- read_csv("txi_B10.counts.csv")
txi_parental_counts <- read_csv("txi_parental.counts.csv")
# Set column names
colnames(txi_B10_counts) = c("Gene_name", "B10_n1", "B10_n2")
View(txi_B10_counts)
colnames(txi_parental_counts) = c("Gene_name", "parental_n1", "parental_n2")
View(txi_parental_counts)
## R is case sensitive so you just wanna ensure that everything is in the same case
## convert Gene names which is column [[1]] into lowercase
txi_parental_counts[[1]] <- tolower( txi_parental_counts[[1]])
View(txi_parental_counts)
txi_B10_counts[[1]] <- tolower(txi_B10_counts[[1]])
View(txi_B10_counts)
## Capitalize the first letter of each gene name
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_parental_counts$Gene_name <- capFirst(txi_parental_counts$Gene_name)
View(txi_parental_counts)
capFirst <- function(s) {
paste(toupper(substring(s, 1, 1)), substring(s, 2), sep = "")
}
txi_B10_counts$Gene_name <- capFirst(txi_B10_counts$Gene_name)
View(txi_B10_counts)
# Merge PL and KO into one table
# full_join takes all counts from PL and KO even if the gene names are missing
# If a value is missing it writes it as NA
# This site explains different types of merging https://remiller1450.github.io/s230s19/Merging_and_Joining.html
mergedCounts <- full_join (x = txi_parental_counts, y = txi_B10_counts, by = "Gene_name")
view(mergedCounts)
# Replace NA with value = 0
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# Save file for merged counts
write.csv(mergedCounts, "MergedCounts.csv")
## --------------------------------------------------------------------------------
# My code to go from count data to DEseq2
# Import data
# I added my metadata incase the issue is how I set up the columns
# metaData is a file with your samples name and Comparison
# Your second column in metadata must be called Comparison, otherwise you'll get error in dds line
metadata <- read.csv(metadata.csv', header = TRUE, sep = ",")
countData <- read.csv('MergedCounts.csv', header = TRUE, sep = ",")
# Assign "Gene Names" as row names
# Notice how there's suddenly an extra row (x)?
# R automatically created and assigned column x as row names
# If you don't fix this the # of columns won't add up
rownames(countData) <- countData[,1]
countData <- countData[,-1]
# Create DEseq2 object
# !!!!!!! Here is where I get stuck!!!!!!!
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line
# It says Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers
## --------------------------------------------------------------------------------
# How I tried to fix this:
# 1) I saw something here that suggested this might be an issue with having zeros in the count data
# I viewed the countData files to make sure there were no zeros and there weren't any
# I thought that would be the case since I replaced NA with value = 0 earlier using this bit of code
mergedCounts[is.na(mergedCounts)] = 0
view(mergedCounts)
# 2) I was then informed that StringTie outputs non integer values
# It was recommended that I try DESeqDataSetFromTximport instead
dds <- DESeqDataSetFromTximport(countData,
colData = metaData,
design = ~ Comparison, tidy = TRUE)
# I can't run this line either
# It says Error in DESeqDataSetFromTximport(countData, colData = metaData, design = ~Comparison, : is(txi, "list") is not TRUE
# I think this might be because merging the parental and B10 counts led to a file that's no longer a txi or accessible through Tximport
# It seems like this should be done with the original StringTie files from the very beginning of the code
# My concern with doing that is that the files for parental and B10 are separate so I don't see how I could end up comparing the two
# I think this approach would work if I was interested in comparing n1 verses n2 for each cell line but that is not of interest to me
`
I'm quite a novice, but I've successfully managed to make some code do what I want.
Right now my code does what I want for one file at a time.
I want to make my code automate this process for 600 files.
I kind of have an idea, that I need to put the list of files in a vector, then maybe use lapply and a function, but I'm not sure how to do this. The syntax and code are beyond me at the moment.
Here's my code...
#Packages are callled
library(tm) #text mining
library(SnowballC) #stemming - reducing words to their root
library(stringr) #for str_trim
library(plyr)
library(dplyr)
library(readtext)
#this is my code to run the code on a bunch of text files. Obviously it's unfinished, and I'm not sure if this is the right approach. Where do I put this? Will it even work?
data_files <- list.files(path = "data/", pattern = '*.txt', full.names = T, recursive = T)
lapply(
#
# where do I put this chunk of code?
# do I need to make all the code below a function?
##this bit cleans the document
company <- "CompanyXReport2015"
txt_raw = readLines("data/CompanyXReport2015.txt")
# remove all extra white space, also splits on lines
txt_format1 <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", txt_raw)
txt_format1.5 <- gsub("^ +| +$|( ) +", "\\1", txt_format1)
# recombine now that all white space is stripped
txt_format2 <- str_c(txt_format1.5, collapse=" ")
#split strings on space now to get a list of all words
txt_format3 <- str_split(txt_format2," ")
txt_format3
# convert to vector
txt_format4 <- unlist(txt_format3)
# remove empty strings and those with words shorter than 3 length
txt_format5 <- txt_format4[str_length(txt_format4) > 3]
# combine document back to single string
cleaned <- str_c(txt_format5, collapse=" ")
head(cleaned, 2)
##import key words and run analysis on frequency for the document
s1_raw = readLines("data/stage1r.txt")
str(s1_raw)
s2_raw = readLines("data/stage2r.txt")
str(s2_raw)
s3_raw = readLines("data/stage3r.txt")
str(s3_raw)
s4_raw = readLines("data/stage4r.txt")
str(s4_raw)
s5_raw = readLines("data/stage5r.txt")
str(s5_raw)
# str_count(cleaned, "legal")
# apply str_count function using each stage vector
level1 <- sapply(s1_raw, str_count, string=cleaned)
level2 <- sapply(s2_raw, str_count, string=cleaned)
level3 <- sapply(s3_raw, str_count, string=cleaned)
level4 <- sapply(s4_raw, str_count, string=cleaned)
level5 <- sapply(s5_raw, str_count, string=cleaned)
#make a vector from this for the report later
wordcountresult <- c(level1,level2,level3,level4,level5)
# convert to dataframes
s1 <- as.data.frame(level1)
s2 <- as.data.frame(level2)
s3 <- as.data.frame(level3)
s4 <- as.data.frame(level4)
s5 <- as.data.frame(level5)
# add a count column that each df shares
s1$count <- s1$level1
s2$count <- s2$level2
s3$count <- s3$level3
s4$count <- s4$level4
s5$count <- s5$level5
# add a stage column to identify what stage the word is in
s1$stage <- "Stage 1"
s2$stage <- "Stage 2"
s3$stage <- "Stage 3"
s4$stage <- "Stage 4"
s5$stage <- "Stage 5"
# drop the unique column
s1 <- s1[c("count","stage")]
s2 <- s2[c("count","stage")]
s3 <- s3[c("count","stage")]
s4 <- s4[c("count","stage")]
s5 <- s5[c("count","stage")]
# s1
df <- rbind(s1, s2,s3, s4, s5)
df
#write the summary for each company to a csv
#Making the report
#Make a vector to put in the report
#get stage counts and make a vector
s1c <- sum(s1$count)
s2c <- sum(s2$count)
s3c <- sum(s3$count)
s4c <- sum(s4$count)
s5c <- sum(s5$count)
stagesvec <- c(s1c,s2c,s3c,s4c,s5c)
names(stagesvec) <- c("Stage1","Stage2","Stage3","Stage4","Stage5")
#get the company report name for a vector
companyvec <- c(company)
names(companyvec) <- c("company")
# combine the vectors for the vector row to be inserted into the report
reportresult <- c(companyvec, wordcountresult, stagesvec)
rrdf <- data.frame(t(reportresult))
newdf <- data.frame(t(reportresult))
#if working file exists-use it
if (file.exists("data/WordCount12.csv")){
write.csv(
rrdf,
"data/WordCountTemp12.csv", row.names=FALSE
)
rrdf2 <-
read.csv("data/WordCountTemp12.csv")
df2 <-
read.csv("data/WordCount12.csv")
df2 <- rbind(df2, rrdf2)
write.csv(df2,
"data/WordCount12.csv", row.names=FALSE)
}else{ #if NO working file exists-make it
write.csv(newdf,
"data/WordCount12.csv", row.names=FALSE)
}
Hello :) Here is an example of workflow, you might find better ones but I started with it when learning.
listoftextfiles = list.files(...)
analysis1 = function(an element of listoftextfiles){
# your 1st analysis
}
res1 = lapply(listoftextfile, analysis1) # results of the 1st analysis
analysis2 = function(an element of res1){
# your 2nd analysis
}
res2 = lapply(res1, analysis2) # results of the 2nd analysis
# ect.
You will find many tutorials about custom functions on internet.
I have a nested list with phrases after applying phrasemachine(). Now I would like to create a document-feature matrix having the documents (user) in the first column and all features as the remaining columns with each user's frequency of usage in the cells.
library(rJava)
library(phrasemachine)
library(quanteda)
#creating dummy data
id <- c(1:2)
text <- c("Election day is coming up and I am super excited. Election day. Wooho. I voted President Obama.", "School is boring. Partying is cool. Happy Birthday to me. When is Election Day?")
test <- data.frame(id, text)
test$text <- as.character(test$text)
corpus_test <- corpus(test[["text"]], docnames = test[["id"]])
tokens_test <- tokens(corpus_test)
phrases_test <- phrasemachine(tokens_test, minimum_ngram_length = 2, maximum_ngram_length = 3, return_phrase_vectors = TRUE, return_tag_sequences = TRUE)
phrases_test
# > phrases_test
# [[1]]
# [[1]]$phrases
# [1] "Election_day" "Election_day" "President_Obama"
#
# [[1]]$tags
# [1] "NN" "NN" "NN"
#
#
# [[2]]
# [[2]]$phrases
# [1] "Happy_Birthday" "Election_Day"
#
# [[2]]$tags
# [1] "AN" "NN"
This is the output I am looking for (a document-feature matrix):
# user Election_day President_Obama Happy_Birthday
# 1 2 1 0
# 2 1 0 1
I tried using lapply but since each user's phrases are of different dimensions, that wouldn't work.
Here is what I tried:
library(plyr)
phrases_user <- laply(phrases_test, function(x) laply(x, identity)) #Error: Results must have the same dimensions.
library(dplyr)
phrases_user <- lapply(phrases_test, `[`, "phrases")
After figuring out the issue in extracting the phrases per Id, I suppose I would have to do the following:
corpus_test_2 <- corpus(phrases_user[["phrases"]], docnames = phrases_user[["id"]])
dfm_test <- dfm(corpus_test_2)
Can anyone help? :)
Example using udpipe with phrasemachine
library(udpipe)
text <- c("Election day is coming up and I am super excited. Election day. Wooho. I voted President Obama.", "School is boring. Partying is cool. Happy Birthday to me. When is Election Day?")
x <- udpipe(text, "english")
x$tags <- as_phrasemachine(x$upos, type = "upos")
keyw <- keywords_phrases(x$tags,
term = x$token, pattern = "(A|N)*N(P+D*(A|N)*N)*",
is_regex = TRUE, detailed = FALSE)
head(keyw)
x$term <- txt_recode_ngram(x$token,
compound = keyw$keyword,
ngram = keyw$ngram)
dtm <- document_term_frequencies(x, document = "doc_id", term = c("term", "token"))
dtm <- document_term_matrix(dtm)
Note though you might also be interested in using the dependency parsing output. You can extract multi-word expressions based on the dep_rel field of the udpipe output - if it says fixed/flat/compound, that are multi-word-expressions. The definition of fixed/flat/compound is defined at http://universaldependencies.org/u/dep/index.html.
I am relatively new to R and even more new to Shiny (literally first day).
I would like a user to input multiple phrases separated by a comma such as female, aged, diabetes mellitus. I have a dataframe in which one variable, MH2 contains text words. I would like to output a dataframe that contains only the rows in which all of the inputted phrases are present. Sometimes a user may input only one phrase, other times 5.
This is my ui.R
library(shiny)
library(stringr)
# load dataset
load(file = "./data/all_cardiovascular_case_reports.Rdata")
ui <- fluidPage(
sidebarLayout(
sidebarPanel(
textInput(inputId = "phrases",
label = "Please enter all the MeSH terms that you would like to search, each separated by a comma:",
value = ""),
helpText("Example: female, aged, diabetes mellitus")
),
mainPanel(DT::dataTableOutput("dataframe"))
)
)
and here is my server.R
library(shiny)
server <- function(input, output)
{
# where all the code will go
df <- reactive({
# counts how many phrases there are
num_phrases <- str_count(input$phrases, pattern = ", ") + 1
a <- numeric(num_phrases) # initialize vector to hold all phrases
# create vector of all entered phrases
for (i in 1:num_phrases)
{
a[i] <- noquote(strsplit(input$phrases, ", ")[[i]][1])
}
# make all phrases lowercase
a <- tolower(a)
# do exact case match so that each phrase is bound by "\\b"
a <- paste0("\\b", a, sep = "")
exact <- "\\b"
a <- paste0(a, exact, sep = "")
# subset dataframe over and over again until all phrases used
for (i in 1:num_phrases)
{
final <- final[grepl(pattern = a, x = final$MH2, ignore.case = TRUE), ]
}
return(final)
})
output$dataframe <- DT::renderDataTable({df()})
}
When I tried running renderText({num_phrases}) I consistently got 1 even when I would input multiple phrases separated by commas. Since then, whenever I try to input multiple phrases, I run into "error: subscript out of bounds." However, when I enter the words separated by a comma only versus a comma and space (entering "female,aged" instead of "female, aged") then that problem disappears, but my dataframe doesn't subset correctly. It can only subset one phrase.
Please advise.
Thanks.
I think your Shiny logic looks good, but the function for subsetting the dataframe has a few small issues. In particular:
a[i] <- noquote(strsplit(input$phrases, ", ")[[i]][1])
The indices [[i]] and 1 are in the wrong place here, should be [[1]][i]
final <- final[grepl(pattern = a, x = final$MH2, ignore.case = TRUE), ]
You can not match multiple patterns like this, only the first element of a will be used, which is also the warning R gives.
Example working code
I have changed input$phrases to inp_phrases here. If this script does what you want I think you can easily copy it into you reactive, making the necessary changes (i.e. changing inp_phrases back, and adding the return(result) statement.). I was also not entirely clear if you wanted all patterns to be matched within one row, or return all rows were any of the patterns were matched, so I added them both, you can uncomment the one you need:
library(stringr)
# some example data
inp_phrases = "ab, cd"
final = data.frame(index = c(1,2,3,4),MH2 = c("ab cd ef","ab ef","cd ef ab","ef gx"),stringsAsFactors = F)
# this could become just two lines:
a <- sapply(strsplit(inp_phrases, ", ")[[1]], function(x) tolower(noquote(x)))
a <- paste0("\\b", a, "\\b")
# Two options here, uncomment the one you need.
# Top one: match any pattern in a. Bottom: match all patterns in a
# indices = grepl(pattern = paste(a,collapse="|"), x = final$MH2, ignore.case = TRUE)
indices = colSums(do.call(rbind,lapply(a, function(x) grepl(pattern = x, x = final$MH2, ignore.case = TRUE))))==length(a)
result <- final[indices,]
Returns:
index MH2
1 1 ab cd ef
3 3 cd ef ab
... with the second version of indices (match all) or
index MH2
1 1 ab cd ef
2 2 ab ef
3 3 cd ef ab
... with the first version of indices (match any)
Hope this helps!
I am trying to define a data frame initially and then add rows using rbind one by one later. My problem is that in the following definition of result, I want humanReplicate, ratReplicate to be in string format and other columns to be in numeric format.
result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate",
pvalue = "p-value", alternative = "alternative_hypothesis", Conf.int1 = "conf.int1",
Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
Then, I am currently adding new rows (which are results of a Fisher's test) in the following way :
newLine <- data.frame(t(c(humanReplicate = humanReplicateName,
ratReplicate = ratReplicateName, pvalue = fisherTest$p,
alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1],
Conf.int2 =fisherTest$conf.int[2], oddratio = fisherTest$estimate[[1]])))
result <-rbind(result,newLine)
Here all the values of the fisherTest$X is numeric. humanReplicateName and ratReplicateName are strings. But if I define result in the above way and then define newLine in this way then all the columns of the data frame becomes string. I understand when I define result here, I am making all of them string. But if I want to make it mixture of string and number, how should I define it?
My final goal is to get result as csv file and I am doing the following to do this:
write.table(result, file = "newData4.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")
Your problem is in the concatenate function c(). You do not need it as well as the tfunction. Indeed
tmp <- data.frame(t(c("test" = 2,"human" = "Paul")))
apparently gives the same than this line
tmp2 <- data.frame("test" = 2,"human" = "Paul")
> tmp
test human
1 2 Paul
> tmp2
test human
1 2 Paul
but
> tmp$test
test
2
Levels: 2
> tmp2$test
[1] 2
and
> is.numeric(tmp$test)
[1] FALSE
> is.numeric(tmp2$test)
[1] TRUE
What happens is that with c first you make 1 vector, and mixing text and numeric is interpreted as a vector of factors, while with the directly call to dataframe you fill two different and indipendent columns