Summarizing R corpus with doc ID

Summarizing R corpus with doc ID - r

I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3

If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.
To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.
# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]
or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame
# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]
In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.

Related

Select all the rows in R that have some attributes from other columns R [duplicate]

This question already has answers here:
Subset dataframe by multiple logical conditions of rows to remove
(8 answers)
Closed 6 years ago.
I want to search my dataset for those values that some attributes from multiple columns.
For that, I found that I can use grep like so:
df <- read.csv('example.csv', header = TRUE, sep='\t')
df[grep("region+druggable", df$locus_type=="region", df$drug_binary==1),]
But when I run this, my output is the different column names.
Why is this happening?
my dataframe is like this:
id locus_type drug_binary
1 pseudogene 1
2 unknown 0
3 region 1
4 region 0
5 phenotype_only 1
6 region 1
...
So ideally, I would expect to get the 3rd and 6th row as a result of my query.

If you want to use base R, the correct syntax is the following:
df[grepl("region|druggable",df$locus_type) & df$drug_binary==1,]
Which gives the following ouput:
id locus_type drug_binary
3 3 region 1
6 6 region 1
Since you want to combine logic vectors you need to use grepl that has a logic output.
Also I assumed you wanted to check for locus type equal to region or druggable, the correct logic for the regex in grepl is the one I used above.

I like dplyr for its of readability
library(dplyr)
subdf <- filter(df, locus_type=="region", drug_binary==1)

sometimes it can be helpful to use the sqldf library.
?sqldf
SQL select on data frames
Description
SQL select on data frames
this is how you could get the result you need:
# load the sqldf library
# if you get error "Error in library(sqldf) : there is no package called sqldf"
# you can install it simply by typing
# install.packages('sqldf') <-- please notice the quotes!
library(sqldf)
# load your input dataframe
input.dataframe <- read.csv('/tmp/data.csv', stringsAsFactors = F)
# of course it's a data.frame
class(input.dataframe)
# express your query in SQL terms
sql_statement <- "select * from mydf where locus_type='region' and drug_binary=1"
# create a new data.frame as output of a select statement
# please notice how the "mydf" data.frame automagically becomes a valid sqlite table
output.dataframe <- sqldf(sql_statement)
# the output of a sqldf 'select' statement is a data.frame, too
class(output.dataframe)
# print your output df
output.dataframe
id locus_type drug_binary
3 region 1
6 region 1

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)

We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

merge and plot multiple text files

I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904

A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()

Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.

This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT

The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

efficient string value count in large data.frame

I have a large dataframe (~ 600K rows) with a string-value column (link)
doc_id,link
1,http://example.com
1,http://example.com
2,http://test1.net
2,http://test2.net
2,http://test5.net
3,http://test1.net
3,http://example.com
4,http://test5.net
and I would like to count the number of times a certain string value occurs in the frame. The result should look like this:
link, count
http://example.com, 3
http://test1.net, 2
http://test2.net, 1
http://test5.net, 2
Is there an efficient way to do this in R? Converting the frame into a matrix doesn't work because of the frame size. Currently I am using the plyr package, but this is too slow.

The table function counts occurrences - and it's very fast compared to ddply. So, something like this perhaps:
# some sample data
set.seed(42)
df <- data.frame(doc_id=1:10, link=sample(letters[1:3], 10, replace=TRUE))
cnt <- as.data.frame(table(df$link))
# Assign appropriate names (optional)
names(cnt) <- c("link", "count")
cnt
Which gives the following output:
link count
1 a 2
2 b 3
3 c 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Summarizing R corpus with doc ID - r

Related

Select all the rows in R that have some attributes from other columns R [duplicate]

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

merge and plot multiple text files

Filling Gaps in Time Series Data in R

efficient string value count in large data.frame

Categories

Resources