R Corrplot Variable Names with Integer in Leading Position - r

In my excel file, the name of two of my variables are 2B and 3B, which means doubles and triples in baseball. However, when using corrplot, it shows up as X2B and X3B. I assume this is because it thinks I want to do multiplication. How would I go about fixing this?
I tried changing the box in excel from general format to text.
Any help would be much appreciated.
EDIT:
I got this part figured out. So now I have:
baseball = read.csv(file="MultComp3.csv",row.names=1)
library(corrplot)
M <- cor(baseball)[1:16,1:16]
colnames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
rownames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
corrplot.mixed(M)
EDIT 2:
But now, I need to make the text smaller, because it comes out of the boxes.

Related

Looping over a text collection to extract subchapters

As a continuation of my example here, I`m now confronted with the problem that I want to extract subchapters for all documents in my document collection in R for further Text Mining. This is my sample data:
doc_title <- c("Example.docx", "AnotherExample.docx")
text <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment.", "2.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")
doc_corpus <- data.frame(doc_title, text)
This is the function to divide the text into subchapters:
divideInto_subchapters <- function(doc_corpus){
corpus_text <- doc_corpus$text
# Replace lines starting with N.N.N+ with space
corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
corpus <- data.frame(doc_title, chapter_id, text)
return(corpus)
}
Now I want to loop over all elements in my doc_corpus and divide all plain text into subchapters. This is what I tried out so far:
subchapter_corpus <- data.frame()
for (i in 1:nrow(doc_corpus)) {
temp_corpus <- divideInto_subchapters(doc_corpus[i])
subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}
Unfortunately, this returns an empty data frame. What am I getting wrong here? Any help is highly appreciated.
My expected output for the first df row looks like this:
doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction"))
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))
chapter_one_df <- data.frame(doc_title, chapter_id, text)
So, for me the loop gave me "subscript out of bounds" until I changed doc_corpus[i] to doc_corpus[i, ]. With that change, I do get one row in the resulting data frame.
However, it's only chapter_id "2.2 Further Fuhter." It seems to be missing "1.1 Futher."
If it's a matter of the regex, then man it would sure help if you commented what you were doing with it! :)
Feel free to comment and I'll amend my answer as needed till it's helpful. Not sure if that's how it works, but this is only my 3rd day of answering questions on SO.

R extract() doesn´t accept my coordinates

First of all, I´m new to programing so this might be a simple question but i cant find the solution anywhere.
I´ve been using this code to extract values from a set of stacked rasters:
raster.files <- list.files()
raster.list <- list()
raster.files <-list.files(".",pattern ="asc")
for(i in 1: length(raster.files)){
raster.list[i] <- raster(raster.files[i])}
stacking <- stack(raster.list)
coord <- read.csv2("...")
extract.data <- extract(stacking,coord,method="simple")
I already used this code several times without any problem, until now. Every time I run the extract line I get this error:
Error in .doCellFromXY(object#ncols, object#nrows, object#extent#xmin, :
Not compatible with requested type: [type=character; target=double].
The coord file consists in a data.frame with 2 columns(X and Y respectively).
I´ve managed to found a way to bypass this error, its not technically a solution because I can´t understand why R was treating my data as text instead in first place.
Basically I separated the X and Y columns and treated them individually and then binded them again in a new data.frame:
coord_matrix_x<-as.numeric(as.matrix(coord[1]))
coord_matrix_y<-as.numeric(as.matrix(coord[2]))
coord2 <- cbind(coord_matrix_x, coord_matrix_y)
coord2<-as.data.frame(coord2)
coordinates(coord2)<-c("coord_matrix_x","coord_matrix_y")
It´s far form the most elegant way to do it, but it just works.

use ape to phase a fasta file and create a DNAbin file as output, then test tajima's D using pegas

I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
library("ape")
library("adegenet")
library("ade4")
library("pegas")
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
data(DNAbin8c18)
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
DNAbin8c18
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
Labels:
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
tajima.test(DNAbin8c18)
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
https://drive.google.com/open?id=0B6qb8IlaQGFZLVRYeXMwRnpMTUU
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Ella
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.

Does R produce warnings when it runs out of space from read.csv command

This question is pretty simple and maybe even dumb, but I can't find an answer on google. I'm trying to read a .txt file into R using this command:
data <- read.csv("perm2test.txt", sep="\t", header=FALSE, row.names=1, col.names=paste("V", seq_len(max(count.fields("perm2test.txt", sep="\t"))), sep=""), fill=TRUE)
The reason I have the col.names command is because every line in my .txt file has a different number of observations. I've tested this on a much smaller file and it works. However, when I run it on my actual dataset (which is only 48MB), I'm not sure if it is working... The reason I'm not sure is because I haven't received an error message, yet it has been "running" for over 24 hours at this point (just the read.csv command above). Is it possible that it has run out of memory and it just doesn't output a warning?
I've looked around and I know people say there are functions out there to reduce the size and remove lines that aren't needed, etc. but to be honest I don't think this file is THAT big, and unfortunately I do need every line in the file... (it's actually only 70 lines, but some lines contain as much as 100k entries, while others may only have say 100). Any ideas what is happening?
Obviously untested but should give you some code to modify:
datL <- readLines("perm2test.txt") # one line per group
# may want to exclude some lines but question is unclear
listL <- lapply(datL, function(L) read.delim(text=L, colCasses="numeric") )
# This is a list of values by group
dfL <- data.frame( vals = unlist(listL),
# Now build a grouping vector that is associated with each bundle of values
groups= rep( LETTERS[1:length(listL)] ,
sapply(listL, length) )
# Might have been able to do that last maneuver with `stack`.
library(lattice)
bwplot( vals ~ groups, data=dfL)

word frequency scatterplot in R (words as labels)

I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq

Resources