So I'm trying to format my xls data in a way that the first row will be seen in R, but it won't be analysed as in this example: http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData
When you open the exprs(bm) expression data in this the first row gives you the gene names, but these aren't e.g. being log transformed.
I formatted my own data into a similar table, but cannot figure out how to omit the first table from showing up in R and more importantly being used in calculations, which of course results in error codes all the way.
Hope that makes sense?
Cheers
Related
I need an automatic code to extract pdf table in R.
So I searched website, find tabulizer package.
and I use
extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name
I tried every method type, but the outcome is not tidy.
Some columns are mixed and there is a lot of blank as you can see image file.
I think I would do modify the data directly. But the purpose is automizing it. So general method is needed. And every pdf file is not organized. Some table is very tidy with every related line matched perfectly but others are not..
As you can see in my outcome image, in column 4, the number is mixed in same column. Other columns, the number is matched one by one what I mean is I want to make column tidy like table in pdf automatically.
Is there any package or some method to make extracted table tidy?
my Code result
table in PDF
I am trying to establish a seperate document term matrix for each of the individual rows in a csv file. I have successfully read the csv file into R-Studio using the read.csv command. The first step to creating a document term matrix using the tm package, as far as I could figure out, would be to create an individual corpus for each of the rows of the file and to try and achieve this, I created the following code.
for(i in 1:no_row)
{
data$TextCorpus[i]<-Corpus(VectorSource(data$Text[i]))
#print(data$TextCorpus[i])
}
Where no_row equals the number of rows in the column (this was done using the command no_row<-nrow(data)) the data$TextCorpus column is a column I created to store the corpus's created by the loop and data$text refers to the column the data being used to create the individual corpus's.
I expected that this would produce a corpus for each of the individual rows however, when I apply the class() function to the data$Text_Corpus column, it says that the column is classed as a list and this is preventing me from applying tm_map functions to individual rows of a column. Furthermore, when I apply the as.Corpus function or any similar function to the column, this has no effect and still the data$TextCorpus column is classified as a list. Does anybody know how to fix this problem? It is greatly appreciated.
P.S. If corpus's isn't the plural of corpus, please feel free to correct me in your response.
I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)
I'm a total newbie with R, and I'm trying to create a histogram (with value and frequency as the axises) from a csv file (just one row of values). Any idea how I can do this?
I'm also an R newbie, and I ran into the same thing. I made two separate mistakes, actually, so I'll describe them both here.
Mistake 1: Passing a frequency table to hist(). Originally I was trying to pass a frequency table to hist() instead of passing in the raw data. One way to fix this is to use the rep() ("replicate") function to explode your frequency table back into a raw dataset, as described here:
Creating a histogram using aggregated data
Simple R (histogram) from counted csv file
Instead of that, though, I just decided to read in my original dataset instead of the frequency table.
Mistake 2: Wrong data type. My raw data CSV file contains two columns: hostname and bookings (idea is to count the number of bookings each host generated during some given time period). I read it into a table.
> tbl <- read.csv('bookingsdata.csv')
Then when I tried to generate a histogram off the second column, I did this:
> hist(tbl[2])
This gave me the "'x' must be numeric" error you mention in a comment. (It was trying to read the "bookings" column header in as a data value.)
This fixed it:
> hist(tbl$bookings)
You should really start to read some basic R manual...
CRAN offers a lot of them (look into the Manuals and Contributed sections)
In any case:
setwd("path/to/csv/file")
myvalues <- read.csv("filename.csv")
hist(myvalues, 100) # Example: 100 breaks, but you can specify them at will
See the manual pages for those functions for more help (accessible through ?read.table, ?read.csv and ?hist).
To plot the histogram, the values must be of numeric class i.e the data must be of numeric value. Here the value of x seems to be of some other class.
Run the following command and see:
sapply(myvalues[1,],class)
I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.