How to modify number of characters in a vector? - vector

so I have this dataset where age of the respondent was an open-ended question and responses sometimes look as follows:
Age:
23
45
36 years
27
33yo
...
I would like to save the numeric data, without introducing (and filtering out NAs), and I wondered if there is an option of making this out of it:
Age:
23
45
36
27
33
...
..by restricting the n.o. characters in the vector and converting them later to "numeric".
I believe there is a simple line for this. I just somehow couldn't find it.

Related

R: Getting a value from a table based on a loop

I have a loop where I am trying to build a table by grabbing information from a driver table I import. What I'm stuck on is I want to loop through columns based on a loop, something like:
In the first loop through I want it to function like
df$a <- Driver$M1[i]
and then in the second loop through function like
df$a <- Driver$M2[i] and so on
Through searching I thought I had come across the solution of
df$a <- get(paste0("Driver$M",j,"[i]")) but I get the error
object 'Driver$M1[i]' not found
so I don't think "get" functions like I thought it did.
Could someone help me find out how to make this work?
Thanks
Iterating over the columns of a table "smoke"
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
is as simple as
> for (i in colnames(smoke)) {t = smoke[,i]; print(i); print(t)}
[1] "High"
current former never
51 92 68
[1] "Low"
current former never
43 28 22
[1] "Middle"
current former never
22 21 9
Thanks for everyone looking at this, I kept looking and came across writing it in a different way. Writing it this way seems to do what I was looking for: Driver[i,paste0("M",j)]
I'm not very experienced so I don't want to be sharing incorrect information but it seems like the $ function cant accept variables but by changing the way its written to Driver[row, column] column is looking for a string anyway so paste0() now works like I want it too.

Create a new column in R spreadsheet with a specific calculation

I have looked for an answer to this on stackexchange but the questions being asked are way more complicated than what I need.
I have a table in R
Teacher Name Usage_in_MINS Usage Rate
Kelper 78
Kelper 85
Smith 85
Kelper 45
Smith 65
7th Grade 45
4th Grade 34
How do I get R to create a new column called Usage Rate
How do I get this new column to take the values in Usage_in_MINS and divide it by 60 for only those classes that are either Kelper or Smith? What about if I want it to calculate usage rates for Kelper and Smith and everyone else as well.
There are a lot of really good basic tutorials on R out there, and you would really help yourself by checking out one or two of them because your question indicates some significant niavete. :-) Here's one way to do what you'd like, assuming that your data.frame is called "data":
data$UsageRate[data$TeacherName %in% c("Kelper", "Smith")] <-
data$Usage_in_MINS[data$TeacherName %in% c("Kelper", "Smith")] * 60

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

read.csv appends/modifies column headings with date values

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?
Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.
You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

How can I select human miRNA from affy chip while analyzing data using R?

I am new to R and want to analyze miRNA expression from a data set of 3 groups. Can anyone help me out.
In this case I got other miRNAs(on affy chips) as top expressed genes. Now I want to select only human miRNAs. Please help me
Thanks in advance
Summary
I'm not entirely sure what your data frame looks like, given that I haven't worked with Affy chips before. Let me try to summarize what I think you have told us. You have a data frame with a list of all of the microRNAs on the Affy chip, along with their expression data. You want to select a subset of these microRNAs that are unique to humans.
Possible solution 1
You do not state whether or not your data frame contains a variable that identifies whether or not these microRNAs are indeed from humans. If it does have this information, all you would need to do is subset your data based on this identifier. Type help(subset) or help(Extract) for more information on how to do this.
Possible solution 2
If your data frame does not contain such an identifier, you will first need to make a list of all known human microRNAs. You could retrieve these manually from the online miRBase website (and then import them into R), or you could download them from Ensembl using the R package biomaRt. To do the latter, after loading biomaRt, you might type this command:
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
The above code requests that R download the mirbase identifier, gene ID, start position, and chromosome name for all microRNAs in the miRBase catalog. (Note that you would have to specify the human Ensembl mart in an earlier command, which I have not shown).
Once you have downloaded this information, you could use a merge command or perhaps a which command to pull the appropriate microRNAs from your Affy chip data.
Recommendations
This all might sound a bit complicated. If you haven't already, I recommend that you spend some time working through exercises on biomaRt and bioconductor. Information about these packages, and how to install them, are available at the below links:
Bioconductor, http://www.bioconductor.org/install/
Database mining with biomaRt, http://www.stat.berkeley.edu/~sandrine/Teaching/PH292.S10/Durinck.pdf
You might consider asking for this question to be migrated to Biostar. I think you would get better responses there. Also, consider editing your question to provide more information about your data. Good luck.
Edit to my original answer
In reference to your comment made at 2012-02-26 22:08:02, try the following:
## Load biomaRt package
library(biomaRt)
## Specify which "mart" (i.e., source of genetic data) that you want to use
ensembl <- useMart("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
## You can then ask the system what attributes are available for download
listAttributes(ensembl)
name description
58 mirbase_accession miRBase Accession(s)
59 mirbase_id miRBase ID(s)
60 mirbase_gene_name miRBase gene name
61 mirbase_transcript_name miRBase transcript
Above I have pasted part of the output from the listAttributes() command, which shows the relevant miRBase options. Now you can try the following code:
## Download microRNA data
miRNA <- getBM(c("mirbase_id", "ensembl_gene_id", "start_position", "chromosome_name"), filters = c("with_mirbase"), values = list(TRUE), mart = ensembl)
## Check how much we downloaded
> dim(miRNA)
[1] 715 4
## Peak at the head of our data
> head(miRNA)
mirbase_id ensembl_gene_id start_position chromosome_name
1 hsa-mir-320c-1 ENSG00000221493 19263471 18
2 hsa-mir-133a-1 ENSG00000207786 19405659 18
3 hsa-mir-1-2 ENSG00000207694 19408965 18
4 hsa-mir-320c-2 ENSG00000212051 21901650 18
5 hsa-mir-187 ENSG00000207797 33484781 18
6 hsa-mir-1539 ENSG00000222690 47013743 18
## Check which chromosomes are contributing to our data
> table(miRNA$chromosome_name)
1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9 X
50 27 26 25 15 59 26 15 35 7 85 23 32 5 16 31 23 30 17 33 27 28 80
Now your challenge will be to use this downloaded data to parse your original Affy data frame. Again, read the help files for the merge, Extract, and which functions to give it a try yourself first.

Resources