simple question about r language in naming - r

I am a developing beginner in r. I have a simple question about r language.
Thanks to many experts in this site, I am improving a lot.
I am always grateful for that, and anyone who's giving hand with this question, thank you in advance.
This is the code.
Data=sample(1:5,size=25,replace=T)
names(Data)=c("a","b","c","d","e")
I want to name each of 1,2,3,4,5 to a,b,c,d,e.
so I thought I could accomplish this by using the upper code.
I know that the right code is
Data=c("a","b","c","d","e")[Data]
But I can't understand why this is the right code and why I need the last [Data].
Any help would be really appreciated!! Thank you so much in advance!!:)

The last Data provides an index to subset values from c("a","b","c","d","e").
Let's take a simple example :
Consider,
a <- 1:10
Now to get the first value in a you can do
a[1]
#[1] 1
To get 3rd value in a you can do
a[3]
#[1] 3
To get 6th and 8th value in a you can do
a[c(6, 8)]
#[1] 6 8
What will happen if you repeat a certain index? Say you select 1 twice and 3 once.
a[c(1, 1, 3)]
#[1] 1 1 3
As you can see the first value is selected two times and third one time.
Now ,Data that you have serves as that index to subset whereas a becomes c("a","b","c","d","e")
a <- c("a","b","c","d","e")
set.seed(123)
Data=sample(1:5,size=25,replace=T)
Data
#[1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
Now you use this Data values to subset from a giving
a[Data]
#[1] "c" "c" "b" "b" "c" "e" "d" "a" "b" "c" "e" "c" "c" "a" "d" "a" "a" "e" "c" "b" "b" "a" "c" "d" "a"
A side note, there is an inbuilt constant letters and LETTERS which gives 26 lower and upper case alphabets.
letters
#[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Here is a way that takes advantage of how objects of class "factor" are coded internally in R. In R, factors are coded as consecutive integers starting at 1, and what the user sees is their labels and levels, not the integer values. But the integer values do not go away, they are still there.
First, create a vector of integers like in the question but setting the RNG seed in order to make the results reproducible. This vector is saved for later.
set.seed(123)
Data <- sample(1:5, size = 25, replace = TRUE)
Saved <- Data
Now create the factor. Note the labels atribute is set to the letters "a" to"e".
Data <- factor(Data, labels = c("a","b","c","d","e"))
Data
# [1] c c b b c e d a b c e c c a d a a e c b b a c d a
#Levels: a b c d e
See the internal representation.
as.integer(Data)
# [1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
And compare with the initial values.
identical(Saved, as.integer(Data))
# [1] TRUE

This is because Data contains the numbers you want to name in the order you want to name them. By adding [Data] to the end you are selecting the letters in the order of Data. To understand this, try what c("a","b","c","d","e")[c(1, 2)] does; it selects just the two first letters. If you instead type c("a","b","c","d","e")[c(5, 4)] it will select the two last letters, but in reverse order. Then if you print just Data, you'll see that it contains the numbers from 1 to 5, which is the amount of unique letters. So it will select the letters according to that order. You can see that all the numbers correspond to the letters in order by printing the correctly named Data.
Using names(Data)=c("a","b","c","d","e") does not work correctly since you aren't naming all 25 of the numbers, but rather just the first five of them.

Related

Function to create a new variable not working in R

I am creating a function to help me quickly recode variables into numerical values, as a form of practice. The idea behind creating the function is to quickly recode several values into numerical form, for any length. If a dataset is really long for instance, the function in theory should recode all of these values without having to manually type out each condition in which to recode it into a specific value.
For instance:
levels(d$letters)
[1] a b c d
The general form of the function is to:
d$letters.recode[d$letters == "a"] <- 1
d$letters.recode[d$letters == "b"] <- 2
d$letters.recode[d$letters == "c"] <- 3
And using this function:
rc.f <- function(a, b){
x <- levels(a)
y <- length(a)
b <- NA
for (i in 1:y){
z <- b[a==x[i]] <- i
}
}
In theory, the idea is that this function should create another variable, where a is recoded as 1, b is recoded as 2 and so on.
However when I run rc.f(d$letters, d$letters.recode), no new variables are created in the dataset, and the function does not return an error.
Any ideas?
Thanks.
Another example dataset d:
Say for a list of respondents they are assigned a category depending on their region:
Respondent Region
1 d
2 b
3 g
4 c
5 e
6 c
7 f
8 a
I am looking for a way to recode d$Region into a numerical value, to d$Region.R.
Using the same function as above, I am wondering whether I can use the function to create another variable in the dataframe, by inputting d$Region and d$Region.R into the function. So recoding a,b,c,[...],g into 1,2,3,[...],7.
If you want to a,b,f,d as 1,2,4,3 then use following
I have updated your code for function rc.f a little bit
Removed second argument b, since we are giving b <- NA ,so we do not need second argument
We do not need other variable to store the value of b , so i removed z
Since every argument is not factor so we need to coerce it into factor
we do not need y , we can directly put length(a) in for loop condition
and last but not the least the last line is the output of the function unless we use return, so there i putted b in last
The code is
rc.f <- function(a)
{
a<-as.factor(a)
x <- levels(a)
b <- NA
for (i in 1:length(a))
{
b[a==x[i]] <- i
}
b
}
let us take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f"
[14] "d" "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 5 6 10 4 9 6 7 4 3 1 8 8 8
If you want a,b,f,d as 1,2,6,4 then use following
rc.f <- function(a)
{
a<-as.factor(a)
b <- NA
for (i in 1:26)
{
b[a==letters[i]] <- i
}
b
}
lets take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f" "d"
[15] "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 7 8 25 6 22 8 10 6 4 1 19 19 19

Grouping rows from an R dataframe together when randomly assigning to training/testing datasets

I have a dataframe that consists of blocks of X rows, each corresponding to a single individual (where X can be different for each individual). I'd like to randomly distribute these individuals into train, test and validation samples but so far I haven't been able to get the syntax correct to ensure that each of a user's X rows are always collected into the same subsample.
For example, the data can be simplified to look like:
user feature1 feature2
1 "A" "B"
1 "L" "L"
1 "Q" "B"
1 "D" "M"
1 "D" "M"
1 "P" "E"
2 "A" "B"
2 "R" "P"
2 "A" "F"
3 "X" "U"
... ... ...
and then if I ended up randomly assigning the users to a train, test or validation set all of the rows for that user (the user number is unique) would be in the same set, and grouped together so that if user 1 was in the traininng set, for example, then the format would still be:
user feature1 feature2
1 "A" "B"
1 "L" "L"
1 "Q" "B"
1 "D" "M"
1 "D" "M"
1 "P" "E"
As a bonus I'd love to know if the solution to this could be extended to do k-folds cross validation, but so far I haven't even figured out this more simple first step.
Thanks in advance.
you can use sample():
# 60 % for training, 20% for testing & validation
indeces <- sample(1:nrow(df),nrow(df)*0.6)
df.train <- df[indeces,]
df <- df[-indeces,]
indeces <- sample(1:nrow(df),nrow(df)*0.5)
df.test <- df[indeces,]
df.validate <- df[-indeces,]
for k-fold cross validation :
library(caret)
library(mlbench)
fld <- createFolds(df$your_dependent_variable, k= 10,list = TRUE, returnTrain = FALSE)
The above code splits the data into 10 folds. Run your model on each samples and validate them.
Edited:
user.df <- split( df , f = df$user )
this produces a separate data frame containing data for a particular user. use user.df[[1]] to access them individually.

R igraph shortest distance

My code snippet is as below:
df = as.data.frame(rbind(
c("a","b",2),
c("b","d",2),
c("d","g",2),
c("g","j",8),
c("j","i",2),
c("i","f",6),
c("f","c",2),
c("c","a",4),
c("c","e",4),
c("e","h",2),
c("h","j",4),
c("e","g",1),
c("e","i",3),
c("e","b",7)
))
names(df) = c("start_node","end_node","dist")
# Convert this to "igraph" class
gdf <- graph.data.frame(df, directed=FALSE)
# Compute the min distances from 'a' to all other vertices
dst_a <- shortest.paths(gdf,v='a',weights=E(gdf)$dist)
# Compute the min distances from 'a' to 'j'
dst_a[1, which(V(gdf)$name == 'j')]
While it returns the result 12, I need to get the shortest path which in this case should be a - b - d - g - e - i - j. I have tried to use get.shortest.paths(), but in vain.
Try using get.all.shortest.paths(). Take into account that there may be more than one short path (e.g. try the same between 'a' and 'e')
sp=get.all.shortest.paths(gdf, "a", "j",weights=E(gdf)$dist)
sp
$res
$res[[1]]
[1] 1 2 3 4 9 6 5
$nrgeo
[1] 1 1 1 1 1 1 1 1 1 1
V(gdf)[sp$res[[1]]]$name
[1] "a" "b" "d" "g" "e" "i" "j"
What did you try with get.shortest.paths? Because this works:
> V(gdf)[get.shortest.paths(gdf,"a","j",weights=E(gdf)$dist)[[1]]]
Vertex sequence:
[1] "a" "b" "d" "g" "e" "i" "j"
get.shortest.paths returns a list of length 1 because I'm only asking it to calculate the shortest path from "a" to "j", so I take the first element of it.

Analyze table in R to count nucleotide frequencies

I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).
You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.
I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909

Taking samples from summarized data

I have data in a form much like the output from aggregate, except that I do not have the original non-aggregated data.
Example:
data <- data.frame(grade=letters[1:4], count=c(3,9,4,1))
grade count
1 a 3
2 b 9
3 c 4
4 d 1
I would like to sample from this population of grades, e.g. using sample. What is the easiest way of taking a sample (without replacement) from summarized counts like this?
Do you expect something like this?
> sample(with(data, rep(as.character(grade), count)), 10)
[1] "b" "b" "d" "a" "c" "c" "b" "b" "c" "b"

Resources