I am creating a function to help me quickly recode variables into numerical values, as a form of practice. The idea behind creating the function is to quickly recode several values into numerical form, for any length. If a dataset is really long for instance, the function in theory should recode all of these values without having to manually type out each condition in which to recode it into a specific value.
For instance:
levels(d$letters)
[1] a b c d
The general form of the function is to:
d$letters.recode[d$letters == "a"] <- 1
d$letters.recode[d$letters == "b"] <- 2
d$letters.recode[d$letters == "c"] <- 3
And using this function:
rc.f <- function(a, b){
x <- levels(a)
y <- length(a)
b <- NA
for (i in 1:y){
z <- b[a==x[i]] <- i
}
}
In theory, the idea is that this function should create another variable, where a is recoded as 1, b is recoded as 2 and so on.
However when I run rc.f(d$letters, d$letters.recode), no new variables are created in the dataset, and the function does not return an error.
Any ideas?
Thanks.
Another example dataset d:
Say for a list of respondents they are assigned a category depending on their region:
Respondent Region
1 d
2 b
3 g
4 c
5 e
6 c
7 f
8 a
I am looking for a way to recode d$Region into a numerical value, to d$Region.R.
Using the same function as above, I am wondering whether I can use the function to create another variable in the dataframe, by inputting d$Region and d$Region.R into the function. So recoding a,b,c,[...],g into 1,2,3,[...],7.
If you want to a,b,f,d as 1,2,4,3 then use following
I have updated your code for function rc.f a little bit
Removed second argument b, since we are giving b <- NA ,so we do not need second argument
We do not need other variable to store the value of b , so i removed z
Since every argument is not factor so we need to coerce it into factor
we do not need y , we can directly put length(a) in for loop condition
and last but not the least the last line is the output of the function unless we use return, so there i putted b in last
The code is
rc.f <- function(a)
{
a<-as.factor(a)
x <- levels(a)
b <- NA
for (i in 1:length(a))
{
b[a==x[i]] <- i
}
b
}
let us take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f"
[14] "d" "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 5 6 10 4 9 6 7 4 3 1 8 8 8
If you want a,b,f,d as 1,2,6,4 then use following
rc.f <- function(a)
{
a<-as.factor(a)
b <- NA
for (i in 1:26)
{
b[a==letters[i]] <- i
}
b
}
lets take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f" "d"
[15] "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 7 8 25 6 22 8 10 6 4 1 19 19 19
Related
I am a developing beginner in r. I have a simple question about r language.
Thanks to many experts in this site, I am improving a lot.
I am always grateful for that, and anyone who's giving hand with this question, thank you in advance.
This is the code.
Data=sample(1:5,size=25,replace=T)
names(Data)=c("a","b","c","d","e")
I want to name each of 1,2,3,4,5 to a,b,c,d,e.
so I thought I could accomplish this by using the upper code.
I know that the right code is
Data=c("a","b","c","d","e")[Data]
But I can't understand why this is the right code and why I need the last [Data].
Any help would be really appreciated!! Thank you so much in advance!!:)
The last Data provides an index to subset values from c("a","b","c","d","e").
Let's take a simple example :
Consider,
a <- 1:10
Now to get the first value in a you can do
a[1]
#[1] 1
To get 3rd value in a you can do
a[3]
#[1] 3
To get 6th and 8th value in a you can do
a[c(6, 8)]
#[1] 6 8
What will happen if you repeat a certain index? Say you select 1 twice and 3 once.
a[c(1, 1, 3)]
#[1] 1 1 3
As you can see the first value is selected two times and third one time.
Now ,Data that you have serves as that index to subset whereas a becomes c("a","b","c","d","e")
a <- c("a","b","c","d","e")
set.seed(123)
Data=sample(1:5,size=25,replace=T)
Data
#[1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
Now you use this Data values to subset from a giving
a[Data]
#[1] "c" "c" "b" "b" "c" "e" "d" "a" "b" "c" "e" "c" "c" "a" "d" "a" "a" "e" "c" "b" "b" "a" "c" "d" "a"
A side note, there is an inbuilt constant letters and LETTERS which gives 26 lower and upper case alphabets.
letters
#[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Here is a way that takes advantage of how objects of class "factor" are coded internally in R. In R, factors are coded as consecutive integers starting at 1, and what the user sees is their labels and levels, not the integer values. But the integer values do not go away, they are still there.
First, create a vector of integers like in the question but setting the RNG seed in order to make the results reproducible. This vector is saved for later.
set.seed(123)
Data <- sample(1:5, size = 25, replace = TRUE)
Saved <- Data
Now create the factor. Note the labels atribute is set to the letters "a" to"e".
Data <- factor(Data, labels = c("a","b","c","d","e"))
Data
# [1] c c b b c e d a b c e c c a d a a e c b b a c d a
#Levels: a b c d e
See the internal representation.
as.integer(Data)
# [1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
And compare with the initial values.
identical(Saved, as.integer(Data))
# [1] TRUE
This is because Data contains the numbers you want to name in the order you want to name them. By adding [Data] to the end you are selecting the letters in the order of Data. To understand this, try what c("a","b","c","d","e")[c(1, 2)] does; it selects just the two first letters. If you instead type c("a","b","c","d","e")[c(5, 4)] it will select the two last letters, but in reverse order. Then if you print just Data, you'll see that it contains the numbers from 1 to 5, which is the amount of unique letters. So it will select the letters according to that order. You can see that all the numbers correspond to the letters in order by printing the correctly named Data.
Using names(Data)=c("a","b","c","d","e") does not work correctly since you aren't naming all 25 of the numbers, but rather just the first five of them.
Using below code I import a dataset, explore it and remove a row.
After removing the row the output of my length and levels command is unchanged. Why?
MT <- read_csv("Q:/PhD/PhD courses/Data Doc and Man/day3-day4/bromraw.txt",
col_names = FALSE)
names(MT) <- c("id","pnr","age","sex", "runtime")
MT$sex <- as.factor(MT$sex)
length(levels(MT$sex))
levels(MT$sex)
This is the output:
[1] 3
[1] "33529" "K" "M"
Something is wrong. I investigate the row where sex has the value 33529
filter(MT, sex == 33529)
After examining the row I decide to drop it, and recheck the sex variable again.
MT <- subset(MT, sex !=33529)
length(levels(MT$sex))
levels(MT$sex)
[1] 3
[1] "33529" "K" "M"
The row is not there when I browse the data, but the output of the length and levels command is the same as before. What am I doing wrong?
I feel the question deserves a better explanation than just a piece of code.
Factor levels can exist independent of the data, e.g.
x <- factor(character(0), levels = LETTERS[1:3])
creates a vector of length 0 which has 3 factor levels
x
factor(0)
Levels: A B C
The length of the vector length(x) is zero but x has 3 levels
levels(x)
[1] "A" "B" "C"
(and length(levels(x)) is 3, accordingly).
The benefit is that we can add data later on which is checked if it is compatible with the defined factor levels:
x[1:4] <- LETTERS[1:4]
Warning message: In [<-.factor(*tmp*, 1:4, value = c("A", "B",
"C", "D")) : invalid factor level, NA generated
x
[1] A B C <NA>
Levels: A B C
Now, the vector consists of 4 elements (length(x)) but there are still only 3 factor levels. Note that "D" has not become an additional factor level automatically but was replaced by NA instead.
If elements of the vector are removed, e.g.
y <- x[-c(1L, 4L)]
y
[1] B C
Levels: A B C
the factor levels remain unchanged while length(y) is 2 now.
However, if you want to remove unused factor levels you can do so by explicitely using the droplevels() function as pointed out by akrun:
y <- droplevels(y)
y
[1] B C
Levels: B C
Now, factor level "A" has been dropped as it is unused.
While the levels() function shows the factor levels which are defined it does not tell which of the boxes (credit to Acccumulation for the picture) are filled or not. The unique() function returns a vector of distinct values while the table() function counts the number of occurrences:
set.seed(1L)
z <- sample(LETTERS[1:8], 10, replace = TRUE)
z
[1] "C" "B" "E" "H" "A" "B" "D" "A" "D" "C"
unique(z)
[1] "C" "B" "E" "H" "A" "D"
table(z)
z
A B C D E H
2 2 2 2 1 1
This could be a case of unused levels. We can resolve it by dropping the levels
MT <- droplevels(subset(MT, sex != 33529))
I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).
You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.
I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909
I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...
I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A