Abstract.
I am having trouble understanding a unit of code regarding
the sub-setting of lists. I am applying an index to a list.
The problem is that when I apply the index to a list inside
a custom function, the list behaves like a table, returning
only the first column, but for every row (4 rows in total).
If I apply the same index to the same list outside of that
custom function, the output is only the first element of the
list, displaying both elements of the character vector contained
in the first element of the list. I need to know why there is a
difference in outputs.
How have I tried to resolve my issue by myself?
I performed a Google search on the following search term:
[Indexing Lists in R](Indexing Lists https://stackoverflow.com/questions/tagged/r).
The closest article was this one: How to correctly use lists in R.
But, it failed to answer my question.
Introduction.
I am citing the code that I am using before stating my question
because it is too confusing a matter to explain in the absolute
abstract.
In the below, there are four instructions that students are told
to follow. Each one is enumerated.
# Instruction 1:
# Create a character vector containing the names of the top four
# mathematicians that contributed to the field of statistics and
# list their birth years, with the name and year separated by a
# colon.
mathematicians <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
# The above code creates a character vector with four elements.
# Instruction 2: Next, use the strsplit() function to split the person's
# last name from his birth year.
split_name_and_year_born <- strsplit(mathematicians, split = ":")
# The variable split_name_and_year_born must be a list because
# strsplit only returns lists (according to the documentation).
# Instruction 3: Write a function that accepts a list or vector
# object and returns only the first element of that object.
first <- function(x) {
x[1]
}
# This is a fairly straightforward function. If x is a list then
# x[1] should be the first element of that list. The same is true
# for vectors.
# Instruction 4: apply the first function to the list split_name_and_year_born
lapply(split_name_and_year_born, first)
# [[1]]
# [1] "GAUSS"
#
# [[2]]
# [1] "BAYES"
#
# [[3]]
# [1] "PASCAL"
#
# [[4]]
# [1] "PEARSON"
My commentary: If you consider split_name_and_year_born as a list of vectors, of length = 2, we could imagine the list behaving somewhat like a table, wherein the first element is the first column in the table. This interpretation of the above code makes sense given the output. However, if I enter the following line of code, I get only the first element of the list.
split_name_and_year_born[1]
[[1]]
[1] "GAUSS" "1777"
My question is, why is there a difference in the output? I am using the same data structure, with the same data. I am only applying the indexing operator in different places. Why is there a difference in outputs? The function must be doing something implicit. I just do not know what.
Related
I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I'm not sure that I understand the different outputs in these two scenarios:
(1)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- strsplit(pioneers, split = ":")
split
(2)
pioneers <- c("GAUSS:1777", "BAYES:1702", "PASCAL:1623", "PEARSON:1857")
split <- lapply(pioneers, strsplit, split = ":")
split
In both cases, the output is a list but I'm not sure when I'd use the one notation (simply applying a function to a vector) or the other (using lapply to loop the function over the vector).
Thanks for the help.
Greg
To me it's to do with how the output is returned. [l]apply stands for list apply - i.e. the output is returned as a list. strsplit already returns a list as, if there were multiple :s in your pioneers vector, it's the only data structure that makes sense - i.e. a list element of each of the 4 elements of the vector and each list element contains a vector of the split string.
So using lapply(x, strsplit, ...) will always return a list inside a list, which you probably don't want in this case.
Using lapply is useful in cases where you expect the result of the function you're applying to be a vector of an undefined or variable length. As strsplit can see this coming already, the use of lapply is redundant, so you should probably know what form you expect/want your answer to be in, and use the appropriate functions to coerce the output in to the right data structure.
To make clear, the output of the examples you gave is not the same. One is a list, one is a list of lists. The identical result would be
lapply(pioneers, function(x, split) strsplit(x, split)[[1]], split = ":")
i.e. taking the first list element of the inner list (which is only 1 element anyway) in each case.
I have been working with R for about 2 months and have had a little bit of trouble getting a hold of how the $ and %% terms.
I understand I can use the $ term to pull a certain value from a function (e.g. t.test(x)$p.value), but I'm not sure if this is a universal definition. I also know it is possible to use this to specify to pull certain data.
I'm also curious about the use of the %% term, in particular, if I am placing a value in between it (e.g. %x%) I am aware of using it as a modulator or remainder e.g. 7 %% 5 returns 2. Perhaps I am being ignorant and this is not real?
Any help or links to literature would be greatly appreciated.
Note: I have been searching for this for a couple hours so excuse me if I couldn't find it!
You are not really pulling a value from a function but rather from the list object that the function returns. $ is actually an infix that takes two arguments, the values preceding and following it. It is a convenience function designed that uses non-standard evaluation of its second argument. It's called non-standard because the unquoted characters following $ are first quoted before being used to extract a named element from the first argument.
t.test # is the function
t.test(x) # is a named list with one of the names being "p.value"
The value can be pulled in one of three ways:
t.test(x)$p.value
t.test(x)[['p.value']] # numeric vector
t.test(x)['p.value'] # a list with one item
my.name.for.p.val <- 'p.value'
t.test(x)[[ my.name.for.p.val ]]
When you surround a set of characters with flanking "%"-signs you can create your own vectorized infix function. If you wanted a pmax for which the defautl was na.rm=TRUE do this:
'%mypmax%' <- function(x,y) pmax(x,y, na.rm=TRUE)
And then use it without quotes:
> c(1:10, NA) %mypmax% c(NA,10:1)
[1] 1 10 9 8 7 6 7 8 9 10 1
First, the $ operator is for selecting an element of a list. See help('$').
The %% operator is the modulo operator. See help('%%').
The '$' operator is used to select particular element from a list or any other data component which contains sub data components.
For example: data is a list which contains a matrix named MATRIX and other things too.
But to get the matrix we write,
Print(data$MATRIX)
The %% operator is a modulus operator ; which provides the remainder.
For example: print(7%%3)
Will print 1 as an output
I have figured out how to create a new column on my data frame that = TRUE if the character string in "Column 5" is contained within the longer string in "Column 6" - can I do this by referring to the names of my columns rather than using [r,c] locational references?
rows = NULL
for(i in 1:length(excptn1[,1]))
{
rows[i] <- grepl(excptn1[i,5],excptn1[i,6], perl=TRUE)
}
As a programmer I'm nervous about referring to things as "Column 5 and Column 6"...I want to refer to the names of the variables captured in those columns so that I'm not reliant on my source file always having the columns in the identical order. Furthermore I might forget about that locational reference and add something earlier in the code that causes the locational reference to fail later...when you can think in terms of the names of the columns in general (rather than their particular ordering at a point in time) it's a lot easier to build robust production strength code.
I found a related question on this site and it uses the same kind of locational references I want to avoid...
How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row
While R does seem very flexible it seems to lack a lot of features that you'd want in scaleable, production strength code...but I'm hoping I'm wrong and can learn otherwise.
Thanks!
You could refer to the columns by name rather than by index in two ways:
rows[i] <- grepl(excptn1[i,"colname"],excptn1[i,"othercolname"], perl=TRUE)
or
rows[i] <- grepl(excptn1$colname[i],excptn1$othercolname[i], perl=TRUE)
Finally, note that most R programmers would do this as:
rows = sapply(1:nrow(excptn), grepl(excptn1$colname[i],excptn1$othercolname[i], perl=TRUE))
One thing this avoids is the overhead of increasing the size of the vector in each iteration.
If you want to do this faster, use stri_match_first_regex function from stringi package.
Example:
require(stringi)
ramka <- data.frame(foo=letters[1:3],bar=c("ala","ma","koteczka"))
> ramka
foo bar
1 a ala
2 b ma
3 c koteczka
> stri_match_first_regex(str=ramka$bar, pattern=ramka$foo)
[,1]
[1,] "a"
[2,] NA
[3,] "c"