I am writing a function to access elements of a formal R class (Dada2). My elements require a unique identifier to access the class. I am currently parsing a string to get that identifier and then need to use that string to access the information in the data class. I would like to automate this script which is why I am parsing the unique identifier. I can easily access the data manually, however with the shear number of samples that is not ideal.
Variables: Dada_Object (Large list with multiple items)
sample (Character string name)
Goal:
Unique_Identifier = Parsing_Function(sample)
Desired = Dada_Object[Unique_Identifier]$sequences
Problem: Using the unique identifier does not allow access to the sequences information. The unique identifier is currently a string object. Any direction to this problem would be greatly appreciated.
I have solved my problem with an adhoc method.
I began by subsetting the class list then unlisting the elements
Step_1 = Dada_Object[Unique_Identifier]
Step_2 = unlist(Step_1)
From here I was able to subset the named list elements
Desired_Output = names(Step_2)[1:Desired_Output_Length]
This solution was a workaround, however I am still curious if anyone has a better way to access class items using strings.
Related
I am working in R.
I have a large set of 20-nucleotide DNA sequence strings (~60 million). Currently I just keep them in a matrix of strings.
I need to be able to store them as efficiently as possible in memory, be able to match sequences and count number of times each string sequence appears, and importantly, be able to associate and store multiple strings to one or more.
I was wondering if anyone can suggest a formal object class that will be suitable for some (/all) of that functionality?
Is it possible to feed R a character string and for it to know I'm looking for the data frame with that name?
Example:
TestData <- matrix(1:100,nrow=10, ncol=10)
Then, if I want to reference it later, can I do something similar to this to have R pull dataset?
paste("TestData$",x[1,],sep="")
When entered this way, it comes up as a character string and obviously returns no data. For context, I'm trying to do it this way because I'm creating a loop that goes through several data sets (and columns within those datasets), but does similar operations, so I would like to be able to dynamically change the referenced dataset.
Any help is appreciated. Thanks!
I am using R to prepare some data for a D3 visualization. The visualization was created using the following structure (this is a single row from a .csv file that is subsequently converted to JSON in javascript).
Joe.Schmoe, joe.schmoe#email.com, Sao Paulo, ["Community01", "Community02", "Community03"],
["workgroup01","workgroup02"]
This is a single row. The headers would be:
Person, Email, Location, Communities, Workgroups
You'll notice that the Communities and Workgroup columns contain lists. Furthermore, these lists will vary in length depending on what Communities and Workgroups each individual is associated with. I recognize that this is probably not best practice with regard to data "tidyness," but it is what this viz is expecting.
So ... in R (which I'm learning), I'm finding it impossible to recreate this structure because, when I try to populate the "communities" or "workgroups" variables, R seems to expect that each variable will be of equal length.
The code that I have is reading from a data.frame which is list of the members of a particular community, and adding the name of that community to a column in a master data.frame of all employees. I'm indexing by email address because it is unique. So this particular loop looks at each individual email address in a data.frame called "commTD" and finds it in a master data.frame called "testr." If it finds it, it looks at the communities variable and either replaces an NA value with the name of the community (in this case "Technical Design"), or if the vector already exists, appends Technical Design to it:
for(i in commTD$email){
if(i %in% testr$email){
tmpList <- testr[which(testr$email ==i) , 'communities']
if(is.na(tmpList)){
tmpList <- list(c("Technical Design"))
}
else{
tmpList <- append(tmpList[[1]][1], 'Technical Design')
}
testr[which(testr$email ==i) , 'communities'] <- list(tmpList)
}
}
This works fine for the initial replacement, but if I append a new community to the list, and then try to pass it back into the testr data.frame, I get an error:
Error in `[<-.data.frame`(`*tmp*`, which(testr$email == i), "communities",
: replacement has 2 rows, data has 1
You'll note that I'm trying to create a list of vectors, which is just one way I've tried to figure this out. I thought maybe I could force R to see the list as a single object, even though it contains multiple items -- or in this case a vector of multiple items.
Is this just impossible in R, to have varied length vectors or lists as a single variable in a data frame?
Data frames are by definition a list of vectors of equal length, so when you ask if this is possible as a class data.frame(), no its not.
You could either use as suggested another type of object like data.table, or another way would be to think of your desired output as a list of unequal vectors, to pass to your js.
That object would look like something like:
dataList <- list(name = c("Joe.Schmoe", "Joe.Bloe"),
email = c("joe.schmoe#email.com", "joe.bloe#email.com"),
location = c("Sao Paulo", "London"),
Communities = list(c("Community01", "Community02", "Community03"),
c("Community02", "Community05", "Community03")
),
Workgroups = list(c("workgroup01","workgroup02"),
c("workgroup01","workgroup03"))
)
Then access each field like a dataframe, for output to your js:
dataList$name
dataList$Communities
etc...
As per Frank's suggestion, if you want to access each entry via the email address, so you can access each entry like this:
data_list[["joe.schmoe#email.com"]]
...then build the list with the names of the email as the index, like so:
data_list = list(`joe.schmoe#email.com`=list(name="Joe",
location="Sao Paulo",
Communities=....),
`joe.bloe#email.com`=list(name="Joe", ...))
Then, you can avoid the non-R style of using for() loops, and start the fun of the lapply() family of functions to work on all the entries in a vectorised manner. (See ?lapply for details)
Hope it helps.
I have created a list of SpatialPolygons objects in r using the below code and wish to run each polygon through a for loop. I would like to access the original name that I assigned to each object so that it can be used within the for loop. This should be really easy but I can't figure out how to do it with a SpatialPolygons object, as there appears to be no information stored in the object once loaded within the for loop that links it to this original name. Any help would be great. Thanks!
oblist = c(p1,p2,p3,p4)
for(i in 1:length(oblist)){
obs = oblist[[i]]
obj.nm = #some way to obtain the original object name i.e. p1 for oblist[[1]]
…#etc#
}
Use a list with named components, rather than a vector:
> oblist = list(p1=p1, p2=p2, p3=p3, p4=p4)
> for(i in 1:length(oblist)){
+ print(names(oblist)[i])
+ print(oblist[[i]])
+ }
Note that the name of a variable should rarely be of interest to code. This kind of introspection is discouraged. Very few languages allow it. A variable should not be able to ask what its name is. Its only in rare occasions, like when you do plot(foo,bar) and you want the axes to be labelled foo and bar, that you should do it.
Better to have another variable that stores the names of the elements of the objects (and this is how the above code sort of works, by storing their names in the names attribute of a list). This also lets you have names that aren't valid variable names.
How can I select the second column of a dynamically named variable?
I create variables of the form "population.USA", "population.Mexico", "population.Canada". Each variable has a column for the year, and another column for the population value. I would like to select the second column from each of these variables during a loop.
I use this syntax:
sprintf("population.%s", country)[, 2]
R returns the error: Error in sprintf("population.%s", country)[, 2] : incorrect number of dimensions
Based on your sequence of questions over the last few minutes, I have two general recommendations for you as you get familiar with R:
Don't use sprintf.
Don't use assign.
Now, obviously, those functions are both useful at times. But you've learned about them too early, before you've mastered some basic stuff about R's data structures. Try to write code without those crutches (for the time being!), as they're just causing you problems.
Rather than creating separate individual variables for each nation's population, place them in a list.
population <- vector("list",3)
names(population) <- c('USA','Mexico','Russia')
Then you can access each using the string representation of the name of each country:
population[['USA']] <- 10000
Or,
region <- 'USA'
population[[region]]
In this example, I've assigned a single value to a list element, lists will hold any other data type, including matrices or data frames. It will be a lot less typing than using sprintf and assign, and a lot safer and more efficient as well.
See ?get. Here is an example:
> country <- "FOO"
> assign(sprintf("population.%s", country), data.frame(runif(5), runif(5)))
>
> get(sprintf("population.%s", country))[,2]
[1] 0.2241105 0.5640709 0.5945869 0.1830719 0.1895938
It is critically important to look at the object returned by a function if you get an error. It is immediately clear why your example fails if you just look at what it returns:
> sprintf("population.%s", country)
[1] "population.FOO"
At that point it would be immediately clear, if you didn't already know or have thought to read ?sprintf, that sprintf() returns a string not the object of that name. Armed with that knowledge you would have narrowed down the problem to how to recall an object from the computed name?