R varied length vector or list in variable - r

I am using R to prepare some data for a D3 visualization. The visualization was created using the following structure (this is a single row from a .csv file that is subsequently converted to JSON in javascript).
Joe.Schmoe, joe.schmoe#email.com, Sao Paulo, ["Community01", "Community02", "Community03"],
["workgroup01","workgroup02"]
This is a single row. The headers would be:
Person, Email, Location, Communities, Workgroups
You'll notice that the Communities and Workgroup columns contain lists. Furthermore, these lists will vary in length depending on what Communities and Workgroups each individual is associated with. I recognize that this is probably not best practice with regard to data "tidyness," but it is what this viz is expecting.
So ... in R (which I'm learning), I'm finding it impossible to recreate this structure because, when I try to populate the "communities" or "workgroups" variables, R seems to expect that each variable will be of equal length.
The code that I have is reading from a data.frame which is list of the members of a particular community, and adding the name of that community to a column in a master data.frame of all employees. I'm indexing by email address because it is unique. So this particular loop looks at each individual email address in a data.frame called "commTD" and finds it in a master data.frame called "testr." If it finds it, it looks at the communities variable and either replaces an NA value with the name of the community (in this case "Technical Design"), or if the vector already exists, appends Technical Design to it:
for(i in commTD$email){
if(i %in% testr$email){
tmpList <- testr[which(testr$email ==i) , 'communities']
if(is.na(tmpList)){
tmpList <- list(c("Technical Design"))
}
else{
tmpList <- append(tmpList[[1]][1], 'Technical Design')
}
testr[which(testr$email ==i) , 'communities'] <- list(tmpList)
}
}
This works fine for the initial replacement, but if I append a new community to the list, and then try to pass it back into the testr data.frame, I get an error:
Error in `[<-.data.frame`(`*tmp*`, which(testr$email == i), "communities",
: replacement has 2 rows, data has 1
You'll note that I'm trying to create a list of vectors, which is just one way I've tried to figure this out. I thought maybe I could force R to see the list as a single object, even though it contains multiple items -- or in this case a vector of multiple items.
Is this just impossible in R, to have varied length vectors or lists as a single variable in a data frame?

Data frames are by definition a list of vectors of equal length, so when you ask if this is possible as a class data.frame(), no its not.
You could either use as suggested another type of object like data.table, or another way would be to think of your desired output as a list of unequal vectors, to pass to your js.
That object would look like something like:
dataList <- list(name = c("Joe.Schmoe", "Joe.Bloe"),
email = c("joe.schmoe#email.com", "joe.bloe#email.com"),
location = c("Sao Paulo", "London"),
Communities = list(c("Community01", "Community02", "Community03"),
c("Community02", "Community05", "Community03")
),
Workgroups = list(c("workgroup01","workgroup02"),
c("workgroup01","workgroup03"))
)
Then access each field like a dataframe, for output to your js:
dataList$name
dataList$Communities
etc...
As per Frank's suggestion, if you want to access each entry via the email address, so you can access each entry like this:
data_list[["joe.schmoe#email.com"]]
...then build the list with the names of the email as the index, like so:
data_list = list(`joe.schmoe#email.com`=list(name="Joe",
location="Sao Paulo",
Communities=....),
`joe.bloe#email.com`=list(n‌​ame="Joe", ...))
Then, you can avoid the non-R style of using for() loops, and start the fun of the lapply() family of functions to work on all the entries in a vectorised manner. (See ?lapply for details)
Hope it helps.

Related

Efficient way of extracting names of a large number of variables in R

It could be a very easy question, given that I am very unfamiliar with R. I know normally one can use deparse(substitute(.)) to extract the name of a variable. However, if I have a long list of variables (let's say it's built without names), how can I extract the name of each variable efficiently? I was thinking about using loops, but the deparse(substitute(.)) method would obviously generate the 'general' variable name we used to denote every item.
Sample code:
countries<-
list(austria,belgium,czech,denmark,france,germany,italy,luxemberg,netherlands,poland,swiss)
Suppose I want to get countryNames equals to list("austria","belgium",...,"swiss"), how shall I code? I tried generating the list using countries <- list(countryA = countryA, countryB = countryB, ...), but it was extremely tedious, and in some cases I might only have an unnamed input list from elsewhere.
countries would just have values of each individual objects (austria,belgium etc.). To access the names you need to create a named list while creating countries which can be done like :
countries <- list(austria = austria,belgium = belgium....)
However, if this is very tedious you can use tibble::lst which creates the names automatically without explicitly mentioning them.
countries <- tibble::lst(austria,belgium....)
In both the case you can access the names using names(countries).
If the country objects are the only ones loaded in the global environment, we can do this easily with ls and mget to return a named list of values
countries <- mget(ls())

How can I access data in a nested R list?

I want to learn how to access data from a nested list in R. I am relatively new to the R programming language, so I am unsure how to proceed.
The data is a 'large list(947 elements, 654.9mb) and takes the form:
The numbers within the datalist refer to station numbers and when I click on one (in Rstudio) it looks like this:
I want to kow how I can access the data within 'doy' for example. I have tried:
data[[1]]
which returns all the data for the first element of the list (site, location, doy,ltm etc). So clearly the number used within the square brackets is interpreted as an index for the list, as opposed to an identifier for the elements/station in the list.
Then I tried:
data$1
but it returned the error:
Error: unexpected numeric constant in "data$1"
Then I tried:
data[data$1==doy]
But was returned this:
Error: unexpected numeric constant in "data[data$1"
So at this point, I realise that it is not construing the number of the station as a category/factor within the list. It's just reading it as a number. So I thought I'd put some quotes around it to see if that changed what happened:
data[data$"1"=="doy"]
This returned
named list()
But when I looked at it in the environment, it was a list of 0.
I looked at some of the similar question here on Stack (like: accessing nested lists in R) and tried:
data[data$"1"=="doy",][[1]]
But just got:
Error in data[data$"1" == "doy", ] : incorrect number of dimensions
How can I access this data? It reminds me of a structure in Matlab, but it doesn't seem to be indexed in a similar fashion in R.
Let's look at some ways to do what you want:
data[[1]]
This returns the first element of the list, which is itself a list. You can use the $ subsetting shorthand, but the name of the first element is nonstandard. R prefers names that start with letters and include only alphanumeric characters, periods and underscores. You can escape this behavior with backticks:
data$`1`
If you want to access one of the elements of list 1 in your list of lists, you need to further subset. To get to doy, which is the third element of 1. You can do that four ways.
data[[1]][[3]]
data$`1`[[3]]
data[[1]]$doy
data$`1`$doy
One way (in addition to what Ben Norris has shown):
our_list[[c("1", "doy")]]
Reproducible example data (please provide next time)
our_list <- list(`1` = list(site = "x", doy = 3))

How do I change column names in list of data frames inside a function?

I know that the answer to "how to change names in a list of data frames" has been answered multiple times. However, I'm stuck trying to generate a function that can take any list as an argument and change all of the column names of all of the data frames in the list. I am working with a large number of .csv files, all of which will have the same 3 column names. I'm importing the files in groups as follows:
# Get a group of drying data data files, remove 1st column
files <- list.files('Mang_Run1', pattern = '*.csv', full = TRUE)
mr1 <- lapply(files, read.csv, skip = 1, header = TRUE, colClasses = c("NULL", NA, NA, NA))
I will have 6 such file groups. If I run the following code on a single list, the names of the columns in each data frame within the specified list will be changed correctly.
for (i in seq_along(mr1)) {
names(mr1[[i]]) <- c('Date_Time', 'Temp_F', 'RH')
}
However, if I try to generalize the function (see code below) to take any list as an argument, it does not work correctly.
nameChange <- function(ls) {
for (i in seq_along(ls)) {
names(ls[[i]]) <- c('Date_Time', 'Temp_F', 'RH')
}
return(ls)
}
When I call nameChange on mr1 (list generated from above), it prints the entire contents of the list to the console and does not change the names of the columns in the data frames within the list. I'm clearly missing something fundamental about the inner workings of R here. I've tried the above function with and without return, and have made several modifications to the code, none of which have proven successful. I'd greatly appreciate any help, and would really like to understand the 'why' behind the problem as well. I've had considerable trouble in the past handling functions that take lists as arguments.
Thanks very much in advance for any constructive input.
I think this might be a very simple fix:
First, generalize the function you are using to rename the columns. This only needs to work on one dataframe at a time.
renameFunction<-function(x,someNames){
names(x) <- someNames
return(x)
}
Now we need to define the names we want to change each column name to.
someNames <- c('Date_Time', 'Temp_F', 'RH')
Then we call the new function and apply it to every element of the "mr1" list.
lapply(mr1, renameFunction, someNames)
I may have gotten some of the details wrong with regards to your exact sitiuation, but I've used this method before to solve similar issues. Since you were able to get it to work on the specific case, I'm pretty sure this will generalize readily using lapply

Select a column from a dynamic variable

How can I select the second column of a dynamically named variable?
I create variables of the form "population.USA", "population.Mexico", "population.Canada". Each variable has a column for the year, and another column for the population value. I would like to select the second column from each of these variables during a loop.
I use this syntax:
sprintf("population.%s", country)[, 2]
R returns the error: Error in sprintf("population.%s", country)[, 2] : incorrect number of dimensions
Based on your sequence of questions over the last few minutes, I have two general recommendations for you as you get familiar with R:
Don't use sprintf.
Don't use assign.
Now, obviously, those functions are both useful at times. But you've learned about them too early, before you've mastered some basic stuff about R's data structures. Try to write code without those crutches (for the time being!), as they're just causing you problems.
Rather than creating separate individual variables for each nation's population, place them in a list.
population <- vector("list",3)
names(population) <- c('USA','Mexico','Russia')
Then you can access each using the string representation of the name of each country:
population[['USA']] <- 10000
Or,
region <- 'USA'
population[[region]]
In this example, I've assigned a single value to a list element, lists will hold any other data type, including matrices or data frames. It will be a lot less typing than using sprintf and assign, and a lot safer and more efficient as well.
See ?get. Here is an example:
> country <- "FOO"
> assign(sprintf("population.%s", country), data.frame(runif(5), runif(5)))
>
> get(sprintf("population.%s", country))[,2]
[1] 0.2241105 0.5640709 0.5945869 0.1830719 0.1895938
It is critically important to look at the object returned by a function if you get an error. It is immediately clear why your example fails if you just look at what it returns:
> sprintf("population.%s", country)
[1] "population.FOO"
At that point it would be immediately clear, if you didn't already know or have thought to read ?sprintf, that sprintf() returns a string not the object of that name. Armed with that knowledge you would have narrowed down the problem to how to recall an object from the computed name?

How to use a value that is specified in a function call as a "variable"

I am wondering if it is possible in R to use a value that is declared in a function call as a "variable" part of the function itself, similar to the functionality that is available in SAS IML.
Given something like this:
put.together <- function(suffix, numbers) {
new.suffix <<- as.data.frame(numbers)
return(new.suffix)
}
x <- c(seq(1000,1012, 1))
put.together(part.a, x)
new.part.a ##### does not exist!!
new.suffix ##### does exist
As it is written, the function returns a dataframe called new.suffix, as it should because that is what I'm asking it to do.
I would like to get a dataframe returned that is called new.part.a.
EDIT: Additional information was requested regarding the purpose of the analysis
The purpose of the question is to produce dataframes that will be sent to another function for analysis.
There exists a data bank where elements are organized into groups by number, and other people organize the groups
into a meaningful set.
Each group has an id number. I use the information supplied by others to put the groups together as they are specified.
For example, I would be given a set of id numbers like: part-1 = 102263, 102338, 202236, 302342, 902273, 102337, 402233.
So, part-1 has seven groups, each group having several elements.
I use the id numbers in a merge so that only the groups of interest are extracted from the large data bank.
The following is what I have for one set:
### all.possible.elements.bank <- .csv file from large database ###
id.part.1 <- as.data.frame(c(102263, 102338, 202236, 302342, 902273, 102337, 402233))
bank.names <- c("bank.id")
colnames(id.part.1) <- bank.names
part.sort <- matrix(seq(1,nrow(id.part.1),1))
sort.part.1 <- cbind(id.part.1, part.sort)
final.part.1 <- as.data.frame(merge(sort.part.1, all.possible.elements.bank,
by="bank.id", all.x=TRUE))
The process above is repeated many, many times.
I know that I could do this for all of the collections that I would pull together, but I thought I would be able to wrap the selection process into a function. The only things that would change would be the part numbers (part-1, part-2, etc..) and the groups that are selected out.
It is possible using the assign function (and possibly deparse and substitute), but it is strongly discouraged to do things like this. Why can't you just return the data frame and call the function like:
new.part.a <- put.together(x)
Which is the generally better approach.
If you really want to change things in the global environment then you may want a macro, see the defmacro function in the gtools package and most importantly read the document in the refrences section on the help page.
This is rarely something you should want to do... assigning to things out of the function environment can get you into all sorts of trouble.
However, you can do it using assign:
put.together <- function(suffix, numbers) {
assign(paste('new',
deparse(substitute(suffix)),
sep='.'),
as.data.frame(numbers),
envir=parent.env(environment()))
}
put.together(part.a, 1:20)
But like Greg said, its usually not necessary, and always dangerous if used incorrectly.

Resources