R Include lists of Strings in Dataframe - r

I am trying to create an artificial dataframe of words contributed and deleted by users of Wikipedia for each edit that they make, the end result should look like this:
I created some artifical data to build such a frame but I'm having problems with the variables "Tokens Added" and "Tokens deleted".
I thought creating them as lists of lists would allow me to include them in dataframes even if the elements do not always have equal length. But apparently thats not the case. Instead, R creates a variable for each individual token. thats not feasible because it would create millions of variables. Here is some code to exemplify:
a <- c(1,2,3)
e <- list(b = as.list(c("a","b")),c = as.list(c(1L,3L,5L,4L)),d = as.list(c(TRUE,FALSE,TRUE)))
DF <- cbind(a,e)
U <- data.frame(a,e)
I would like to have it like this:
Is this possible at all in R with dataframes (I tried dearching for answers already but they were either for different questions or too technical for me)? Any help is much appreciated!

You can do exactly what you want if you are willing to use library(tibble):
library(tibble)
a <- c(1,2,3)
e <- list(b = as.list(c("a","b")),c = as.list(c(1L,3L,5L,4L)),d = as.list(c(TRUE,FALSE,TRUE)))
tibble(a,e)
# A tibble: 3 × 2
a e
<dbl> <list>
1 1 <list [2]>
2 2 <list [4]>
3 3 <list [3]>
A tibble or tbl_df will behave just like you are used to with a traditional data.frame but allow you some nice extra functionality like storing lists of various lengths in a column.

I don't think what you want is possible using a vector of lists (as you suggest in your question). This is mainly because you can't create a vector of lists in R (see: How to create a vector of lists in R?)
However, one option (if you really want a data.frame) would be to coerce everything to a character (the most flexible type in R). Something like this might work for you:
e <- c(paste0(c("a","b"),collapse=","), paste0(c(1L,3L,5L,4L), collapse = ","), paste0(c(TRUE,FALSE,TRUE), collapse = ","))
U <- data.frame(a,e, stringAsFactors = F)
U
# a e
#1 1 a,b
#2 2 1,3,5,4
#3 3 TRUE,FALSE,TRUE
Then you can back out the value of each cell with a split. Something like:
strsplit(U$e, ",")

Thanks for all the suggestions everyone! I think I found a simpler solution though. Just in case anyone else has a similar problem in the future, this is what I did:
a <- c(1,2,3)
b <- c("a","b")
c <- c(1L,3L,5L,4L)
d <- c(TRUE,FALSE,TRUE)
e <- list(b,c,d);e
DF <- data.frame(a,I(e));DF
The I() inhibit function apparently prevents the lists from being converted and the column behaves just like a list of lists as far as I can tell so far. The class of the e column is however not "list" but "AsIs". I don't know whether this might cause problems further down the line, if so, I will update this answer!
EDIT
So it turns out that some functions do not take the AsIs class as input. To convert it back to a usefull character string, you can simply use unlist() on every row.

Try this:
cbind(a,lapply(e,function(x) paste(unlist(x),collapse=",")))

Related

R Subsetting nested lists, select multiple entries

I frequently work with large datasets, resulting in me creating nested lists sometimes to reduce the objects in the environment.
When subsetting such a list and wanting to go to the first entry along all steps, it would look like this:
llra[[1]][[1]][[1]]
In some of my current scripts the data in these scripts are aligned to that each of the entries of the last step down the list are comparable. If I would like to compare these or make a calculation it would look something like this:
mean(llra[[1]][[1]][[1]], llra[[1]][[2]][[1]], llra[[1]][[3]][[1]])
Is there a way to subset them differently so I could write it something like this:
mean(llra[[1]][[c(1:3)]][[1]])
Thanks for your help!
You can use purrr::map.
mean(map_dbl(1:3, ~llra[[1]][[.x]][[1]]))
Create a small helper function. This creates a grid of indexes and extracts each one. Finally it unlists the result. No packages are used.
unravel <- function(L, ...) {
if (...length()) L <-
apply(expand.grid(...), 1, function(ix) L[[ix]], simplify = FALSE)
unlist(L)
}
# test
L <- list(a = list(b = list(1:3, 4:5), c = list(11:12, 20:25)))
# Example 1
mean(unravel(L, 1, 1:2, 1))
## [1] 5.8
# check
mean(c(L[[1]][[1]][[1]], L[[1]][[2]][[1]]))
## [1] 5.8
# Example 2
mean(unravel(L, 1, 1, 1:2))
## [1] 3
# check
mean(c(L[[1]][[1]][[1]], L[[1]][[1]][[2]]))
## [1] 3
Update
Generalize unravel so that it does not assume three levels or which level(s) are specified as scalar or vector indices.
Thank you for your answers, both Grothendieck's and Novotny's approaches work.
I simplified my example and since I am using raster layers in the last step of the list I made it work like this:
unravel method:
mean(stack(unravel(llra, 1, 1:3,1)))
map method:
mean(stack(map(1:3, ~llra[[1]][[.x]][[1]])))
It seems like a basic thing, strange that this isn't implemented in R yet.

Is there a way to change variable assignment names

Using R. Is there a way that I can give R any text string and it will treat it like a formula?
An example says it all.
a <- 1
b <- 2
c <- 3
d <- 4
What if I had to do this all the way up to z?
In R we can write:
letters[1]
This gives us an "a"
So what about something like this:
(It doesn't work but I'd like to do something like this)
for (i in 1:4) {
letters[i] <- i
}
There's the as.formula function but that's only good for formulas like a ~ b + c.
Thanks.
If you want to evaluate a text :
eval(parse(text="a<-1"))
But if you want to initialize many variables, you can create a named list and convert it to a separate variables (attach each components to the global environment) using list2env, but I would highly recommend that you keep your variables in the same list.
xx <- letters[1:5]
list2env(setNames(seq_along(xx), xx), .GlobalEnv)

r llply lapply usage without losing dataframes names

Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.

Changing columns positions in a data frame without total reassignment

I want to swap two columns in a data.frame. I know I could do something like:
dd <- dd[c(1:4, 6:5, 7:10)]
But I find it inelegant, potentially slow (?) and not program-friendly (you need to know length(dd), and even have some cases if the swapped columns are close or not to that value...)
Is there an easy way to do it without reassigning the whole data frame?
dd[2:3] <- dd[3:2]
Turns out to be very "lossy" because the [ <- only concerns the values, and not the attributes. So for instance:
(dd <- data.frame( A = 1:4, Does = 'really', SO = 'rock' ) )
dd[3:2]
dd[2:3] <- dd[2:1]
print(dd)
The column names are obviously not flipped...
Any idea? I could also add a small custom function to my very long list, but grrr... should be a way. ;-)
It's not a single function, but relatively simple:
dd[replace(seq(dd), 2:3, 3:2)]
A SO Does
1 1 rock really
2 2 rock really
3 3 rock really
4 4 rock really
This:
dd[,2:3] <- dd[,3:2]
works, but you have to update the names as well:
names(dd)[2:3] <- names(dd)[3:2]

Assigning output of a function to two variables in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
function with multiple outputs
This seems like an easy question, but I can't figure it out and I haven't had luck in the R manuals I've looked at. I want to find dim(x), but I want to assign dim(x)[1] to a and dim(x)[2] to b in a single line.
I've tried [a b] <- dim(x) and c(a, b) <- dim(x), but neither has worked. Is there a one-line way to do this? It seems like a very basic thing that should be easy to handle.
This may not be as simple of a solution as you had wanted, but this gets the job done. It's also a very handy tool in the future, should you need to assign multiple variables at once (and you don't know how many values you have).
Output <- SomeFunction(x)
VariablesList <- letters[1:length(Output)]
for (i in seq(1, length(Output), by = 1)) {
assign(VariablesList[i], Output[i])
}
Loops aren't the most efficient things in R, but I've used this multiple times. I personally find it especially useful when gathering information from a folder with an unknown number of entries.
EDIT: And in this case, Output could be any length (as long as VariablesList is longer).
EDIT #2: Changed up the VariablesList vector to allow for more values, as Liz suggested.
You can also write your own function that will always make a global a and b. But this isn't advisable:
mydim <- function(x) {
out <- dim(x)
a <<- out[1]
b <<- out[2]
}
The "R" way to do this is to output the results as a list or vector just like the built in function does and access them as needed:
out <- dim(x)
out[1]
out[2]
R has excellent list and vector comprehension that many other languages lack and thus doesn't have this multiple assignment feature. Instead it has a rich set of functions to reach into complex data structures without looping constructs.
Doesn't look like there is a way to do this. Really the only way to deal with it is to add a couple of extra lines:
temp <- dim(x)
a <- temp[1]
b <- temp[2]
It depends what is in a and b. If they are just numbers try to return a vector like this:
dim <- function(x,y)
return(c(x,y))
dim(1,2)[1]
# [1] 1
dim(1,2)[2]
# [1] 2
If a and b are something else, you might want to return a list
dim <- function(x,y)
return(list(item1=x:y,item2=(2*x):(2*y)))
dim(1,2)[[1]]
[1] 1 2
dim(1,2)[[2]]
[1] 2 3 4
EDIT:
try this: x <- c(1,2); names(x) <- c("a","b")

Resources