How to create a dataframe with dynamic column length in R - r

I am creating dataframes in R like below.
len5<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0), stringsAsFactors=FALSE)
len6<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0),"C6"=character(0),stringsAsFactors=FALSE)
len7<-data.frame("C1"=character(0),"C2"=character(0),"C3"=character(0),"C4"=character(0),"C5"=character(0),"C6"=character(0),"C7"=character(0),stringsAsFactors=FALSE)
etc
However I need to create it dynamically in a loop starting from length 5 to 15
for dataframe of column length starting from 5 to 15.
Is there any way of doing that? all the dataframes will be characters only
Thanks
Tanmay

We can do this with lapply to create a list of data.frames. It is better to keep that in the list and not create multiple objects in the global environment.
i1 <- 5:15
lst <- lapply(i1, function(x) data.frame(setNames(replicate(x,character(0)),
paste0("C", seq_len(x))), stringsAsFactors = FALSE))
names(lst) <- paste0("len", i1)
In case, the program needs to take objects from global environment
list2env(lst, .GlobalEnv)
str(len5)
#'data.frame': 0 obs. of 5 variables:
# $ C1: chr
# $ C2: chr
# $ C3: chr
# $ C4: chr
# $ C5: chr

Related

Change class of dataframe based on other dataframe

I have a dataframe in R that I import from excel and a dataframe that I create with a script. These dataframes contain the same columns but since one is imported from excel, the class of the columns are not identical to the columns of the dataframe created with the script.
The dataframes contain 500+ columns so to do it individually would take a lot of time. Is there any way to change the class of all columns of the excel imported dataframe to the class of the columns from the script created dataframe?
Many thanks!
df1 <- data.frame(a=1,b="2")
df2 <- data.frame(a=1L,b=2,d=3)
nms <- intersect(names(df1), names(df2))
df2[nms] <- Map(function(ref, tgt) { class(tgt) <- class(ref); tgt; }, df1[nms], df2[nms])
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: int 1
# $ b: chr "2"
# $ d: num 3
Granted, $a remains integer instead of being cast to numeric; if that's not a concern, then that may suffice. If not, then this more-verbose and more-flexible option might be preferred:
cls <- sapply(df1[nms], function(z) class(z)[1])
df2[nms] <- Map(function(tgt, cl) {
if (cl == "numeric") {
tgt <- as.numeric(tgt)
} else if (cl == "integer") {
tgt <- as.integer(tgt)
} else if (cl == "character") {
tgt <- as.character(tgt)
}
tgt
}, df2[nms], cls)
str(df2)
# 'data.frame': 1 obs. of 3 variables:
# $ a: num 1
# $ b: chr "2"
# $ d: num 3
The rationale behind sapply(.., class(z)[1]) is that some classes have length greater than 1 (e.g., tbl_df, POSIXct), which will spoil that process.

List assignment for list with greater than three nesting

I have not been able to find a fix for this error. I have implemented work-arounds before, but I wonder if anyone here knows why it occurs.
the following returns no error as expected
q <- list()
q[["a"]][["b"]] <- 3
q[["a"]][["c"]] <- 4
However, when I add another level of nesting I get:
q <- list()
q[["a"]][["b"]][["c"]]<- 3
q[["a"]][["b"]][["d"]] <- 4
Error in q[["a"]][["b"]][["d"]] <- 4 : more elements supplied than there are to replace
To make this even more confusing if I add a fourth nested list I get:
q <- list()
q[["a"]][["b"]][["c"]][["d"]] <- 3
q[["a"]][["b"]][["c"]][["e"]] <- 4
Error in *tmp*[["c"]] : subscript out of bounds
I would have expected R to return the same error message for the triple nested list as for the quadruple nested list.
I first came across this a few months ago. I am running R 3.4.3.
If we check the str(q) from the first assignment, it is a list with a single element 'a'. On subsequent assignment, it is creating a named vector rather than a list.
q <- list()
q[["a"]][["b"]] <- 3
q[["a"]][["c"]] <- 4
str(q)
#List of 1
# $ a: Named num [1:2] 3 4
# ..- attr(*, "names")= chr [1:2] "b" "c"
is.vector(q$a)
#[1] TRUE
If we try to do an assignment on the next level, it is like assignment based on indexing the name i.e. 'b' which is empty and assign value on 'c'. The option would be to create a list element by wrapping the value with list
q <- list()
q[["a"]][["b"]][["c"]]<- list(3)
q[["a"]][["b"]][["d"]] <- list(4)
It returns the structure with 'q' as a list of 1 element i.e. 'a', which is again a list of length 1 ('b') and as we assign two values '3' and '4' for 'c' and 'd', it is a list of 2 elemeents
str(q)
#List of 1
# $ a:List of 1
# ..$ b:List of 2
# .. ..$ c:List of 1
# .. .. ..$ : num 3
# .. ..$ d:List of 1
# .. .. ..$ : num 4
By this way, we can nest 'n' number of lists
q <- list()
q[["a"]][["b"]][["c"]][["d"]] <- list(3)
q[["a"]][["b"]][["c"]][["e"]] <- list(4)
Note: It is not clear about the expected output structure

split a matrix and save in separate variables in a loop

I have a data which is a m*n matrix. I would like to split the matrix by column and save each column separately in a different vector.
E.g
data<-matrix(1:9, ncol=3)
I would like to have vec1 containing the first column so
vec1 will be transpose of [1,2,3], a column matrix with dimension 3*1 which is basically the first column of data. Similarly, vec2 represents the 2nd column and vec3 represents the last column.
I understand that I can do this manually by repeating
vec1<-data[,1],
vec2<-data[,2]
...
vecn<-data[,n].
However, this is not feasible when n is large.
So I would like to know whether it is feasible to use a loop to do this.
As has been pointed out, this is probably not a good idea. However, if you still felt the need to proceed down this path, rather than using a list or just using the source matrix itself, the easiest approach would probably be to use a combination of list2env and data.frame.
Here's a demo, step-by-step:
data <- matrix(1:9, ncol=3)
ls() # Only one object in my workplace
# [1] "data"
data_list <- unclass(data.frame(data))
str(data_list)
# List of 3
# $ X1: int [1:3] 1 2 3
# $ X2: int [1:3] 4 5 6
# $ X3: int [1:3] 7 8 9
# - attr(*, "row.names")= int [1:3] 1 2 3
ls() # Two objects now
# [1] "data" "data_list"
list2env(data_list, envir = .GlobalEnv)
ls() # Five objects now
# [1] "data" "data_list" "X1" "X2" "X3"
X1
# [1] 1 2 3
If you want single-column data.frames, you can use split.list:
list2env(setNames(split.default(data.frame(data), seq(ncol(data))),
paste0("var", seq(ncol(data)))), envir = .GlobalEnv)
var1
# X1
# 1 1
# 2 2
# 3 3
Putting this together, you can actually do this all at once (without first having to create "data_list") like this:
list2env(data.frame(data), envir = .GlobalEnv)
But again, you should have a good reason to do so!

Formatting a df column of vectors while maintaining the structure. (R)

I have a 2 column data frame (DF) of which one column contains vectors and the other is characters.
Orig. Matched
AbcD c("ab.d","Acbd","AA.D","")
jKdf c("JJf.","K.dF","JkD.","")
My aim is to strip all the punctuation marks (commas and periods) as well make everything lowercase. This is easy enough for the character column, but the vector column is more challenging.
Some lower case methods I tried using are
lapply(DF, tolower). This causes the data frame to convert to a matrix. In doing so I lose the column of vectors structure.
In regards to the punctuation, I tried
gsub("\\.", "", DF) and
gsub("\\,", "", DF) to remove the periods and commas respectively.
This causes the data frame to convert to a character list.
I guess my questions are as follows:
Is there another way to remove punctuation and convert to lowercase that preserves the data frame structure?
If not, how may i be able to convert the above outputs back into the original format; that being of a column of vectors?
I'm sure there are other ways to get this done but here's an example that works pretty well:
DF = data.frame(a = c("JJf.","K.dF","JkD.",""), b = c("ab.d","Acbd","AA.D",""))
DF2 = as.data.frame(lapply(X = DF, FUN = tolower))
DF2$a = gsub(pattern = "\\.",replacement = "", x = DF2$a)
Data frames are just special cases of lists where all the elements have the same length so coercion back and fourth isn't usually a problem.
From your description, it sounds like you have some data that looks like:
mydf <- data.frame(Orig = c("AbcD", "jKdf"),
Matched = I(list(c("ab.d","Ac,bd","AA.D",""),
c("JJf.","K.dF","JkD.",""))))
mydf
# Orig Matched
# 1 AbcD ab.d, Ac....
# 2 jKdf JJf., K.....
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : Factor w/ 2 levels "AbcD","jKdf": 1 2
# $ Matched:List of 2
# ..$ : chr "ab.d" "Ac,bd" "AA.D" ""
# ..$ : chr "JJf." "K.dF" "JkD." ""
# ..- attr(*, "class")= chr "AsIs"
Usually, if you want to replace data while maintaining the same structure, you replace with [], like this:
mydf[] <- lapply(mydf, function(x) {
if (is.list(x)) {
lapply(x, function(y) {
tolower(gsub("[.,]", "", y))
})
} else {
tolower(gsub("[.,]", "", x))
}
})
Here's the result:
mydf
# Orig Matched
# 1 abcd abd, acbd, aad,
# 2 jkdf jjf, kdf, jkd,
str(mydf)
# 'data.frame': 2 obs. of 2 variables:
# $ Orig : chr "abcd" "jkdf"
# $ Matched:List of 2
# ..$ : chr "abd" "acbd" "aad" ""
# ..$ : chr "jjf" "kdf" "jkd" ""

Converting an R list with NULL sub-elements to a data frame

Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.

Resources