Weird behaviour by ordering a data frame - r

I have the following data frame that I want to order by the fifth column ("Distance").
When I try `
df.order <- df[order(df[, 5]), ]
I always get the following error message.
Error in order(df[, 5]) : unimplemented type 'list' in 'orderVector1'`
I don't know why R consider my data frame as a list. Running is.data.frame(df) returns TRUE. I have to admit that is.list(df) also returns TRUE. Is is possible to force my data frame to be only a data frame and not a list?
Thanks for your help.
structure(list(ID = list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Latitude = list(50.7368, 50.7368, 50.7368, 50.7369, 50.7369, 50.737, 50.737, 50.7371, 50.7371, 50.7371),
Longitude = list(6.0873, 6.0873, 6.0873, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872),
Elevation = list(269.26, 268.99, 268.73, 268.69, 268.14, 267.87, 267.61, 267.31, 267.21, 267.02),
Distance = list(119.4396, 119.4396, 119.4396, 121.199, 121.199, 117.5658, 117.5658, 114.9003, 114.9003, 114.9003),
RxPower = list(-52.6695443922406, -52.269130891243, -52.9735258244422, -52.2116571930007, -51.7784534281727, -52.7703448813654, -51.6558862949081, -52.2892907635308, -51.8322993596551, -52.4971436682333)),
.Names = c("ID", "Latitude", "Longitude", "Elevation", "Distance", "RxPower"),
row.names = c(NA, 10L), class = "data.frame")

Your data frame contains lists, not vectors. You can convert this data frame to the "classical" format using as.data.frame and unlist:
df2 <- as.data.frame(lapply(df, unlist))
Now, the new data frame could be sorted in the intended way:
df2[order(df2[, 5]), ]

I've illustrated with a small example what's the problem:
df <- structure(list(ID = c(1, 2, 3, 4),
Latitude = c(50.7368, 50.7368, 50.7368, 50.7369),
Longitude = c(6.0873, 6.0873, 6.0873, 6.0872),
Elevation = c(269.26, 268.99, 268.73, 268.69),
Distance = c(119.4396, 119.4396, 119.4396, 121.199),
RxPower = c(-52.6695443922406, -52.269130891243, -52.9735258244422,
-52.2116571930007)),
.Names = c("ID", "Latitude", "Longitude", "Elevation", "Distance", "RxPower"),
row.names = c(NA, 4L), class = "data.frame")
Notice that list only occurs once. And all the values are wrapped by c(.) and not list(.). This is why doing sapply(df, class) on your data resulted in all columns having class list.
Now,
> sapply(df, classs)
# ID Latitude Longitude Elevation Distance RxPower
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Now order works:
> df[order(df[,4]), ]
# ID Latitude Longitude Elevation Distance RxPower
# 4 4 50.7369 6.0872 268.69 121.1990 -52.21166
# 3 3 50.7368 6.0873 268.73 119.4396 -52.97353
# 2 2 50.7368 6.0873 268.99 119.4396 -52.26913
# 1 1 50.7368 6.0873 269.26 119.4396 -52.66954

This turns you data.frame of lists into a matrix:
mat <- sapply(df,unlist)
Now you can order it.
mat[order(mat[,5]),]
If all columns are of one type, e.g., numeric, a matrix often is preferable, because operations on matrices are faster than on data.frames. However, you can transform to a data.frame using as.data.frame(mat).
Btw, a data.frame is a special kind of list and thus is.list returns TRUE for every data.frame.

Ran across this same problem. This worked for me (maybe it might help someone else who is having the same problem and stumbled on this page).
I had a structure like:
lst <- list(row1 = list(col1="A",col2=1,col3="!"), row2 = list(col1="B",col2=2,col3="#"))
> lst
$row1
$row1$col1
[1] "A"
$row1$col2
[1] 1
$row1$col3
[1] "!"
$row2
$row2$col1
[1] "B"
$row2$col2
[1] 2
$row2$col3
[1] "#"
I was doing:
df <- as.data.frame(do.call(rbind, lst))
And I kept getting the same error you were getting when I tried to df[order(df$col1),]. Turns out I had to do:
df <- do.call(rbind.data.frame, lst)

Related

getting the names of data frames from list in R

I have a list which contains 36 data frames. I want to create a list containing all the names of those data frames :
dput(myfiles[1:2])
list(structure(list(X.Treatment.1.Treatment.10.Treatment.2.Treatment.3.Treatment.4.Treatment.5.Treatment.6.Treatment.7.Treatment.8.Treatment.9 = c("Treatment.1,1,0.779269898976048,0.987582177817029,0.999865208543176,0.999637376053903,0.969316946773183,0.992798203986959,0.424960684181985,0.804869101320034,0.934784678841289",
"Treatment.10,0.779269898976048,1,0.671138248567996,0.789454098761072,0.762111859396959,0.909408486972833,0.848734212632234,-0.236126723371631,0.255300504533133,0.505840502482398",
"Treatment.2,0.987582177817029,0.671138248567996,1,0.984869671366683,0.991454531822078,0.918661911614817,0.961649044703906,0.561895346303209,0.888107698459535,0.978982111839266",
"Treatment.3,0.999865208543176,0.789454098761072,0.984869671366683,1,0.99906051831384,0.973222174821046,0.994631289318653,0.410041249133801,0.795017057233326,0.9288266084351",
"Treatment.4,0.999637376053903,0.762111859396959,0.991454531822078,0.99906051831384,1,0.962346166096083,0.989212254209048,0.449182113577399,0.820557713571369,0.944010924367408",
"Treatment.5,0.969316946773183,0.909408486972833,0.918661911614817,0.973222174821046,0.962346166096083,1,0.991784351747349,0.189407610662142,0.634294194129571,0.81878574572229",
"Treatment.6,0.992798203986959,0.848734212632234,0.961649044703906,0.994631289318653,0.989212254209048,0.991784351747349,1,0.31345701514879,0.72797778020465,0.885498274066011",
"Treatment.7,0.424960684181985,-0.236126723371631,0.561895346303209,0.410041249133801,0.449182113577399,0.189407610662142,0.31345701514879,1,0.879237827530393,0.718791431723663",
"Treatment.8,0.804869101320034,0.255300504533133,0.888107698459535,0.795017057233326,0.820557713571369,0.634294194129571,0.72797778020465,0.879237827530393,1,0.963182415401058",
"Treatment.9,0.934784678841289,0.505840502482398,0.978982111839266,0.9288266084351,0.944010924367408,0.81878574572229,0.885498274066011,0.718791431723663,0.963182415401058,1"
)), class = "data.frame", row.names = c(NA, -10L)), structure(list(
X.Treatment.1.Treatment.10.Treatment.2.Treatment.3.Treatment.4.Treatment.5.Treatment.6.Treatment.7.Treatment.8.Treatment.9 = c("Treatment.1,1,NA,NA,NA,NA,NA,NA,NA,NA,NA",
"Treatment.10,NA,1,NA,NA,NA,NA,NA,NA,NA,NA", "Treatment.2,NA,NA,1,NA,NA,NA,NA,NA,NA,NA",
"Treatment.3,NA,NA,NA,1,NA,NA,NA,NA,NA,NA", "Treatment.4,NA,NA,NA,NA,1,NA,NA,NA,NA,NA",
"Treatment.5,NA,NA,NA,NA,NA,1,NA,NA,NA,NA", "Treatment.6,NA,NA,NA,NA,NA,NA,1,NA,NA,NA",
"Treatment.7,NA,NA,NA,NA,NA,NA,NA,1,NA,NA", "Treatment.8,NA,NA,NA,NA,NA,NA,NA,NA,1,NA",
"Treatment.9,NA,NA,NA,NA,NA,NA,NA,NA,NA,1")), class = "data.frame", row.names = c(NA,
-10L)))
I want a list containing all the names of the data frames. The problem is that when I write:
names(list_median)[i]
It just returns NULL. Each data frame in the list is a correlation matrix that looks like this.
I am not understanding if this is it:
mat_names <- lapply(list_median, \(x) do.call(cbind, dimnames(x)))
mat_names <- lapply(mat_names, \(x) {colnames(x) <- c("Rows", "Cols"); x})
Here is a possible explanation why you are running into issues. The code is commented:
# extract each dataframe to global environment with this code
for (i in seq(list_median))
assign(paste0("df", i), list_median[[i]])
# you should see df1 and df2 etc.. in the Environment
# Now construct a list out of a few of df eg.df1 and df2 with a list of two dataframes:
my_list<- list(df1,df2)
# Now try to get the names
names(my_list)
# you will get NULL
# Now try this: name the dataframes like here and call the names:
my_list<- list(df1nownamed = df1, df2nownamed = df2)
names(my_list)
# and you will get:
[1] "df1nownamed" "df2nownamed"

Pass multiple arguments to function from dataframe with unknown number of columns

I have a function similar to this:
testfun = function(jID,kID,d){
g=paste0(jID,kID)
date = d
bb=data.frame(g,date)
return(bb)
}
Data frame:
x=data.frame(jID = c("a","b"),kID=c("c","d"),date="20170206",stringsAsFactors = FALSE)
I want to pass each row as inputs into the function. The solutions provided here: Passing multiple arguments to a function taken from dataframe are great but in their case, the number of columns was known. How would a solution like this:
vtestfun <- (Vectorize(testfun, SIMPLIFY=FALSE))
vtestfun(x[,1],x[,2],x[,3])
be applied if the number of columns in the dataframe is not known or keeps changing?
If you can match the argument names to the column names like so:
testfun <- function(jID, kID, date){ # 'date', not 'd'
g <- paste0(jID, kID)
bb <- data.frame(g, date)
return(bb)
}
You could do:
purrr::pmap(x, testfun)
Returning:
[[1]]
g date
1 ac 20170206
[[2]]
g date
1 bd 20170206
# Data used:
x <- structure(list(jID = c("a", "b"), kID = c("c", "d"), date = c("20170206", "20170206")), class = "data.frame", row.names = c(NA, -2L))

Returning specific values within a row

I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1

In R what's the difference between [[X]] and [, X] when selecting vectors

library(tidyverse)
df0 <- data.frame(col1 = c(5, 2), col2 = c(6, 4))
df1 <- data.frame(col1 = c(5, 2),
col2 = c(6, 4),
col3 = ifelse(apply(df0[, 1:2], 1, sum) > 10 &
df0[, 2] > 5,
"True",
"False"))
df2 <- as_tibble(df1)
I've got my data frame df1 above. I've basically "copied" it as a tibble df2. Let's mimic an analysis for this df1 data frame and df2 tibble.
identical(df1[[2]], df1[, 2])
# [1] TRUE
identical(df2[[2]], df2[, 2])
# [1] FALSE
Since df1 and df2 are essentially the "same", why do I get the TRUE/FALSE dichotomy in my code block above. What is the tibble() property that has changed?
The same question asked another way - what is the difference between [[X]] and [, X], when applied to base R, and also when used in the tidyverse?
Since all lists are vectors, we can think of this in terms of list subsetting. Take for instance:
L <- list(A = c(1, 2), B = c(1, 4))
L[[2]]
This Extracts the second element of the list. Extrapolate this to:
df1[[2]]
We get the same output as df1[, 2] hence identical(df1[[2]], df1[, 2]) returns TRUE.
The second part is to do with tibble structure ie:
typeof(as_tibble(df1)[[2]])
[1] "double"
typeof(as_tibble(df1[, 2]))
[1] "list"
The second is a list while the first is a vector hence identical returns FALSE.
Objects of class tbl_df have:(From the docs)
A class attribute of c("tbl_df", "tbl", "data.frame").
A base type of "list", where each element of the list has the same NROW().
A names attribute that is a character vector the same length as the underlying list.
A row.names attribute, included for compatibility with the base data.frame class. This attribute is only consulted to query the number of rows, any row names that might be stored there are ignored by most tibble methods.

extract command from data.frame to create same df

is there any package/command in R that reads a data.frame and then creates a command that can be used to create the exactly same data.frame without loading data, i.e., all data of the data.frame would have to be stored within the command?
e.g. if one has a data.frame like this:
mydata <- data.frame(col1=c(1,2),col2=c(3,4))
I just want to get the command such that reading "mydata" results in the command on the right hand side.
BR
Fabian
The dput function "Writes an ASCII text representation of an R object to a file or connection" and is as close to the right hand side as you would get. It actually contains more details about the structure of the object, as is seen below:
> dput(mydata)
structure(list(col1 = c(1, 2), col2 = c(3, 4)), .Names = c("col1",
"col2"), row.names = c(NA, -2L), class = "data.frame")
You could also use enquote, which turns mydata back into an unevaluated call. It can then be evaluated with eval.
> ( e <- enquote(mydata) )
# quote(list(col1 = c(1, 2), col2 = c(3, 4)))
> eval(e)
# col1 col2
# 1 1 3
# 2 2 4
> identical(eval(e), mydata)
# [1] TRUE

Resources