Could anyone explain the split and list function in R? I am quite confused how to use them together. For example
x <- rnorm(10)
a <- gl(2,5)
b <- gl(5,2)
str(split(x,list(a,b))
The result I get is
List of 10
$ 1.1: num [1:2] 0.1326 -0.0578
$ 2.1: num(0)
$ 1.2: num [1:2] 0.151 0.907
$ 2.2: num(0)
$ 1.3: num -0.393
$ 2.3: num 1.83
$ 1.4: num(0)
$ 2.4: num [1:2] 0.4266 -0.0116
$ 1.5: num(0)
$ 2.5: num [1:2] 0.62 1.64
How are values in x assigned to a level in list(a,b)? Why are there some levels without any values and some with many values? I do not see any relation between the values in x and the levels of list(a,b). Are they randomly assigned?
Really apreciate if someone could help me with this.
When you call split(x, list(a, b)), you are basically saying that two x values are in the same group if they have the same a and b value and are in different groups otherwise.
list(a, b)
# [[1]]
# [1] 1 1 1 1 1 2 2 2 2 2
# Levels: 1 2
#
# [[2]]
# [1] 1 1 2 2 3 3 4 4 5 5
# Levels: 1 2 3 4 5
We can see that the first two elements in x are going to be in group "1.1" (the group where a=1 and b=1), the next two will be in group 1.2, the next one will be in group 1.3, the next one will be in group 2.3, the next two will be in group 2.4, and the last two will be in group 2.5. This is exactly what we see when we call split(x, list(a, b)):
split(x, list(a, b))
# $`1.1`
# [1] -0.2431983 -1.5747339
# $`2.1`
# numeric(0)
# $`1.2`
# [1] -0.1058044 -0.8053585
# $`2.2`
# numeric(0)
# $`1.3`
# [1] -1.538958
# $`2.3`
# [1] 0.8363667
# $`1.4`
# numeric(0)
# $`2.4`
# [1] 0.8391658 -1.0488495
# $`1.5`
# numeric(0)
# $`2.5`
# [1] 0.3141165 -1.1813052
The reason you have extra empty groups (e.g. group 2.1) is that a and b have some pairs of values where there are no x values. From ?split, you can read that the way to not include these in the output is with the drop=TRUE option:
split(x, list(a, b), drop=TRUE)
# $`1.1`
# [1] -0.2431983 -1.5747339
# $`1.2`
# [1] -0.1058044 -0.8053585
# $`1.3`
# [1] -1.538958
# $`2.3`
# [1] 0.8363667
# $`2.4`
# [1] 0.8391658 -1.0488495
# $`2.5`
# [1] 0.3141165 -1.1813052
Related
Suppose I have the following dataframe with data (v) and a lookup dataframe (l):
v <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-05'), as.Date('2019-01-30'), as.Date('2019-02-02')), kind=c('a', 'b', 'c', 'a'), v1=c(1,2,3,4))
v
d kind v1
1 2019-01-01 a 1
2 2019-01-05 b 2
3 2019-01-30 c 3
4 2019-02-02 a 4
l <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-04'), as.Date('2019-02-01')), kind=c('a','b','a'), l1=c(10,20,30))
l
d kind l1
1 2019-01-01 a 10
2 2019-01-04 b 20
3 2019-02-01 a 30
I would like to find the closest row in the l dataframe corresponding to each row in v using the columns: c("d", "kind"). Column kind needs to match exactly and maybe use findInterval(...) on d?
I would like my result to be:
d kind v1 l1
1 2019-01-01 a 1 10
2 2019-01-05 b 2 20
3 2019-01-30 c 3 NA
4 2019-02-02 a 4 30
NOTE: I would prefer a base-R implementation but it would be
interesting to see others
I tried findInterval(...) but I don't know how get it to work with multiple columns
Here's a shot in base-R only. (I do believe that data.table will do this much more elegantly, but I appreciate your aversion to bring in other packages.)
Split each frame into a list of frames, by kind:
v_spl <- split(v, v$kind)
l_spl <- split(l, l$kind)
str(v_spl)
# List of 3
# $ a:'data.frame': 2 obs. of 3 variables:
# ..$ d : Date[1:2], format: "2019-01-01" "2019-02-02"
# ..$ kind: Factor w/ 3 levels "a","b","c": 1 1
# ..$ v1 : num [1:2] 1 4
# $ b:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-05"
# ..$ kind: Factor w/ 3 levels "a","b","c": 2
# ..$ v1 : num 2
# $ c:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-30"
# ..$ kind: Factor w/ 3 levels "a","b","c": 3
# ..$ v1 : num 3
Now we determine the unique kind we have in common between the two, no need to try to join everything:
### this has the 'kind' in common
(nms <- intersect(names(v_spl), names(l_spl)))
# [1] "a" "b"
### this has the 'kind' we have to bring back in later
(miss_nms <- setdiff(names(v_spl), nms))
# [1] "c"
For the in-common kind, do an interval join:
joined <- Map(
v_spl[nms], l_spl[nms],
f = function(v0, l0) {
ind <- findInterval(v0$d, l0$d)
ind[ ind < 1 ] <- NA
v0$l1 <- l0$l1[ind]
v0
})
Ultimately we will rbind things back together, but those in miss_nms will not have the new column(s). This is a generic way to capture exactly one row of the new columns with an appropriate NA value:
emptycols <- joined[[1]][, setdiff(colnames(joined[[1]]), colnames(v)),drop=FALSE][1,,drop=FALSE][NA,,drop=FALSE]
emptycols
# l1
# NA NA
And add that column(s) to the not-yet-found frames:
unjoined <- lapply(v_spl[miss_nms], cbind, emptycols)
unjoined
# $c
# d kind v1 l1
# 3 2019-01-30 c 3 NA
And finally bring everything back into a single frame:
do.call(rbind, c(joined, unjoined))
# d kind v1 l1
# a.1 2019-01-01 a 1 10
# a.4 2019-02-02 a 4 30
# b 2019-01-05 b 2 20
# c 2019-01-30 c 3 NA
If you want an exact match you would go:
vl <- merge(v, l, by = c("d","kind"))
For your purposes, you can transform d into additional variables for year, month or day and use the merge
I have the following code
n <- list(1,2,3)
str(n)
which outputs
> str(n)
List of 3
$ : num 1
$ : num 2
$ : num 3
I would like 100 of these, but when I do
n <- list(1:100)
str(n)
I get
List of 1
$ : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
The difference is one list vs three lists. How do I solve this with in R? Also, how do you solve it with the purrr package?
?as.list vs. list. Short explanation is that in your first example, you are storing three objects to their own vector within a holding list. For example:
if each number werre named:
> list(a = 1, b = 2, c = 3)
$a
[1] 1
$b
[1] 2
$c
[1] 3
vs:
> list(1:3)
[[1]]
[1] 1 2 3
BUT...
> as.list(1:3)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
1:3 in R is a numeric range, thus if stored in a vector it is the representation of the range itself, whereas list(1, 2, 3) is a list, where the first vector is 1, second 2 and so forth...
I used RNCEP backstage to get a reanalyses data for temp. My data looks something like this:
(DD1 <- array(1:12, dim = c(2, 3, 2),
dimnames = list(c("A", "B"),
c("a", "b", "c"),
c("First", "Second"))))
# , , First
#
# a b c
# A 1 3 5
# B 2 4 6
#
# , , Second
#
# a b c
# A 7 9 11
# B 8 10 12
str(DD1)
# int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "a" "b" "c"
# ..$ : chr [1:2] "First" "Second"
I think this is a tabular data?
I need to write the data as csv file where I have something like this:
y a a b b c c
x A B A B A B
1 2 3 4 5 6
7 8 9 10 11 12
But when I used write.csv I got this:
write.csv(DD1)
# "","a.First","b.First","c.First","a.Second","b.Second","c.Second"
# "A",1,3,5,7,9,11
# "B",2,4,6,8,10,12
I thought I had to transpose the data first. So I used this:
DD2 <- as.data.frame.table(DD1)
I also used t() but that also did not work.
Transpose function in R is t(), so hopefully this will work on the dataframe you are trying to transpose.
DD3= t(DD2)
You were on the right track with as.data.frame.table(DD1). That would give you a "long" dataset, that can then be converted to a "wide" form that you can use write.csv on.
Note, however, that R only allows one row of headers, so you will have to combine what you show as "x" and "y" into a single header row.
Here's the approach I would suggest:
library(data.table)
(DD2 <- dcast(data.table(as.data.frame.table(DD1)),
Var3 ~ Var1 + Var2, value.var = "Freq"))
# Var3 A_a A_b A_c B_a B_b B_c
# 1: First 1 3 5 2 4 6
# 2: Second 7 9 11 8 10 12
You can then easily use write.csv on the "DD2" object.
I'm trying to figure out how to add a data.frame or data.table to the first position in a list.
Ideally, I want a list structured as follows:
List of 4
$ :'data.frame': 1 obs. of 3 variables:
..$ a: num 2
..$ b: num 1
..$ c: num 3
$ d: num 4
$ e: num 5
$ f: num 6
Note the data.frame is an object within the structure of the list.
The problem is that I need to add the data frame to the list after the list has been created, and the data frame has to be the first element in the list. I'd like to do this using something simple like append, but when I try:
append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0)
I get a list structured:
str(append(list(1,2,3),data.frame(a=2,b=1,c=3),after=0))
List of 6
$ a: num 2
$ b: num 1
$ c: num 3
$ : num 1
$ : num 2
$ : num 3
It appears that R is coercing data.frame into a list when I'm trying to append. How do I prevent it from doing so? Or what alternative method might there be for constructing this list, inserting the data.frame into the list in position 1, after the list's initial creation.
The issue you are having is that to put a data frame anywhere into a list as a single list element, it must be wrapped with list(). Let's have a look.
df <- data.frame(1, 2, 3)
x <- as.list(1:3)
If we just wrap with c(), which is what append() is doing under the hood, we get
c(df)
# $X1
# [1] 1
#
# $X2
# [1] 2
#
# $X3
# [1] 3
But if we wrap it in list() we get the desired list element containing the data frame.
list(df)
# [[1]]
# X1 X2 X3
# 1 1 2 3
Therefore, since x is already a list, we will need to use the following construct.
c(list(df), x) ## or append(x, list(df), 0)
# [[1]]
# X1 X2 X3
# 1 1 2 3
#
# [[2]]
# [1] 1
#
# [[3]]
# [1] 2
#
# [[4]]
# [1] 3
given a data.table object I would to collapse the values of some grouped columns into a single object and insert the resulting objects into a new colum.
dt <- data.table(
c('A|A', 'B|A', 'A|A', 'B|A', 'A|B'),
c(0, 0, 1, 1, 0),
c(22.7, 1.2, 0.3, 0.4, 0.0)
)
setnames(dt, names(dt), c('GROUPING', 'NAME', 'VALUE'))
dt
# GROUPING NAME VALUE
# 1: A|A 0 22.7
# 2: B|A 0 1.2
# 3: A|A 1 0.3
# 4: B|A 1 0.4
# 5: A|B 0 0.0
I think that to do this is first necessary to specify the column for which you want to group, so I should start with something like dt[, OBJECTS := <expr>, by = GROUPING].
Unfortunately, I don't know the expression <expr> to use so that the result is as follows:
# GROUPING OBJECTS
# 1: A|A <vector>
# 2: B|A <vector>
# 3: A|B <vector>
Each <vector> must contain the values of the other columns. E.g the first <vector> have to be a named vector equivalent to:
eg <- c(22.7, 0.3)
names(eg) <- c('0', '1')
# 0 1
# 22.7 0.3
Working inside of j: If you want to have the values of a column be a vector, you need to wrap the output in list(.).
j itself requires a call to list, so your final expression will resemble a nested list, eg:
dt[, list(allNames=list(NAME), allValues=list(VALUE)), by=GROUPING]
# GROUPING allNames allValues
# 1: A|A 0,1 22.7,0.3
# 2: B|A 0,1 1.2,0.4
# 3: A|B 0 0
As #Mnel points out, equivalently:
dt[, lapply(.SD, list), by=GROUPING]
If you want it in long form, then the structure of your <expr> will be:
list( c( list(), list(), ..., list() ) ) eg:
dt[, list(c(list(NAME), list(VALUE))), by=GROUPING]
# GROUPING V1
# 1: A|A 0,1
# 2: A|A 22.7,0.3
# 3: B|A 0,1
# 4: B|A 1.2,0.4
# 5: A|B 0
# 6: A|B 0
Or equivalently:
dt[, list(lapply(.SD, c)), by=GROUPING]
I think that this is what you are looking for:
dt1 <- dt[, list(list(setNames(VALUE, NAME))), by = GROUPING]
dt1
# GROUPING V1
# 1: A|A 22.7,0.3
# 2: B|A 1.2,0.4
# 3: A|B 0
str(dt1)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ GROUPING: chr "A|A" "B|A" "A|B"
# $ V1 :List of 3
# ..$ : Named num 22.7 0.3
# .. ..- attr(*, "names")= chr "0" "1"
# ..$ : Named num 1.2 0.4
# .. ..- attr(*, "names")= chr "0" "1"
# ..$ : Named num 0
# .. ..- attr(*, "names")= chr "0"
# - attr(*, ".internal.selfref")=<externalptr>
dt1$V1
# [[1]]
# 0 1
# 22.7 0.3
#
# [[2]]
# 0 1
# 1.2 0.4
#
# [[3]]
# 0
# 0
As #Arun points out in the comments, the "data.table" alternative to setNames in this case is setattr(VALUE, 'names', NAME), making another solution:
dt1 <- dt[, list(list(setattr(VALUE, 'names', NAME))), by = GROUPING]