This question already has answers here:
Alternative to expand.grid for data.frames
(6 answers)
Closed 2 years ago.
I have a dataframe:
data.frame(x=c(1,2,3), y=c(4,5,6))
x y
1 1 4
2 2 5
3 3 6
For each row, I want to repeat x and y for each element within a given sequence, where the sequence is:
E=seq(0,0.2,by=0.1)
So when combined this would give:
x y E
1 1 4 0
2 1 4 0.1
3 1 4 0.2
4 2 5 0
5 2 5 0.1
6 2 5 0.2
7 3 6 0
8 3 6 0.1
9 3 6 0.2
I can not seem to achieve this with expand.grid - seems to give me all possible combinations. Am I after a cartesian product?
library(data.table)
dt <- data.table(x=c(1,2,3), y=c(4,5,6))
dt[,.(E=seq(0,0.2,by=0.1)),by=.(x,y)]
#> x y E
#> 1: 1 4 0.0
#> 2: 1 4 0.1
#> 3: 1 4 0.2
#> 4: 2 5 0.0
#> 5: 2 5 0.1
#> 6: 2 5 0.2
#> 7: 3 6 0.0
#> 8: 3 6 0.1
#> 9: 3 6 0.2
Created on 2020-05-01 by the reprex package (v0.3.0)
Yes, you are looking for cartesian product but base expand.grid cannot handle dataframes.
You can use tidyr functions here :
tidyr::expand_grid(df, E)
# A tibble: 9 x 3
# x y E
# <dbl> <dbl> <dbl>
#1 1 4 0
#2 1 4 0.1
#3 1 4 0.2
#4 2 5 0
#5 2 5 0.1
#6 2 5 0.2
#7 3 6 0
#8 3 6 0.1
#9 3 6 0.2
Or with crossing
tidyr::crossing(df, E)
Related
Assume that I have a df similar to this:
# A tibble: 5 x 6
x1 x2 x3 y1 y2 y3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 3 2 4 3 2
2 4 3 2 4 3 2
3 4 3 2 4 3 2
4 4 3 2 4 3 2
5 4 3 2 4 3 2
Is there a way to create a tibble with the columns x, y, z in a single pivot_longer comand?, right now I'm using a pivot_longer line for each group of columns but I'm sure there is an easier way to get it.
x y z
<dbl> <dbl> <dbl>
1 4 4 1
2 4 4 1
3 4 4 1
4 4 4 1
5 4 4 1
6 3 3 2
7 3 3 2
8 3 3 2
9 3 3 2
10 3 3 2
11 2 2 3
12 2 2 3
13 2 2 3
14 2 2 3
15 2 2 3
If you are using pivot_longer:
df %>% pivot_longer(everything(), names_to = c(".value", "z"), names_pattern = "(.)(.)")
You could used reshape from base R:
reshape(inp, direction="long", varying=list(1:3, 4:6), sep="")
time x1 y1 id
1.1 1 4 4 1
2.1 1 4 4 2
3.1 1 4 4 3
4.1 1 4 4 4
5.1 1 4 4 5
1.2 2 3 3 1
2.2 2 3 3 2
3.2 2 3 3 3
4.2 2 3 3 4
5.2 2 3 3 5
1.3 3 2 2 1
2.3 3 2 2 2
3.3 3 2 2 3
4.3 3 2 2 4
5.3 3 2 2 5
Two "tricks". One is to use sep="" which will split those column names between alpha and numeric. Second is the use of a list argument to varying. If you wanted to drop the first column which identifies the row of origin, then use [-1]. You could also use the v.names vector to name columns, viz:
reshape(inp, direction="long", varying=list(1:3, 4:6), sep="", v.names=c("X","Y"))[-1]
X Y id
1.1 4 4 1
2.1 4 4 2
3.1 4 4 3
4.1 4 4 4
5.1 4 4 5
1.2 3 3 1
2.2 3 3 2
3.2 3 3 3
4.2 3 3 4
5.2 3 3 5
1.3 2 2 1
2.3 2 2 2
3.3 2 2 3
4.3 2 2 4
5.3 2 2 5
I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies
An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))
will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18
I'm trying to emulate a random walk/ Markov chain using R. As you can see I've set up a transition matrix, and then I'm trying to run a random walker on this. The thing here is that when the random walker meets an absorbing state (like 2 and 5) does not stop but continue running.
Am I using the wrong functions for something like this or there is somewhere else the problem? Actually what I want to achieve is to print out al the vertices that the walker visited.
library(igraph)
# Produce a transition matrix.
tm <- read.table(row.names=1, header=FALSE, text="
1 0.2 0.3 0.1 0.2 0.1 0.1
6 0.3 0.2 0.4 0.1 0 0
3 0 0.2 0.4 0.1 0.2 0.1
4 0.2 0.1 0.2 0.3 0.1 0.1
5 0 0 0 0 1 0
2 0 0 0 0 0 1")
tm<-as.matrix(tm)
row.names(tm) <- c(1,6,3,4,5,2)
colnames(tm) <- c(1,6,3,4,5,2)
g1 <- graph.adjacency(tm, mode="undirected", weighted=TRUE)
random_walk( graph = g1, start = '4', steps = 100, stuck = "error" )
An output example :
[1] 4 5 3 4 4 4 3 3 2 2 4 4 2 2 3 4 4 5 5 3 5 5 3 6 3 3 3 4 4 4 6 4 4 4 4 2 2 4 3 6 3 2 2 2 4 1
[47] 4 3 4 1 1 4 2 3 6 6 6 6 4 3 6 6 6 3 5 5 3 5 5 5 3 1 1 3 2 4 4 2 1 1 1 2 3 1 2 1 1 2 2 1 5 3
[93] 5 4 4 2 4 3 4 4
And as you can see it doesn't stop, neither at 2 nor at 5.
If you think these states are absorbing, then you are interpreting your adjacency graph as directed, not undirected. Use mode="directed". Then when you plot the graph, you will see the loops better indicate where you can move (notice how there are no longer any lines leading "out" of node 2).
Please see this example. Look at y axis. The data there has only two levels: 1 and 2. But in the plot 6 tickmarks drawn on that axis. How could I fix that. The x axis has the same problem.
The data
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
The script
require('mise')
require('scatterplot3d')
mise() # clear the workspace
# example data
print(sleep)
# plot it
scatterplot3d(x=sleep$ID,
x.ticklabs=levels(sleep$ID),
y=sleep$group,
y.ticklabs=levels(sleep$group),
z=sleep$extra)
The result
How about this:
scatterplot3d(x=sleep$ID, y=sleep$extra, z=sleep$group, lab.z = c(1, 2))
I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.
Example:
df1 looks like this:
Shrub value period
1 0.5 1
2 0.6 1
3 0.7 1
4 0.8 1
5 0.9 1
10 0.9 1
1 0.4 2
5 0.4 2
6 0.5 2
7 0.3 2
2 0.4 3
3 0.1 3
8 0.5 3
9 0.2 3
df2 looks like this:
Shrub x y
1 5 8
2 6 7
3 3 2
4 1 2
5 4 6
6 5 9
7 9 4
8 2 1
9 4 3
10 3 6
and i want the combined dataframe to look like:
Shrub x y value period
1 5 8 0.5 1
2 6 7 0.6 1
3 3 2 0.7 1
4 1 2 0.8 1
5 4 6 0.9 1
6 5 9 NA 1
7 9 4 NA 1
8 2 1 NA 1
9 4 3 NA 1
10 3 6 0.9 1
1 5 8 0.4 2
2 6 7 NA 2
3 3 2 NA 2
4 1 2 NA 2
5 4 6 0.4 2
6 5 9 0.5 2
7 9 4 0.3 2
8 2 1 NA 2
9 4 3 NA 2
10 3 6 NA 2
1 5 8 NA 3
2 6 7 0.4 3
3 3 2 0.1 3
4 1 2 NA 3
5 4 6 NA 3
6 5 9 NA 3
7 9 4 NA 3
8 2 1 0.5 3
9 4 3 0.2 3
10 3 6 NA 3
I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!
This is a situation where complete from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1 to include all period/Shrub combinations, filling the other variables in with NA by default. Once you do that you can simply join the two datasets together - I'll use inner_join from dplyr.
library(dplyr)
library(tidyr)
First, using complete on df1, showing the first 10 lines of output:
complete(df1, period, Shrub)
Source: local data frame [30 x 3]
period Shrub value
1 1 1 0.5
2 1 2 0.6
3 1 3 0.7
4 1 4 0.8
5 1 5 0.9
6 1 6 NA
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 0.9
.. ... ... ...
Then all you need to do is join this expanded dataset with df2:
complete(df1, period, Shrub) %>%
inner_join(., df2)
Source: local data frame [30 x 5]
period Shrub value x y
1 1 1 0.5 5 8
2 1 2 0.6 6 7
3 1 3 0.7 3 2
4 1 4 0.8 1 2
5 1 5 0.9 4 6
6 1 6 NA 5 9
7 1 7 NA 9 4
8 1 8 NA 2 1
9 1 9 NA 4 3
10 1 10 0.9 3 6
.. ... ... ... . .
Start by repeating the rows of df2 to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:
tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame
The result:
> head(out)
Shrub period x y value
1 1 1 5 8 0.5
2 1 2 5 8 0.4
3 1 3 5 8 NA
4 2 1 6 7 0.6
5 2 2 6 7 NA
6 2 3 6 7 0.4
> str(out)
'data.frame': 30 obs. of 5 variables:
$ Shrub : int 1 1 1 2 2 2 3 3 3 4 ...
$ period: int 1 2 3 1 2 3 1 2 3 1 ...
$ x : int 5 5 5 6 6 6 3 3 3 1 ...
$ y : int 8 8 8 7 7 7 2 2 2 2 ...
$ value : num 0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...
You can use dplyr. This works by taking each period in a seperate frame, and merging with all=TRUE to force all values, then putting it all back together. The cbind(df2,.. part adds on the period to the missing values so we don't get extra NA.:
library(dplyr)
df1 %>% group_by(period) %>%
do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))
Shrub period value x y
1 1 1 0.5 5 8
2 2 1 0.6 6 7
3 3 1 0.7 3 2
4 4 1 0.8 1 2
5 5 1 0.9 4 6
6 6 1 NA 5 9
7 7 1 NA 9 4
8 8 1 NA 2 1
9 9 1 NA 4 3
10 10 1 0.9 3 6
11 1 2 0.4 5 8
12 2 2 NA 6 7
13 3 2 NA 3 2
14 4 2 NA 1 2
15 5 2 0.4 4 6
16 6 2 0.5 5 9
17 7 2 0.3 9 4
18 8 2 NA 2 1
19 9 2 NA 4 3
20 10 2 NA 3 6
21 1 3 NA 5 8
22 2 3 0.4 6 7
23 3 3 0.1 3 2
24 4 3 NA 1 2
25 5 3 NA 4 6
26 6 3 NA 5 9
27 7 3 NA 9 4
28 8 3 0.5 2 1
29 9 3 0.2 4 3
30 10 3 NA 3 6