using melt() does not sort by set ID - r

I have the data set:
Time a b
[1,] 0 5.06 9.60
[2,] 4 9.57 4.20
[3,] 8 1.78 3.90
[4,] 12 2.21 3.90
[5,] 16 4.10 5.84
[6,] 20 2.81 8.10
[7,] 24 2.70 1.18
[8,] 36 52.00 5.68
[9,] 48 NA 6.66
And I would like to reshape it to:
Time variable value
0 a 5.06
4 a 9.57
8 a 1.78
...
0 b 9.60
4 b 4.20
8 b 3.90
...
The code I am using is:
library(reshape2)
Time <- c(0,4,8,12,16,20,24,36,48)
a <- c(5.06,9.57,1.78,2.21,4.1,2.81,2.7,52,NA)
b <- c(9.6,4.2,3.9,3.9,5.84,8.1,1.18,5.68,6.66)
Mono <- cbind(Time,a,b)
mono <- melt(Mono,id="Time",na.rm=F)
Which produces:
Var1 Var2 value
1 1 Time 0.00
2 2 Time 4.00
3 3 Time 8.00
4 4 Time 12.00
5 5 Time 16.00
6 6 Time 20.00
7 7 Time 24.00
8 8 Time 36.00
9 9 Time 48.00
10 1 a 5.06
11 2 a 9.57
12 3 a 1.78
13 4 a 2.21
14 5 a 4.10
15 6 a 2.81
16 7 a 2.70
17 8 a 52.00
18 9 a NA
19 1 b 9.60
20 2 b 4.20
21 3 b 3.90
22 4 b 3.90
23 5 b 5.84
24 6 b 8.10
25 7 b 1.18
26 8 b 5.68
27 9 b 6.66
I'm sure its a small error, but I can't figure it out. It's especially frustrating because I've used melt() without problems many times before. How can I fix the code to produce the table I'm looking for?
Thanks for your help!

Use tidyr::gather() to move from wide to long format.
> df <- data.frame(time = seq(0,20,5),
a = rnorm(5,0,1),
b = rnorm(5,0,1))
> library(tidyr)
> gather(df, variable, value, -time)
time variable value
1 0 a 1.5406529
2 5 a 1.5048055
3 10 a -1.1138529
4 15 a -0.1199039
5 20 a -1.7052608
6 0 b -1.1976938
7 5 b 0.7997127
8 10 b 1.1940454
9 15 b 0.5177981
10 20 b 0.6725264

Related

Method in R to find difference between rows with varying row spacing

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

Extract every 11 rows from data frame [duplicate]

This question already has answers here:
Split a vector into chunks
(22 answers)
Closed 1 year ago.
So I have a data frame and I want to get every 11 rows. Not just the every 11th rows but a chunk of 11 rows every time for eg:
Subject Wt Dose Time conc
1 1 79.6 4.02 0.00 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.57 6.57
4 1 79.6 4.02 1.12 10.50
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 1 79.6 4.02 5.10 8.36
8 1 79.6 4.02 7.03 7.47
9 1 79.6 4.02 9.05 6.89
10 1 79.6 4.02 12.12 5.94
11 1 79.6 4.02 24.37 3.28
and then later 11 and then again the other following 11 rows.
I tried this
for (i in 1:nrow(Theoph)) {
everyEleven = Theoph[11,i]
everyEl
}
But it just gives me the first 11 rows and not the second chunk of 11 rows and so on
Maybe you can try split like below
everyEleven <- split(Theoph,ceiling(seq(nrow(Theoph))/11))
Try this as adapted from [split into multiple subset of dataframes with dplyr:group_by?
library(tibble)
library(dplyr)
library(tidyr)
Make an indicative dataframe as your data in the question is only 11 rows.
tib <- tibble(sub = rep(1:33),
var = runif(33))
tib1 <-
tib %>%
# create a grouping variable every 11 rows , unless there is a variable in your data which does the same.
mutate(grp = rep(1:3, each = 11)) %>%
group_by(grp) %>%
nest()%>%
select(data) %>%
unlist(recursive = FALSE)
Gives you:
$data1
# A tibble: 11 x 2
sub var
<int> <dbl>
1 1 0.258
2 1 0.337
3 1 0.463
4 1 0.856
5 1 0.466
6 1 0.701
7 1 0.548
8 1 0.999
9 1 0.454
10 1 0.292
11 1 0.173
$data2
# A tibble: 11 x 2
sub var
<int> <dbl>
1 2 0.148
2 2 0.487
3 2 0.246
4 2 0.279
5 2 0.130
6 2 0.730
7 2 0.312
8 2 0.935
9 2 0.968
10 2 0.745
11 2 0.485
$data3
# A tibble: 11 x 2
sub var
<int> <dbl>
1 3 0.141
2 3 0.200
3 3 0.00000392
4 3 0.993
5 3 0.644
6 3 0.334
7 3 0.567
8 3 0.817
9 3 0.0342
10 3 0.718
11 3 0.527
Since in the sample data you provided there is a column Subject, which I assume represents the subject IDs and there are only 11 rows with the same value for Subject, you can use
split(Theoph, Theoph$Subject)
I will assume your data frame is 11*N rows long then
everyEleven = vector(mode = "list", length = N)
for(i in 1:N){
start = (i - 1) * 11 + 1
end = i * 11
everyEleven[[i]] = Theoph[start:end, ]
}
We can use gl to create the grouping index
split(Theoph, as.integer(gl(nrow(Theoph), 11, nrow(Theoph))))

How to create a variable from a condition on other indexed variables in an efficient way?

Let's see an example. For it, I have two observations repeated 4 times:
> data(anscombe)
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
If I want to see how many of the four times the first observation is greater than 10 and the second is greater than 9, I have at least two options to proceed:
First, reshape the table to long format, sum by group (in this example is like if I have only an id) and reshape again to wide. I can do that, but it does not seem to me very efficient, and if I have too many columns, some indexed and some not, the codes to rehape can be a bit cumbersome.
Second, I can do the next:
library(dplyr)
library(purrr)
anscombe %>%
mutate(new_var = rowSums(map_dfc(
1:4,
~ anscombe[[paste0("x",.)]] > 10 & anscombe[[paste0("y",.)]] > 9
), na.rm = T))
x1 x2 x3 x4 y1 y2 y3 y4 new_var
1 10 10 10 8 8.04 9.14 7.46 6.58 0
2 8 8 8 8 6.95 8.14 6.77 5.76 0
3 13 13 13 8 7.58 8.74 12.74 7.71 1
4 9 9 9 8 8.81 8.77 7.11 8.84 0
5 11 11 11 8 8.33 9.26 7.81 8.47 1
6 14 14 14 8 9.96 8.10 8.84 7.04 1
7 6 6 6 8 7.24 6.13 6.08 5.25 0
8 4 4 4 19 4.26 3.10 5.39 12.50 1
9 12 12 12 8 10.84 9.13 8.15 5.56 2
10 7 7 7 8 4.82 7.26 6.42 7.91 0
11 5 5 5 8 5.68 4.74 5.73 6.89 0
Great! It works. But, since in my real data I have much more observations and conditions each time, I would like to do the line anscombe[[paste0("x",.)]] > 10 & anscombe[[paste0("y",.)]] > 9 shorter.
For example, with dplyr functions, data frame name often can be avoided. Maybe I would have to use rlang function sym as follows:
!!sym(paste0("x",.)) > 10 & !!sym(paste0("y",.)) > 9
I tried, but it didn't work. Maybe there is some other function than map_dfc in dplyr, purrr or some other package which allow to do this in an easier and more efficient way. Do you have some idea?
Thank you very much.
1) split/map2: Here is an option with split based on the names of the dataset. Here, we remove the digit part at the end from the names, split the dataset into a list of data.frames, using map2, pass the vector elements to compare, reduce and get the rowSums
library(dplyr)
library(purrr)
library(stringr)
anscombe %>%
split.default(str_remove(names(.), "\\d+$")) %>%
map2(., c(10, 9), `>`) %>%
reduce(`&`) %>%
rowSums %>%
bind_cols(anscombe, new_var = .)
# x1 x2 x3 x4 y1 y2 y3 y4 new_var
#1 10 10 10 8 8.04 9.14 7.46 6.58 0
#2 8 8 8 8 6.95 8.14 6.77 5.76 0
#3 13 13 13 8 7.58 8.74 12.74 7.71 1
#4 9 9 9 8 8.81 8.77 7.11 8.84 0
#5 11 11 11 8 8.33 9.26 7.81 8.47 1
#6 14 14 14 8 9.96 8.10 8.84 7.04 1
#7 6 6 6 8 7.24 6.13 6.08 5.25 0
#8 4 4 4 19 4.26 3.10 5.39 12.50 1
#9 12 12 12 8 10.84 9.13 8.15 5.56 2
#10 7 7 7 8 4.82 7.26 6.42 7.91 0
#11 5 5 5 8 5.68 4.74 5.73 6.89 0
2) pivot_longer: Another option is pivot_longer from tidyr which can take multiple sets of columns and reshape it to 'long' format
library(dplyr)
library(tidyr) #1.0.0
library(tibble)
anscombe %>%
rownames_to_column('rn') %>%
pivot_longer( -rn, names_to = c(".value", "repl"),
values_to = c('x', 'y'), names_pattern = '(\\D+)(\\d+)') %>%
group_by(rn) %>%
summarise(new_var = sum(x > 10 & y > 9, na.rm = TRUE)) %>%
arrange(as.integer(rn)) %>%
select(-rn) %>%
bind_cols(anscombe, .)
# x1 x2 x3 x4 y1 y2 y3 y4 new_var
#1 10 10 10 8 8.04 9.14 7.46 6.58 0
#2 8 8 8 8 6.95 8.14 6.77 5.76 0
#3 13 13 13 8 7.58 8.74 12.74 7.71 1
#4 9 9 9 8 8.81 8.77 7.11 8.84 0
#5 11 11 11 8 8.33 9.26 7.81 8.47 1
#6 14 14 14 8 9.96 8.10 8.84 7.04 1
#7 6 6 6 8 7.24 6.13 6.08 5.25 0
#8 4 4 4 19 4.26 3.10 5.39 12.50 1
#9 12 12 12 8 10.84 9.13 8.15 5.56 2
#10 7 7 7 8 4.82 7.26 6.42 7.91 0
#11 5 5 5 8 5.68 4.74 5.73 6.89 0
3) base R: (similar to the logic used for the first method). This would make it automatic as we can split the data into chunks based on the prefix similarity
anscombe$new_var <- rowSums(Reduce(`&`, Map(`>`,
split.default(anscombe, sub("\\d+$", "", names(anscombe))), c(10, 9))))
4) unique substring prefix: Or another option which is making use of prefix matching is loop through the unique substring prefix (would be slower than split) and then apply
rowSums(Reduce(`&`, Map(`>`, lapply(unique(sub("\\d+$", "",
names(anscombe))), function(nm)
anscombe[grep(nm, names(anscombe))]), c(10, 9))))
#[1] 0 0 1 0 1 1 0 1 2 0 0
You could try pmap in purrr to iterate a data frame row-wisely
library(dplyr)
library(purrr)
library(stringr)
new_var <- pmap_dbl(anscombe, function(...){
row <- unlist(list(...))
x <- row[str_subset(names(row),"^x")]
y <- row[str_subset(names(row),"^y")]
sum((x > 10) & (y > 9))
})
anscombe[,"new_var"] <- new_var
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4 new_var
1 10 10 10 8 8.04 9.14 7.46 6.58 0
2 8 8 8 8 6.95 8.14 6.77 5.76 0
3 13 13 13 8 7.58 8.74 12.74 7.71 1
4 9 9 9 8 8.81 8.77 7.11 8.84 0
5 11 11 11 8 8.33 9.26 7.81 8.47 1
6 14 14 14 8 9.96 8.10 8.84 7.04 1
7 6 6 6 8 7.24 6.13 6.08 5.25 0
8 4 4 4 19 4.26 3.10 5.39 12.50 1
9 12 12 12 8 10.84 9.13 8.15 5.56 2
10 7 7 7 8 4.82 7.26 6.42 7.91 0
11 5 5 5 8 5.68 4.74 5.73 6.89 0
Why not just
rowSums(anscombe[1:4] > 10 & anscombe[5:8] > 9)
# [1] 0 0 1 0 1 1 0 1 2 0 0
or
rowSums(anscombe[grep("^x", names(anscombe))] > 10 &
anscombe[grep("^y", names(anscombe))] > 9)
# [1] 0 0 1 0 1 1 0 1 2 0 0
I have been told in github by another option similar to my attempts:
library(dplyr)
library(purrr)
anscombe %>%
mutate(new_var = rowSums(map_dfc(
1:4,
~ get(paste0("x",.)) > 10 & get(paste0("y",.)) > 9
), na.rm = T))
which gives me no problem independent on data format (whether it is Date or whatever), allows flexibility writing the condition, shorten the script a few and it is intuitive.

Error : missing value where TRUE/FALSE needed

WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 5.99 2847.50 0.00 3
2 4995 3.33 36759.00 3.33 3
3 4996 5.99 2517.00 0.00 3
4 4997 5.49 2858.50 0.00 3
5 5001 3.33 32425.00 3.33 3
6 5002 5.49 4205.50 0.00 3
7 5004 5.99 4329.50 0.00 3
8 5006 2.74 55811.00 2.74 3
9 5007 5.49 4133.00 0.00 3
10 5008 5.99 4074.00 0.00 3
11 5009 3.99 12125.25 3.99 3
12 5017 2.74 77645.00 2.74 3
13 5018 5.49 5315.50 0.00 3
14 5020 2.74 78699.00 2.74 3
15 5021 5.49 5158.50 0.00 3
16 5023 5.99 5315.00 0.00 3
17 5024 5.49 6545.00 0.00 3
18 5025 3.33 63418.00 3.33 3
If there are consecutive 0 sale price entries then I want to keep last entry with sale price 0. Like I want to remove week 4996 and want to keep week 4997, I want week 5004 and I want to remove 5002. Similarly I want to delete 5021 & 5023 and want to keep week 5024.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). create a grouping variable with rleid based on a logical vector of the presence of 0 in 'SALE_PRICE' (!SALE_PRICE). Using the 'grp' as grouping variable, we get the last row of 'Subset of Data.table (.SD[.N]) if the 'SALE_PRICEelements areall0 orelseget the.SD` i.e. the full rows for a particular group.
library(data.table)
setDT(df1)[, grp:= rleid(!SALE_PRICE)
][,if(all(!SALE_PRICE)) .SD[.N] else .SD , grp
][, grp := NULL][]
# WEEK PRICE QUANTITY SALE_PRICE TYPE
# 1: 4992 5.99 2847.50 0.00 3
# 2: 4995 3.33 36759.00 3.33 3
# 3: 4997 5.49 2858.50 0.00 3
# 4: 5001 3.33 32425.00 3.33 3
# 5: 5004 5.99 4329.50 0.00 3
# 6: 5006 2.74 55811.00 2.74 3
# 7: 5008 5.99 4074.00 0.00 3
# 8: 5009 3.99 12125.25 3.99 3
# 9: 5017 2.74 77645.00 2.74 3
#10: 5018 5.49 5315.50 0.00 3
#11: 5020 2.74 78699.00 2.74 3
#12: 5024 5.49 6545.00 0.00 3
#13: 5025 3.33 63418.00 3.33 3
Or an option using dplyr by creating a grouping variable with diff and cumsum, then filter the rows to keep only the last row of 'SALE_PRICE' that are 0 or (|) select the rows where 'SALE_PRICE' is not 0.
library(dplyr)
df1 %>%
group_by(grp = cumsum(c(TRUE,diff(!SALE_PRICE)!=0))) %>%
filter( !duplicated(!SALE_PRICE, fromLast=TRUE)|SALE_PRICE!=0) %>%
select(-grp)
# grp WEEK PRICE QUANTITY SALE_PRICE TYPE
# (int) (int) (dbl) (dbl) (dbl) (int)
#1 1 4992 5.99 2847.50 0.00 3
#2 2 4995 3.33 36759.00 3.33 3
#3 3 4997 5.49 2858.50 0.00 3
#4 4 5001 3.33 32425.00 3.33 3
#5 5 5004 5.99 4329.50 0.00 3
#6 6 5006 2.74 55811.00 2.74 3
#7 7 5008 5.99 4074.00 0.00 3
#8 8 5009 3.99 12125.25 3.99 3
#9 8 5017 2.74 77645.00 2.74 3
#10 9 5018 5.49 5315.50 0.00 3
#11 10 5020 2.74 78699.00 2.74 3
#12 11 5024 5.49 6545.00 0.00 3
#13 12 5025 3.33 63418.00 3.33 3

Column of static mean for n rows

Given a data frame with two columns, I'm looking to calculate a third column which would contain the mean for every n number of rows while keeping the data frame intact.
Given the data frame
index<-1:20
V<-c(2,5,7,4,8,9,4,6,8,NA,3,4,5,6,0,4,5,7,5,3)
DF<-data.frame(index,V)
How could I create DF$mean which would be the non-rolling mean of every 5 rows.
index V mean
1 2 5.2
2 5 5.2
3 7 5.2
4 4 5.2
5 8 5.2
6 9 6.75
7 4 6.75
8 6 6.75
9 8 6.75
10 NA 6.75
11 3 3.6
12 4 3.6
13 5 3.6
14 6 3.6
15 0 3.6
16 4 4.8
17 5 4.8
18 7 4.8
19 5 4.8
20 3 4.8
You can use colMeans and rep
DF$mean <- rep(colMeans(matrix(DF$V, nrow=5), na.rm=TRUE), each=5)
DF$mean <- ave(DF$V,
rep(1:(nrow(DF)/5), each=5),
FUN=function(x){mean(x, na.rm=TRUE)})
which gives
> DF
index V mean
1 1 2 5.20
2 2 5 5.20
3 3 7 5.20
4 4 4 5.20
5 5 8 5.20
6 6 9 6.75
7 7 4 6.75
8 8 6 6.75
9 9 8 6.75
10 10 NA 6.75
11 11 3 3.60
12 12 4 3.60
13 13 5 3.60
14 14 6 3.60
15 15 0 3.60
16 16 4 4.80
17 17 5 4.80
18 18 7 4.80
19 19 5 4.80
20 20 3 4.80

Resources