Stacking multiple columns using pivot longer in R - r

I am trying to change my data from wide to long on r using pivot_longer. There appear to be a few people having similar issues on here but I have been unable to adapt their solutions to my data. I have attached a picture of example data in the wide data example and what I am trying to achieve in the long data example.
In summary I have a time and reference column which refer to all columns, I also have multiple columns of group, subject, ID, xcoordinate ycoordinate in the form of:
group1, subject1. ID1, xcoord1 ycoord1, group2, subject2, ID2, xcoord2, ycoord2 and so on... What I want is a long table with columns:
time, reference, group, subject, ID, xcoord, ycoord.
With the 5 columns stacking their respective numbered columns, and the time and reference columns repeating for the relevant stacks.
df %>%
pivot_longer(cols = -c(time, reference),
names_to = c("group", "subject", "ID", "xcoord", "ycoord")
My understanding is that I require to use the names_pattern function, although I cant seem to get that to work, and I cant find anything clear describing How I should be using it. I should say my data is much wider than the example data, so can't really rely on numbering of columns.
appreciate any help
wide data example
time reference group1 subject1 ID1 xcoord1 ycoord1 group2 subject2 ID2 xcoord2 ycoord2 group3 subject3 ID3 xcoord3 ycoord3
1 00:01 4097365 1 4 1 7.44 38.16 0 21 2 33.90 47.26 1 15 3 21.53 2.67
2 00:02 4097366 1 4 1 9.84 37.03 0 21 2 32.98 48.47 1 15 3 21.82 2.95
3 00:03 4097367 1 4 1 12.01 35.83 0 21 2 30.17 50.33 1 15 3 22.06 4.45
4 00:04 4097368 1 4 1 12.15 34.17 0 21 2 29.85 50.52 1 15 3 23.50 4.75
5 00:05 4097369 1 4 1 15.27 32.94 0 21 2 28.39 51.30 1 15 3 24.25 4.76
6 00:06 4097370 1 4 1 18.96 31.98 0 21 2 28.39 52.36 1 15 3 25.31 6.57
7 00:07 4097371 1 4 1 22.50 31.13 0 21 2 26.59 53.14 1 15 3 26.05 7.04
8 00:08 4097372 1 4 1 27.47 30.15 0 21 2 25.89 53.94 1 15 3 27.29 7.91
9 00:09 4097373 1 4 1 32.17 29.92 0 21 2 24.64 54.42 1 15 3 27.47 8.44
10 00:10 4097374 1 4 1 33.77 27.49 0 21 2 24.61 55.23 1 15 3 28.59 8.71
Long data example
time reference group subject ID xcoord ycoord
1 00:01 4097365 1 4 1 7.44 38.16
2 00:01 4097365 0 21 2 33.90 47.26
3 00:01 4097365 1 15 3 21.53 2.67
4 00:02 4097366 1 4 1 9.84 37.03
5 00:02 4097367 0 21 2 32.98 48.47
6 00:02 4097368 1 15 3 21.82 2.95
7 00:03 4097369 1 4 1 12.01 35.83
8 00:03 4097370 0 21 2 30.17 50.33
9 00:03 4097371 1 15 3 22.06 4.45
10 00:04 4097372 1 4 1 12.15 34.17
edit: playing about a bit with the data I have managed to achieve this odd solution which is a mixture of long and wide data.
dput(head(df1))
structure(list(time = c(0, 0, 0, 0, 0, 0), state = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Alive", "Alive;:", "Dead", "Dead;:"
), class = "factor"), reference = c("1880439", "1880439", "1880439",
"1880439", "1880439", "1880439"), num = c("1", NA, "2", "3",
"4", "5"), group = c("1", NA, "1", "4", "0", "0"), X = c(NA,
NA, NA, NA, NA, NA), ID = c(1L, NA, 2L, 4L, 5L, 6L), subect = c(21L,
NA, 7L, -1L, 2L, 6L), x = c(3514L, NA, 2807L, 5550L, 3956L, 3686L
), y = c(-1644L, NA, -510L, 4400L, 1297L, -55L), speed = c("5.23",
NA, "3.24", "0.00", "2.31", "3.57"), group1 = c("0", NA, "4",
"1", "1", "0"), ID1 = c(13L, NA, 14L, 15L, 16L, 17L), subect1 = c(9L,
NA, -1L, 13L, 14L, 11L), x1 = c(882L, NA, 5550L, 3004L, 761L,
3317L), y1 = c(-1468L, NA, 4400L, 1633L, 559L, 1443L), speed1 = c("1.70",
NA, "0.00", "3.06", "2.92", "3.30"), group2 = c("4", NA, "0",
"1", "0", "0"), ID2 = c(24L, NA, 25L, 26L, 27L, 28L), subect2 = c(-1L,
NA, 1L, 18L, 5L, 10L), x2 = c(5550L, NA, 5031L, 3936L, 3972L,
3623L), y2 = c(4400L, NA, -74L, 190L, 686L, 356L), speed2 = c("0.00",
NA, "0.54", "1.06", "0.95", "2.49"), speed.group2 = c(NA, NA,
NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L)).
the dataframe the code results in looks like this
> head(df1)
# A tibble: 6 x 24
time state reference num group X ID subect x y speed group1 ID1 subect1 x1 y1 speed1 group2 ID2 subect2 x2 y2 speed2
<dbl> <fct> <chr> <chr> <chr> <lgl> <int> <int> <int> <int> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <int> <int> <int> <int> <chr>
1 0 Aliv~ 1880439 1 1 NA 1 21 3514 -1644 5.23 0 13 9 882 -1468 1.70 4 24 -1 5550 4400 0.00
2 0 Aliv~ 1880439 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 0 Aliv~ 1880439 2 1 NA 2 7 2807 -510 3.24 4 14 -1 5550 4400 0.00 0 25 1 5031 -74 0.54
4 0 Aliv~ 1880439 3 4 NA 4 -1 5550 4400 0.00 1 15 13 3004 1633 3.06 1 26 18 3936 190 1.06
5 0 Aliv~ 1880439 4 0 NA 5 2 3956 1297 2.31 1 16 14 761 559 2.92 0 27 5 3972 686 0.95
6 0 Aliv~ 1880439 5 0 NA 6 6 3686 -55 3.57 0 17 11 3317 1443 3.30 0 28 10 3623 356 2.49
# ... with 1 more variable: speed.group2 <lgl>

Would first rename columns and insert underscore right before number, then use that as separator in pivot_longer.
library(tidyverse)
df %>%
rename_at(-c(1:2), ~ str_replace(., "(\\w+)(\\d)", "\\1_\\2")) %>%
pivot_longer(cols = -c(1:2), names_to = c(".value", "num"), names_sep = "_")
Edit (2/7/20):
With your updated dataset, it appears that some of the variable column names don't have a number at the end. We can add 0 for those.
Also, I assume you want: group, ID, subect, x, y, speed that are repeated (with the first group in column 5 separated from its related variables in columns 7-11).
df1 %>%
rename_at(c(5,7:11), ~ paste0(., "0")) %>%
rename_at(-c(1:4, 6, 24), ~ str_replace(., "(\\w+)(\\d+)", "\\1_\\2")) %>%
pivot_longer(cols = -c(1:4, 6, 24), names_to = c(".value", "val"), names_sep = "_")
Output (Revised):
# A tibble: 18 x 13
time state reference num X speed.group2 val group ID subect x y speed
<dbl> <fct> <chr> <chr> <lgl> <lgl> <chr> <chr> <int> <int> <int> <int> <chr>
1 0 Alive;: 1880439 1 NA NA 0 1 1 21 3514 -1644 5.23
2 0 Alive;: 1880439 1 NA NA 1 0 13 9 882 -1468 1.70
3 0 Alive;: 1880439 1 NA NA 2 4 24 -1 5550 4400 0.00
4 0 Alive;: 1880439 NA NA NA 0 NA NA NA NA NA NA
5 0 Alive;: 1880439 NA NA NA 1 NA NA NA NA NA NA
6 0 Alive;: 1880439 NA NA NA 2 NA NA NA NA NA NA
7 0 Alive;: 1880439 2 NA NA 0 1 2 7 2807 -510 3.24
8 0 Alive;: 1880439 2 NA NA 1 4 14 -1 5550 4400 0.00
9 0 Alive;: 1880439 2 NA NA 2 0 25 1 5031 -74 0.54
10 0 Alive;: 1880439 3 NA NA 0 4 4 -1 5550 4400 0.00
11 0 Alive;: 1880439 3 NA NA 1 1 15 13 3004 1633 3.06
12 0 Alive;: 1880439 3 NA NA 2 1 26 18 3936 190 1.06
13 0 Alive;: 1880439 4 NA NA 0 0 5 2 3956 1297 2.31
14 0 Alive;: 1880439 4 NA NA 1 1 16 14 761 559 2.92
15 0 Alive;: 1880439 4 NA NA 2 0 27 5 3972 686 0.95
16 0 Alive;: 1880439 5 NA NA 0 0 6 6 3686 -55 3.57
17 0 Alive;: 1880439 5 NA NA 1 0 17 11 3317 1443 3.30
18 0 Alive;: 1880439 5 NA NA 2 0 28 10 3623 356 2.49

Related

Binned physiological time series data in R: calculate duration spent in each bin

I have a dataset containing changes in mean arterial blood pressure (MAP) over time from multiple participants. Here is an example dataframe:
df=structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), Time = structure(1:14, .Label = c("11:02:00",
"11:03:00", "11:04:00", "11:05:00", "11:06:00", "11:07:00", "11:08:00",
"13:30:00", "13:31:00", "13:32:00", "13:33:00", "13:34:00", "13:35:00",
"13:36:00"), class = "factor"), MAP = c(90.27999878, 84.25, 74.81999969,
80.87000275, 99.38999939, 81.51000214, 71.51000214, 90.08999634,
88.75, 84.72000122, 83.86000061, 94.18000031, 98.54000092, 51
)), class = "data.frame", row.names = c(NA, -14L))
I have binned the data into groups: e.g. MAP 40-60, 60-80, 80-100 and added a unique flag (1, 2 or 3) in an additional column map_bin. This is my code so far:
library(dplyr)
#Mean Arterial Pressure
#Bin 1=40-60; Bin 2=60-80; Bin 3=80-100
map_bin=c("1","2","3")
output <- as_tibble(df) %>%
mutate(map_bin = case_when(
MAP >= 40 & MAP < 60 ~ map_bin[1],
MAP >= 60 & MAP < 80 ~ map_bin[2],
MAP >= 80 & MAP < 100 ~ map_bin[3]
))
For each ID I wish to calculate, in an additional column, the total time MAP is in each bin. I expect the following output:
ID
Time
MAP
map_bin
map_bin_dur
1
11:02:00
90.27999878
3
5
1
11:03:00
84.25
3
5
1
11:04:00
74.81999969
2
2
1
11:05:00
80.87000275
3
5
1
11:06:00
99.38999939
3
5
1
11:07:00
81.51000214
3
5
1
11:08:00
71.51000214
2
2
2
13:30:00
90.08999634
3
6
2
13:31:00
88.75
3
6
2
13:32:00
84.72000122
3
6
2
13:33:00
83.86000061
3
6
2
13:34:00
94.18000031
3
6
2
13:35:00
98.54000092
3
6
2
13:36:00
51
1
1
Where map_bin_dur is the time in minutes that MAP for each individual resided in each bin. e.g. ID 1 had a MAP in Bin 3 for 5 minutes in total.
If you have Time column of 1 min-duration always you can use add_count -
library(dplyr)
output <- output %>% add_count(ID, map_bin, name = 'map_bin_dur')
output
# ID Time MAP map_bin map_bin_dur
# <int> <fct> <dbl> <chr> <int>
# 1 1 11:02:00 90.3 3 5
# 2 1 11:03:00 84.2 3 5
# 3 1 11:04:00 74.8 2 2
# 4 1 11:05:00 80.9 3 5
# 5 1 11:06:00 99.4 3 5
# 6 1 11:07:00 81.5 3 5
# 7 1 11:08:00 71.5 2 2
# 8 2 13:30:00 90.1 3 6
# 9 2 13:31:00 88.8 3 6
#10 2 13:32:00 84.7 3 6
#11 2 13:33:00 83.9 3 6
#12 2 13:34:00 94.2 3 6
#13 2 13:35:00 98.5 3 6
#14 2 13:36:00 51 1 1

How to fill a dataframe with values from another matching columns names and rows names in R?

I have the following dataframes DF1 and DF2. I am trying to fill DF2 with values from the Close column in DF1. But it turns out that the output is zero when running the loop. Don't know what is wrong, but it seems it doesn't read the Asset column values.
DF1:
Data Asset Close
1 1986-11-27 ABC 6 5.95
2 1986-12-01 ABC 6 5.90
3 1986-12-03 ABC 6 5.90
4 1986-12-04 ABC 6 5.90
5 1986-12-05 ABC 6 5.00
6 1986-12-08 ABC 6 5.00
7 1986-12-09 ABC 6 4.78
8 1986-10-31 ABC 8 3.90
9 1986-11-03 ABC 8 3.70
10 1986-11-04 ABC 8 3.70
. . . .
. . . .
DF2:
ABC 6 ABC 8
1986-11-27 NA NA
1986-12-01 NA NA
1986-12-03 NA NA
1986-12-04 NA NA
1986-12-05 NA NA
1986-12-08 NA NA
1986-12-09 NA NA
1986-12-10 NA NA
1986-12-11 NA NA
1986-12-12 NA NA
. . .
. . .
for (i in 1:length(DF2))
{
for (m in 1:nrow(DF2))
{
for (n in 1:nrow(DF1))
{
if ((names(DF2[i]) == DF1[n,2]) & (row.names(DF2[m,0])==as.character(DF1[n,1])))
{
DF2[m,i] <- DF1[n,3]
} else{DF2[m,i] <- 0}
}
}
}
Output:
ABC 6 ABC 8
1986-11-27 0 0
1986-12-01 0 0
1986-12-03 0 0
1986-12-04 0 0
1986-12-05 0 0
1986-12-08 0 0
1986-12-09 0 0
1986-12-10 0 0
1986-12-11 0 0
1986-12-12 0 0
. . .
. . .
If I understand correctly, the OP wants to reshape the data from long to wide format.
This is a very common operation and questions have been asked and answered many times before, e.g., here.
However, just for demonstration, here is a collection of solutions which provide the expected result:
Base R
reshape(DF1, idvar = "Data", timevar = "Asset", v.names = "Close", direction = "wide")
Data Close.ABC 6 Close.ABC 8
1 1986-11-27 5.95 NA
2 1986-12-01 5.90 NA
3 1986-12-03 5.90 NA
4 1986-12-04 5.90 NA
5 1986-12-05 5.00 NA
6 1986-12-08 5.00 NA
7 1986-12-09 4.78 NA
8 1986-10-31 NA 3.9
9 1986-11-03 NA 3.7
10 1986-11-04 NA 3.7
reshape2 package
reshape2::dcast(DF1, Data ~ Asset)
Using Close as value column: use value.var to override.
Data ABC 6 ABC 8
1 1986-10-31 NA 3.9
2 1986-11-03 NA 3.7
3 1986-11-04 NA 3.7
4 1986-11-27 5.95 NA
5 1986-12-01 5.90 NA
6 1986-12-03 5.90 NA
7 1986-12-04 5.90 NA
8 1986-12-05 5.00 NA
9 1986-12-08 5.00 NA
10 1986-12-09 4.78 NA
data.table package
library(data.table)
dcast(setDT(DF1), Data ~ Asset)
Using 'Close' as value column. Use 'value.var' to override
Data ABC 6 ABC 8
1: 1986-10-31 NA 3.9
2: 1986-11-03 NA 3.7
3: 1986-11-04 NA 3.7
4: 1986-11-27 5.95 NA
5: 1986-12-01 5.90 NA
6: 1986-12-03 5.90 NA
7: 1986-12-04 5.90 NA
8: 1986-12-05 5.00 NA
9: 1986-12-08 5.00 NA
10: 1986-12-09 4.78 NA
tidyr package
library(tidyr)
DF1 %>%
pivot_wider(names_from = Asset, values_from = Close)
# A tibble: 10 x 3
Data `ABC 6` `ABC 8`
<chr> <dbl> <dbl>
1 1986-11-27 5.95 NA
2 1986-12-01 5.9 NA
3 1986-12-03 5.9 NA
4 1986-12-04 5.9 NA
5 1986-12-05 5 NA
6 1986-12-08 5 NA
7 1986-12-09 4.78 NA
8 1986-10-31 NA 3.9
9 1986-11-03 NA 3.7
10 1986-11-04 NA 3.7
Data
DF1 <- as.data.frame(readr::read_table("rn Data Asset Close
1 1986-11-27 ABC 6 5.95
2 1986-12-01 ABC 6 5.90
3 1986-12-03 ABC 6 5.90
4 1986-12-04 ABC 6 5.90
5 1986-12-05 ABC 6 5.00
6 1986-12-08 ABC 6 5.00
7 1986-12-09 ABC 6 4.78
8 1986-10-31 ABC 8 3.90
9 1986-11-03 ABC 8 3.70
10 1986-11-04 ABC 8 3.70", col_types = "_ccd"))
I am not sure if this is what you want to achieve. A solution with base R is given as below:
DF2 <- cbind(DF2[1],sapply(names(DF2)[-1],
function(v) DF1$Close[match(DF2$Date,subset(DF1,Asset==v)$Date)]))
such that
> DF2
Date ABC 6 ABC 8
1 1986-11-27 5.95 NA
2 1986-12-01 5.90 NA
3 1986-12-03 5.90 NA
4 1986-12-04 5.90 NA
5 1986-12-05 5.00 NA
6 1986-12-08 5.00 NA
7 1986-12-09 4.78 NA
8 1986-12-10 NA NA
9 1986-12-11 NA NA
10 1986-12-12 NA NA
DATA
DF1 <- structure(list(Date = structure(c(4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L), .Label = c("1986-10-31", "1986-11-03", "1986-11-04",
"1986-11-27", "1986-12-01", "1986-12-03", "1986-12-04", "1986-12-05",
"1986-12-08", "1986-12-09"), class = "factor"), Asset = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("ABC 6", "ABC 8"
), class = "factor"), Close = c(5.95, 5.9, 5.9, 5.9, 5, 5, 4.78,
3.9, 3.7, 3.7)), row.names = c(NA, -10L), class = "data.frame")
DF2 <- structure(list(Date = structure(1:10, .Label = c("1986-11-27",
"1986-12-01", "1986-12-03", "1986-12-04", "1986-12-05", "1986-12-08",
"1986-12-09", "1986-12-10", "1986-12-11", "1986-12-12"), class = "factor"),
"ABC 6" = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), "ABC 8" = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))

Add rows to a data-frame based on values in one of the columns

Currently the data-frame looks something like this:
Scenario Month A B C
1 1 -0.593186301 1.045550808 -0.593816304
1 2 0.178626141 2.043084432 0.111370583
1 3 1.205779717 -0.324083723 -1.397716949
2 1 0.933615199 0.052647056 -0.656486153
2 2 1.647291688 -1.065793671 0.799040546
2 3 1.613663101 -1.955567231 -1.817457972
3 1 -0.621991775 1.634069402 -1.404981646
3 2 -1.899326887 -0.836322394 -1.826351541
3 3 0.164235141 -1.160701812 1.238246459
I'd like to add rows on top of the row where Month = 1 as below. I know dplyr has an add_rows function but I'd like to add rows based on a condition. Any help is hugely appreciated.
Scenario Month A B C
0
1 1 -0.593186301 1.045550808 -0.593816304
1 2 0.178626141 2.043084432 0.111370583
1 3 1.205779717 -0.324083723 -1.397716949
0
2 1 0.933615199 0.052647056 -0.656486153
2 2 1.647291688 -1.065793671 0.799040546
2 3 1.613663101 -1.955567231 -1.817457972
0
3 1 -0.621991775 1.634069402 -1.404981646
3 2 -1.899326887 -0.836322394 -1.826351541
3 3 0.164235141 -1.160701812 1.238246459
A solution using tidyverse.
library(tidyverse)
dat2 <- dat %>%
split(f = .$Scenario) %>%
map_dfr(~bind_rows(tibble(Scenario = 0), .x))
dat2
# # A tibble: 12 x 5
# Scenario Month A B C
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 0 NA NA NA NA
# 2 1 1 -0.593 1.05 -0.594
# 3 1 2 0.179 2.04 0.111
# 4 1 3 1.21 -0.324 -1.40
# 5 0 NA NA NA NA
# 6 2 1 0.934 0.0526 -0.656
# 7 2 2 1.65 -1.07 0.799
# 8 2 3 1.61 -1.96 -1.82
# 9 0 NA NA NA NA
# 10 3 1 -0.622 1.63 -1.40
# 11 3 2 -1.90 -0.836 -1.83
# 12 3 3 0.164 -1.16 1.24
DATA
dat <- read.table(text = "Scenario Month A B C
1 1 -0.593186301 1.045550808 -0.593816304
1 2 0.178626141 2.043084432 0.111370583
1 3 1.205779717 -0.324083723 -1.397716949
2 1 0.933615199 0.052647056 -0.656486153
2 2 1.647291688 -1.065793671 0.799040546
2 3 1.613663101 -1.955567231 -1.817457972
3 1 -0.621991775 1.634069402 -1.404981646
3 2 -1.899326887 -0.836322394 -1.826351541
3 3 0.164235141 -1.160701812 1.238246459 ",
header = TRUE)
Somehow add_row doesn't take multiple values to its .before parameter.
One way is to split the dataframe wherever Month = 1 and then for each dataframe add a row using add_row above Month = 1.
library(tidyverse)
map_df(split(df, cumsum(df$Month == 1)),
~ add_row(., Scenario = 0, .before = which(.$Month == 1)))
# Scenario Month A B C
#1 0 NA NA NA NA
#2 1 1 -0.5931863 1.04555081 -0.5938163
#3 1 2 0.1786261 2.04308443 0.1113706
#4 1 3 1.2057797 -0.32408372 -1.3977169
#5 0 NA NA NA NA
#6 2 1 0.9336152 0.05264706 -0.6564862
#7 2 2 1.6472917 -1.06579367 0.7990405
#8 2 3 1.6136631 -1.95556723 -1.8174580
#9 0 NA NA NA NA
#10 3 1 -0.6219918 1.63406940 -1.4049816
#11 3 2 -1.8993269 -0.83632239 -1.8263515
#12 3 3 0.1642351 -1.16070181 1.2382465
Here is one option with data.table
library(data.table)
setDT(df1)[, .SD[c(.N+1, seq_len(.N))], Scenario][
!duplicated(Scenario), Scenario := 0][]
# Scenario Month A B C
# 1: 0 NA NA NA NA
# 2: 1 1 -0.5931863 1.04555081 -0.5938163
# 3: 1 2 0.1786261 2.04308443 0.1113706
# 4: 1 3 1.2057797 -0.32408372 -1.3977169
# 5: 0 NA NA NA NA
# 6: 2 1 0.9336152 0.05264706 -0.6564862
# 7: 2 2 1.6472917 -1.06579367 0.7990405
# 8: 2 3 1.6136631 -1.95556723 -1.8174580
# 9: 0 NA NA NA NA
#10: 3 1 -0.6219918 1.63406940 -1.4049816
#11: 3 2 -1.8993269 -0.83632239 -1.8263515
#12: 3 3 0.1642351 -1.16070181 1.2382465
Or as #chinsoon12 mentioned in the comments
setDT(df1)[, rbindlist(.(.(Scenario=0L), c(.(Scenario=rep(Scenario, .N)),
.SD)), use.names=TRUE, fill=TRUE), by=.(Scenario)][, -1L]
data
df1 <- structure(list(Scenario = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), A = c(-0.593186301,
0.178626141, 1.205779717, 0.933615199, 1.647291688, 1.613663101,
-0.621991775, -1.899326887, 0.164235141), B = c(1.045550808,
2.043084432, -0.324083723, 0.052647056, -1.065793671, -1.955567231,
1.634069402, -0.836322394, -1.160701812), C = c(-0.593816304,
0.111370583, -1.397716949, -0.656486153, 0.799040546, -1.817457972,
-1.404981646, -1.826351541, 1.238246459)), class = "data.frame",
row.names = c(NA,
-9L))
Here's a simple way (without loops) using base R -
df1 <- df[rep(1:nrow(df), (df$Month == 1)+1), ]
df1[duplicated(df1, fromLast = T), ] <- NA
df1$Scenario[is.na(df1$Scenario)] <- 0
df1
Scenario Month A B C
1 0 NA NA NA NA
1.1 1 1 -0.5931863 1.04555081 -0.5938163
2 1 2 0.1786261 2.04308443 0.1113706
3 1 3 1.2057797 -0.32408372 -1.3977169
4 0 NA NA NA NA
4.1 2 1 0.9336152 0.05264706 -0.6564862
5 2 2 1.6472917 -1.06579367 0.7990405
6 2 3 1.6136631 -1.95556723 -1.8174580
7 0 NA NA NA NA
7.1 3 1 -0.6219918 1.63406940 -1.4049816
8 3 2 -1.8993269 -0.83632239 -1.8263515
9 3 3 0.1642351 -1.16070181 1.2382465
Data -
df <- structure(list(Scenario = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L
), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), A = c(-0.593186301,
0.178626141, 1.205779717, 0.933615199, 1.647291688, 1.613663101,
-0.621991775, -1.899326887, 0.164235141), B = c(1.045550808,
2.043084432, -0.324083723, 0.052647056, -1.065793671, -1.955567231,
1.634069402, -0.836322394, -1.160701812), C = c(-0.593816304,
0.111370583, -1.397716949, -0.656486153, 0.799040546, -1.817457972,
-1.404981646, -1.826351541, 1.238246459)), class = "data.frame", row.names = c(NA,
-9L))

Set value of a column to NA based on conditions in R

I have a data frame, a reproducible example is as follows:
structure(list(subscriberid = c(1177460837L, 1177460837L, 1177460837L,
1146526049L, 1146526049L, 1146526049L), variable = c("3134",
"4550", "4550", "5160", "2530", "2530"), value = c(1, 2, 2, 1,
2, 2), gender = c(2, 2, 2, 1, 2, 2), cwe = c(NA, 50L, 50L, NA,
30L, 30L), hw = c(NA, 48L, 48L, NA, 26L, 26L), resp = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), cna = c(3L, 1L, 1L, 3L, 1L, 1L)), .Names = c("subscriberid",
"variable", "value", "gender", "cwe", "hw", "resp", "cna"), row.names = c(4L,
5L, 6L, 9L, 10L, 11L), class = "data.frame")
The actual data frame looks like this:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 48 NA 1
6 1177460837 4550 2 2 50 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 26 NA 1
11 1146526049 2530 2 2 30 26 NA 1
In the above df, row 5 and 6 are exactly the same. From row 5, I want to remove 48 and row 6 I want to remove 50. Essentially, I want to retain only one age in a row and set the other to NA. I tried using a for loop but that sets column values in the column that I refer in both the rows to NA.
for (i in 1:nrow(test)) {
test$hw[i] <- ifelse(!is.na(test$cwe[i]) & !is.na(test$hw[i]), NA, test$hw[i])
}
I am trying to set an if condition to identify if both the rows are same, then I want to iteratively remove one of the values from the first row and remove the other from the second.
The desired output is as follows:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 NA NA 1
6 1177460837 4550 2 2 NA 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 NA NA 1
11 1146526049 2530 2 2 NA 26 NA 1
You can use a combination of which() and duplicated() to receive duplicated rows.
Because you need to change values twice of the rows, you have to create a copy of the dataframe. Note that this will only work if the identical rows are always consecutive.
dfNA <- df
dfNA$hw[which(duplicated(df))-1] <- NA
dfNA$cwe[which(duplicated(df))] <- NA
dfNA
# subscriberid variable value gender cwe hw resp cna
#4 1177460837 3134 1 2 NA NA NA 3
#5 1177460837 4550 2 2 50 NA NA 1
#6 1177460837 4550 2 2 NA 48 NA 1
#9 1146526049 5160 1 1 NA NA NA 3
#10 1146526049 2530 2 2 30 NA NA 1
#11 1146526049 2530 2 2 NA 26 NA 1
A possible solution :
# create a logical vector indicating if current row is identical to previous one
# N.B.: do.call("paste",c(DF,sep="\r")) is used internally by "duplicated.data.frame" function
rowStrings <- do.call("paste", c(DF, sep = "\r"))
currRowIsEqualToPrev <- rowStrings[-1] == rowStrings[-length(rowStrings)]
# set first row hw = NA and second identical row cwe = NA
DF[c(FALSE,currRowIsEqualToPrev),'hw'] <- NA
DF[c(currRowIsEqualToPrev,FALSE),'cwe'] <- NA
> DF
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 NA 48 NA 1
6 1177460837 4550 2 2 50 NA NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 NA 26 NA 1
11 1146526049 2530 2 2 30 NA NA 1
Using lead and lag from dplyr package:
library(dplyr)
df1 %>%
group_by(subscriberid, variable) %>%
mutate(cwe = if_else(lead(cwe) == cwe, cwe, NA_integer_),
hw = if_else(lag(hw) == hw, hw, NA_integer_)) %>%
ungroup()
# # A tibble: 6 x 8
# subscriberid variable value gender cwe hw resp cna
# <int> <int> <int> <int> <int> <int> <lgl> <int>
# 1 1177460837 3134 1 2 NA NA NA 3
# 2 1177460837 4550 2 2 50 NA NA 1
# 3 1177460837 4550 2 2 NA 48 NA 1
# 4 1146526049 5160 1 1 NA NA NA 3
# 5 1146526049 2530 2 2 30 NA NA 1
# 6 1146526049 2530 2 2 NA 26 NA 1
I took a shot at it. This relies on using group_by from dplyr to find duplicate rows. This method assumes that rows can be reliably be identified as identical by using the subscriberid, variable, value, gender, resp, and cna columns alone.
Because it is operating within groups only, it will work even if a preceding non-identical row contains the same value for cwe (I did check this, but I would also confirm it for yourself if I were you).
library(dplyr)
ndf <- df %>%
group_by(subscriberid, variable, value, gender, resp, cna) %>%
mutate(cwe = na_if(cwe, lag(cwe)),
hw = na_if(hw, lead(hw))) %>%
ungroup()
Output:
# A tibble: 6 x 8
subscriberid variable value gender cwe hw resp cna
<int> <chr> <dbl> <dbl> <int> <int> <int> <int>
1 1177460837 3134 1. 2. NA NA NA 3
2 1177460837 4550 2. 2. 50 NA NA 1
3 1177460837 4550 2. 2. NA 48 NA 1
4 1146526049 5160 1. 1. NA NA NA 3
5 1146526049 2530 2. 2. 30 NA NA 1
6 1146526049 2530 2. 2. NA 26 NA 1

lapply alternative to for loop to append to data frame

I have a data frame:
df<-structure(list(chrom = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 20L, 33L, 40L,
45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", "pos"), row.names = c(NA, -14L), class = "data.frame")
> head(df)
chrom pos
1 1 10
2 1 200
3 1 134
4 1 400
5 1 600
6 1 1000
And I want to calculate pos[i+1] - pos[i] on the sample chromosome (chrom)
By using a for loop over each chrom level, and another over each row I get the expected results:
for (c in levels(df$chrom)){
df_chrom<-filter(df, chrom == c)
df_chrom<-arrange(df_chrom, df_chrom$pos)
for (i in 1:nrow(df_chrom)){
dist<-(df_chrom$pos[i+1] - df_chrom$pos[i])
logdist<-log10(dist)
cat(c, i, df_chrom$pos[i], dist, logdist, "\n")
}
}
However, I want to save this to a data frame, and think that lapply or apply is the right way to go about this. I can't work out how to make the pos[i+1] - pos[i] calculation though (seeing as lapply works on each row/column.
Any pointers would be appreciated
Here's the output from my solution:
chrom index pos dist log10dist
1 1 10 124 2.093422
1 2 134 66 1.819544
1 3 200 200 2.30103
1 4 400 200 2.30103
1 5 600 400 2.60206
1 6 1000 NA NA
2 1 20 13 1.113943
2 2 33 NA NA
3 1 40 5 0.69897
3 2 45 NA NA
4 1 50 5 0.69897
4 2 55 45 1.653213
4 3 100 23 1.361728
4 4 123 NA NA
We could do this using a group by difference. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'chrom', order the 'pos', get the difference of 'pos' (diff) and also log of the difference
library(data.table)
setDT(df)[order(pos), {v1 <- diff(pos)
.(index = seq_len(.N), pos = pos,
dist = c(v1, NA), logdiff = c(log10(v1), NA))}
, by = chrom]
# chrom index pos dist logdiff
# 1: 1 1 10 124 2.093422
# 2: 1 2 134 66 1.819544
# 3: 1 3 200 200 2.301030
# 4: 1 4 400 200 2.301030
# 5: 1 5 600 400 2.602060
# 6: 1 6 1000 NA NA
# 7: 2 1 20 13 1.113943
# 8: 2 2 33 NA NA
# 9: 3 1 40 5 0.698970
#10: 3 2 45 NA NA
#11: 4 1 50 5 0.698970
#12: 4 2 55 45 1.653213
#13: 4 3 100 23 1.361728
#14: 4 4 123 NA NA
Upon running the OP's code the output printed are
#1 1 10 124 2.093422
#1 2 134 66 1.819544
#1 3 200 200 2.30103
#1 4 400 200 2.30103
#1 5 600 400 2.60206
#1 6 1000 NA NA
#2 1 20 13 1.113943
#2 2 33 NA NA
#3 1 40 5 0.69897
#3 2 45 NA NA
#4 1 50 5 0.69897
#4 2 55 45 1.653213
#4 3 100 23 1.361728
#4 4 123 NA NA
We split df by df$chrom (Note that we reorder both df and df$chrom before splitting). Then we go through each of the subgroups (the subgroups are called a in this example) using lapply. On the pos column of each subgroup, we calculate difference (diff) of consecutive elements and take log10. Since diff decreases the number of elements by 1, we add a NA to the end. Finally, we rbind all the subgroups together using do.call.
do.call(rbind, lapply(split(df[order(df$chrom, df$pos),], df$chrom[order(df$chrom, df$pos)]),
function(a) data.frame(a, dist = c(log10(diff(a$pos)), NA))))
# chrom pos dist
#1.1 1 10 2.093422
#1.3 1 134 1.819544
#1.2 1 200 2.301030
#1.4 1 400 2.301030
#1.5 1 600 2.602060
#1.6 1 1000 NA
#2.7 2 20 1.113943
#2.8 2 33 NA
#3.9 3 40 0.698970
#3.10 3 45 NA
#4.11 4 50 0.698970
#4.12 4 55 1.653213
#4.13 4 100 1.361728
#4.14 4 123 NA

Resources