How to test whether a sequence of sections have gaps in it? - r

I have several (ice-)core section samples in a dataset (ID in the example below). Some cores have missing sections (i.e. gaps), but I do not know which ones. How to find this out using R?
Example:
dt <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), Sec.start = c(0,
5, 10, 20, 50, 100, 200, 0, 5, 10, 30), Sec.end = c(5, 10, 20,
30, 100, 200, 400, 5, 10, 20, 50), Section = c("0-5", "5-10",
"10-20", "20-30", "50-100", "100-200", "200-400", "0-5", "5-10",
"10-20", "30-50")), .Names = c("ID", "Sec.start", "Sec.end",
"Section"), row.names = c(NA, -11L), class = "data.frame")
dt
ID Sec.start Sec.end Section
1 a 0 5 0-5
2 a 5 10 5-10
3 a 10 20 10-20
4 a 20 30 20-30
5 b 50 100 50-100
6 b 100 200 100-200
7 b 200 400 200-400
8 c 0 5 0-5
9 c 5 10 5-10
10 c 10 20 10-20
11 c 30 50 30-50
"a" and "b" do not have gaps, whereas "c" does (missing piece between 20 and 30), so I am after a following result:
$a
[1] TRUE
$b
[1] TRUE
$c
[1] FALSE

You can try:
lapply(split(dt,dt$ID),function(x) all(x[-1,2]==x[-nrow(x),3]))
#$a
#[1] TRUE
#$b
#[1] TRUE
#$c
#[1] FALSE

Here's a dplyr approach:
library(dplyr)
dt %>%
group_by(ID) %>%
summarise(check = all(Sec.end == lead(Sec.start, default = last(Sec.end))))
#Source: local data table [3 x 2]
#
# ID check
# (fctr) (lgl)
#1 a TRUE
#2 b TRUE
#3 c FALSE
Or the same using data.table:
library(data.table)
setDT(dt)[, .(check = all(Sec.end == shift(Sec.start, 1L, 'lead', fill = last(Sec.end)))),
by=ID]
# ID check
#1: a TRUE
#2: b TRUE
#3: c FALSE
Both approaches make use of lag/lead functions (in data.table called shift) to compare each Sec.end value to the next row's Sec.start value. In the last row, where there's no leading Sec.start value, we supply a default value which is the last row's Sec.end - this means the last row (per ID) is always TRUE. We use all to check if all of the comparisons are TRUE per ID.

Related

check if numbers in a column are ascending by a certain value (R dataframe)

I have a column of numbers (index) in a dataframe like the below. I am attempting to check if these numbers are in ascending order by the value of 1. For example, group B and C do not ascend by 1. While I can check by sight, my dataframe is thousands of rows long, so I'd prefer to automate this. Does anyone have advice? Thank you!
group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2
...
I think this works. diff calculates the difference between the two subsequent numbers, and then we can use all to see if all the differences are 1. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
group_by(group) %>%
summarize(Result = all(diff(index) == 1)) %>%
ungroup()
dat2
# # A tibble: 3 x 2
# group Result
# <chr> <lgl>
# 1 A TRUE
# 2 B FALSE
# 3 C FALSE
DATA
dat <- read.table(text = "group index
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
B 2
C 0
C 3
C 1
C 2",
header = TRUE, stringsAsFactors = FALSE)
Maybe aggregate could help
> aggregate(.~group,df1,function(v) all(diff(v)==1))
group index
1 A TRUE
2 B FALSE
3 C FALSE
We can do a group by group, get the difference between the current and previous value (shift) and check if all the differences are equal to 1.
library(data.table)
setDT(df1)[, .(Result = all((index - shift(index))[-1] == 1)), group]
# group Result
#1: A TRUE
#2: B FALSE
#3: C FALSE
data
df1 <- structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "C", "C", "C", "C"), index = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 2L, 0L, 3L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-13L))

Find multiple "switching points" by comparing the answers in columns

I have a data set in which subjects have made choices between A and B for 13 different B's. Below is a simplified example of what the data looks like with 54 subjects and 5 choices. (1 is A, 2 is B).
subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2
I would like to find the questions in which subjects switch option A to B , i.e. for subject 1 this would be choice4.
In a previous study we did this by computing number of times the subject would choose option A and then selecting the corresponding option B form a separate matrix. See code below.
However, the difference now is that instead of choosing 1 switching point, subjects were asked the questions in a randomized order, and thus there is the possibility of having multiple switching points. For example in the table above, subject 3 switches to B at choice2 and again at choice4.
I would like to find both the first time the subject switches to option B, and the last time (before sticking with B for the rest of the choices).
sure_amounts <- matrix(nrow = 4, ncol = 13) # 4 treatments, 13 questions
sure_amounts[1, ] <- c(0, 2, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 8, 10) # Option B's
sure_amounts[2, ] <- seq(2, 14, 1)
sure_amounts[3, ] <- seq(2, 14, 1)
sure_amounts[4, ] <- seq(2, 14, 1)
b_choice <- matrix(nrow = 201, ncol = 4)
switch_choice <- matrix(nrow = 201, ncol = 4) # switching point form A to B
for(j in 1:4){ # number of treatments
for(i in 201){ # number of subjects
choice = NULL
fl = data$ID == i
k = 1 + 36*(j-1) # 36 before going to the next treatment (due to other questions)
choice = c(data[fl,k:(k+12)])
b_choice[i,j] = length(choice[choice==1])
temp = b_choice[i,j]
switch_choice[i,j] <- ifelse(temp==0, 0, sure_amounts[j, temp])
}
}
Does anyone have any tips on how to approach this? Thanks in advance!
I am not sure how you want your expected output to look like but you can try to get data in long format and for each subject select rows where they switch from 1 -> 2.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -subject) %>%
group_by(subject) %>%
filter(value == 2 & lag(value) == 1 |
value == 1 & lead(value) == 2)
# subject name value
# <int> <chr> <int>
# 1 1 choice3 1
# 2 1 choice4 2
# 3 2 choice2 1
# 4 2 choice3 2
# 5 3 choice1 1
# 6 3 choice2 2
# 7 3 choice3 1
# 8 3 choice4 2
# 9 4 choice1 1
#10 4 choice2 2
Here we can see that subject 1 moves from 1 -> 2 from choice3 -> choice4 and so on.
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
A Base R solution:
Essentially this code only substracts a lag of the decisions and detects if the difference is not equal to zero.
Code:
lapply(as.data.frame(t(df_1)[-1,]), function(x){
t <- x - c(x[-1], 0) # row substracted by shortened row
z <- which(t[-length(t)] != 0) # values not equal to zero and rm last value
z + 1 # remove lag
})
# $`1`
# [1] 4
# $`2`
# [1] 3
# $`3`
# [1] 2 3 4
# $`4`
# [1] 2
Data:
df_1 <- read.table(text = " subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2 ", header = T)
An alternative approach:
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(g = paste0(choice1, choice2, choice3, choice4, choice5),
switches = as.character(map(g, ~pluck(str_locate_all(.x, "12"), 1)))) %>%
select(-g)
#> subject choice1 choice2 choice3 choice4 choice5 switches
#> 1 1 1 1 1 2 2 3:4
#> 2 2 1 1 2 2 2 2:3
#> 3 3 1 2 1 2 2 c(1, 3, 2, 4)
#> 4 4 1 2 2 2 2 1:2
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Created on 2020-07-10 by the reprex package (v0.3.0)

Normalize multiple values using values of one factor in R

We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...
If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33

How to check previous row value with present row value in dataframe

How to check previous row value with present row value dynamically for all column of data frame by grouping specific ID.
my data frame:
ID ITEM1 ITEM2 ITEM3
1 A A A
2 C B C
1 A B C
1 B A C
2 NA B F
3 A A D
4 R G J
4 H T J
For Ex:
ID ITEM1 ITEM2 ITEM3 ITEM1change ITEM2change ITEM3change
1 A A A 0 0 0
1 A B C 0 1 1
1 B A C 1 1 0
2 C B C 0 0 0
2 NA B F 1 0 1
3 A A D 0 0 0
4 R G J 0 0 0
4 H T J 1 1 0
My final output will be:
Fiels modifiedcout unmodifiedcount Total
ITEM1change 3 5 8
ITEM2change 3 5 8
ITEM3change 2 6 8
my data:
structure(list(ID = c(1, 2, 1, 1, 2, 3, 4, 4), ITEM1 = structure(c(1L,
3L, 1L, 2L, NA, 1L, 5L, 4L), .Label = c("A", "B", "C", "H", "R"
), class = "factor"), ITEM2 = structure(c(1L, 2L, 2L, 1L, 2L,
1L, 3L, 4L), .Label = c("A", "B", "G", "T"), class = "factor"),
ITEM3 = structure(c(1L, 2L, 2L, 2L, 4L, 3L, 5L, 5L), .Label = c("A",
"C", "D", "F", "J"), class = "factor")), .Names = c("ID",
"ITEM1", "ITEM2", "ITEM3"), row.names = c(NA, -8L), class = "data.frame")
A possible solution:
library(dplyr)
library(tidyr)
df %>%
gather(item, value, -1) %>%
group_by(ID, item) %>%
mutate(change = lag(value, default = first(value)) != value,
change = replace(change, is.na(change), TRUE)) %>%
group_by(item) %>%
summarise(modified = sum(change, na.rm = TRUE),
unmodified = sum(!change, na.rm = TRUE)) %>%
mutate(total = modified + unmodified)
which gives:
# A tibble: 3 x 4
item modified unmodified total
<chr> <int> <int> <int>
1 ITEM1 3 5 8
2 ITEM2 3 5 8
3 ITEM3 2 6 8
Here is another idea using rollapply from zoo. By using rollapply with width = 2, we are testing if x is not equal with x-1. Wrapping it in as.integer gives 1s (TRUE) and 0s (FALSE). We then replace all NAs with 1 since you consider them as being modified, and use colSums to sum the modified/unmodified elements. The total is just the number of rows of the original data frame.
library(zoo)
m1 <- do.call(rbind, lapply(split(df, df$ID), function(i)
sapply(i[-1], function(j)
as.integer(c(FALSE, rollapply(j, 2, function(k) k[1] != k[2]))))))
m1 <- replace(m1, is.na(m1), 1)
#giving
# ITEM1 ITEM2 ITEM3
# 0 0 0
# 0 1 1
# 1 1 0
# 0 0 0
# 1 0 1
#3 0 0 0
# 0 0 0
# 1 1 0
To get your expected data frame,
final_df <- data.frame(modified = colSums(m1 == 1),
unmodified = colSums(m1 != 1),
Total = nrow(df), stringsAsFactors = FALSE)
which gives,
modified unmodified Total
ITEM1 3 5 8
ITEM2 3 5 8
ITEM3 2 6 8
The given data has many columns of the same type. This strongly suggests that the data better be stored in long format rather than in wide format.
Jaap's solution is reshaping the data using tidyr / dplyr.
However, I would like to suggest a data.tablesolution which does not reshape the data. In addition, it avoids to handle NA values separately.
library(data.table)
# coerce to data.table, loop over columns and determine changes to previous row by ID
tmp <- setDT(DF)[, lapply(.SD, function(x) x == shift(x, fill = x[1])), by = ID]
tmp
ID ITEM1 ITEM2 ITEM3
1: 1 TRUE TRUE TRUE
2: 1 TRUE FALSE FALSE
3: 1 FALSE FALSE TRUE
4: 2 TRUE TRUE TRUE
5: 2 NA TRUE FALSE
6: 3 TRUE TRUE TRUE
7: 4 TRUE TRUE TRUE
8: 4 FALSE FALSE TRUE
Now, we can count the unchanged rows:
tmp[, lapply(.SD, sum, na.rm = TRUE), .SDcols = -"ID"]
ITEM1 ITEM2 ITEM3
1: 5 5 6
From here, OP's expected result can be achieved in two different ways
using melt()
melt(tmp[, lapply(.SD, sum, na.rm = TRUE), .SDcols = -"ID"]
, measure.vars = patterns("^ITEM"),
variable.name = "item",
value.name = "unmodified")[
, c("modified", "Total") := .(nrow(DF) - unmodified, nrow(DF))][]
or by transposing:
as.data.table(
t(tmp[, lapply(.SD, sum, na.rm = TRUE), .SDcols = -"ID"])
, keep.rownames = "item")[, setnames(.SD, "V1", "unmodified")][
, c("modified", "Total") := .(nrow(DF) - unmodified, nrow(DF))][]
Both return the same result:
item unmodified modified Total
1: ITEM1 5 3 8
2: ITEM2 5 3 8
3: ITEM3 6 2 8
For the sake of completeness, here is also a data.table implementation of the reshape approach. As above, NA are handled by counting the unmodified rows first excluding any NA.
melt(setDT(DF), id.vars = "ID", variable.name = "item")[
, value == shift(value, fill = value[1L]), by = .(ID, item)][
, .(unmodified = sum(V1, na.rm = TRUE)), by = item][
, c("modified", "Total") := .(nrow(DF) - unmodified, nrow(DF))][]
If dat is your data, then try:
Create ITEMCHANGE variables
dat["ITEM1Change"] <- c(NA, head(dat["ITEM1"], dim(dat)[1] - 1)[[1]])
dat["ITEM2Change"] <- c(NA, head(dat["ITEM2"], dim(dat)[1] - 1)[[1]])
dat["ITEM3Change"] <- c(NA, head(dat["ITEM3"], dim(dat)[1] - 1)[[1]])
Then compare if there are changes
dat$ITEM1Change <- ifelse(dat$ITEM1Change == dat$ITEM1, 0, 1)
dat$ITEM2Change <- ifelse(dat$ITEM2Change == dat$ITEM2, 0, 1)
dat$ITEM3Change <- ifelse(dat$ITEM3Change == dat$ITEM2, 0, 1)
Then group and summarize
library(dplyr)
dat %>%
group_by("ITEM1") %>%
summarise_at(.funs = sum, .vars = "ITEM1Change") -> ITEM1Change
etc.
Is this what you need?

Transform a dataframe to use first column values as column names

I have a dataframe with 2 columns:
.id vals
1 A 10
2 B 20
3 C 30
4 A 100
5 B 200
6 C 300
dput(tst_df)
structure(list(.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), vals = c(10, 20, 30, 100, 200,
300)), .Names = c(".id", "vals"), row.names = c(NA, -6L), class = "data.frame")
Now i want to have the .id column to become my column names and the vals will become 2 rows.
Like this:
A B C
10 20 30
100 200 300
Basically .id is my grouping variable and i want to have all values belonging to 1 group as a row. I expected something simple like melt and transform. But after many tries i still not succeeded. Is anyone familiar with a function that will accomplish this?
You can do this in base R with unstack:
unstack(df, form=vals~.id)
A B C
1 10 20 30
2 100 200 300
The first argument is the name of the data.frame and the second is a formula which determines the unstacked structure.
You can also use tapply,
do.call(cbind, tapply(df$vals, df$.id, I))
# A B C
#[1,] 10 20 30
#[2,] 100 200 300
or wrap it in data frame, i.e.
as.data.frame(do.call(cbind, tapply(df$vals, df$.id, I)))

Resources