How to cope with multi-index data in R? - r

I have a multi-index data set with 100 cases, and each case has 5 questions. Each question was scored by 2 raters.
case question rater1 rater2
1 1 1 1
1 2 1 0
1 3 1 1
1 4 1 1
1 5 0 0
2 1 0 1
2 2 1 1
2 3 1 1
2 4 1 0
2 5 0 0
3 1 0 0
3 2 1 0
3 3 1 1
3 4 1 1
3 5 0 1
...
I want to sum question 1, 2, 3 in each case as A, and question 4, 5 in each case as B. Then insert the value at the end of each case, such as
case question rater1 rater2
1 1 1 1
1 2 1 0
1 3 1 1
1 4 1 1
1 5 0 0
1 A 3 2
1 B 1 1
2 1 0 1
2 2 1 1
2 3 1 1
2 4 1 0
2 5 0 0
2 A 2 3
2 B 1 0
3 1 0 0
3 2 1 0
3 3 1 1
3 4 1 1
3 5 0 1
3 A 2 1
3 B 1 2
...
I am unsure how to achieve it.

You could summarize the data, and then bind it back to the original data and resort it. For example
library(dplyr)
dd %>%
group_by(case, grp = case_when(question %in% 1:3~"A", question %in% 4:5 ~ "B")) %>%
summarize(across(-question, sum)) %>%
ungroup() %>%
rename(question = grp) %>%
bind_rows(mutate(dd, question = as.character(question))) %>%
arrange(case, question)

With data.table
library(data.table)
dt[
,.(
question = c(question, "A", "B"),
rater1 = c(rater1, sum(rater1[1:3]), sum(rater1[4:5])),
rater2 = c(rater2, sum(rater2[1:3]), sum(rater2[4:5]))
), case
][1:15]
#> case question rater1 rater2
#> 1: 1 1 1 0
#> 2: 1 2 1 1
#> 3: 1 3 0 0
#> 4: 1 4 0 0
#> 5: 1 5 0 1
#> 6: 1 A 2 1
#> 7: 1 B 0 1
#> 8: 2 1 0 0
#> 9: 2 2 0 1
#> 10: 2 3 0 1
#> 11: 2 4 1 1
#> 12: 2 5 0 0
#> 13: 2 A 0 2
#> 14: 2 B 1 1
#> 15: 3 1 0 0
Data
dt <- data.table(
case = rep(1:100, each = 5),
question = rep(1:5, 100),
rater1 = sample(0:1, 500, 1),
rater2 = sample(0:1, 500, 1)
)

Related

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

How to get the number of consecutive zeroes from a column in a dataframe

I'm trying to work out how to get the number of consecutive zeroes for a given column for a dataframe.
Here is a dataframe:
data <- data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2), value = c(1,0,0,1,0,0,0,0,0,0,4,3))
This would be the desired output:
id value consec
1 1 0
1 0 2
1 0 2
1 1 0
1 0 2
1 0 2
2 0 4
2 0 4
2 0 4
2 0 4
2 4 0
2 3 0
Any ideas on how to achieve this output?
Many thanks
You can do:
data$consec <- with(data, ave(value, value, cumsum(value != 0), id, FUN = length) - (value != 0))
data
id value consec
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0
Here's a base R solution using interaction and rle (run-length encoding):
rlid <- rle(as.numeric(interaction(data$id, data$value)))$lengths
data$consec <- replace(rep(rlid, rlid), data$value != 0, 0)
data
#> id value consec
#> 1 1 1 0
#> 2 1 0 2
#> 3 1 0 2
#> 4 1 1 0
#> 5 1 0 2
#> 6 1 0 2
#> 7 2 0 4
#> 8 2 0 4
#> 9 2 0 4
#> 10 2 0 4
#> 11 2 4 0
#> 12 2 3 0
This dplyr solution will work. Using cumulative sum we keep track of every time a new non-zero entry occurs, and for each of these groups we count the number of zeros:
data %>%
group_by(id) %>%
mutate(flag_0 = cumsum(value == 1)) %>%
group_by(id, flag_0) %>%
mutate(conseq = ifelse(value == 0, sum(value == 0), 0)) %>%
ungroup()
# A tibble: 12 x 4
id value flag_0 conseq
<dbl> <dbl> <int> <dbl>
1 1 1 1 0
2 1 0 1 2
3 1 0 1 2
4 1 1 2 0
5 1 0 2 2
6 1 0 2 2
7 2 0 0 4
8 2 0 0 4
9 2 0 0 4
10 2 0 0 4
11 2 4 0 0
12 2 3 0 0
This tidyverse approach can also do the job
library(tidyverse)
data %>% group_by(id) %>%
mutate(value2 =cumsum(value)) %>% group_by(id, value, value2) %>%
mutate(consec = ifelse(value == 0, n(), 0)) %>%
ungroup() %>% select(-value2)
# A tibble: 12 x 3
id value consec
<dbl> <dbl> <dbl>
1 1 1 0
2 1 0 2
3 1 0 2
4 1 1 0
5 1 0 2
6 1 0 2
7 2 0 4
8 2 0 4
9 2 0 4
10 2 0 4
11 2 4 0
12 2 3 0

freq table for multiple variables in r

I would like to crosstab the items variable vs cat as a frequency table.
df1 <- data.frame(cat = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4),
item1 = c(0,0,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1),
item2 = c(1,1,0,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0),
item3 = c(0,0,1,0,1,0,0,0,1,0,1,1,1,0,0,1,0,1))
> table(df1$cat, df1$item1)
0 1
1 3 1
2 3 2
3 3 2
4 2 2
Is there a way to print all the items variables freq table by cat together?
Thanks
Here is a quick solution in base-R
aggregate(.~ cat, df1, table)
cat item1.0 item1.1 item2.0 item2.1 item3.0 item3.1
1 1 3 1 1 3 3 1
2 2 3 2 3 2 3 2
3 3 3 2 2 3 2 3
4 4 2 2 3 1 2 2
You can use tally() to get the frequency for every combination of groups.
library(tidyverse)
df1 <- data.frame(cat = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4),
item1 = c(0,0,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1),
item2 = c(1,1,0,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0),
item3 = c(0,0,1,0,1,0,0,0,1,0,1,1,1,0,0,1,0,1))
df1 %>% mutate_if(is.numeric, as.factor) %>%
group_by(cat, item1, item2, item3, .drop=F) %>%
tally()
First convert your variables to factors then you can then use group_by(, .drop=F) %>% tally() to tally all of your variables, including all groupings with zero frequencies. Remove .drop=F to remove all zero frequencies.
cat item1 item2 item3 n
1 1 0 0 0 0
2 1 0 0 1 0
3 1 0 1 0 3
4 1 0 1 1 0
5 1 1 0 0 0
6 1 1 0 1 1
7 1 1 1 0 0
8 1 1 1 1 0
9 2 0 0 0 1
10 2 0 0 1 1
11 2 0 1 0 1
12 2 0 1 1 0
13 2 1 0 0 0
14 2 1 0 1 1
15 2 1 1 0 1
16 2 1 1 1 0
17 3 0 0 0 0
18 3 0 0 1 0
19 3 0 1 0 1
20 3 0 1 1 2
21 3 1 0 0 1
22 3 1 0 1 1
23 3 1 1 0 0
24 3 1 1 1 0
25 4 0 0 0 0
26 4 0 0 1 1
27 4 0 1 0 1
28 4 0 1 1 0
29 4 1 0 0 1
30 4 1 0 1 1
31 4 1 1 0 0
32 4 1 1 1 0
Alternatively, if that is too unwieldy, you can also try table1() from library(table1).
library(tidyverse)
library(table1)
df1 <- data.frame(cat = c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4),
item1 = c(0,0,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1),
item2 = c(1,1,0,1,0,1,1,0,0,0,1,0,1,1,0,0,1,0),
item3 = c(0,0,1,0,1,0,0,0,1,0,1,1,1,0,0,1,0,1))
df1 <- df1 %>% mutate_if(is.numeric, as.factor)
table1(~ item1 + item2 + item3 | cat, data=df1)
To get a table of the frequencies and percentages. The top row is your cat variable.
table1() is really great for generating HTML frequency tables. Highly recommend. You can do lots of formatting and labels to make tables presentable. Here is a tutorial
Here's another approach using ftable and stack from base R:
x <- ftable(cbind(cat = df1[, 1], stack(df1[-1])), row.vars = 1, col.vars = c(3, 2))
x
# ind item1 item2 item3
# values 0 1 0 1 0 1
# cat
# 1 3 1 1 3 3 1
# 2 3 2 3 2 3 2
# 3 3 2 2 3 2 3
# 4 2 2 3 1 2 2
One (debatable) downside of this approach is that the default data.table or data.frame methods for converting ftables to more usable objects will convert the output to a long format. But, you can grab SOfun and use ftable2dt if you want to keep the wide format.
library(SOfun)
ftable2dt(x)
# cat item1_0 item1_1 item2_0 item2_1 item3_0 item3_1
# 1: 1 3 1 1 3 3 1
# 2: 2 3 2 3 2 3 2
# 3: 3 3 2 2 3 2 3
# 4: 4 2 2 3 1 2 2
You can try this:
List <- list()
for(i in 2:dim(df1)[2])
{
List[[i-1]] <- table(df1$cat, df1[,i])
}
[[1]]
0 1
1 3 1
2 3 2
3 3 2
4 2 2
[[2]]
0 1
1 1 3
2 3 2
3 2 3
4 3 1
[[3]]
0 1
1 3 1
2 3 2
3 2 3
4 2 2

Creating a new variable while using subsequent values in r

I have the following data frame:
df1 <- data.frame(id = rep(1:3, each = 5),
time = rep(1:5),
y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))
df1
## id time y
## 1 1 1 1
## 2 1 2 1
## 3 1 3 1
## 4 1 4 1
## 5 1 5 0
## 6 2 1 1
## 7 2 2 0
## 8 2 3 1
## 9 2 4 1
## 10 2 5 0
## 11 3 1 0
## 12 3 2 1
## 13 3 3 0
## 14 3 4 0
## 15 3 5 0
I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.
I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.
Essentially, I'm looking for the following output:
df1
## id time y new_col
## 1 1 1 1 0
## 2 1 2 1 0
## 3 1 3 1 0
## 4 1 4 1 0
## 5 1 5 0 1
## 6 2 1 1 0
## 7 2 2 0 0
## 8 2 3 1 0
## 9 2 4 1 0
## 10 2 5 0 1
## 11 3 1 0 0
## 12 3 2 1 0
## 13 3 3 0 1
## 14 3 4 0 1
## 15 3 5 0 1
The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.
I would use a little helper function for that.
foo <- function(x, val) {
pos <- max(which(x != val)) +1
as.integer(seq_along(x) >= pos)
}
df1 %>%
group_by(id) %>%
mutate(indicator = foo(y, 0))
# # A tibble: 15 x 4
# # Groups: id [3]
# id time y indicator
# <int> <int> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 1 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 1 5 0 1
# 6 2 1 1 0
# 7 2 2 0 0
# 8 2 3 1 0
# 9 2 4 1 0
# 10 2 5 0 1
# 11 3 1 0 0
# 12 3 2 1 0
# 13 3 3 0 1
# 14 3 4 0 1
# 15 3 5 0 1
In case you want to consider NA-values in y, you can adjust foo to:
foo <- function(x, val) {
pos <- max(which(x != val | is.na(x))) +1
as.integer(seq_along(x) >= pos)
}
That way, if there's a NA after the last y=0, the indicator will remain 0.
Here is an option using data.table
library(data.table)
setDT(df1)[, indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
# id time y indicator
# 1: 1 1 1 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 0 1
# 6: 2 1 1 0
# 7: 2 2 0 0
# 8: 2 3 1 0
# 9: 2 4 1 0
#10: 2 5 0 1
#11: 3 1 0 0
#12: 3 2 1 0
#13: 3 3 0 1
#14: 3 4 0 1
#15: 3 5 0 1
Based on the comments from #docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do
setDT(df1)[, indicator := {
i1 <- rleid(y) * !y
if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)]) }, id]

Changing the ID value based on another column

I have a large data set that looks something like this:
Conv. Rev. ID Order path_no
0 0 1 1 1
1 50 1 2 1
0 0 1 3 2
1 100 1 4 2
0 0 2 1 1
0 0 2 2 1
1 150 2 3 1
1 100 2 4 2
I want to make a new ID column based on when there is a new path_no, then the ID will change. So I am hoping it will look something like this:
Conv. Rev. ID Order path_no
0 0 1 1 1
1 50 1 2 1
0 0 2 3 2
1 100 2 4 2
0 0 3 1 1
0 0 3 2 1
1 150 3 3 1
1 100 4 4 2
I think rleid from data.table should do the trick. Here's one solution that uses data.table and dplyr:
dplyr::mutate(df, ID = data.table::rleid(path_no))
Conv. Rev. ID Order path_no
1 0 0 1 1 1
2 1 50 1 2 1
3 0 0 2 3 2
4 1 100 2 4 2
5 0 0 3 1 1
6 0 0 3 2 1
7 1 150 3 3 1
8 1 100 4 4 2
Or with data.table only:
dt <- setDT(df)
dt[, ID := rleid(path_no)][]
Conv. Rev. ID Order path_no
1: 0 0 1 1 1
2: 1 50 1 2 1
3: 0 0 2 3 2
4: 1 100 2 4 2
5: 0 0 3 1 1
6: 0 0 3 2 1
7: 1 150 3 3 1
8: 1 100 4 4 2
Data:
text <- "Conv. Rev. ID Order path_no
0 0 1 1 1
1 50 1 2 1
0 0 1 3 2
1 100 1 4 2
0 0 2 1 1
0 0 2 2 1
1 150 2 3 1
1 100 2 4 2"
df <- read.table(text = text, stringsAsFactors = FALSE, header = TRUE)
Can go for a simple for loop:
vals <- c(1, 1, 1, 2, 2, 2, 1, 1, 2)
nobs <- length(vals)
idx <- rep(1, nobs)
for (i in 2:nobs) {
if (vals[i] != vals[i-1]) {
idx[i] <- idx[i-1] + 1
} else {
idx[i] <- idx[i-1]
}
}

Resources