Counting values in data frame subject to conditions - r

I have been searching around and I cannot figure out how to sumarise the data I have in my data frame (subject to some ranges). I know that it can be done when applying some combination of daaply/taaply or table but I haven't been able to get the exact result I was expecting.
Basically I want to turn this:
part_no val1 val2 val3
2 1 2 3 45.3
2 1 3 4 -12.3
3 1 3 4 99.3
3 1 5 2 -3.2
3 1 4 3 -55.3
Into this:
part_no val3_between0_50 val3_bw50_100 val3_bw-50_0 val3_bw-100_-50
2 1 0 0 1 0
3 0 1 0 1 1
This is dummy data, I got a lot more rows, but the idea is the same. I just want to count the number of values for a participant that meet certain condition.
If anyone could explain it sort of step by step, I would really appreciate it. I saw lots of different little posts around, but none do exactly this and my attempts only got me half way there. Like using table, etc.

Better solution that the one below (will not need the extra row used below although if you wanted to move the renaming code to this matrix result, you could):
xtabs(~part_no +cut(val4, breaks=c(-100, -50, 0, 50, 100) ), dat=dat)
#-------------
cut(val4, breaks = c(-100, -50, 0, 50, 100))
part_no (-100,-50] (-50,0] (0,50] (50,100]
2 0 1 1 0
3 1 1 0 1
First try: .... n to a slightly different problem and would be easy to adapt to your situation. The difficulty I ran into is that my solution requires the part_no to start with 1. You could assign row labels later I suppose. Or make 'part_no' a factor and use its numeric-mode value.
dat <- read.table(text="part_no val1 val2 val3 val4
1 1 2 3 -32
2 1 2 3 45.3
2 1 3 4 -12.3
3 1 3 4 99.3
3 1 5 2 -3.2
3 1 4 3 -55.3
", head=T)
levs= 4; recs <- matrix( c(unique(dat$part_no),
rep(0, levs*length(unique(dat$part_no))) ),
nrow=length(unique(dat$part_no)) )
recs[ cbind( dat$part_no,
1+ findInterval(dat$val4, c(-100, -50, 0, 50, 100) ) )] <- 1
recs
#------------------------------------
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 0 0
[2,] 2 0 1 1 0
[3,] 3 1 1 0 1
#------------------------------------
colnames(recs) <- c(names(dat)[1] ,
paste("val_btwn",
c(-100, -50, 0, 50, 100)[1:4],
c(-100, -50, 0, 50, 100)[2:5],
sep="_") )
recs
#------------------------------------
part_no val_btwn_-100_-50 val_btwn_-50_0 val_btwn_0_50 val_btwn_50_100
[1,] 1 0 1 0 0
[2,] 2 0 1 1 0
[3,] 3 1 1 0 1
And now that I think further I might use cut and xtabs next time. In fact it worked so well I am going to post it on top.

Related

Convert Categorical Data into binary (1&0) in R Studio

I have a dataset comparing cases to categories of mental illness. In the data set, it is reported as 0 for no mental illness, 1 for mood disorders, 2 for behavioral disorders, 3 for other, and 4 for disorder like symptoms. I am trying to convert my dataset (mentallIllness) so that if you show any symptoms or have any disorder (I.e. you have a 1 to 4) it counts as a 1 (just yes that you have signs/have disorder) or 0 for no mental illness.
How can I go about that?
Thanks!
Suppose you have a vector with numbers from 0 to 4:
my_data <- c(0:4, 2, 3, 0)
my_data
#[1] 0 1 2 3 4 2 3 0
Here are a few ways to convert all the non-zeros to 1:
1*(my_data>0)
#[1] 0 1 1 1 1 1 1 0
as.numeric(my_data>0)
#[1] 0 1 1 1 1 1 1 0
In both of these cases, the term (my_data>0) tests each value in my_data to evaluate if it is greater than 0, if so the result is TRUE, otherwise FALSE. We can multiply TRUE/FALSE by 1, or convert to numeric, to change those to 1/0.
As Ben Bolker suggested, we could use ifelse to get the same results:
ifelse(my_data == 0, 0, 1)
#[1] 0 1 1 1 1 1 1 0
Your vector might live in a data frame, like:
my_df <- data.frame(my_data = c(0:4, 2, 3, 0))
We could use the same code to make a new variable, or overwrite the existing one:
my_df$recoded = ifelse(my_df$my_data == 0, 0, 1)
my_df
# my_data recoded
#1 0 0
#2 1 1
#3 2 1
#4 3 1
#5 4 1
#6 2 1
#7 3 1
#8 0 0

Reshaping data with no time var

Problem:
How to generate a new dataset from an existing one, basically it is a reshape from long to wide, but a bit more complicated.
I have a non-trivial amount of data, of which I offer a simplified version below:
id <- c(1,2,3,4,5)
job <- c(11,12,11,12,13)
sex <- c(0,1,0,1,0)
country <- c(1,2,3,2,1)
data <- data.frame(id, job, sex, country)
Desired data:
I'd like to have a dataset of the jobs and their occupants, like this:
in job=11, I have 2 people of sex==0 and 1 born in country==1 and 1 born in country==3
So, the new dataset would be like this:
jobs jobs_sex0 jobs_sex1 jobs_country1 jobs_country2 jobs_country3
1 11 2 0 1 0 0
2 12 0 2 0 2 0
3 13 1 0 0 0 1
I have an intuition that this can be achieved with tapply, but I am not sure how.
I have tried this, and it does not work:
tapply(occupation[sex==1],sex[sex==1], sum)
aggregate(occupation, list(sex), fun=sum)
Edit:
I think this Q is not a duplicate of Transpose / reshape dataframe without "timevar" from long to wide format, as the problem I have is that I need to reshape different factor variables with different number of levels... Applying the answer from the supposedly duplicated Q does not work...
I wonder if the tableone package might help you here. Consider:
data$sex <- factor(data$sex) # note that you will have to ensure these are factors
data$country <- factor(data$country)
library(tableone)
tab1 <- CreateTableOne(vars=c("sex", "country"), strata="job", data=data)
print(tab1, showAllLevels=TRUE, test=FALSE, explain=FALSE)
# Stratified by job
# level 11 12 13
# n 2 2 1
# sex 0 2 (100.0) 0 ( 0.0) 1 (100.0)
# 1 0 ( 0.0) 2 (100.0) 0 ( 0.0)
# country 1 1 ( 50.0) 0 ( 0.0) 1 (100.0)
# 2 0 ( 0.0) 2 (100.0) 0 ( 0.0)
# 3 1 ( 50.0) 0 ( 0.0) 0 ( 0.0)
If you want to do subsequent processing, the above solution will be less workable. Here is a coded solution, but you will have to adapt it for each situation:
out.data <- t(sapply(split(data, job), function(df){
with(df, c(table(sex), table(country))) }))
out.data <- data.frame(job=rownames(out.data), out.data)
rownames(out.data) <- NULL
colnames(out.data)[2:6] <- c(paste("sex", levels(data$sex), sep="_"),
paste("country", levels(data$country), sep="_") )
out.data
# job sex_0 sex_1 country_1 country_2 country_3
# 1 11 2 0 1 0 1
# 2 12 0 2 0 2 0
# 3 13 1 0 1 0 0
I think I have found another very simple solution, with the help of some friends :)
data
id job sex country
1 1 11 2 1
2 2 12 1 2
3 3 11 2 3
4 4 12 1 2
5 5 13 2 1
data$sex <- as.factor(data$sex)
data$country <- as.factor(data$country)
agg_data <- aggregate((model.matrix(~.-1, data[,-(1:2)])), by =
list(unique.jobs = data$job), FUN=sum)
agg_data
unique.jobs sex1 sex2 country1 country2 country3
1 11 0 2 1 0 1
2 12 2 0 0 2 0
3 13 0 1 1 0 0

Group function with basic calculation

I have a data.table with two parameters(date and status), now I want to insert new columns based on the original table.
data rules:
the Status column contains only "0" and "1"
the Date column is always increase by seconds :)
new variables:
group: to number each group or cycle for the status, the order of the status is (0,1). it means that the status starts with status '0', when the status becomes '0' again, one cycle is completed.
cycle_time: calculate the cycle time for each group
group_0: calculate the time for the status 0 within a specific group
group_1: calculate the time for the status 1 within a specific group
For example, a simple input:
the code to generate the data:
dd <- data.table(date = c("2015-07-01 00:00:12", "2015-07-01 00:00:13","2015-07-01 00:00:14","2015-07-01 00:00:15", "2015-07-01 00:00:16", "2015-07-01 00:00:17","2015-07-01 00:00:18", "2015-07-01 00:00:19", "2015-07-01 00:00:20","2015-07-01 00:00:21", "2015-07-01 00:00:22", "2015-07-01 00:00:23","2015-07-01 00:00:24", "2015-07-01 00:00:25"), status = c(0,0,0,0,1,1,1,0,0,1,1,1,1,0))
the output including new parameters is:
actually i have done with some basic methods,
the main idea is :if the current status is 0 and the next status is 1, then mark it as one group.
the idea could work, but the problem is the calculation time is too long, since so many loops.
I supposed that there could be an easier solution for this case
So a transition from 1 to 0 marks the boundary of a group. You can use cumsum and diff to get this working. For the x example in the answer of #zx8754:
data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
x group_id
1 0 1
2 0 1
3 0 1
4 1 1
5 1 1
6 0 2
7 0 2
8 1 2
9 0 3
For a more realistically sized example:
res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
user system elapsed
2.770 1.680 4.449
> head(res, 20)
status group_id
1 0 1
2 0 1
3 1 1
4 0 2
5 0 2
6 0 2
7 1 2
8 1 2
9 0 3
10 1 3
11 1 3
12 0 4
13 1 4
14 0 5
15 0 5
16 1 5
17 0 6
18 0 6
19 1 6
20 0 7
5 seconds for 10 million records is quite fast (although that depends on your definition of fast :)).
Benchmarking
set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))
microbenchmark::microbenchmark(
rleid = {
gr <- data.table::rleid(res$status)
x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
# removing "as.numeric(as.factor" helps, but still not as fast as cumsum
#x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
},
cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777 100 b
# cumsum 1.511811 1.559563 2.221273 1.826404 2.475402 6.88169 100 a
identical(x1, x2)
# [1] TRUE
Try this:
#dummy data
x <- c(0,0,0,1,1,0,0,1,0)
#get group id using rleid from data.table
gr <- data.table::rleid(x)
#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)
#result
cbind(x, gr)
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 3
# [7,] 0 3
# [8,] 1 3
# [9,] 0 5
#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 2
# [7,] 0 2
# [8,] 1 2
# [9,] 0 3

Apply confusionMatrix() to Elements of a Split List in R

I'm searching for a solution to apply the confusionMatrix() function from {caret} to specific elements of a split list. I have 3 Groups, with each group having 10 observations of Actuals and 3 Preds columns.
library(caret)
set.seed(10)
dat <- data.frame(Group = c(rep(1, 10), rep(2, 10), rep(3, 10)), Actual = round(runif(30, 0, 1)),
Preds1 = round(runif(30, 0, 1)), Preds2 = round(runif(30, 0, 1)), Preds3 = round(runif(30, 0, 1)))
> dat
Group Actual Preds1 Preds2 Preds3
1 1 1 1 0 0
2 1 0 0 0 1
3 1 0 0 0 1
4 1 1 1 0 1
...........
27 3 1 0 1 0
28 3 0 0 0 1
29 3 1 0 0 1
30 3 0 1 0 1
The final solution should create confusion matrices by Group, by each Preds column. I will need the actual confusion matrix tables, but will eventually need to extract the $overall and $byClass elements and end up with something like below.
> conf_matrix
$Preds1
Accuracy Sensitivity Specificity
[1,] 0.73 0.8 0.6
[2,] 0.93 0.91 1
[3,] 0.87 0.83 1
[4,] 0.8 0.82 0.75
...............
[27,] 0.8 0.82 0.75
[28,] 0.58 0.67 0.5
[29,] 1 0.67 1
[30,] 1 0 1
$Preds2
Accuracy Sensitivity Specificity
[1,] 0.73 0.8 0.6
[2,] 0.93 0.91 1
[3,] 0.87 0.83 1
[4,] 0.8 0.82 0.75
...............
[27,] 0.8 0.82 0.75
[28,] 0.58 0.67 0.5
[29,] 1 0.67 1
[30,] 1 0 1
$Preds3
...............
I have tried the script below, but keeping running into issues when trying the secondary indexing by the Preds column within each group. I believe it has something to do with my nested lapply's and how I'm indexing since this works when I decompose the code and step through it one at a time.
I have also tried to do this manually using table(), however have abandoned that method because it does not give me consistent results like using confusionMatrix().
lapply(seq_along(split(dat[3:5], list(dat$Group))), function(x) {
x_temp <- split(dat[3:5], list(dat$Group))[[x]]
lapply(seq_along(x_temp), function(x2) {
x_temp <- x_temp[[x2]]
lapply(seq_along(split(dat[2], list(dat$Group))), function(y) {
y_temp <- split(dat[2], list(dat$Group))[[y]]
lapply(seq_along(y_temp), function(y2) {
y_temp <- y_temp[[y2]]
confusionMatrix(x_temp, y_temp)
})
})
})
})
I may be way off base so I'm open to all suggestions and comments.
I don't understand the final outcome but confusion matrices would be obtained by the following.
library(caret)
set.seed(10)
dat <- data.frame(Group = c(rep(1, 10), rep(2, 10), rep(3, 10)), Actual = round(runif(30, 0, 1)),
Preds1 = round(runif(30, 0, 1)), Preds2 = round(runif(30, 0, 1)), Preds3 = round(runif(30, 0, 1)))
dat[] <- lapply(dat, as.factor)
# split by group
dats <- split(dat[,-1], dat$Group)
cm <- do.call(c, lapply(dats, function(x) {
actual <- x[, 1]
lapply(x[, 2:4], function(y) {
confusionMatrix(actual, unlist(y))$table
})
}))
cm[1:3]
$`1.Preds1`
Reference
Prediction 0 1
0 3 4
1 0 3
$`1.Preds2`
Reference
Prediction 0 1
0 4 3
1 3 0
$`1.Preds3`
Reference
Prediction 0 1
0 3 4
1 1 2
# Brian
In the link (What's the difference between lapply and do.call in R?), I find Paul Hiemstra's answer quite straightforward.
-lapply is similar to map, do.call is not. lapply applies a function to all elements of a list, do.call calls a function where all the function arguments are in a list. So for a n element list, lapply has n function calls, and do.call has just one function call. So do.call is quite different from lapply.
In the example,
dats has three elements - 1, 2 and 3
dats <- split(dat[,-1], dat$Group)
dats[1]
$`1`
Actual Preds1 Preds2 Preds3
1 1 1 0 0
2 0 0 0 1
3 0 0 0 1
4 1 1 0 1
5 0 0 1 0
6 0 1 1 1
7 0 1 1 0
8 0 1 0 1
9 1 1 0 1
10 0 1 0 0
Below is double loop and the first loop applied to 1, 2 and 3 and the second loop to Preds1, Preds2 and Preds3. Therefore the list generated by lapply() alone produces a nested list as shown below.
lapply(dats, function(x) {
actual <- x[, 1]
lapply(x[, 2:4], function(y) {
confusionMatrix(actual, unlist(y))$table
})
})[1]
$`1`
$`1`$Preds1
Reference
Prediction 0 1
0 3 4
1 0 3
$`1`$Preds2
Reference
Prediction 0 1
0 4 3
1 3 0
$`1`$Preds3
Reference
Prediction 0 1
0 3 4
1 1 2
However the above is not easy to use later as another double loop is necessary to have access to each confusion matrix. It is simplified with do.call(). The first argument c is a function and it does c(dats$1$Preds1, dats$1$Preds2, dats$1$Preds2 ...) so that the structure is reduced to be accessible by single loop. Normally I tend to use do.call() when it is necessary to change the structure of a list.
do.call(c, lapply(dats, function(x) {
actual <- x[, 1]
lapply(x[, 2:4], function(y) {
confusionMatrix(actual, unlist(y))$table
})
}))[1:3]
$`1.Preds1`
Reference
Prediction 0 1
0 3 4
1 0 3
$`1.Preds2`
Reference
Prediction 0 1
0 4 3
1 3 0
$`1.Preds3`
Reference
Prediction 0 1
0 3 4
1 1 2

Rescore Items from Scoring Key

I have a set of data on which respondents were given a series of questions, each with five response options (e.g., 1:5). Given those five options, I have a scoring key for each question, where some responses are worth full points (e.g., 2), others half points (1), and others no points (0). So, the data frame is n (people) x k (questions), and the scoring key is a k (questions) x m (responses) matrix.
What I am trying to do is to programmatically create a new dataset of the rescored items. Trivial dataset:
x <- sample(c(1:5), 50, replace = TRUE)
y <- sample(c(1:5), 50, replace = TRUE)
z <- sample(c(1:5), 50, replace = TRUE)
dat <- data.frame(cbind(x,y,z)) # 3 items, 50 observations (5 options per item)
head(dat)
x y z
1 3 1 2
2 2 1 3
3 5 3 4
4 1 4 5
5 1 3 4
6 4 5 4
# Each option is scored 0, 1, or 2:
key <- matrix(sample(c(0,0,1,1,2), size = 15, replace = TRUE), ncol=5)
key
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 1 2
[2,] 2 1 1 1 2
[3,] 2 2 1 1 2
Some other options, firstly using Map:
data.frame(Map( function(x,y) key[y,x], dat, seq_along(dat) ))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Secondly using matrix indexing on key:
newdat <- dat
newdat[] <- key[cbind( as.vector(col(dat)), unlist(dat) )]
newdat
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Things would be even simpler if you specified key as a list:
key <- list(x=c(0,0,0,1,2),y=c(2,1,1,1,2),z=c(2,2,1,1,2))
data.frame(Map("[",key,dat))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
For posterity, I was discussing this issue with a friend, who suggested another approach. The benefits of this is that it still uses mapvalues() to do the rescoring, but does not require a for loop, instead uses "from" in sapply to do the indexing.
library(plyr)
scored <- sapply(1:ncol(raw), function(x, dat, key){
mapvalues(dat[,x], from = 1:ncol(key), to = key[x,])
}, dat = dat, key = key)
My current working approach is to use 1) mapvalues, which lives within package:plyr to do the heavy lifting: it takes a vector of data to modify, and two additional parameters "from", which is the original data (here 1:5), and "to", or what we want to convert the data to; and, 2) A for loop with index notation, in which we cycle through the available questions, extract the vector pertaining to each using the current loop value, and use it to select the proper row from our scoring key.
library(plyr)
newdat <- matrix(data=NA, nrow=nrow(dat), ncol=ncol(dat))
for (i in 1:3) {
newdat[,i] <- mapvalues(dat[,i], from = c(1,2,3,4,5),
to = c(key[i,1], key[i,2], key[i,3], key[i,4], key[i,5]))
}
head(newdat)
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 0 2 1
[3,] 2 1 1
[4,] 0 1 2
[5,] 0 1 1
[6,] 1 2 1
I am pretty happy with this solution, but if anyone has any better approaches, I would love to see them!

Resources