R Loop To New Data Frame Summary Weighted - r

I have a tall data frame as such:
data = data.frame("id"=c(1,2,3,4,5,6,7,8,9,10),
"group"=c(1,1,2,1,2,2,2,2,1,2),
"type"=c(1,1,2,3,2,2,3,3,3,1),
"score1"=c(sample(1:4,10,r=T)),
"score2"=c(sample(1:4,10,r=T)),
"score3"=c(sample(1:4,10,r=T)),
"score4"=c(sample(1:4,10,r=T)),
"score5"=c(sample(1:4,10,r=T)),
"weight1"=c(173,109,136,189,186,146,173,102,178,174),
"weight2"=c(147,187,125,126,120,165,142,129,144,197),
"weight3"=c(103,192,102,159,128,179,195,193,135,145),
"weight4"=c(114,182,199,101,111,116,198,123,119,181),
"weight5"=c(159,125,104,171,166,154,197,124,180,154))
library(reshape2)
library(plyr)
data1 <- reshape(data, direction = "long",
varying = list(c(paste0("score",1:5)),c(paste0("weight",1:5))),
v.names = c("score","weight"),
idvar = "id", timevar = "count", times = c(1:5))
data1 <- data1[order(data1$id), ]
And what I want to create is a new data frame like so:
want = data.frame("score"=rep(1:4,6),
"group"=rep(1:2,12),
"type"=rep(1:3,8),
"weightedCOUNT"=NA) # how to calculate this? count(data1, score, wt = weight)
I am just not sure how to calculate weightedCOUNT which should apply the weights to the score variable so then it gives in column 'weightedCOUNT' a weighted count that is aggregated by score and group and type.

An option would be to melt (from data.table - which can take multiple measure patterns, and then grouped by 'group', 'type' get the count
library(data.table)
library(dplyr)
melt(setDT(data), measure = patterns('^score', "^weight"),
value.name = c("score", "weight")) %>%
group_by(group, type) %>%
count(score, wt = weight)
If we need to have a complete set of combinations
library(tidyr)
melt(setDT(data), measure = patterns('^score', "^weight"),
value.name = c("score", "weight")) %>%
group_by(group, type) %>%
ungroup %>%
complete(group, type, score, fill = list(n = 0))

If I understand correctly, weightedCOUNT is the sum of weights grouped by score, group, and type.
For the sake of completeness, I would like to show how the accepted solution would look like when implemented in pure base R and pure data.table syntax, resp.
Base R
The OP was almost there. He has already reshaped data from wide to long format for multiple value variables. Only the final aggregation step was missing:
data1 <- reshape(data, direction = "long",
varying = list(c(paste0("score",1:5)),c(paste0("weight",1:5))),
v.names = c("score","weight"),
idvar = "id", timevar = "count", times = c(1:5))
result <- aggregate(weight ~ score + group + type, data1, FUN = sum)
result
score group type weight
1 1 1 1 479
2 3 1 1 558
3 4 1 1 454
4 1 2 1 378
5 2 2 1 154
6 3 2 1 174
7 4 2 1 145
8 1 2 2 535
9 2 2 2 855
10 3 2 2 248
11 4 2 2 499
12 1 1 3 189
13 2 1 3 351
14 3 1 3 600
15 4 1 3 362
16 1 2 3 596
17 2 2 3 265
18 3 2 3 193
19 4 2 3 522
result can be reordered by
with(result, result[order(score, group, type), ])
score group type weight
1 1 1 1 479
12 1 1 3 189
4 1 2 1 378
8 1 2 2 535
16 1 2 3 596
13 2 1 3 351
5 2 2 1 154
9 2 2 2 855
17 2 2 3 265
2 3 1 1 558
14 3 1 3 600
6 3 2 1 174
10 3 2 2 248
18 3 2 3 193
3 4 1 1 454
15 4 1 3 362
7 4 2 1 145
11 4 2 2 499
19 4 2 3 522
data.table
As shown by akrun, melt() from the data.table package can be combined with dplyr. Alternatively, we can stay with the data.table syntax for aggregation:
library(data.table)
cols <- c("score", "weight") # to save typing
melt(setDT(data), measure = patterns(cols), value.name = cols)[
, .(weightedCOUNT = sum(weight)), keyby = .(score, group, type)]
score group type weightedCOUNT
1: 1 1 1 479
2: 1 1 3 189
3: 1 2 1 378
4: 1 2 2 535
5: 1 2 3 596
6: 2 1 3 351
7: 2 2 1 154
8: 2 2 2 855
9: 2 2 3 265
10: 3 1 1 558
11: 3 1 3 600
12: 3 2 1 174
13: 3 2 2 248
14: 3 2 3 193
15: 4 1 1 454
16: 4 1 3 362
17: 4 2 1 145
18: 4 2 2 499
19: 4 2 3 522
The keyby parameter is used for grouping and ordering the output in one step.
Completion of missing combinations of the grouping variables is also possible in data.table syntax using the cross join function CJ():
melt(setDT(data), measure = patterns(cols), value.name = cols)[
, .(weightedCOUNT = sum(weight)), keyby = .(score, group, type)][
CJ(score, group, type, unique = TRUE), on = .(score, group, type)][
is.na(weightedCOUNT), weightedCOUNT := 0][]
score group type weightedCOUNT
1: 1 1 1 479
2: 1 1 2 0
3: 1 1 3 189
4: 1 2 1 378
5: 1 2 2 535
6: 1 2 3 596
7: 2 1 1 0
8: 2 1 2 0
9: 2 1 3 351
10: 2 2 1 154
11: 2 2 2 855
12: 2 2 3 265
13: 3 1 1 558
14: 3 1 2 0
15: 3 1 3 600
16: 3 2 1 174
17: 3 2 2 248
18: 3 2 3 193
19: 4 1 1 454
20: 4 1 2 0
21: 4 1 3 362
22: 4 2 1 145
23: 4 2 2 499
24: 4 2 3 522
score group type weightedCOUNT

Related

R purrr row-wise lookups from two lists

Here’s a simplified version of a problem that involves larger, more complex inputs. First, I create data:
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = T)
)
These data are responses from three persons on a 12-item test. input is a multilevel table in which the test items are nested within each person. The columns of input are:
person: ID numbers for persons 101, 102, and 103 (12 rows for each person)
item: test items 1-12 for each person. Note how the items are nested within each person
response: score for each item
The test is divided into four subscales consisting of three items each.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
scale_num <- 1:4
scale_assign is a four-element list containing four item sets (expressed as four numerical ranges): items 1-3 (subscale 1), items 4-6 (subscale 2), items 7-9 (subscale 3), and items 10-12 (subscale 4). scale_num is a four element numerical vector containing the numbers (1-4) that label the four subscales.
What I want R to do is process input row-wise, creating a new column scale, and filling it with the correct value of scale_num for each item (that is, each item's subscale assignment). In each row, R needs to check the value of item against the ranges in scale_assign and fill in scale with the value of scale_num that corresponds to the scale_assign range for that item.
The desired output looks like this:
# A tibble: 36 x 4
# person item response scale
# 1 101 1 4 1
# 2 101 2 2 1
# 3 101 3 4 1
# 4 101 4 4 2
# 5 101 5 4 2
# 6 101 6 4 2
# 7 101 7 3 3
# 8 101 8 2 3
# 9 101 9 4 3
# 10 101 10 1 4
# 11 101 11 1 4
# 12 101 12 4 4
# 13 102 1 1 1
# 14 102 2 3 1
# 15 102 3 1 1
# 16 102 4 1 2
# 17 102 5 3 2
# 18 102 6 3 2
# 19 102 7 4 3
# 20 102 8 1 3
# 21 102 9 3 3
# 22 102 10 4 4
# 23 102 11 3 4
# 24 102 12 3 4
# 25 103 1 4 1
# 26 103 2 1 1
# 27 103 3 2 1
# 28 103 4 2 2
# 29 103 5 4 2
# 30 103 6 1 2
# 31 103 7 4 3
# 32 103 8 4 3
# 33 103 9 1 3
# 34 103 10 4 4
# 35 103 11 1 4
# 36 103 12 2 4
Preferring a tidyverse solution, I thought this might be a job for purrr::map2(), because it seems to involve simultaneous iteration over a four-element list scale_assign and a four-element vector scale_num. I tried to implement the coding of scale within a map2() call, using mutate() and case_when() to do the coding, but could not get it to work.
Thanks in advance for any help!
Instead of performing this operation row-wise and checking for each value it would be easy to perform a join operation if you change scale_assign to named list convert it into a dataframe and do a right_join with input dataframe.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
names(scale_assign) <- 1:4
library(tidyverse)
enframe(scale_assign) %>%
unnest(cols = value) %>%
mutate_all(as.integer) %>%
right_join(input, by = c("value" = "item"))
# A tibble: 36 x 4
# name value person response
# <int> <int> <int> <int>
# 1 1 1 101 4
# 2 1 2 101 4
# 3 1 3 101 2
# 4 2 4 101 2
# 5 2 5 101 1
# 6 2 6 101 4
# 7 3 7 101 3
# 8 3 8 101 1
# 9 3 9 101 1
#10 4 10 101 2
# … with 26 more rows
In base R, that can be done using stack and merge
merge(input, stack(scale_assign), all.x = TRUE, by.x = "item", by.y = "values")
data
set.seed(1234)
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = TRUE))
Here is a data.table solution, using an update-join.
Basically this is #Ronak Shah's Base-R answer, but using the data.table-package (i.e. fast performance on large data-sets).
library(data.table)
#1. set inpus as data.table
#2. create a lookup-table using `stack( scale_assign )`,
# and make that also a data.table (using setDT() )
#3. left update join on item
setDT(input)[ setDT( stack( scale_assign ) ),
scale := i.ind,
on = .( item = values ) ][]
output
# person item response scale
# 1: 101 1 3 1
# 2: 101 2 4 1
# 3: 101 3 3 1
# 4: 101 4 2 2
# 5: 101 5 3 2
# 6: 101 6 4 2
# 7: 101 7 1 3
# 8: 101 8 3 3
# 9: 101 9 4 3
# 10: 101 10 2 4
# 11: 101 11 3 4
# 12: 101 12 4 4
# 13: 102 1 4 1
# 14: 102 2 2 1
# 15: 102 3 3 1
# 16: 102 4 2 2
# 17: 102 5 1 2
# 18: 102 6 4 2
# 19: 102 7 1 3
# 20: 102 8 3 3
# 21: 102 9 2 3
# 22: 102 10 1 4
# 23: 102 11 4 4
# 24: 102 12 3 4
# 25: 103 1 1 1
# 26: 103 2 1 1
# 27: 103 3 2 1
# 28: 103 4 1 2
# 29: 103 5 2 2
# 30: 103 6 4 2
# 31: 103 7 4 3
# 32: 103 8 2 3
# 33: 103 9 3 3
# 34: 103 10 2 4
# 35: 103 11 2 4
# 36: 103 12 2 4
# person item response scale

How to rank a column with a condition

I have a data frame :
dt <- read.table(text = "
1 390
1 366
1 276
1 112
2 97
2 198
2 400
2 402
3 110
3 625
4 137
4 49
4 9
4 578 ")
The first colomn is Index and the second is distance.
I want to add a colomn to rank the distance by Index in a descending order (the highest distance will be ranked first)
The result will be :
dt <- read.table(text = "
1 390 1
1 66 4
1 276 2
1 112 3
2 97 4
2 198 3
2 300 2
2 402 1
3 110 2
3 625 1
4 137 2
4 49 3
4 9 4
4 578 1")
Another R base approach
> dt$Rank <- unlist(tapply(-dt$V2, dt$V1, rank))
A tidyverse solution
dt %>%
group_by(V1) %>%
mutate(Rank=rank(-V2))
transform(dt,s = ave(-V2,V1,FUN = rank))
V1 V2 s
1 1 390 1
2 1 66 4
3 1 276 2
4 1 112 3
5 2 97 4
6 2 198 3
7 2 300 2
8 2 402 1
9 3 110 2
10 3 625 1
11 4 137 2
12 4 49 3
13 4 9 4
14 4 578 1
You could group, arrange, and rownumber. The result is a bit easier on the eyes than a simple rank, I think, and so worth an extra step.
dt %>%
group_by(V1) %>%
arrange(V1,desc(V2)) %>%
mutate(rank = row_number())
# A tibble: 14 x 3
# Groups: V1 [4]
V1 V2 rank
<int> <int> <int>
1 1 390 1
2 1 366 2
3 1 276 3
4 1 112 4
5 2 402 1
6 2 400 2
7 2 198 3
8 2 97 4
9 3 625 1
10 3 110 2
11 4 578 1
12 4 137 2
13 4 49 3
14 4 9 4
A scrambled alternative is min_rank
dt %>%
group_by(V1) %>%
mutate(min_rank(desc(V2)) )

Conditional Unique Counting in R data.table

I would like to count the number of conflicts in my dataset by group. I feel like there has to be an easy way to do this in data.table, but can't seem to figure it out. I've created a dummy variable to tell me if there is a conflict for each row of the data.table:
testDT <- data.table(Name = c(rep('A',6),rep('B',5)),
Division = c(rep(11,6),rep(12,5)),
ID = c(205,205,NA,201,201,201,203,203,203,204,NA),
Conflict = c(0,0,0,1,1,1,1,1,1,1,0))
I need to count the unique number of non-NA IDs that have a conflict flag of 1 and apply that count in a new column to each Name-Division grouping. This is what the answer should be:
testDT[, Count := c(rep(1,6),rep(2,5))]
Name Division ID Conflict Count
1: A 11 205 0 1
2: A 11 205 0 1
3: A 11 NA 0 1
4: A 11 201 1 1
5: A 11 201 1 1
6: A 11 201 1 1
7: B 12 203 1 2
8: B 12 203 1 2
9: B 12 203 1 2
10: B 12 204 1 2
11: B 12 NA 0 2
I've been thinking about some usage of sum(!is.na(unique(ID))), but I'm not sure how to conditionally count the unique values without creating criteria in the i section of the data.table (Conflict == 1).
You can subset the ID variable by conditions within the data.table [] and then count the unique values:
library(data.table)
testDT[, Count := uniqueN(ID[!is.na(ID) & Conflict == 1]), by=.(Name, Division)]
testDT
# Name Division ID Conflict Count
# 1: A 11 205 0 1
# 2: A 11 205 0 1
# 3: A 11 NA 0 1
# 4: A 11 201 1 1
# 5: A 11 201 1 1
# 6: A 11 201 1 1
# 7: B 12 203 1 2
# 8: B 12 203 1 2
# 9: B 12 203 1 2
# 10: B 12 204 1 2
# 11: B 12 NA 0 2
Or following your logic:
testDT[, Count := sum(!is.na(unique(ID[Conflict == 1]))), by=.(Name, Division)]
Here is an option with dplyr
library(dplyr)
testDT %>%
group_by(Name, Division) %>%
mutate(Count = n_distinct(ID[!is.na(ID) & Conflict==1]))
# Name Division ID Conflict Count
# <chr> <dbl> <dbl> <dbl> <int>
#1 A 11 205 0 1
#2 A 11 205 0 1
#3 A 11 NA 0 1
#4 A 11 201 1 1
#5 A 11 201 1 1
#6 A 11 201 1 1
#7 B 12 203 1 2
#8 B 12 203 1 2
#9 B 12 203 1 2
#10 B 12 204 1 2
#11 B 12 NA 0 2

R: column binding with unequal number of rows

I have two data sets. Each of them has variables ID, Block, and RT (reaction time). I want to merge/column bind the two sets so that I have one data set with variables: ID, Block, RT1, RT2. The problem is there is unequal number of rows in the two sets. Also, it is important that the ID and block number match. Missing values should be replaced with NA. So what I have:
head(blok1, 10)
ID Blok RT1
1 1 1 592
2 1 1 468
3 1 1 530
4 1 1 546
5 1 1 452
6 1 1 483
7 1 2 499
8 1 2 452
9 1 2 608
10 1 2 530
head(blok2, 10)
ID Blok RT2
1 1 1 592
2 1 1 920
3 1 1 686
4 1 1 561
5 1 1 561
6 1 2 327
7 1 2 686
8 1 2 670
9 1 2 702
10 1 3 920
What I want to have:
ID Blok RT1 RT2
1 1 1 592 592
2 1 1 468 920
3 1 1 530 686
4 1 1 546 561
5 1 1 452 561
6 1 1 483 NA
7 1 2 499 327
8 1 2 452 686
9 1 2 608 670
10 1 2 530 702
etc.
Here's a solution using dplyr, also utilizing an index or unique ID:
blok1 <- data.frame(ID = c(1, 1, 2), RT1 = c(11, 12, 13))
blok2 <- data.frame(ID = c(1, 2, 2), RT2 = c(21, 22, 23))
library(dplyr)
## if you want NAs for RT2 only
blok1 %>%
mutate(uID = row_number()) %>%
left_join(blok2 %>% mutate(uID = row_number()), by = c("uID", "ID"))
# uID ID RT1 RT2
# 1 1 1 11 21
# 2 2 1 12 NA
# 3 3 2 13 23
## if you want NAs for both RT1 and RT2
blok1 %>%
mutate(uID = row_number()) %>%
outer_join(blok2 %>% mutate(uID = row_number()), by = c("uID", "ID"))
# uID ID RT1 RT2
# 1 1 1 11 21
# 2 2 1 12 NA
# 3 3 2 13 23
# 4 2 2 NA 22

R: sum at different levels [duplicate]

This question already has answers here:
generating sums of data according to values of a variable
(3 answers)
Closed 9 years ago.
I have a dataset X as:
customer_id event_type tot_count
931 1 5
231 2 6
231 1 3
333 3 9
444 1 1
931 3 3
333 1 21
444 2 43
I need a sum at customer_id and event_type level.
This is a 1 line code in SQL as:
select customer_id, event_type, sum(tot_count) from X group by 1,2
I need the same operation in R.
You can use the aggregate function:
aggregate(tot_count ~ customer_id + event_type, X, sum)
customer_id event_type tot_count
1 231 1 3
2 333 1 21
3 444 1 1
4 931 1 5
5 231 2 6
6 444 2 43
7 333 3 9
8 931 3 3
For fun, here are a few more options:
Since you know SQL, sqldf
> sqldf("select customer_id, event_type, sum(tot_count) from mydf group by 1,2")
customer_id event_type sum(tot_count)
1 231 1 3
2 231 2 6
3 333 1 21
4 333 3 9
5 444 1 1
6 444 2 43
7 931 1 5
8 931 3 3
If you have a lot of data, data.table
> library(data.table)
> DT <- data.table(mydf, key = c("customer_id", "event_type"))
> DT[, sum(tot_count), by = key(DT)]
customer_id event_type V1
1: 231 1 3
2: 231 2 6
3: 333 1 21
4: 333 3 9
5: 444 1 1
6: 444 2 43
7: 931 1 5
8: 931 3 3

Resources