create variable conditionally by group in R (write function) - r

I want to create a variable by group conditioned on existing variable on individual level. Each individual has a outlier variable 1, 2, 3. I want to create a new variable by group so that the new var = 2 whenever there is at least one individual in that group whose outlier variable = 2; and the new var = 3 whenever there is at least one individual in that group whose outlier variable = 3.
The data looks like this
grpid id outlier
1 1 1
1 2 1
1 3 2
2 4 1
2 5 3
2 6 1
3 7 1
3 8 1
3 9 1
Ideal output like this
grpid id outlier goutlier
1 1 1 2
1 2 1 2
1 3 2 2
2 4 1 3
2 5 3 3
2 6 1 3
3 7 1 1
3 8 1 1
3 9 1 1
Any suggestions?
Thanks!

It is easy with dplyr
library(dplyr)
df <- read.table(header = TRUE,sep = ",",
text = "grpid,id,outlier
1,1,1
1,2,1
1,3,2
2,4,1
2,5,3
2,6,1
3,7,1
3,8,1
3,9,1")
df %>% group_by(grpid) %>% mutate(goutlier = max(outlier))

Related

Count the amount of times value A occurs without value B and vice versa

I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
Output
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
A more efficient approach would be to work on a tabulation format:
tab = crossprod(table(rep(seq_len(nrow(df)), ncol(df)), c(df$id1, df$id2)))
#tab
#
# 1 2 3 4
# 1 7 3 2 2
# 2 3 6 1 2
# 3 2 1 4 1
# 4 2 2 1 5
So, now, we have the times each value appears with another (irrespectively of their order in the two columns). Here on, we need a way to subset the above table by each pair and subtract the value of their cooccurence from the value of each id's total appearance.
Make a grid of all combinations:
gr = expand.grid(id1 = colnames(tab), id2 = rownames(tab), stringsAsFactors = FALSE)
Create 2-column matrices to subset the table:
id1.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id1, rownames(tab)))
id2.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id2, rownames(tab)))
Subtract the respective values:
cbind(gr, count = tab[id1.ij] - tab[id2.ij])
# id1 id2 count
#1 1 1 0
#2 2 1 3
#3 3 1 2
#4 4 1 3
#5 1 2 4
#6 2 2 0
#7 3 2 3
#8 4 2 3
#9 1 3 5
#10 2 3 5
#11 3 3 0
#12 4 3 4
#13 1 4 5
#14 2 4 4
#15 3 4 3
#16 4 4 0
Of course, if we do not need the full grid of values, we can set:
gr = unique(df)
which results in:
# id1 id2 count
#1 1 2 4
#3 1 3 5
#4 1 4 5
#5 2 1 3
#6 2 3 5
#7 2 4 4
#8 3 1 2
#9 3 4 3
#10 4 2 3
#11 4 1 3

Collapsing multiple columns with repeating values

I am currently working with a previously collected dataframe. Participant race is currently split among several categories (Race_White, Race_Black, etc.) where each participant has a value of 1 for Yes or 2 for No. For example, a White participant that does not identify with any other race would have a 1 in the Race_White column and 2's in all other Race_X columns.
I would like to merge these into one "Race" column, where 1 = White, 2 = Black, etc. Does anyone know of a nice piece of code/function/package to do this efficiently?
This is what I have been trying:
Race <- mutate(mydata,
Race = case_when(
mydata$Race_White = 1 & mydata$Race_Black = 2 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 2 ~ 1,
mydata$Race_White = 2 & mydata$Race_Black = 1 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 2 ~ 2,
mydata$Race_White = 2 & mydata$Race_Black = 2 & mydata$Race_Asian = 1 & mydata$Race_NoReply = 2 ~ 3,
mydata$Race_White = 2 & mydata$Race_Black = 2 & mydata$Race_Asian = 2 & mydata$Race_NoReply = 1 ~ 4,
TRUE ~ NA_real_))
I would use pivot_longer and str_remove like this:
tib <- tibble::tibble(#example data
individual = 1:10,
race_white = sample(c(0,1), 10, T),
race_black = 1 - race_white
)
tib %>%
dplyr::pivot_longer(dplyr::contains('race')) %>%
dplyr::filter(value == 1) %>%
dplyr::mutate(
name = stringr::str_remove(name, 'race_')
) %>%
dplyr::select(-value, race = name)
If you want them integer coded you could use case_when on the character column.
But it is hard to know exactly what u want without example data.
Here is the output:
# A tibble: 10 x 2
individual race
<int> <chr>
1 1 white
2 2 white
3 3 white
4 4 white
5 5 white
6 6 white
7 7 black
8 8 white
9 9 white
10 10 black
Edit:
I used 0 = No, and 1 = Yes. But that does not change anything. I also added package notation to all functions.
You could do:
names(df)[max.col(df==1)]
[1] "Race_yellow" "Race_red" "Race_green" "Race_red" "Race_green" "Race_yellow"
[7] "Race_red" "Race_purple" "Race_purple" "Race_yellow" "Race_yellow" "Race_blue"
[13] "Race_purple" "Race_red" "Race_purple"
The data:
df <- read.table(text =
"Race_yellow Race_green Race_purple Race_blue Race_red
1 1 2 2 2 2
2 2 2 2 2 1
3 2 1 2 2 2
4 2 2 2 2 1
5 2 1 2 2 2
6 1 2 2 2 2
7 2 2 2 2 1
8 2 2 1 2 2
9 2 2 1 2 2
10 1 2 2 2 2
11 1 2 2 2 2
12 2 2 2 1 2
13 2 2 1 2 2
14 2 2 2 2 1
15 2 2 1 2 2")

R, dplyr: Is there a way to add order of groups when there are multiple rows per group without creating a new data frame? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
I have data from an experiment that has multiple rows per item (each row has the reading time for one word of a sentence of n words), and multiple items per subject. Items can be varying numbers of rows. Items were presented in a random order, and their order in the data as initially read in reflects the sequence they saw the items in. What I'd like to do is add a column that contains the order in which the subject saw that item (i.e., 1 for the first item, 2 for the second, etc.).
Here's an example of some input data that has the relevant properties:
d <- data.frame(Subject = c(1,1,1,1,1,2,2,2,2,2),
Item = c(2,2,2,1,1,1,1,2,2,2))
Subject Item
1 2
1 2
1 2
1 1
1 1
2 1
2 1
2 2
2 2
2 2
And here's the output I want:
Subject Item order
1 2 1
1 2 1
1 2 1
1 1 2
1 1 2
2 1 1
2 1 1
2 2 2
2 2 2
2 2 2
I know I can do this by setting up a temp data frame that filters d to unique combinations of Subject and Item, adding order to that as something like 1:n() or row_number(), and then using a join function to put it back together with the main data frame. What I'd like to know is whether there's a way to do this without having to create a new data frame just to store the order---can this be done inside dplyr's mutate somehow if I group by Subject and Item, for instance?
Here's one way:
d %>%
group_by(Subject) %>%
mutate(order = match(Item, unique(Item))) %>%
ungroup()
# # A tibble: 10 x 3
# Subject Item order
# <dbl> <dbl> <int>
# 1 1 2 1
# 2 1 2 1
# 3 1 2 1
# 4 1 1 2
# 5 1 1 2
# 6 2 1 1
# 7 2 1 1
# 8 2 2 2
# 9 2 2 2
# 10 2 2 2
Here is a base R option
transform(d,
order = ave(Item, Subject, FUN = function(x) as.integer(factor(x, levels = unique(x))))
)
or
transform(d,
order = ave(Item, Subject, FUN = function(x) match(x, unique(x)))
)
both giving
Subject Item order
1 1 2 1
2 1 2 1
3 1 2 1
4 1 1 2
5 1 1 2
6 2 1 1
7 2 1 1
8 2 2 2
9 2 2 2
10 2 2 2

Calculate rowwise maximum from columns that have changing names

I have the following objects:
s1 = "1_1_1_1_1"
s2 = "2_1_1_1_1"
s3 = "3_1_1_1_1"
Please note that the value of s1, s2, s3 can change in another example.
I then have the follwoing data frame:
set.seed(666)
df = data.frame(draw = c(1,2,3,4,1,2,3,4,1,2,3,4),
resp = c(1,1,1,1,2,2,2,2,3,3,3,3),
"1_1_1_1_1" = runif(12),
"2_1_1_1_1" = runif(12),
"3_1_1_1_1" = runif(12)).
Please note that the column names of may data frame will change based on the values of s1,s2,s3.
I now want to achieve the following:
I want to find out which of last three columns in df has the highest value and store it as a value in a new column (values are supposed to be either of 1,2 or 3, depending on if the highest value is the first, second or third of these variables).
Now that I know which value is the highest per row, I want to group/summarize the result by the column resp and count how often my max value is 1, 2 or 3.
So the outcome from 1. should be:
draw resp 1_1_1_1_1 2_1_1_1_1 3_1_1_1_1 max
1 1 0.774 0.095 0.806 3
2 1 0.197 0.142 0.266 3
...
And the outcome from 2. is supposed to be:
resp first_max second_max third_max
1 1 1 2
2 2 1 1
3 1 2 1
My problem is that tidyverse's rowwise function is deprecated and that I don't know how I can dynamically address columns in a tidyverse pipe by column names which a re stored externally (here in s1, s2, s3). One last note: I might be overcomplicating things by trying to go by the column names, when, in fact, the positions of the columns that I'm interested in are always at column position 3:5.
Here is one way to get what you want. For a sligthly different format, you can use count rather than table but this matches your expected output. Hope this helps!!
library(dplyr)
df %>%
mutate(max_val = max.col(select(., starts_with("X")))) %>%
select(resp, max_val) %>%
table()
max_val
resp 1 2 3
1 1 1 2
2 2 1 1
3 1 2 1
Or, you could do this:
df %>%
mutate(max_val = max.col(.[3:5])) %>%
count(resp, max_val) %>%
mutate(max_val = paste0("max_", max_val)) %>%
spread(value = n, key = max_val)
resp max_1 max_2 max_3
<dbl> <int> <int> <int>
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1
calculate max using pmap(row-wise iteration)
max_cols <- pmap_dbl(unname(df),function(x,y,...){
vals <- unlist(list(...))
return(which(vals == max(vals)))
})
result <- df %>% add_column(max = max_cols)
> result
draw resp X1_1_1_1_1 X2_1_1_1_1 X3_1_1_1_1 max
1 1 1 0.4551478 0.70061232 0.618439890 2
2 2 1 0.3667764 0.26670969 0.024742605 1
3 3 1 0.6806912 0.03233215 0.004014758 1
4 4 1 0.9117449 0.42926492 0.885247456 1
5 1 2 0.1886954 0.34189707 0.985054492 3
6 2 2 0.5569398 0.78043504 0.100714130 2
7 3 2 0.9791164 0.92823982 0.676584495 1
8 4 2 0.9174654 0.74627116 0.485582287 1
9 1 3 0.3681890 0.69622331 0.672346875 2
10 2 3 0.5510356 0.99651637 0.482430518 2
11 3 3 0.4283281 0.12832611 0.018095649 1
12 4 3 0.6168436 0.64381995 0.655178701 3
Reshape the data frame.
reshape2::dcast(result,resp~max,fun.aggregate = length,value.var = "max")
resp 1 2 3
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources