In this dataset DF, we have 4 names and 4 professions.
DF<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,
"diana", 4, 1, 2, 3,
"shakira", 2, 1, 3, 4,
"armstrong", 3, 4, 1, 2,
"picasso", 1, 3, 1, 4
)
Assume that the cell values are some measure of their their profession. So, for instance, Diana has highest cell value for princess (correctly) but Shakira has highest cell value for painter (incorrectly).
I want to create two columns called "Compatible" and "Incompatible" where the program will pick value of 4 for Diana as it is under the correct profession Princess and assign it to column "Compatible" and in the "Incompatible" put an average of the other 3 values. For Shakira, it will pick the value 1 from the correct profession of singer, and assign it to Compatible; for Incompatible it average the other values. Similarly for other names
So the output will be like this:
DF1<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,~Compatible,~Incompatible,
"diana", 4, 1, 2, 3, 4, 2,
"shakira", 2, 1, 3, 4, 1, 3,
"armstrong", 3, 4, 1, 2, 1, 3,
"picasso", 1, 3, 1, 4, 4, 1.66
)
Here is the dataset which shows the correct names and professions:
DF3<- tribble(
~names, ~professions,
"diana", "princess",
"shakira", "singer",
"armstrong", "astronaut",
"picasso", "painter"
)
DF1[1:5] %>%
pivot_longer(-names) %>%
left_join(DF3, 'names') %>%
group_by(names, name = if_else(name == professions, 'compatible', 'incompatible')) %>%
summarise(profession = first(professions), value = mean(value), .groups = 'drop') %>%
pivot_wider()
# A tibble: 4 x 4
names profession compatible incompatible
<chr> <chr> <dbl> <dbl>
1 armstrong astronaut 1 3
2 diana princess 4 2
3 picasso painter 4 1.67
4 shakira singer 1 3
Related
HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
WANT = data.frame(INSTRUCTOR = c(1, 2, 3),
SCORE.DIF = c(-9, NA, 6))
For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'
I try:
HAVE[, .SD[1:2], by = 'INSTRUCTOR']
but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'
library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
)[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
To use your new TIME variable, we can do
HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
One might be tempted to replace SCORE[1:2] with head(SCORE,2), but that won't work: head(SCORE,2) will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff on length-1 (e.g., diff(1)), it returns a 0-length vector, which in the above data.table code reduces to zero rows for instructor 2. However, when there is only one student, SCORE[1:2] resolves to c(SCORE[1], NA), for which the diff is length-1 (as needed) and NA (as needed).
I have data table whose look like this
I want this output
I would use:
library(dplyr)
data %>%
group_by(SubjectID) %>%
summarise_at(vars(everything()), ~paste0(unique(.), collapse = ",")) %>%
mutate_if(is.character, ~paste0("[", ., "]"))
Output is:
# A tibble: 1 × 4
SubjectID PunchLocation NumOfPunch PunchType
<dbl> <chr> <chr> <chr>
1 102 [1,2,3,4,5] [1,2] [5,6,9]
I used this data:
data <- tibble(
SubjectID = rep(102, 12),
PunchLocation = c(1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 4, 4),
NumOfPunch = c(rep(1, 10), 2, 2),
PunchType = c(5, 6, 5, 6, 9, 5, 6, 5, 6, 9, 5, 6)
)
Using data.table (with example from #Stephan).
library(data.table)
setDT(data)
fun.collapse <- function(x) sprintf('[%s]', paste(unique(x), collapse=','))
dcast(data, SubjectID ~ .,
value.var = c('PunchLocation', 'NumOfPunch', 'PunchType'),
fun.aggregate = fun.collapse)
## SubjectID PunchLocation NumOfPunch PunchType
## 1: 102 [1,2,3,4,5] [1,2] [5,6,9]
I want to count how often each pairwise combination of unique elements in column c in data frame df co-occurs on the elements of column a, but with the addition that co-occurrences are only counted if the respective values in column b are unequal, i.e., conditional on a non-match in column b
a <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
b <- c(1,1,2,2,2,1,1,2,2,3,3,3,3,1,1,1,2,2,2,4)
c <- c(1,2,1,2,3,2,3,1,2,1,1,2,3,1,2,1,1,2,4,1)
df <- as.data.frame(cbind(a,b,c))
Without considering column b I could do the following to retain for each pair of elements of column c, on how many elements of a they co-occur
df <- unique(df[,c(1,3)])
df <- merge(df, df, by = "a")
df$count <- 1
df <- aggregate(count ~ ., df[, c(2:4)], sum)
df <- df[df$c.x != df$c.y,]
With the additional condition of a non-match in b, there is only one difference: elements 2 and 4 of column c both co-occur on element 4 of column a, but have the same value in b and should therefore not be counted to end up with:
c.x <- c(2,3,4,1,3,1,2,1)
c.y <- c(1,1,1,2,2,3,3,4)
count <- c(4,3,1,4,3,3,3,1)
result <- as.data.frame(cbind(c.x,c.y,count))
As the original data set is large (> 1,000,000 observations), I welcome fast solutions, i.e., without using loops or merges. Usually, I create co-occurrence matrices from three-column data frames using sparseMatrix()
I'm not sure from your description if this is what you had in mind, nor how fast this would turn out to be, but here is an approach with purrr:
library(purrr)
split(df, c) %>%
combn(2, simplify = F) %>%
set_names(map(., ~ paste(names(.x), collapse = "_"))) %>%
map_int(~ merge(.x[[1]], .x[[2]], by = NULL) %>%
dplyr::filter(a.x == a.y && b.x != b.y) %>%
nrow())
Returns:
1_2 1_3 1_4 2_3 2_4 3_4
0 27 0 21 0 0
# Data used:
df <- structure(list(a = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4), b = c(1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4), c = c(1, 2, 1, 2, 3, 2, 3, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 4, 1)), class = "data.frame", row.names = c(NA, -20L))
To begin with, let's suppose we have a dataset like this:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
It has columns concerning whole experiment (like time) and object-specific columns (like X_size and X_cuteness). These objects are ordered randomly, though, so I'd like to mutate these column to order the objects by size for each experiment separately. The result I expect to be like that:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_max_size = c(5, 6, 8, 4, 6),
obj_max_cuteness = c(6, 1, 4, 1, 8),
obj_2nd_size = c(3, 4, 7, 2, 5),
obj_2nd_cuteness = c(10, 2, 6, 9, 6),
obj_min_size = c(1, 3, 3, 1, 2),
obj_min_cuteness = c(3, 6, 10, 8, 2)
)
Notice that cuteness isn't ordered descending or ascending, but I want cuteness to be considered part of an object and set obj_max_cuteness = obj_2_cuteness wherever obj_max_size = obj_2_size, and so on.
Number of objects is known in advance (there are four of them), columns are known as well, and there are four columns describing each object. There is no missing data. I'm willing to use any package, if necessary. Also, original dataset is about 500k by 30, so bonus points for quick or memory-friendly code.
EDIT: Some noticed that the description is not very clear. What I'm after is a bit object-oriented thing: in the case above each object within experiment could be described as such (X in obj_X_ means that it belongs to experiment no. X):
obj_1_a = {"size": 1, "cuteness": 3}
obj_1_b = {"size": 5, "cuteness": 6}
obj_1_c = {"size": 3, "cuteness": 10}
obj_2_a = {"size": 3, "cuteness": 6}
...
I want to reorder them by size so that (in the resulting data frame):
obj_1_max = {"size": 5, "cuteness": 6}
obj_1_2nd = {"size": 3, "cuteness": 10}
obj_1_min = {"size": 1, "cuteness": 3}
obj_2_max = {"size": 6, "cuteness": 1}
...
Is this what you are after?
The min and max value calculations are straightforward. To find the 2nd max you need to do a bit more work. My interpretation of the 2nd values is that it is the 2nd value of the sorted and unique values. My output differs from yours but that may be due to a different interpretation of what you mean by the 2nd value. My reading: you are looking for the first value down from the max value; from the groups of 3 columns (size, cuteness).
library(dplyr)
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
obj_max_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_max_size = max(value)) %>%
ungroup() %>%
select(obj_max_size)
obj_min_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_min_size = min(value)) %>%
ungroup() %>%
select(obj_min_size)
obj_2nd_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_size = value)
obj_max_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_max_cuteness = max(value)) %>%
ungroup() %>%
select(obj_max_cuteness)
obj_min_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_min_cuteness = min(value)) %>%
ungroup() %>%
select(obj_min_cuteness)
obj_2nd_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_cuteness = value)
output <- bind_cols(id = data$id, obj_max_size, obj_min_size, obj_2nd_size, obj_max_cuteness, obj_min_cuteness, obj_2nd_cuteness)
With output looking like this:
> output
# A tibble: 5 x 7
id obj_max_size obj_min_size obj_2nd_size obj_max_cuteness obj_min_cuteness obj_2nd_cuteness
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 3 10 3 6
2 2 6 3 4 6 1 2
3 3 8 4 7 10 4 6
4 4 4 1 2 9 1 8
5 5 6 2 5 8 2 6
I have a list of dataframes that I want to consolidate these dataframes into one data frame. I am looking to solve two problems:
How to add together the columns
How to only include common dates across all the dfs withing the list
This is what I have:
library(tidyverse)
library(lubridate)
df1 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04",
"2019-02-05")),
x = c(1, 2, 3, 4, 5),
y = c(2, 3, 4, 5, 6),
z = c(3, 4, 5, 6, 7)
)
df2 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-04", "2019-02-05")),
x = c(1, 2, 3, 4),
y = c(2, 3, 4, 5),
z = c(3, 4, 5, 6)
)
df3 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04")),
x = c(1, 2, 3, 4),
y = c(2, 3, 4, 5),
z = c(3, 4, 5, 6)
)
dfl <- list(df1, df2, df3)
This is the output I am looking for:
data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-04")),
x = c(3, 6, 11),
y = c(6, 9, 14),
z = c(9, 12, 17)
)
I have tried inner_join and tried looping through the list but it got too complicated and I still didn't manage to land on the answer.
Is there a more cleaner way to get to the final answer
How about this?
bind_rows(dfl) %>%
group_by(date) %>%
mutate(n = 1) %>%
summarise_all(sum) %>%
filter(n == length(dfl)) %>%
select(-n)
## A tibble: 3 x 4
# date x y z
# <date> <dbl> <dbl> <dbl>
#1 2019-02-01 3 6 9
#2 2019-02-02 6 9 12
#3 2019-02-04 11 14 17
This assumes that there are no duplicate dates in a single data.frame of dfl.