Related
I have the following data:
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
I want to "extend" this data frame to make name pairs for every possible combination of names without repetition like so:
names_1 <- c("a", "a", "a", "b", "b", "c")
names_2 <- c("b", "c", "d", "c", "d", "d")
scores_1 <- c(95, 95, 95, 55, 55, 100)
scores_2 <- c(55, 100, 60, 100, 60, 60)
df_extended <- cbind.data.frame(names_1, names_2, scores_1, scores_2)
In the extended data, scores_1 are the scores for the corresponding name in names_1, and scores_2 are for names_2.
The following bit of code makes the appropriate name pairs. But I do not know how to get the scores in the right place after that.
t(combn(df$names,2))
The final goal is to get the row-wise difference between scores_1 and scores_2.
df_extended$score_diff <- abs(df_extended$scores_1 - df_extended$scores_2)
df_ext <- data.frame(t(combn(df$names, 2,\(x)c(x, df$scores[df$names %in%x]))))
df_ext <- setNames(type.convert(df_ext, as.is =TRUE), c('name_1','name_2', 'type_1', 'type_2'))
df_ext
name_1 name_2 type_1 type_2
1 a b 95 55
2 a c 95 100
3 a d 95 60
4 b c 55 100
5 b d 55 60
6 c d 100 60
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
library(tidyverse)
map(df, ~combn(x = .x, m = 2)%>% t %>% as_tibble) %>%
imap_dfc(~set_names(x = .x, nm = paste(.y, seq(ncol(.x)), sep = "_"))) %>%
mutate(score_diff = scores_1 - scores_2)
#> # A tibble: 6 × 5
#> names_1 names_2 scores_1 scores_2 score_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 a b 95 55 40
#> 2 a c 95 100 -5
#> 3 a d 95 60 35
#> 4 b c 55 100 -45
#> 5 b d 55 60 -5
#> 6 c d 100 60 40
Created on 2022-06-06 by the reprex package (v2.0.1)
First, we can create a new data frame with the unique combinations of names. Then, we can merge on the scores to match the names for both names_1 and names_2 to get the final data.frame.
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
new_df <- data.frame(t(combn(df$names,2)))
names(new_df)[1] <- "names_1"; names(new_df)[2] <- "names_2"
new_df <- merge(new_df, df, by.x = 'names_1', by.y = 'names')
new_df <- merge(new_df, df, by.x = 'names_2', by.y = 'names')
names(new_df)[3] <- "scores_1"; names(new_df)[4] <- "scores_2"
> new_df
names_2 names_1 scores_1 scores_2
1 b a 95 55
2 c a 95 100
3 c b 55 100
4 d a 95 60
5 d b 55 60
6 d c 100 60
I’m having trouble working with 3 different sets of data (df1, df2, vec1) to output a third dataframe df3. I have 2 dataframes df1 and df2. In df1, each letter in X1 corresponds to a value in X2. In df2, X3 represents a numerical value found in vec1 and X4 represents a letter or multiple letters from df1$X1. I’m looking to scan the letters found in df2$X4 and see if there is a sequential order of N values determined from df2$X3 in vec1, and then remove any letters that do not fit this criterion.
For example, in df2[1, ], the letters are “A, B, D” and the value is 3. Looking at vec1, the max sequential order that includes the value 3 is “2, 3, 4, 5”, meaning df2[1, 2] should be replaced with “A, D” instead of “A, B, D”. The final output should look like df3. Any ideas would be greatly appreciated.
df1 <- data.frame(c("A", "B", "C", "D"), c(4, 8, 1, 3))
colnames(df1) <- c("X1", "X2")
df2 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, B, D", "A, C", NA, "B", "B, D", "C"))
colnames(df2) <- c("X3", "X4")
vec1 <- c(2, 3, 4, 5, 21, 22, 23, 27, 33, 34, 35, 36, 37, 38, 39, 46)
df3 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, D", "C", NA, NA, "D", NA))
This is not elegant but it may do what you need it to do.
First, create a list that contains consecutive integers:
vec1_seq <- split(vec1, cumsum(c(0, diff(vec1) > 1)))
$`0`
[1] 2 3 4 5
$`1`
[1] 21 22 23
$`2`
[1] 27
$`3`
[1] 33 34 35 36 37 38 39
$`4`
[1] 46
Then, do the following. Check for X3 in each element of the list, and determine the length if contained in that element. Then, keep only those letters that meet the length requirement:
cbind(df2,
X5 = apply(df2, 1, function(x) {
l <- length(unlist(vec1_seq[sapply(seq_along(vec1_seq), function(i) {
as.numeric(x[["X3"]]) %in% vec1_seq[[i]]
})]))
toString(na.omit(as.vector(sapply(trimws(unlist(strsplit(x[["X4"]], ","))), function(i) {
ifelse(i == df1[["X1"]] & df1[["X2"]] <= l, i, NA)
}))))
}))
It seems that "C" should remain for row 6; if that is incorrect let me know.
Output
X3 X4 X5
1 3 A, B, D A, D
2 21 A, C C
3 27 <NA>
4 34 B
5 35 B, D D
6 46 C C
I am trying to add two rows to the data frame.
Regarding the first row, its value in MODEL column should be X, total_value should be the sum of total value of rows, with the MODEL being A and C and total_frequency should be the sum of total_frequency of rows, with the MODEL being A and C.
In the second row, the value in MODEL column should be Z, total_value should be the sum of total_value of rows, with the MODEL being D, Fand E, and total_frequency should be the sum of total_frequency of rows, with the MODEL being D,Fand E.
I am stuck, as I do not know how to select specific values of MODEL and then sum these two other columns.
Here is my data
data.frame(MODEL=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), total_value= c(62, 54, 78, 38, 16, 75, 39, 13, 58, 37),
total_frequency = c(78, 83, 24, 13, 22, 52, 16, 16, 20, 72))
You can try with dplyr, calculating the "new rows", then put together with the data df:
library(dplyr)
first <- df %>%
# select the models you need
filter(MODEL %in% c("A","C")) %>%
# call them x
mutate(MODEL = 'X') %>%
# grouping
group_by(MODEL) %>%
# calculate the sums
summarise_all(sum)
# same with the second
second <- df %>%
filter(MODEL %in% c("D","F","E")) %>%
mutate(MODEL = 'Z') %>%
group_by(MODEL) %>% summarise_all(sum)
# put together
rbind(df, first, second)
# A tibble: 12 x 3
MODEL total_value total_frequency
1 A 62 78
2 B 54 83
3 C 78 24
4 D 38 13
5 E 16 22
6 F 75 52
7 G 39 16
8 H 13 16
9 I 58 20
10 J 37 72
11 X 140 102
12 Z 129 87
The following code is a straightforward solution to the problem.
i1 <- df1$MODEL %in% c("A", "C")
total_value <- sum(df1$total_value[i1])
total_frequency <- sum(df1$total_frequency[i1])
df1 <- rbind(df1, data.frame(MODEL = "X", total_value, total_frequency))
i2 <- df1$MODEL %in% c("D", "E", "F")
total_value <- sum(df1$total_value[i2])
total_frequency <- sum(df1$total_frequency[i2])
df1 <- rbind(df1, data.frame(MODEL = "Z", total_value, total_frequency))
df1
# MODEL total_value total_frequency
#1 A 62 78
#2 B 54 83
#3 C 78 24
#4 D 38 13
#5 E 16 22
#6 F 75 52
#7 G 39 16
#8 H 13 16
#9 I 58 20
#10 J 37 72
#11 X 140 102
#12 Z 129 87
It is also possible to write a function to avoid repeating the same code.
fun <- function(X, M, vals){
i1 <- X$MODEL %in% vals
total_value <- sum(X$total_value[i1])
total_frequency <- sum(X$total_frequency[i1])
rbind(X, data.frame(MODEL = M, total_value, total_frequency))
}
df1 <- fun(df1, M = "X", vals = c("A", "C"))
df1 <- fun(df1, M = "Z", vals = c("D", "E", "F"))
Suppose that we have the following data frame:
ID <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5)
age <- c(25, 25, 25, 22, 22, 56, 56, 56, 80, 33, 33, 90, 90, 90)
gender <- c("m", "m", "m", "f", "f", "m", "m", "m", "m", "m", "m", "f", "f", "m")
company <- c("c1", "c2", "c2", "c3", "c3", "c1", "c1", "c1", "c1", "c5", "c5", "c3", "c4", "c5")
income <- c(1000, 1000, 1000, 500, 1700, 200, 200, 250, 500, 700, 700, 300, 350, 300)
df <- data.frame(ID, age, gender, company, income)
I need to find the row that have different values by ID for age, gender, and income. I don't care about the company whether they are same or different.
So after processing, here is the output:
BONUS,
Can we create another data frame include the list of variables that are different by id. For example:
An option would be to group by 'ID', check whether the number of distinct elements in 'age', 'gender', 'income' is equal to 1 and then negate (!)
library(dplyr)
out <- df %>%
group_by(ID) %>%
filter(!(n_distinct(age) == 1 &
n_distinct(gender) == 1 &
n_distinct(income) == 1))
out
# A tibble: 9 x 5
# Groups: ID [3]
# ID age gender company income
# <dbl> <dbl> <fct> <fct> <dbl>
#1 2 22 f c3 500
#2 2 22 f c3 1700
#3 3 56 m c1 200
#4 3 56 m c1 200
#5 3 56 m c1 250
#6 3 80 m c1 500
#7 5 90 f c3 300
#8 5 90 f c4 350
#9 5 90 m c5 300
If there are many variable, another option i filter_at
df %>%
group_by(ID) %>%
filter_at(vars(age, gender, income), any_vars(!(n_distinct(.) == 1)))
From the above, we can get the ssecond output with
library(tidyr)
out %>%
select(-company) %>%
gather(key, val, - ID) %>%
group_by(key, add = TRUE) %>%
filter(n_distinct(val) > 1) %>%
group_by(ID) %>%
summarise(Different = toString(unique(key)))
# A tibble: 3 x 2
# ID Different
# <dbl> <chr>
#1 2 income
#2 3 age, income
#3 5 gender, income
In base R, we can split c("age", "gender", "income") column based on ID find out ID's which have more than 1 unique row and subset them.
df[df$ID %in% unique(df$ID)[sapply(split(df[c("age", "gender", "income")], df$ID),
function(x) nrow(unique(x)) > 1)], ]
# ID age gender company income
#4 2 22 f c3 500
#5 2 22 f c3 1700
#6 3 56 m c1 200
#7 3 56 m c1 200
#8 3 56 m c1 250
#9 3 80 m c1 500
#12 5 90 f c3 300
#13 5 90 f c4 350
#14 5 90 m c5 300
This question already has an answer here:
Forward and backward fill data frame in R [duplicate]
(1 answer)
Closed 3 years ago.
Suppose that we have the following data frame:
ID <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6)
age <- c(25, 25, 25, 22, 22, 56, 56, 56, 80, 33, 33, 90, 90, 90, 5, 5, 5)
gender <- c("m", "m", NA, "f", "f", "m", NA, "m", "m", "m", NA, NA, NA, "m", NA, NA, NA)
company <- c("c1", "c2", "c2", "c3", "c3", "c1", "c1", "c1", "c1", "c5", "c5", "c3", "c4", "c5", "c3", "c1", "c1")
income <- c(1000, 1000, 1000, 500, 1700, 200, 200, 250, 500, 700, 700, 300, 350, 300, 500, 1700, 200)
df <- data.frame(ID, age, gender, company, income)
In this data we have 6 unique IDs, and if you look at the gender variable, sometimes in includes NA
I want to replace the NAs with the correct gender category. Also, in case an ID has all NA's for gender, then leave it as is.
The expected outcome would be:
Here's way in base R using ave -
df$gender <- with(df, ave(gender, ID, FUN = function(x) na.omit(x)[1]))
ID age gender company income
1 1 25 m c1 1000
2 1 25 m c2 1000
3 1 25 m c2 1000
4 2 22 f c3 500
5 2 22 f c3 1700
6 3 56 m c1 200
7 3 56 m c1 200
8 3 56 m c1 250
9 3 80 m c1 500
10 4 33 m c5 700
11 4 33 m c5 700
12 5 90 m c3 300
13 5 90 m c4 350
14 5 90 m c5 300
15 6 5 <NA> c3 500
16 6 5 <NA> c1 1700
17 6 5 <NA> c1 200
Some ways with dplyr and tidyr -
df %>%
group_by(ID) %>%
mutate(gender = na.omit(gender)[1])
df %>%
group_by(ID) %>%
fill(gender, .direction = "up") %>%
fill(gender, .direction = "down")
Using the tidyverse library you can do this
library(tidyverse)
# for each ID get the gender
df_gender_ref <- df %>% filter(!is.na(gender)) %>% select(ID,gender) %>% unique()
# add the new gender column to the original dataframe
df %>% select(-gender) %>% left_join(df_gender_ref)