r aggregate and collapse several cells into one - r

I have a data frame:
x <- data.frame(id = 1:18,
super = c(rep("A", 12), rep("B", 6)),
category = c(rep("one", 6), rep("two", 6), rep("three", 6)),
root = sort(rep(letters[1:6], 3)),
coldefs = letters[1:18], stringsAsFactors = F)
x
I am creating a new column by concatenating 3 columns:
myvars <- c("super", "category", "root")
library(tidyverse)
x <- x %>% unite(col = concat, myvars, sep = "_", remove = F)
x
Now, for each unique value of column 'concat' the values of column 'super' are the same, the values of column 'category' are the same, and the values of column "root" are the same. However, for each unique value of column 'concat' the values of column 'id' are different. The same is true for column 'coldefs'.
I would like to collapse (aggregate) x so that it has only as many rows as there are unique values in column 'concat' (i.e., 6 rows). In each row, I want one value from column 'super', one value from column 'category', one value from column 'root'; and then 3 values of column 'id' (concatenated like this: 1;2;3) and 3 values of column 'coldefs' (concatenated like this: a;b;c).
What's the best way of doing it?
I am trying the following, but it's not working:
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"),
super = unique(super), category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))
I am clearly doing something wrong.
Thanks a lot for your help!

I must say this is a bit (or completely) crazy! I tried my code (the one at the bottom) piece by piece and it worked. I merged it all together - and it worked. I don't understand why was I getting an error before. Here is the correct code that works (at least now):
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"), super = unique(super),
category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))

Related

How to summarize rows of a data frame into one while removing Duplicates in R?

so, I have a data frame with 2 or more rows and different columns (ID, Location, Task, Skill, ...). I want to summarize these rows into (a) one row (dataframe) where different column entries should be joined together (but only if different! i.e. if for two rows the IDs are the same, the final dataframe row should show only one ID not the same twice i.e. "ID1", but if they are different, both should be shown i.e. 'ID1, ID2") and some numerical values should be added (+) together.
df = data.frame("ID" = c(PA1, PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes
", "PO - data based operations"), "Field" = c("PA","PA"), "Work" = c(0.5, 0.1), "Skill1" = c(CRO, CRO), "Skill2" = c(0, PPto), "ds" = c(5, 5))
print(df)
and the output should look like this
df_final = data.frame("ID" = c(PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures, react to changes, data based operations"), "Field" = c("PA"), "Work" = c(0.6), "Skill1" = c(CRO), "Skill2" = c(PPto), "ds" = c(5))
print(df_final)
Thank you!
Let's ignore Skill2 for now:
How close is the following code to what you want to do?
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
You can also mutate(occupation = str_replace_all(occupation, "PO - ")) to get rid of the duplicate "PO - "'s.
You're going to run into problems if the variables like Skill1/Skill2/ds are not unique to each ID, as in they have cardinality > 1.
df2 %>%
group_by(ID)%>%
summarise(work = sum(Work),
skill1 = unique(Skill1),
skill2 = unique(Skill2),
ds = unique(ds),
occupation = paste0(Occupation, collapse = " "),
field = unique(Field))
If it's a simple data-entry issue, you could do a bit of wrangling to filter for only Skill2 entries with letters contained, and then join this frame back to your original frame.
You could also use the past0() collapse = trick, but then you'll end up with Skill2 = c(NA, "PPto"), which I'm pretty sure you don't want.

Mutate multiple new columns based on references to themselves

A data frame:
df <- data.frame(
date = seq(ymd('2021-01-01'), ymd('2021-01-31'), by = 1),
ims_x = rnorm(31, mean = 0),
ims_y = rnorm(31, mean = 1),
ims_z = rnorm(31, mean = 2),
blah = 1:31
)
I'd like to mutate 3 new fields (not overwrite), 'ims_x_lagged', 'ims_y_lagged' and 'ims_z_lagged' where each new field corresponds to the original but lagged by one day/row. The names of the new fields would just have '_lagged' appended onto the name of the original and the value would change to be that of it's original in the preceding row.
I could do this manually for each field, but that would be a lot of typing and my real data has many more than 3 fields that need to be lagged.
Something kind of like this, if it's possible to tell what I'm trying to do:
df <- df %>%
mutate_at(vars(contains('ims_')) := lag(vars(contains('ims_')))) # but append '_lagged' to the name
With the new version of dplyr, _at or _all are getting deprecated and in its place, can use the more flexible across. If we don't specify the .names, it will replace the modified column values with the same column name. By specifying the .names, the {.col} - returns the original column name and can add either prefix or suffix as a string.
library(dplyr)
df <- df %>%
mutate(across(starts_with('ims'), lag, .names = "{.col}_lagged"))

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

Erase lines on a dataframe based on a specific calculation

I have a dataframe like this :
mydata
and I would like to erase a specific amount of lines based on the IDNumber and #Paymts.
Basically I want to keep only the 2 last lines for each IDNumber. For example, for IDNumber = 230, i have 5 lines (indicated in the column #Paymts), I want to erase all the first lines and just keep the two last.
Any idea ?
Thanks in advance!
We can use slice
library(dplyr)
df1 %>%
group_by(IDNumber) %>%
slice(tail(row_number(), 2)) #or
#slice((n()-1):n())
data
set.seed(24)
df1 <- data.frame(IDNumber = rep(LETTERS[1:3], each = 5), Date = Sys.Date(),
`#Paymts` = sample(1:9, 15, replace = TRUE), check.names = FALSE)
You can do it this way:
trimmed_df = df[unlist(tapply(rownames(df), df$id_number, tail, 2)), ]

Matching across two data frames with certain observations having multiple entries to match against

I working with two data frames corresponding to the sample below:
# Data sets
set.seed(1)
dta_a <- data.frame(some_value = runif(n = 10),
identifier=c("A0001","A0002","A0003","A0004","A0005",
"A0006","B0001","B0002","B0003","B0004"),
other_val = runif(n = 10))
dta_b <- data.frame(variable_abc = runif(n = 6),
identifier=c("A0001","A0002","A0003,A0004,A0005,C0001",
"B0001,B0002","B0003","B0004"),
variable_df = runif(n = 6))
I would like to merge those two data frames and obtain a data frame similar to the one presented below:
The resulting data frame would have the following qualities:
For the observations where only one identifier is present the merge command performs with all.y = TRUE and all.x = FALSE assuming that y is dta_b.
For the observations where multiple identifiers are provided only the first matched value from the dta_a is taken with the remaining values ignored. If there is no match on the first identifier (A0003) I would like for the command to attempt to match the next one (A0004).
I made a reference to the merge command but, naturally, dplyr and other solutions are fine.
you can 'melt' the dta_b so to have one row per identifier with a preference order and then join all the identifiers:
library(dplyr)
library(tidyr)
melt_dta_b = lapply(1:nrow(dta_b), function(i){
split_identifier = strsplit(as.character(dta_b$identifier[i]), split = ",", fixed = TRUE)[[1]]
data_frame(identifier = split_identifier,
original_identifier = dta_b$identifier[i], original_row = i, preference = 1:length(identifier),
variable_abc = dta_b$variable_abc[i], variable_df = dta_b$variable_df[i])
})
melt_dta_b = rbind_all(melt_dta_b)
At that point you can select only the one with the highest preference score:
joined_df = left_join(melt_dta_b, dta_a) %>%
filter(!is.na(some_value)) %>%
group_by(original_row) %>%
filter(preference == min(preference)) %>%
ungroup()
UPDATE
in order to not explicitly call the variables by name you can use the following code that binds all the 'unused' columns of the orginal df:
melt_dta_b = lapply(1:nrow(dta_b), function(i){
tmp = dta_b[i,]
split_identifier = strsplit(as.character(tmp$identifier), split = ",", fixed = TRUE)[[1]]
colnames(tmp)[2] = "original_identifier"
data_frame(identifier = split_identifier, original_row = i, preference = 1:length(identifier)) %>%
cbind(tmp)
})
melt_dta_b = rbind_all(melt_dta_b)
Just one way of doing it, but not best way I guess. Just made a try.
Split the identifiers and merge according to the first one.
dta_a$identifier = as.vector(dta_a$identifier)
dta_a1 = data.frame(dta_a, identifier_split = do.call(rbind, strsplit(dta_a$identifier, split = ",", fixed = T)))
dta_b$identifier = as.vector(dta_b$identifier)
dta_b1 = data.frame(dta_b, identifier_split = do.call(rbind, strsplit(dta_b$identifier, split = ",", fixed = T)))
dta_join = merge(dta_a1, dta_b1, by = "identifier_split.1", all.x = F, all.y = T)
In cases you don't have a match for the first one, you'll see NAs and you can subset them and merge with second ones ("identifier_split.2")

Resources