Subsetting data frame based on frequency - r

I've been trying to figure out this problem for some time now. I have the following data frame with repeated observation by ID:
ID color
1 blue
1 red
1 blue
2 red
2 blue
2 red
.
.
.
I want to create a new data frame by choosing the color with the highest frequency for each ID so that I have only 1 row for each ID. That is, I'd like to get the following data frame:
ID color
1 blue
2 red
3
.
.
.
I attempted using transform but that didn't work as it only summed the number of times each ID appeared in the data.
transform(df, freq.ID = ave(seq(nrow(df)), ID, FUN=length))
Is there a way I can do this?

We get the frequency count based on 'ID', 'color', creates a summarised 'n' column with frequency, then do order the rows on the 'ID' and descending order of 'n', and use the distinct to return the first unique row for each 'ID'
library(dplyr)
df1 %>%
count(ID, color) %>%
arrange(ID, desc(n)) %>%
select(-n) %>%
distinct(ID, .keep_all = TRUE)
-output
# ID color
#1 1 blue
#2 2 red
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), color = c("blue",
"red", "blue", "red", "blue", "red")), class = "data.frame", row.names = c(NA,
-6L))

Base R method using aggregate and ave -
subset(aggregate(length~color + ID, transform(df1, length = ID), length),
ave(length, ID, FUN = function(x) x == max(x)) == 1)
# color ID length
#1 blue 1 2
#4 red 2 2

Related

Filtering a large data frame based on column values using R

I have a very large dataframe with almost 502493 rows and 261 columns. I want to filter it and need IDs with specific codes (codes starting with 'E'). This is how my data looks like,
IDs
code1
code2
1
C443
E109
2
AX31
M223
1
E341
QWE1
3
E131
M223
My required output is IDs with codes starting with 'E' only.
IDs
code
1
E109
1
E341
3
E131
I am trying to use the 'filter' of dplyr package but not getting the required output.
Thanks in advance
We can reshape to 'long' format with pivot_longer and filter by creating a logical vector from the first character extracted (with substr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
-output
# A tibble: 3 × 2
IDs code
<int> <chr>
1 1 E109
2 1 E341
3 3 E131
If the data is really big, we may do a filter before the pivot_longer to keep only rows having at least one 'E' in the column
df1 %>%
filter(if_any(starts_with('code'), ~ substr(., 1, 1) == 'E')) %>%
pivot_longer(cols = starts_with("code"),
values_to = 'code', names_to = NULL) %>%
filter(substr(code, 1, 1) == "E")
If it is a very big data, another option is data.table. Convert the data.frame to 'data.table' (setDT), loop across the columns of interest (.SDcols) with lapply, replace the elements that are not starting with "E" to NA, then use fcoalesce to get the first non-NA element for each row using do.call
library(data.table)
na.omit(setDT(df1)[, .(IDs, code = do.call(fcoalesce,
lapply(.SD, function(x) replace(x, substr(x, 1, 1) != "E",
NA)))), .SDcols = patterns("code")])
-output
IDs code
1: 1 E109
2: 1 E341
3: 3 E131
data
df1 <- structure(list(IDs = c(1L, 2L, 1L, 3L), code1 = c("C443", "AX31",
"E341", "E131"), code2 = c("E109", "M223", "QWE1", "M223")),
class = "data.frame", row.names = c(NA,
-4L))

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))

separate a column into multiple variables with unique column names in R

Here is how I want my dataframe to look:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
However, the data (df) appears as follows:
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
The code for df
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
For each record, I would like to separate the vars column by the "," delimiter, and create a new column with the indicated variable name...The record should be repeated if there are multiple values for a particular variable
I know that to do this with tidyverse I will need to use dplyr::group_by and dplyr::separate, however I'm not clear how to incorporate the new variable names in the "into" parameter for separate. Do I need some type of regular expression to identify any text prior to an equal sign "=" as the new variable name in "into"?? Any suggestions much welcome!
df %>%
group_by(record) %>%
separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")
Since the columns are already almost written as R code defining a list, you could parse/eval them and then unnest_wider
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin
Here is one option with tidyverse. Create a sequence column 'rn', then separate_rows of the 'vars' column based on the ,, remove the quotes with str_remove_all, separate the column into two, and reshape from 'long' to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(vars, sep=",\\s*\\n*") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>%
pivot_wider(names_from = vars1, values_from = vars2,
values_fill = list(vars2 = '')) %>%
select(-rn)
# A tibble: 3 x 5
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
#1 1 blue large "" ""
#2 2 green small "" ""
#3 2 "" "" tall thin
Another way is to convert to 2-column-matrices and merge. We'll need a helper FUNction that converts a vector into a matrix with first row as the header.
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
Then just get rid of non-character stuff and merge.
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall
I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
which shows 3 rows although the input data has 4 rows.
If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color has two values which appear in rows 1 and 2, resp., as the OP has requested
The record should be repeated if there are multiple values for a
particular variable
All other variables of record 1 have only one value (or none) and are arranged in row 1.
So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).
I have tried to reproduce this in two different ways: First with data.table which I am more fluent with and then with dplyr/tidyr. Finally, I will propose an alternative presentation of duplicate values using toString().
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
This works in 5 steps:
Split multiple key-value-pairs in each row and arrange in separate rows.
Remove double quotes.
Split key-value-pairs and arrange in separate columns.
Reshape from long to wide format where the rows are given by record and by a
count of each individual key within record using rowid() and the columns are given by the keys (variables). Using fct_inorder() ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
Drop the helper column from the final result.
To be even more consistent with OP's expected result, the NAs can be turned into blanks by adding the parameter fill = "" to the dcast() call.
dplyr / tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
The steps are essentially the same as for the data.table approach. The statements
group_by(record, key) %>%
mutate(keyid = row_number(key))
are a replacement for data.table::rowid().
Add the parameter values_fill = list(val = "") to repalce the NAs by blank.
Alternative representation
The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record.
During reshaping, a function can be used to aggregate the data in each cell. The toString() function concatenates character strings.
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
or
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin

find duplicates with grouped variables

I have a df that looks like this:
I guess it will work some with dplyr and duplicates. Yet I don't know how to address multiple columns while distinguishing between a grouped variable.
from to group
1 2 metro
2 4 metro
3 4 metro
4 5 train
6 1 train
8 7 train
I want to find the ids which exist in more than one group variable.
The expected result for the sample df is: 1 and 4. Because they exist in the metro and the train group.
Thank you in advance!
Using base R we can split the first two columns based on group and find the intersecting value between the groups using intersect
Reduce(intersect, split(unlist(df[1:2]), df$group))
#[1] 1 4
We gather the 'from', 'to' columns to 'long' format, grouped by 'val', filter the groups having more than one unique elements, then pull the unique 'val' elements
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, from:to) %>%
group_by(val) %>%
filter(n_distinct(group) > 1) %>%
distinct(val) %>%
pull(val)
#[1] 1 4
Or using base R we can just table to find the frequency, and get the ids out of it
out <- with(df1, colSums(table(rep(group, 2), unlist(df1[1:2])) > 0)) > 1
names(which(out))
#[1] "1" "4"
data
df1 <- structure(list(from = c(1L, 2L, 3L, 4L, 6L, 8L), to = c(2L, 4L,
4L, 5L, 1L, 7L), group = c("metro", "metro", "metro", "train",
"train", "train")), class = "data.frame", row.names = c(NA, -6L
))
Convert data to long format and count unique values, using data.table. melt is used to convert to long format, and data table allows filtering in the i part of df1[ i, j, k], grouping in the k part, and pulling in the j part.
library(data.table)
library(magrittr)
setDT(df1)
melt(df1, 'group') %>%
.[, .(n = uniqueN(group)), value] %>%
.[n > 1, unique(value)]
# [1] 1 4

Split column into intervals based on row content

I am trying to convert a single-column data frame into separate columns — the main descriptor in the data is the "item number" and then includes information on the price, date, color, etc. I would just split the column depending on row number, but since each item has a different amount of information, that doesn't really work.
I've been playing around with this a bit but haven't found anything at all to come close, as I can't use regex to create a separate column (using str_which, for example) since the information differs so much item to item. How can I use regex to create intervals that I can then split the column into (so I need the information between each row containing "item" in a separate column). Sample data is below.
data
item 1
$600
red
item 2
$70
item 3
$430
orange
10/11/2017
Thank you!
Here is a function to reformat your data depending on how you want the final dataset to look like. For the function, you supply the dataframe DF, the variable var, and a vector of column names in the correct order colnames and byitem to choose the output format (default is TRUE, which outputs a dataframe with one row per item):
library(tidyverse)
df_transform = function(DF, var, colnames, byitem = TRUE){
if(byitem){
ID = sym("rowid")
}else{
ID = sym("id")
}
DF %>%
group_by(id = paste0("item", cumsum(grepl("item", var)))) %>%
mutate(rowid = replace(2:n(), 2:n(), setNames(colnames[1:(n()-1)], 2:n()))) %>%
filter(!grepl("item", var)) %>%
spread(!!ID, var)
}
Output:
> df_transform(df, var, c("price", "color", "date"))
# A tibble: 3 x 4
# Groups: id [3]
id color date price
<chr> <fct> <fct> <fct>
1 item1 red <NA> $600
2 item2 <NA> <NA> $70
3 item3 orange 10/11/2017 $430
> df_transform(df, var, c("price", "color", "date"), byitem = FALSE)
# A tibble: 3 x 4
rowid item1 item2 item3
<chr> <fct> <fct> <fct>
1 color red <NA> orange
2 date <NA> <NA> 10/11/2017
3 price $600 $70 $430
Note that this would not work if you have missing values in the middle, since the column names are assigned by position.
Data:
df <- structure(list(var = structure(c(5L, 2L, 9L, 6L, 3L, 7L, 1L,
8L, 4L), .Label = c("$430", "$600", "$70", "10/11/2017", "item_1",
"item_2", "item_3", "orange", "red"), class = "factor")), .Names = "var", class = "data.frame", row.names = c(NA,
-9L))

Resources