How do I add row values to colnames in R - r

I have a dataframe and I would like to add the first row to the names of the columns
What I have:
col1
col2
col3
city
state
country
...
...
...
What I want:
col1_city
col2_state
col3_country
city
state
country
...
...
...
I can't do it manually because there are many cols in the df
I think of something like
df %>% rename_with(~ names(.) %>%
map_chr(~glue('{.x}_.[1,])))
Thanks!!

With rename_with
df %>%
rename_with(.cols = everything(),
.fn = ~paste0(colnames(df), '_', df[1,]))
Update: Here's a solution where you can pass the current data as it is created/altered within a pipe:
df |>
(\(x) (x <- x |>
rename_with(.cols = everything(),
.fn = ~paste0(colnames(x), '_', x[1,]))))()
So here you could, for example, do some filtering before the renaming or some mutating or whatever you want.

In base R, just do
names(df) <- paste0(names(df), "_", unlist(df[1,]))
-output
> df
col1_city col2_state col3_country
1 city state country
Or with dplyr
library(dplyr)
library(stringr)
df %>%
set_names(str_c(names(.), '_', slice(., 1)))
-output
col1_city col2_state col3_country
1 city state country
data
df <- structure(list(col1 = "city", col2 = "state",
col3 = "country"), class = "data.frame", row.names = c(NA,
-1L))

Related

Return only groups that have correct combination of values in another column

I would like to filter out any group that does not have some desired values in a column. Here here I want to return the data.frame where a grp has both "var1" and "var2" in x. I can do it with summarise and paste but that feels a bit clunky.
library(dplyr)
library(tibble)
dat <- tibble(
x = c("var1", "var2", "var1"),
grp = c("grp1", "grp1", "grp2")
)
dat_summary <- dat %>%
group_by(grp) %>%
summarise(both_vars = paste(x, collapse = ", ")) %>%
filter(both_vars == "var1, var2")
dat_summary$grp
#> [1] "grp1"
We may use filter to filter the grp s having all the 'var1', 'var2' in the 'x' column and then pull the distinct elements of 'grp'
library(dplyr)
dat %>%
group_by(grp) %>%
filter(all(c("var1", "var2") %in% x)) %>%
distinct(grp) %>%
pull(grp)
[1] "grp1"
Or in base R
Reduce(intersect, with(subset(dat, x %in% c("var1", "var2")), split(grp, x)))
[1] "grp1"

Pivot wider in R with multiple columns

I am having trouble converting a particular dataset from long to wide.
col1 col2
ID 55.
animal. dog
animal bear
animal rabbit
shape. circle
ID 67.
animal. cat
shape. square
As you can see, some IDs have multiple observations for "animal" and so I want to make multiple columns like this:
ID. animal. animal2 animal3 shape
55. dog bear. rabbit circle
67. cat. NA NA square
Any help is appreciated!
Try this solution.
Most of the work was creating an separate ID column and then creating the unique names for the columns.
library(tidyr)
library(dplyr)
library(vctrs)
df<- structure(list(col1 = c("ID", "animal", "animal", "animal", "shape", "ID", "animal", "shape"),
col2 = c("55.", "dog", "bear", "rabbit", "circle", "67.", "cat", "square")),
class = "data.frame", row.names = c(NA, -8L))
#create the ID column
df$ID <- NA
#find the ID rows
idrows <- which(df$col1 == "ID")
#fill column and delete rows
df$ID[idrows] <- df$col2[idrows]
df <- fill(df, ID, .direction = "down")
df <- df[-idrows, ]
#create unique names in each grouping and the pivot wider
df %>% group_by(ID) %>%
mutate(col1=vec_as_names(col1, repair = "unique")) %>%
mutate(col1=stringr::str_replace( col1, "\\.+1", "")) %>%
ungroup() %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2")
ID animal animal...2 animal...3 shape
<chr> <chr> <chr> <chr> <chr>
1 55. dog bear rabbit circle
2 67. cat NA NA square
Another alternatives based on one of your previous questions:
df %>% group_by(ID) %>%
mutate(col1 = paste0(col1, data.table::rowid(col1))) %>%
ungroup() %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2")
or
df %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2") %>%
unnest_wider( "shape", names_sep = "_") %>% unnest_wider( "animal", names_sep = "_")

Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

removing groups with a certain NA number

Sorry to bother with a relatively simple question perhaps.
I have this type of dataframe:
A long list of names in the column "NAME" c(a, b, c, d, e ...) , two potential classes in the column "SURNAME" c(A, B) and a third column containing values.
I want to remove all NAMES for which at least in one of the SURNAME classes I have more than 2 "NA" in the VALUE column.
I wanted to post an example dataset but I am struggling to format it properly
I was trying to use
df <- df %>%
group_by(NAME) %>%
group_by(SURNAME) %>%
filter(!is.na(VALUE)) %>%
filter(length(VALUE)>=3)
it does not throw an error but I have the impression that something is wrong. Any suggestion? Many thanks
Let's create a dataset to work with:
set.seed(1234)
df <- data.frame(
name = sample(x=letters, size=1e3, replace=TRUE),
surname = sample(x=c("A", "B"), size=1e3, replace=TRUE),
value = sample(x=c(1:10*10,NA), size=1e3, replace=TRUE),
stringsAsFactors = FALSE
)
Here's how to do it with Base R:
# count NAs by name-surname combos (na.action arg is important!)
agg <- aggregate(value ~ name + surname, data=df, FUN=function(x) sum(is.na(x)), na.action=NULL)
# rename is count of NAs column
names(agg)[3] <- "number_of_na"
#add count of NAs back to original data
df <- merge(df, agg, by=c("name", "surname"))
# subset the original data
result <- df[df$number_of_na < 3, ]
Here's how to do it with data.table:
library(data.table)
dt <- as.data.table(df)
dt[ , number_of_na := sum(is.na(value)), by=.(name, surname)]
result <- dt[number_of_na < 3]
Here's how to do it with dplr/tidyverse:
library(dplyr) # or library(tidyverse)
result <- df %>%
group_by(name, surname) %>%
summarize(number_of_na = sum(is.na(value))) %>%
right_join(df, by=c("name", "surname")) %>%
filter(number_of_na < 3)
After grouping by 'NAME', 'SURNAME', create a column with the number of NA elements in that group and then filter out any 'NAME' that have an 'ind' greater than or equal to 3
df %>%
group_by(NAME, SURNAME) %>%
mutate(ind = sum(is.na(VALUE))) %>%
group_by(NAME) %>%
filter(!any(ind >=3)) %>%
select(-ind)
Or do an anti_join after doing the filtering by 'NAME', 'SURNAME' based on the condition
df %>%
group_by(NAME, SURNAME) %>%
filter(sum(is.na(VALUE))>=3) %>%
ungroup %>%
distinct(NAME) %>%
anti_join(df, .)
data
set.seed(24)
df <- data.frame(NAME = rep(letters[1:5], each = 20),
SURNAME = sample(LETTERS[1:4], 5 * 20, replace = TRUE),
VALUE = sample(c(NA, 1:3), 5 *20, replace = TRUE),
stringsAsFactors = FALSE)

Sample groups and preserve row order

I have a dataframe such as:
df <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445,787,787,787)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that", "please", "eat", "noodles"),
word_id = c(12,5,28,99,214,800,28,12,78,912))
And I am attempting to take a sample of the data frame while preserving the id group and the word and word_id order.
I tried newDF <- df %>% group_by(id) %>% sample_frac(0.33) but this takes a sample of each group.
I would like to result in a dataframe that takes a sample of all id groups in the original dataframe and preserves the order of the columns. So if I want to take a 33% sample of df I will end up with 33% of the id groups and the columns remain in order.
newDF <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that"),
word_id = c(12,5,28,99,214,800,28))
Adding to alistaire's comment:
library(dplyr)
library(tidyr)
newDF1 <- df %>%
group_by(id) %>%
nest() %>%
sample_frac(1/3) %>%
unnest()
newDF2 <- anti_join(df, newDF1, by = "id")

Resources