Pivot wider in R with multiple columns - r

I am having trouble converting a particular dataset from long to wide.
col1 col2
ID 55.
animal. dog
animal bear
animal rabbit
shape. circle
ID 67.
animal. cat
shape. square
As you can see, some IDs have multiple observations for "animal" and so I want to make multiple columns like this:
ID. animal. animal2 animal3 shape
55. dog bear. rabbit circle
67. cat. NA NA square
Any help is appreciated!

Try this solution.
Most of the work was creating an separate ID column and then creating the unique names for the columns.
library(tidyr)
library(dplyr)
library(vctrs)
df<- structure(list(col1 = c("ID", "animal", "animal", "animal", "shape", "ID", "animal", "shape"),
col2 = c("55.", "dog", "bear", "rabbit", "circle", "67.", "cat", "square")),
class = "data.frame", row.names = c(NA, -8L))
#create the ID column
df$ID <- NA
#find the ID rows
idrows <- which(df$col1 == "ID")
#fill column and delete rows
df$ID[idrows] <- df$col2[idrows]
df <- fill(df, ID, .direction = "down")
df <- df[-idrows, ]
#create unique names in each grouping and the pivot wider
df %>% group_by(ID) %>%
mutate(col1=vec_as_names(col1, repair = "unique")) %>%
mutate(col1=stringr::str_replace( col1, "\\.+1", "")) %>%
ungroup() %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2")
ID animal animal...2 animal...3 shape
<chr> <chr> <chr> <chr> <chr>
1 55. dog bear rabbit circle
2 67. cat NA NA square
Another alternatives based on one of your previous questions:
df %>% group_by(ID) %>%
mutate(col1 = paste0(col1, data.table::rowid(col1))) %>%
ungroup() %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2")
or
df %>%
pivot_wider(id_cols = "ID", names_from = "col1", values_from = "col2") %>%
unnest_wider( "shape", names_sep = "_") %>% unnest_wider( "animal", names_sep = "_")

Related

How do I add row values to colnames in R

I have a dataframe and I would like to add the first row to the names of the columns
What I have:
col1
col2
col3
city
state
country
...
...
...
What I want:
col1_city
col2_state
col3_country
city
state
country
...
...
...
I can't do it manually because there are many cols in the df
I think of something like
df %>% rename_with(~ names(.) %>%
map_chr(~glue('{.x}_.[1,])))
Thanks!!
With rename_with
df %>%
rename_with(.cols = everything(),
.fn = ~paste0(colnames(df), '_', df[1,]))
Update: Here's a solution where you can pass the current data as it is created/altered within a pipe:
df |>
(\(x) (x <- x |>
rename_with(.cols = everything(),
.fn = ~paste0(colnames(x), '_', x[1,]))))()
So here you could, for example, do some filtering before the renaming or some mutating or whatever you want.
In base R, just do
names(df) <- paste0(names(df), "_", unlist(df[1,]))
-output
> df
col1_city col2_state col3_country
1 city state country
Or with dplyr
library(dplyr)
library(stringr)
df %>%
set_names(str_c(names(.), '_', slice(., 1)))
-output
col1_city col2_state col3_country
1 city state country
data
df <- structure(list(col1 = "city", col2 = "state",
col3 = "country"), class = "data.frame", row.names = c(NA,
-1L))

How do I pivot columns?

I have found this dataframe in an Excel file, very disorganized. This is just a sample of a bigger dataset, with many jobs.
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0),
)
Visually it's
I need it to be in this format, something that I can work in a database:
df <- data.frame(
Job = c("Driver", "Driver", "Operator", "Operator"),
Frequency= c("Daily", "Weekly", "Daily", "Weekly"),
Item= c("Gloves", "Aprons", "Gloves", "Aprons"),
Quantity= c(1,2,2,0)
)
Visually it's
Any thoughts in how do we have to manipulate the data? I have tried without any luck.
We could use tidyverse methods by doing this in three steps
Remove the first row - slice(-1), reshape to 'long' format (pivot_longer)
Keep only the first row - slice(1), reshape to 'long' format (pivot_longer)
Do a join with both of the reshaped datasets
library(dplyr)
library(tidyr)
df %>%
slice(-1) %>%
pivot_longer(cols = -Job, names_to = 'Item',
values_to = 'Quantity') %>%
left_join(df %>%
slice(1) %>%
pivot_longer(cols= -Job, values_to = 'Frequency',
names_to = 'Item') %>%
select(-Job) )
-output
# A tibble: 4 x 4
Job Item Quantity Frequency
<chr> <chr> <chr> <chr>
1 Driver Gloves 1 Daily
2 Driver Aprons 2 Weekly
3 Operator Gloves 2 Daily
4 Operator Aprons 0 Weekly
data
df <- data.frame(
Job = c("Frequency", "Driver", "Operator"),
Gloves = c("Daily", 1,2),
Aprons = c("Weekly", 2,0))

Is there a more efficient way to handle facts which are duplicating in an R dataframe?

I have a dataframe which looks like this:
ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")
df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)
The dataframes dimensions work like this:
There will always be an ID/key which singularly and uniquely identifies a submitted fact
There will always be a dimension for a given fact defining the Overall_Category of which a submitted fact belongs.
Most of the time - but not always - there will be a dimension for a "Descriptor",
If there is a "Descriptor" dimension for a given fact, there will be another "Members" dimension to show possible members within "Descriptor".
The problem is that a single submitted fact is duplicated for a given ID based on how many dimensions apply to the given fact. What I'd like is a way to show the fact only once, based on its ID, and have the applicable dimensions stored against that single ID.
I've achieved it by doing this:
df1 <- pivot_wider(df,
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")
ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()
df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")
But it seems like it wouldn't scale well for facts with many dimensions due to the pivot_wide, and in general doesn't seem like a very efficient approach.
Is there a better way to do this?
You can unite the columns and for each ID combine them together and take average of Fact values.
library(dplyr)
library(tidyr)
df %>%
unite(Descriptor, Overall_Category:Members, sep = '-', na.rm = TRUE) %>%
group_by(ID) %>%
summarise(Descriptor = paste0(Descriptor, collapse = '_'),
mean_sel = mean(Fact, na.rm = TRUE))
# ID Descriptor mean_sel
# <dbl> <chr> <dbl>
#1 1 Purchaser-Country-America_Purchaser-Gender-Male_Purchas… 233
#2 2 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Trans… 50
#3 3 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmi… 15
I think you want simple paste with sep and collapse arguments
library(dplyr, warn.conflicts = F)
df %>% group_by(ID, Fact) %>%
summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')
# A tibble: 3 x 3
ID Fact Descriptor
<dbl> <dbl> <chr>
1 1 233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown
2 2 50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual
3 3 15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic
An option with str_c
library(dplyr)
library(stringr)
df %>%
group_by(ID, Fact) %>%
summarise(Descriptor = str_c(Overall_Category, Descriptor, Members, sep= "-", collapse="_"), .groups = 'drop')

R: Get column names where value is not null

I have a table of 7 columns, the first column is id, then 3 columns of vegetable types and the last 3 columns are fruit types. The values indicate whether a person has this vegetable/ fruit. Is there a way to group the vegetables and the fruits, and output the column names if the person has that vegetable/ fruit?
Input data frame:
id1 <- c("id_1", 1, NA, NA, NA, 1, NA)
id2 <- c("id_2", NA, 1, 1, NA, NA, NA)
input <- data.frame(rbind(id1, id2))
colnames(input) = c("id", "lettuce", "tomato", "bellpeper", "pineapple", "apple", "banana")
Expected output data frame:
output_id1 <- c("id_1", "lettuce", "apple")
output_id2 <- c("id_2", "tomato, bellpeper", NA)
output <- data.frame(rbind(output_id1, output_id2))
colnames(output) <- c("id", "veg", "fruit")
Using the original input data you posted (also shown below in Data) you could do this with the tidyr package:
library(tidyr)
input %>%
tidyr::pivot_longer(cols = matches("^veg|^fruit"),
names_sep = "_",
names_to = c("type", "val"),
values_drop_na = T) %>%
tidyr::pivot_wider(id_cols = id,
names_from = type,
values_from = val,
values_fn = function(x) paste0(x, collapse = ","))
Output
id veg fruit
<chr> <chr> <chr>
1 id_1 lettuce apple
2 id_2 tomato,bellpeper NA
Data
input <- structure(list(id = c("id_1", "id_2"), veg_lettuce = c("1", NA
), veg_tomato = c(NA, "1"), veg_bellpeper = c(NA, "1"), fruit_pineapple = c(NA_character_,
NA_character_), fruit_apple = c("1", NA), fruit_banana = c(NA_character_,
NA_character_)), class = "data.frame", row.names = c("id1", "id2"
))
This should do the trick!
id1 <- c("id_1", 1, NA, NA, NA, 1, NA)
id2 <- c("id_2", NA, 1, 1, 1, NA, NA)
input <- data.frame(rbind(id1, id2))
colnames(input) = c("id", "lettuce", "tomato", "bellpeper", "pineapple", "apple", "banana")
# Remove the id column, it's not necessary
input_without_id <- dplyr::select(input, -c("id"))
# For each row (margin = 1) of the input, return the names vector (names(input))
# but only in the positions the where the row (x!) is not NA
result <- apply(input_without_id, MARGIN = 1, function(x) {
return(names(input_without_id)[which(!is.na(x))])
})
# Rename the result with the corresponding ids originally found in input.
names(result) <- input$id
Here is a tidyverse solution:
library(tidyverse)
input %>%
pivot_longer(-id) %>%
group_by(id) %>%
separate(name, into = c('type', 'class'), sep = "_") %>%
na.omit() %>%
select(-value) %>%
group_by(id, type) %>%
summarise(class = toString(class)) %>%
ungroup() %>%
pivot_wider(names_from = type, values_from = class) %>%
unnest() %>%
select(id, veg, fruit)
This gives us:
# A tibble: 2 x 3
id veg fruit
<chr> <chr> <chr>
1 id_1 lettuce apple
2 id_2 tomato, bellpeper NA

Sample groups and preserve row order

I have a dataframe such as:
df <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445,787,787,787)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that", "please", "eat", "noodles"),
word_id = c(12,5,28,99,214,800,28,12,78,912))
And I am attempting to take a sample of the data frame while preserving the id group and the word and word_id order.
I tried newDF <- df %>% group_by(id) %>% sample_frac(0.33) but this takes a sample of each group.
I would like to result in a dataframe that takes a sample of all id groups in the original dataframe and preserves the order of the columns. So if I want to take a 33% sample of df I will end up with 33% of the id groups and the columns remain in order.
newDF <- data.frame(id = factor(c(12321,12321,12321,4445,4445,4445,4445)),
word = c("please", "stop", "that", "the", "fox", "jumps", "that"),
word_id = c(12,5,28,99,214,800,28))
Adding to alistaire's comment:
library(dplyr)
library(tidyr)
newDF1 <- df %>%
group_by(id) %>%
nest() %>%
sample_frac(1/3) %>%
unnest()
newDF2 <- anti_join(df, newDF1, by = "id")

Resources