In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword column. The outcome would include as many keyowrd colums as the number of rows for an observation (e.g. keyword_1, keyword_2, etc). Do you have any idea? Thanks a lot.
This is my MWE
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova")),
row.names = c(NA, 5L), class = "data.frame")
Does this work:
library(dplyr)
library(tidyr)
df1 %>% group_by(rif,date) %>% mutate(n = row_number()) %>% pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
# A tibble: 2 x 5
# Groups: rif, date [2]
rif date keyword1 keyword2 keyword3
<chr> <chr> <chr> <chr> <chr>
1 text10 20180329 Lucca Piacenza Milano
2 text11 20180329 Cascina Padova NA
Related
I have a data frame that looks like this:
date
var
2022-01-01
a,b,...,h
2022-01-02
a,b,...,z
Now I want to use separate function in R from dplyr or any other function that separates all the characters with criterion the "," but I don't know how many unique characters are in each cell to create unknown number of columns. There will be unbalanced columns filled with NA.
Ideally I want the reported data frame to look like this :
date
var1
var2
var...
var...
var_Inf
2022-01-01
a
b
...
h
NA
2022-01-02
a
b
...
...
z
how can I do this in R ?
date = seq(as.Date("2022/1/1"), as.Date("2022/1/2"), by = "day")
date
var = c("a,b,...,h","a,b,...,z");var
df = tibble(date,var)
a more reproducible example is this :
var = c("a,b,c,d,e,h","a,b,c,d,e,f,i,z");var
df = tibble(date,var)
but consider that I don't know the number of letters on each column element.
How can I do this in R ?
We could do it this way:
Bringing in long format with separate_rows() from tidyr package makes it easier to handle such tasks:
library(dplyr)
library(tidyr)
df %>%
separate_rows(var) %>%
group_by(date) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = var, names_glue = "var{id}")
date var1 var2 var3 var4
<chr> <chr> <chr> <chr> <chr>
1 2022-01-01 a b x h
2 2022-01-02 a b x z
data:
df <- structure(list(date = c("2022-01-01", "2022-01-02"), var = c("a,b,x,h",
"a,b,x,z")), class = "data.frame", row.names = c(NA, -2L))
This question already has answers here:
How to Reshape Long to Wide While Preserving Some Variables in R [duplicate]
(3 answers)
Closed 1 year ago.
In my dataframe there are multiple rows for a single observation (each referenced by ref). I would like to collapse the rows and create new columns for the keyword and the reg_birth columns.
df1 <- structure(list(rif = c("text10", "text10", "text10", "text11", "text11"),
date = c("20180329", "20180329", "20180329", "20180329", "20180329"),
keyword = c("Lucca", "Piacenza", "Milano", "Cascina", "Padova"),
reg_birth = c("Tuscany", "Emilia", "Lombardy", "Veneto", "Veneto")),
row.names = c(NA, 5L), class = "data.frame")
If I just want to create the columns for the 'keyword' column I used this code, as shown in this answer.
df1 %>% group_by(rif,date) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = c(rif,date), values_from = keyword, names_from = n, names_prefix = 'keyword')
However, I don't know how to do the same for an additional column (reg_birth in this case).
This is my expected output
rif keyword reg.birth keyword2 reg_birth2 keyword3 reg_birth3
1 text10 Lucca Tuscany Piacenza Emilia Milano Lombardy
2 text11 Cascina Veneto Padova Veneto <NA> <NA>
Thank you.
You may try with pivot_wider from tidyr.
library(dplyr)
library(tidyr)
df1 %>%
mutate(id = data.table::rowid(rif, date)) %>%
pivot_wider(names_from = id, values_from = c(keyword, reg_birth))
# rif date keyword_1 keyword_2 keyword_3 reg_birth_1 reg_birth_2 reg_birth_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 text10 20180329 Lucca Piacenza Milano Tuscany Emilia Lombardy
#2 text11 20180329 Cascina Padova NA Veneto Veneto NA
I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3
I have the following data.table object called x:
month.option som.month
all.year 56.6%
diff -0.9%
and when I perform the following operation :
x %>% pivot_wider(names_from = month.option, values_from = som.month) %>%
select(diff, everything()) %>%
set_names(c("Dif vs MA", "SOM YTD", "SOM AA"))
I get the following error: Error in data.frame(row = row_id, col = col_id) : arguments imply differing number of rows: 0, 2. However I don't understand the reason since x is a 2x2 data.table. If anyone knows a possible issue that I am not seeing I will appreciate the correction.
As a side note, all the columns are of type character, if that is any useful info
If we want to use pivot_wider, we could do this without creating a new column by specifying the values_fn as I
library(dplyr)
library(tidyr)
x %>%
pivot_wider(names_from = month.option, values_from = som.month, values_fn = I)
# A tibble: 1 x 2
# all.year diff
# <I<chr>> <I<chr>>
#1 56.6% -0.9%
Or it can be also a function to get the first element
x %>%
pivot_wider(names_from = month.option,
values_from = som.month, values_fn = first)
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
However, these kind of problems can be easily tackled with transpose from data.table
data.table::transpose(x, make.names = 'month.option')
# all.year diff
#1 56.6% -0.9%
Or use deframe with as_tibble_row which would be more direct
library(tibble)
deframe(x) %>%
as_tibble_row
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
Or another option is to convert the first column to rownames, do the transpose with t and convert to tibble (or data.frame)
x %>%
column_to_rownames('month.option') %>%
t %>%
as_tibble
# A tibble: 1 x 2
# all.year diff
# <chr> <chr>
#1 56.6% -0.9%
data
x <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))
Try this tidyverse solution with same pivot_wider(). You are having issues because the function can not identify the rows properly. Creating an id is the solution:
#Code
df %>% mutate(id=1) %>%
pivot_wider(names_from = month.option,values_from=som.month) %>%
select(-1)
Output:
# A tibble: 1 x 2
all.year diff
<chr> <chr>
1 56.6% -0.9%
Some data used:
#Data
df <- structure(list(month.option = c("all.year", "diff"), som.month = c("56.6%",
"-0.9%")), class = "data.frame", row.names = c(NA, -2L))
If you have data.table we can also use dcast :
library(data.table)
dcast(x, rowid(month.option)~month.option, value.var = 'som.month')
# month.option all.year diff
#1: 1 56.6% -0.9%
I have a very simple dataset, with one column for ID numbers and one column for DOB of that individual.
Example:
x_df <- data.frame(stringsAsFactors=FALSE,
ID = c("ID-1", "ID-2", "ID-2", "ID-3", "ID-4", "ID-5"),
DOB = c("4/16/1955", "9/4/1976", "9/4/1976", "4/16/1955", "2/10/1995",
"11/29/1980")
)
I am trying to write a code in R that will compare all the DOBs and print the IDs and DOBs when the DOB is the same but the ID is different.
Any suggestions?
lets arrange the data by DOBs, pairs can be compared
library(tidyverse)
x_df %>%
group_by(DOB) %>%
mutate(idord = paste0("x", 1:n()) ) %>%
spread(idord, ID) %>%
filter(x1 != x2)
result is
DOB x1 x2
<chr> <chr> <chr>
1 4/16/1955 ID-1 ID-3
if you might have more than 2 pairs then you can use this
x_df %>%
group_by(DOB) %>%
summarise(idcount = n_distinct(ID), IDall = paste(ID, collapse = "|")) %>%
filter(idcount > 1)
number of IDs and all IDs in one cell
DOB idcount IDall
<chr> <int> <chr>
1 4/16/1955 2 ID-1|ID-3