How to widen a 2-row dataset into a single row? [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 1 year ago.
I have this small 2-row dataset
df=structure(list(V2 = c("Primera", "Segunda"), Lote = c("EN1195",
"EN1195"), V7 = c("No registra", "No registra"), fecha_app = structure(c(18690,
18711), class = "Date")), class = "data.frame", row.names = c(NA,
-2L))
this
I need to widen it so the second row becomes part of the first row.
df=structure(list(Lote.1 = "EN1195", V7.1 = "No registra", fecha_app.1 = structure(18690, class = "Date"), Lote.2 = "EN1195", V7.2 = "No registra",
fecha_app.2 = structure(18711, class = "Date")), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
I have researched this but im unsure on how to implement it on my case

You can use rowid from data.table to create a unique id and use pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(id = data.table::rowid(Lote)) %>%
pivot_wider(names_from = id, values_from = V2:fecha_app)
# V2_1 V2_2 Lote_1 Lote_2 V7_1 V7_2 fecha_app_1 fecha_app_2
# <chr> <chr> <chr> <chr> <chr> <chr> <date> <date>
#1 Primera Segunda EN1195 EN1195 No registra No registra 2021-03-04 2021-03-25
Or using only data.table.
library(data.table)
setDT(df)
dcast(df, Lote~rowid(Lote), value.var = c('V2', 'V7', 'Lote', 'fecha_app'))

Related

How to cbind a list of tables by one column, and suffix headings with the list item name

I've got a list of dataframes. I'd like to cbind them by the index column, sample_id. Each table has the same column headings, so I can't just cbind them otherwise I won't know which list item the columns came from. The name of the list item gives the measure used to generate them, so I'd like to suffix the column headings with the list item name.
Here's a simplified demo list of dataframes:
list_of_tables <- list(number = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(655, 331, 271
), max = c(12, 5, 7)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), concentration_cm_3 = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(121454697, 90959097,
43080697), max = c(2050000, 2140000, 915500)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), volume_nm_3 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(2412783009, 1293649395, 438426087
), max = c(103500000, 117400000, 23920000)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), area_nm_2 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(15259297.4, 7655352.2, 3775922
), max = c(266500, 289900, 100400)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame")))
You'll see it's a list of 4 tables, and the list item names are "number", "concentration_cm_3", "volume_nm_3", and "area_nm_2".
Using join_all from plyr I can merge them all by sample_id. However, how do I suffix with the list item name?
merged_tables <- plyr::join_all(stats_by_measure, by = "sample_id", type = "left")
we could do it this way:
The trick is to use .id = 'id' in bind_rows which adds the name as a column. Then we could pivot:
library(dplyr)
library(tidyr)
bind_rows(list_of_tables, .id = 'id') %>%
pivot_wider(names_from = id,
values_from = c(total, max))
sample_id total_number total_concentration_cm_3 total_volume_nm_3 total_area_nm_2 max_number max_concentration_cm_3 max_volume_nm_3 max_area_nm_2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSF_1 655 121454697 2412783009 15259297. 12 2050000 103500000 266500
2 CSF_2 331 90959097 1293649395 7655352. 5 2140000 117400000 289900
3 CSF_4 271 43080697 438426087 3775922 7 915500 23920000 100400
Probably, we may use reduce2 here with suffix option from left_join
library(dplyr)
library(purrr)
nm <- names(list_of_tables)[1]
reduce2(list_of_tables, names(list_of_tables)[-1],
function(x, y, z) left_join(x, y, by = 'sample_id', suffix = c(nm, z)))
Or if we want to use join_all, probably we can rename the columns before doing the join
library(stringr)
imap(list_of_tables, ~ {
nm <- .y
.x %>% rename_with(~str_c(.x, nm), -1)
}) %>%
plyr::join_all( by = "sample_id", type = "left")
Or use a for loop
tmp <- list_of_tables[[1]]
names(tmp)[-1] <- paste0(names(tmp)[-1], names(list_of_tables)[1])
for(nm in names(list_of_tables)[-1]) {
tmp2 <- list_of_tables[[nm]]
names(tmp2)[-1] <- paste0(names(tmp2)[-1], nm)
tmp <- left_join(tmp, tmp2, by = "sample_id")
}
tmp

How to add single quotes to dataset?

A test dataset
structure(list(numero_certificado = c("1234", "5678"
), sitio_defuncion = c("HOSPITAL/CLINICA", "HOSPITAL/CLINICA"
), tipo_defuncion = c("NO FETAL", "NO FETAL"), fecha_defuncion = structure(c(1635861000,
1635874800), tzone = "", class = c("POSIXct", "POSIXt")), tipo_documento_fallecido = c("REGISTRO CIVIL",
"CEDULA DE CIUDADANIA"), documento_fallecido = c("1111",
"2222")), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to be able to
Add singles quotes (') to each element of the entire dataset
Add singles quotes (') to all elements in specific columns based on index, as some data will be numeric or date and not string
structure(list(numero_certificado = c("'1234'", "'5678'"
), sitio_defuncion = c("'HOSPITAL/CLINICA'", "'HOSPITAL/CLINICA'"
), tipo_defuncion = c("'NO FETAL'", "'NO FETAL'"), fecha_defuncion = structure(c(1635861000,
1635874800), tzone = "", class = c("POSIXct", "POSIXt")), tipo_documento_fallecido = c("'REGISTRO CIVIL'",
"'CEDULA DE CIUDADANIA'"), documento_fallecido = c("'1111'",
"'2222'")), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
We may use sQuote by looping across the columns (everything() - if all the columns needs to be changed or for selected columns use one of the select_helpers i.e. here if we need to remove the fecha_defuncion, prefix with -) of the data
library(dplyr)
df1 <- df1 %>%
mutate(across(-fecha_defuncion, sQuote, FALSE))
-output
df1
# A tibble: 2 × 6
numero_certificado sitio_defuncion tipo_defuncion fecha_defuncion tipo_documento_fallecido documento_fallecido
<chr> <chr> <chr> <dttm> <chr> <chr>
1 '1234' 'HOSPITAL/CLINICA' 'NO FETAL' 2021-11-02 08:50:00 'REGISTRO CIVIL' '1111'
2 '5678' 'HOSPITAL/CLINICA' 'NO FETAL' 2021-11-02 12:40:00 'CEDULA DE CIUDADANIA' '2222'
Also as #KonradRudolph mentioned in the comments, if the sQuote depends on the locale, another option is either glue or paste or sprintf
df1 <- df1 %>%
mutate(across(-fecha_defuncion, ~sprintf("'%s'", as.character(.))))

Converting empty values to NULL in R - Handling date column

I have a simple dataframe as: dput(emp)
structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
-1L))
I want to convert all empty rows to NULL
The simplest way to achieve is:
emp[emp==""] <- NA
Which ofcourse would have worked but I get the error for the date column as:
Error in charToDate(x) :
character string is not in a standard unambiguous format
How can I convert all other empty rows to NULL without having to deal with the date column? Please note that the actual data frame has 30000+ rows.
Try formating the date variable as character, make the change and transform to date again:
#Format date
emp$update <- as.character(emp$update)
#Replace
emp[emp=='']<-NA
#Reformat date
emp$update <- as.Date(emp$update)
Output:
name job Mgr update
1 Alex <NA> <NA> 2020-08-24
You can try type.convert like below
type.convert(emp,as.is = TRUE)
such that
name job Mgr update
1 Alex NA NA 2020-08-24
You may try this using dplyr:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"")
As mentioned by #Duck, you have to format the date variable as character.
afterwards you can transform it back to date if you need it:
library(dplyr)
df %>%
mutate_at(vars(update),as.character) %>%
na_if(.,"") %>%
mutate_at(vars(update),as.Date)
See if this works:
> library(dplyr)
> library(purrr)
> emp <- structure(list(name = structure(1L, .Label = "Alex", class = "factor"),
+ job = structure(1L, .Label = "", class = "factor"), Mgr = structure(1L, .Label = "", class = "factor"),
+ update = structure(18498, class = "Date")), class = "data.frame", row.names = c(NA,
+ -1L))
> emp
name job Mgr update
1 Alex 2020-08-24
> emp %>% mutate(update = as.character(update)) %>% map_df(~gsub('^$',NA, .x)) %>% mutate(update = as.Date(update)) %>% mutate(across(1:3, as.factor))
# A tibble: 1 x 4
name job Mgr update
<fct> <fct> <fct> <date>
1 Alex NA NA 2020-08-24
>

How do I unnest a nested df and use the coumn name as part of the new column name?

I realize my title is probably a little confusing. I have some JSON that is a little confusing to unnest. I am trying to use the tidyverse.
Sample Data
df <- structure(list(long_abbr = c("Team11", "BBS"), short_name = c("Ac ",
"BK"), division = c("", ""), name = c("AC Slaters Muscles", "Broken Bats"
), abbr = c("T1", "T1"), owners = list(structure(list(commissioner = 0L,
name = "Chris Liss", id = "300144F8-79F4-11EA-8F25-9AE405472731"), class = "data.frame", row.names = 1L),
structure(list(commissioner = 1L, name = "Mark Ortin", id = "90849EF6-7427-11EA-95AA-4EEEAC7F8CD2"), class = "data.frame", row.names = 1L)),
id = c("1", "2"), logged_in_team = c(NA_integer_, NA_integer_
)), row.names = 1:2, class = "data.frame")
)
# Unnest Owners Information
df <- df %>%
unnest(owners)
I get the following error since I have duplicate columns that use name.
Error: Column names `name` and `id` must not be duplicated.
Is there an easy way to unnest the columns with a naming convention that takes the prefix owners (or in my case, I'd want it to take whatever the name of the column that hold the nested df is) before the nested columns. I.E. owners.commissioner, owners.name, owners.id. I'd also be interested in solutions that use camel case, and an underscore. I.E. ownersName, or owners_name.
set the argument names_sep:
df <- structure(
list(long_abbr = c("Team11", "BBS"),
short_name = c("Ac ", "BK"),
division = c("", ""),
name = c("AC Slaters Muscles", "Broken Bats"),
abbr = c("T1", "T1"),
owners = list(
structure(list(commissioner = 0L, name = "Chris Liss",
id = "300144F8-79F4-11EA-8F25-9AE405472731"),
class = "data.frame", row.names = 1L),
structure(list(commissioner = 1L, name = "Mark Ortin",
id = "90849EF6-7427-11EA-95AA-4EEEAC7F8CD2"),
class = "data.frame", row.names = 1L)),
id = c("1", "2"),
logged_in_team = c(NA_integer_, NA_integer_)),
row.names = 1:2, class = "data.frame"
)
tidyr::unnest(df, owners, names_sep = "_")
#> # A tibble: 2 x 10
#> long_abbr short_name division name abbr owners_commissi… owners_name
#> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 Team11 "Ac " "" AC S… T1 0 Chris Liss
#> 2 BBS "BK" "" Brok… T1 1 Mark Ortin
#> # … with 3 more variables: owners_id <chr>, id <chr>, logged_in_team <int>
Created on 2020-04-26 by the reprex package (v0.3.0)
Does this solve your problem?

Find value of a row by comparing two columns and a value with a range of a different dataset

I have 2 different datasets. One with an object that comes from a StationX and goes to StationY and arrives at a specific date and time as the following.
df1<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"), To = c("Station15", "Station2", "Station2", "Station7"),
Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L),class = c("tbl_df","tbl", "data.frame"))
In the Dataset2 are e.g. trucks which wait for the specific object at StationY between the time&date "Arrival" and "Departure" and leave at "Departure to a specifc region "TOID".
As in the following:
df2<-structure(list(TOID = c(2, 4, 7, 20), Station = c("Station15",
"Station2", "Station2","Station7"), Arrival = structure(c(971169600, 971172000, 971177700, 971179500), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Departure1 = structure(c(971170200, 971173200, 971178600, 971179800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I want to look for the TOID in Dataset2 and add it to Dataset1 if "TO"(Dataset1)="Station"(Dataset2) and "Arrival"(Dataset2)<="Arrival"(Dataset1)<="Departure"(Dataset2) and has therefore the following outcome:
df1outcome<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"
), To = c("Station15", "Station2", "Station2", "Station7"), `TO_ID` = c(2, 4, 7, 20), Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I need a solution which looks in dataset2 for the ID that matches the conditions regardless the roworder.
Would be awesome if you guys could help me how to code this in R.
Best,
J
Perhaps you could use tidyverse, use a left_join based on the station, and then filter based on dates:
library(tidyverse)
df1 %>%
left_join(df2, by = c("To" = "Station"), suffix = c("1","2")) %>%
filter(Arrival1 >= Arrival2 & Arrival1 <= Departure1) %>%
select(-c(Arrival2, Departure1))
# A tibble: 4 x 4
From To Arrival1 TOID
<chr> <chr> <dttm> <dbl>
1 Station1 Station15 2000-10-10 09:22:00 2
2 Station5 Station2 2000-10-10 10:12:00 4
3 Station6 Station2 2000-10-10 11:42:00 7
4 Station10 Station7 2000-10-10 12:07:00 20
Im pretty new to R, so this code is probably longer then it should be. But does this work?
#renaming variables so its easier to merge the objects and to compare them
df1 <- df1 %>% rename(Arrival_Package = Arrival)
df2 <- df2 %>% rename(Arrival_Truck = Arrival)
#merge objects
df1outcome <- merge(df1, df2, by.x = "To", by.y = "Station")
#subset from object and select relevant columns
df1outcome <- subset(df1outcome, Arrival_Package <= Departure1)
df1outcome <- subset(df1outcome, Arrival_Truck <= Arrival_Package)
df1outcome <- df1outcome %>% select(From, To, TOID, Arrival_Package)

Resources