How to add single quotes to dataset? - r

A test dataset
structure(list(numero_certificado = c("1234", "5678"
), sitio_defuncion = c("HOSPITAL/CLINICA", "HOSPITAL/CLINICA"
), tipo_defuncion = c("NO FETAL", "NO FETAL"), fecha_defuncion = structure(c(1635861000,
1635874800), tzone = "", class = c("POSIXct", "POSIXt")), tipo_documento_fallecido = c("REGISTRO CIVIL",
"CEDULA DE CIUDADANIA"), documento_fallecido = c("1111",
"2222")), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))
I would like to be able to
Add singles quotes (') to each element of the entire dataset
Add singles quotes (') to all elements in specific columns based on index, as some data will be numeric or date and not string
structure(list(numero_certificado = c("'1234'", "'5678'"
), sitio_defuncion = c("'HOSPITAL/CLINICA'", "'HOSPITAL/CLINICA'"
), tipo_defuncion = c("'NO FETAL'", "'NO FETAL'"), fecha_defuncion = structure(c(1635861000,
1635874800), tzone = "", class = c("POSIXct", "POSIXt")), tipo_documento_fallecido = c("'REGISTRO CIVIL'",
"'CEDULA DE CIUDADANIA'"), documento_fallecido = c("'1111'",
"'2222'")), row.names = c(NA, -2L), class = c("tbl_df", "tbl",
"data.frame"))

We may use sQuote by looping across the columns (everything() - if all the columns needs to be changed or for selected columns use one of the select_helpers i.e. here if we need to remove the fecha_defuncion, prefix with -) of the data
library(dplyr)
df1 <- df1 %>%
mutate(across(-fecha_defuncion, sQuote, FALSE))
-output
df1
# A tibble: 2 × 6
numero_certificado sitio_defuncion tipo_defuncion fecha_defuncion tipo_documento_fallecido documento_fallecido
<chr> <chr> <chr> <dttm> <chr> <chr>
1 '1234' 'HOSPITAL/CLINICA' 'NO FETAL' 2021-11-02 08:50:00 'REGISTRO CIVIL' '1111'
2 '5678' 'HOSPITAL/CLINICA' 'NO FETAL' 2021-11-02 12:40:00 'CEDULA DE CIUDADANIA' '2222'
Also as #KonradRudolph mentioned in the comments, if the sQuote depends on the locale, another option is either glue or paste or sprintf
df1 <- df1 %>%
mutate(across(-fecha_defuncion, ~sprintf("'%s'", as.character(.))))

Related

How to cbind a list of tables by one column, and suffix headings with the list item name

I've got a list of dataframes. I'd like to cbind them by the index column, sample_id. Each table has the same column headings, so I can't just cbind them otherwise I won't know which list item the columns came from. The name of the list item gives the measure used to generate them, so I'd like to suffix the column headings with the list item name.
Here's a simplified demo list of dataframes:
list_of_tables <- list(number = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(655, 331, 271
), max = c(12, 5, 7)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), concentration_cm_3 = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(121454697, 90959097,
43080697), max = c(2050000, 2140000, 915500)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), volume_nm_3 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(2412783009, 1293649395, 438426087
), max = c(103500000, 117400000, 23920000)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), area_nm_2 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(15259297.4, 7655352.2, 3775922
), max = c(266500, 289900, 100400)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame")))
You'll see it's a list of 4 tables, and the list item names are "number", "concentration_cm_3", "volume_nm_3", and "area_nm_2".
Using join_all from plyr I can merge them all by sample_id. However, how do I suffix with the list item name?
merged_tables <- plyr::join_all(stats_by_measure, by = "sample_id", type = "left")
we could do it this way:
The trick is to use .id = 'id' in bind_rows which adds the name as a column. Then we could pivot:
library(dplyr)
library(tidyr)
bind_rows(list_of_tables, .id = 'id') %>%
pivot_wider(names_from = id,
values_from = c(total, max))
sample_id total_number total_concentration_cm_3 total_volume_nm_3 total_area_nm_2 max_number max_concentration_cm_3 max_volume_nm_3 max_area_nm_2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSF_1 655 121454697 2412783009 15259297. 12 2050000 103500000 266500
2 CSF_2 331 90959097 1293649395 7655352. 5 2140000 117400000 289900
3 CSF_4 271 43080697 438426087 3775922 7 915500 23920000 100400
Probably, we may use reduce2 here with suffix option from left_join
library(dplyr)
library(purrr)
nm <- names(list_of_tables)[1]
reduce2(list_of_tables, names(list_of_tables)[-1],
function(x, y, z) left_join(x, y, by = 'sample_id', suffix = c(nm, z)))
Or if we want to use join_all, probably we can rename the columns before doing the join
library(stringr)
imap(list_of_tables, ~ {
nm <- .y
.x %>% rename_with(~str_c(.x, nm), -1)
}) %>%
plyr::join_all( by = "sample_id", type = "left")
Or use a for loop
tmp <- list_of_tables[[1]]
names(tmp)[-1] <- paste0(names(tmp)[-1], names(list_of_tables)[1])
for(nm in names(list_of_tables)[-1]) {
tmp2 <- list_of_tables[[nm]]
names(tmp2)[-1] <- paste0(names(tmp2)[-1], nm)
tmp <- left_join(tmp, tmp2, by = "sample_id")
}
tmp

Populate a new column in one table based on start and end dates in another table

I have a larger data table (called raw.data) and a smaller one (called balldrop.times) listing the start and end times of an event.
I want to create a new column in the larger data table that will fill up the times between the event start and end date that are located in the smaller table. The times that aren't between the event start/end time can be labeled something else, it doesn't really matter.
#the dput of the smaller table
> dput(balldrop.times)
structure(list(Stage = 6:14,
BallStart = structure(c(1635837081, 1635847841, 1635856675, 1635866152, 1635878326, 1635886132, 1635895547, 1635902934, 1635911136), tzone = "", class = c("POSIXct", "POSIXt")),
BallEnd = structure(c(1635837364, 1635848243, 1635857005, 1635866475, 1635878704, 1635886465, 1635895905, 1635903786, 1635911457), tzone = "", class = c("POSIXct", "POSIXt"))),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L))
#here is part of the larger table just in case
> dput(head(raw.data, 5))
structure(list(DateTime = structure(c(1635825603.6576, 1635825604.608, 1635825605.6448, 1635825606.6816, 1635825607.632), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Press.Well = c(1154.2561461, 1154.0308849, 1149.7247783, 1152.0544566, 1155.7363779),
row.names = c(NA, -5L),
class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000020725b51ef0>)
My desired output is something like the following, with "Event Active" only for the times between the listed DateTime vales in the balldrop.times table:
DateTime
Press.Well
Event Status
2021-11-02 02:11:20
10
Event Not Active
2021-11-02 02:11:21
10
Event Active
2021-11-02 02:11:22
15
Event Active
...
...
...
2021-11-02 02:16:04
25
Event Active
2021-11-02 02:16:05
30
Event Not Active
I am thinking I can use mutate() to create a new column in the raw.data table and set conditions for the DateTime, but I am not sure how to do this for multiple separate start/end DateTimes.
Any help would be appericated. Thank you.
Your code isn't working. Neither do the times in your example table correspond with the ones in your expected output.
tmp <- structure(list(Stage = 6:14,
BallStart = structure(c(1635837081, 1635847841, 1635856675, 1635866152, 1635878326, 1635886132, 1635895547, 1635902934, 1635911136), tzone = "", class = c("POSIXct", "POSIXt")),
BallEnd = structure(c(1635837364, 1635848243, 1635857005, 1635866475, 1635878704, 1635886465, 1635895905, 1635903786, 1635911457), tzone = "", class = c("POSIXct", "POSIXt"))
),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L))
tmp1 <- structure(list(DateTime = structure(c(1635825603.6576, 1635825604.608, 1635825605.6448, 1635825606.6816, 1635825607.632), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Press.Well = c(1154.2561461, 1154.0308849, 1149.7247783, 1152.0544566, 1155.7363779) ), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
So note this isn't a clean solution.
tmp1 %>%
mutate(`Event Status` = case_when(
DateTime >= (tmp[1,] %>% pull(BallStart)) & DateTime <= (tmp[1,] %>% pull(BallEnd)) ~ "Event Active",
DateTime >= (tmp[2,] %>% pull(BallStart)) & DateTime <= (tmp[2,] %>% pull(BallEnd)) ~ "Event Active",
DateTime >= (tmp[3,] %>% pull(BallStart)) & DateTime <= (tmp[3,] %>% pull(BallEnd)) ~ "Event Active",
DateTime >= (tmp[4,] %>% pull(BallStart)) & DateTime <= (tmp[4,] %>% pull(BallEnd)) ~ "Event Active",
DateTime >= (tmp[5,] %>% pull(BallStart)) & DateTime <= (tmp[5,] %>% pull(BallEnd)) ~ "Event Active",
TRUE ~ "Event Not Active"
))
Because you want to compare multiple conditions, case_when is the preferred option rather than ifelse. With that I compare it to every row in your reference table.
Now, like said it isn't a clean solution as you have many rows to specify it. With a bigger reference table to check the code will increase exponentionally. But you can clean it up into a function.

How to widen a 2-row dataset into a single row? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 1 year ago.
I have this small 2-row dataset
df=structure(list(V2 = c("Primera", "Segunda"), Lote = c("EN1195",
"EN1195"), V7 = c("No registra", "No registra"), fecha_app = structure(c(18690,
18711), class = "Date")), class = "data.frame", row.names = c(NA,
-2L))
this
I need to widen it so the second row becomes part of the first row.
df=structure(list(Lote.1 = "EN1195", V7.1 = "No registra", fecha_app.1 = structure(18690, class = "Date"), Lote.2 = "EN1195", V7.2 = "No registra",
fecha_app.2 = structure(18711, class = "Date")), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
I have researched this but im unsure on how to implement it on my case
You can use rowid from data.table to create a unique id and use pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(id = data.table::rowid(Lote)) %>%
pivot_wider(names_from = id, values_from = V2:fecha_app)
# V2_1 V2_2 Lote_1 Lote_2 V7_1 V7_2 fecha_app_1 fecha_app_2
# <chr> <chr> <chr> <chr> <chr> <chr> <date> <date>
#1 Primera Segunda EN1195 EN1195 No registra No registra 2021-03-04 2021-03-25
Or using only data.table.
library(data.table)
setDT(df)
dcast(df, Lote~rowid(Lote), value.var = c('V2', 'V7', 'Lote', 'fecha_app'))

Identify and strip characters from colums

I have a large dataset in which I want identify and remove characters and signs to keep only the number value.
For example I want -£1125.91m to be -1125.91
dataset
Event var1 var2
<fct> <chr> <chr>
1 Labour Costs YoY 13.34m 0.026
2 Unemployment Change (000's) $16.91b -0.449
3 Unemployment Rate -£1125.91m 0.89k
4 Jobseekers Net Change ¥1012.74b 9.56m
At the moment I know how to remove a single character from the column. Like this:
dataset$`var1` <- gsub("k", "", dataset$`var`)
Doing this manually will be a lot of work because the dataset is really big. I was wondering if you can identify and remove all the characters, so also the currency symbols and the m's and b's all at once?
To replicate the dataset:
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", "$16.91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
To remove all but a hyphen, digit or a dot, you can use
dataset$var1 <- gsub("[^-0-9.]", "", dataset$var1)
The [^-0-9.] pattern is a negated character class that matches any char but the ones defined in the class.
See the regex demo online.
See an online R demo:
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", "$16.91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
gsub("[^-0-9.,]", "", dataset$var1)
## => [1] "13.34" "16.91" "-1125.91" "1012.74"

Find value of a row by comparing two columns and a value with a range of a different dataset

I have 2 different datasets. One with an object that comes from a StationX and goes to StationY and arrives at a specific date and time as the following.
df1<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"), To = c("Station15", "Station2", "Station2", "Station7"),
Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L),class = c("tbl_df","tbl", "data.frame"))
In the Dataset2 are e.g. trucks which wait for the specific object at StationY between the time&date "Arrival" and "Departure" and leave at "Departure to a specifc region "TOID".
As in the following:
df2<-structure(list(TOID = c(2, 4, 7, 20), Station = c("Station15",
"Station2", "Station2","Station7"), Arrival = structure(c(971169600, 971172000, 971177700, 971179500), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Departure1 = structure(c(971170200, 971173200, 971178600, 971179800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I want to look for the TOID in Dataset2 and add it to Dataset1 if "TO"(Dataset1)="Station"(Dataset2) and "Arrival"(Dataset2)<="Arrival"(Dataset1)<="Departure"(Dataset2) and has therefore the following outcome:
df1outcome<-structure(list(From = c("Station1", "Station5", "Station6", "Station10"
), To = c("Station15", "Station2", "Station2", "Station7"), `TO_ID` = c(2, 4, 7, 20), Arrival = structure(c(971169720, 971172720, 971178120, 971179620), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I need a solution which looks in dataset2 for the ID that matches the conditions regardless the roworder.
Would be awesome if you guys could help me how to code this in R.
Best,
J
Perhaps you could use tidyverse, use a left_join based on the station, and then filter based on dates:
library(tidyverse)
df1 %>%
left_join(df2, by = c("To" = "Station"), suffix = c("1","2")) %>%
filter(Arrival1 >= Arrival2 & Arrival1 <= Departure1) %>%
select(-c(Arrival2, Departure1))
# A tibble: 4 x 4
From To Arrival1 TOID
<chr> <chr> <dttm> <dbl>
1 Station1 Station15 2000-10-10 09:22:00 2
2 Station5 Station2 2000-10-10 10:12:00 4
3 Station6 Station2 2000-10-10 11:42:00 7
4 Station10 Station7 2000-10-10 12:07:00 20
Im pretty new to R, so this code is probably longer then it should be. But does this work?
#renaming variables so its easier to merge the objects and to compare them
df1 <- df1 %>% rename(Arrival_Package = Arrival)
df2 <- df2 %>% rename(Arrival_Truck = Arrival)
#merge objects
df1outcome <- merge(df1, df2, by.x = "To", by.y = "Station")
#subset from object and select relevant columns
df1outcome <- subset(df1outcome, Arrival_Package <= Departure1)
df1outcome <- subset(df1outcome, Arrival_Truck <= Arrival_Package)
df1outcome <- df1outcome %>% select(From, To, TOID, Arrival_Package)

Resources