Numbers as column name in tibbles: problem when using select() - r

I am trying to select some column by name, and the names are numbers. This is the code:
df2 <- df1 %>% select(`Year`, all_of(append(list1, list2))) %>%
I get this error:
Error: Can't subset columns that don't exist. x Locations 61927,
169014, 75671, 27059, 225963, etc. don't exist. i There are only 5312
columns.
I think the error is due to column names being numbers. How do I solve it? (I want to keep the column name as numbers)

We may use any_of with paste so that if there are numeric values as column names, it still work and if some of them are missing too, it would not throw an error
library(dplyr)
df1 %>%
select(Year, any_of(paste(c(list1, list2))))

If you insert a number in select it will use as position, buy you can use the number as characters.
Example
library(dplyr)
df <- tibble(`2020` = NA,`2021` = NA, "var" = NA)
df
# A tibble: 1 x 3
`2020` `2021` var
<lgl> <lgl> <lgl>
1 NA NA NA
Using number inside select
I will give an error, since there just 3 variables, and if you use 2020 will search the 2020th column.
df %>%
select(2020)
Erro: Can't subset columns that don't exist. x Location 2020 doesn't
exist. i There are only 3 columns.
Using number as string inside select
df %>%
select("2020")
# A tibble: 1 x 1
`2020`
<lgl>
1 NA

You can clean the column names using the janitor package.
df1 <- janitor::clean_names(df1)

Related

Combining multiple rows for one ID into one row with multiple columns based on 2 different variables in R

I am working with a dataframe in R that looks like this:
id <- c(1,1,1,2,2,3,3,3,3)
dx_code <- c("MI","HF","PNA","PNA","Cellulitis","MI","Flu","Sepsis","HF")
dx_date <- c("7/11/22","7/11/22","8/1/22","8/4/22","8/7/22","8/4/22","7/11/22","7/11/22","9/10/22")
df <- data.frame(id, dx_code, dx_date)
df
I want to be able to group it so that each patient ID has each date they were seen and each diagnosis they received on each specific date. So it would look something like:
id2 <- c(1,2,3)
dx_date1 <- c("7/11/22","8/4/22","8/4/22")
dx_date1code1 <- c("MI","PNA","MI")
dx_date1code2 <- c("HF",NA,NA)
dx_date2 <- c("8/1/22","8/7/22","7/11/22")
dx_date2code1 <- c("PNA","Cellulitis","Flu")
dx_date2code2 <- c(NA,NA,"Sepsis")
dx_date3 <- c(NA,NA,"9/10/22")
dx_date3code1 <- c(NA,NA,"HF")
df2 <- data.frame(id2, dx_date1, dx_date1code1,dx_date1code2,dx_date2,dx_date2code1,dx_date2code2,dx_date3,dx_date3code1)
df2
I am not sure how to reformat it in this way - is there a function in R, or should I try to use for loops? I would appreciate any help - thanks so much!
I believe you can use pivot_wider for this. The output is not the same is in the original post, but similar to what you provided in your comment.
You can enumerate dates and codes after grouping by id using row_number().
After using pivot_wider, you can select column names based on the numeric value contained, which will reorder so that dates and codes columns are next to each other.
library(tidyverse)
df %>%
group_by(id) %>%
mutate(code_num = row_number()) %>%
pivot_wider(id_cols = id,
names_from = code_num,
values_from = c(dx_date, dx_code)) %>%
select(id, names(.)[-1][order(readr::parse_number(names(.)[-1]))])
Output
id dx_date_1 dx_code_1 dx_date_2 dx_code_2 dx_date_3 dx_code_3 dx_date_4 dx_code_4
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 7/11/22 MI 7/11/22 HF 8/1/22 PNA NA NA
2 2 8/4/22 PNA 8/7/22 Cellulitis NA NA NA NA
3 3 8/4/22 MI 7/11/22 Flu 7/11/22 Sepsis 9/10/22 HF

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?
tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.
This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

Unpack json columns into a dataframe

I have json strings inside a dataframe column. I want to bring all these new json columns into the dataframe.
# Input
JsonID <- as.factor(c(1,2,3))
JsonString1 = "{\"device\":{\"site\":\"Location1\"},\"tags\":{\"Engine Pressure\":\"150\",\"timestamp\":\"2608411982\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"2608411982\"}"
JsonString2 = "{\"device\":{\"site\":\"Location2\"},\"tags\":{\"Engine Pressure\":\"160\",\"timestamp\":\"3608411983\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"3608411983\"}"
JsonString3 = "{\"device\":{\"site\":\"Location3\"},\"tags\":{\"Brake Fluid\":\"100\",\"timestamp\":\"4608411984\",\"historic\":false,\"adhoc\":false},\"online\":true,\"time\":\"4608411984\"}"
JsonStrings = c(JsonString1, JsonString2, JsonString3)
Example <- data.frame(JsonID, JsonStrings)
Using the jsonlite library I can make each json string into a 1 row dataframe.
library(jsonlite)
# One row dataframes
DF1 <- data.frame(fromJSON(JsonString1))
DF2 <- data.frame(fromJSON(JsonString2))
DF3 <- data.frame(fromJSON(JsonString3))
Unfortunately the JsonID variable column is lost. All json strings share common column name such as "time". But there are column names they don't share. By pivoting the data longer I could Rbind all the dataframes together.
library(dplyr)
library(tidyr)
# Row bindable one row dataframes
DF1_RowBindable <- DF1 %>%
rename_all(~gsub("tags.", "", .x)) %>%
tidyr::pivot_longer(cols = c(colnames(.)[2]))
Is there a better way to do this?
I have never worked with json strings before. The solution must be computationally scalable.
We can store the data from fromJSON in list in the dataframe itself so we don't loose any information that we already have in the data. We can use unnest_wider to create new columns from named list.
library(dplyr)
library(tidyr)
library(jsonlite)
Example %>%
rowwise() %>%
mutate(data = list(fromJSON(JsonStrings))) %>%
unnest_wider(data) %>%
select(-JsonStrings) %>%
unnest_wider(tags) %>%
unnest_wider(device)
# JsonID site `Engine Pressure` timestamp historic adhoc `Brake Fluid` online time
# <fct> <chr> <chr> <chr> <lgl> <lgl> <chr> <lgl> <chr>
#1 1 Location1 150 2608411982 FALSE FALSE NA TRUE 2608411982
#2 2 Location2 160 3608411983 FALSE FALSE NA TRUE 3608411983
#3 3 Location3 NA 4608411984 FALSE FALSE 100 TRUE 4608411984
Since each column (data, tags, device) are of different lengths we need to use unnest_wider separately on each one of them.

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

Long to Wide with Non-Unique Key Combinations in R

I am trying to convert a dataset from long to wide format. Need to do this to feed into another program for analysis purposes. My input data is below:
sdata <- data.frame(c(1,1,1,1,1,1,1,1,1,1,1,1,1),c(1,1,1,1,1,1,1,1,1,2,2,2,2),c("X1","A","B","C","D","X2","A","B","C","X1","A","B","C"),c(81,31,40,5,5,100,8,90,2,50,20,24,6))
col_headings <- c("Orig","Dest","Desc","Estimate")
names(sdata) <- col_headings
Input Data
Depending on the unique combination of Orig-Dest-X1, Orig-Dest-X2 category above, the subcategories vary from only A,B,C to A,B,C,D to A,B, etc. I am trying to get the desired output (code to recreate in R below) along with image of desired output.
sdata_spread <- data.frame(c(1,1),c(1,2),c(81,50),c(31,20),c(40,24),c(5,6),c(5,NA),c(100,NA),c(8,NA),c(90,NA),c(2,NA))
col_headings <- c("Orig","Dest","X1", "X1_A", "X1_B", "X1_C", "X1_D","X2", "X2_A", "X2_B", "X2_C")
names(sdata_spread) <- col_headings
Desired Output
I tried the following:
sdata_spread <- sdata %>% spread(Desc,Estimate)
The error I got was:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 6 rows
I also tried the accepted answer given here: Long to wide with no unique key and here: Long to wide format with several duplicates. Circumvent with unique combo of columns but it did not get me the desired output.
Any insights would be much appreciated.
Thanks,
Krishnan
One option is to create a grouping variable based on the occurrence of 'X' as the first character in the 'Desc', use that to modify the 'Desc' by pasteing the first element of 'Desc' with each of the element based on a condition in case_when and reshape to wide format with pivot_wider (from tidyr_1.0.0, spread/gather are getting deprecated and in its place pivot_wider/pivot_longer are used)
library(dplyr)
library(tidyr)
library(stringr)
sdata %>%
group_by(grp = cumsum(str_detect(Desc, '^X'))) %>%
mutate(Desc = case_when(row_number() > 1 ~ str_c(first(Desc), Desc, sep="_"),
TRUE ~ as.character(Desc))) %>%
ungroup %>%
select(-grp) %>%
pivot_wider(names_from = Desc, values_from = Estimate)
# A tibble: 2 x 11
# Orig Dest X1 X1_A X1_B X1_C X1_D X2 X2_A X2_B X2_C
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 81 31 40 5 5 100 8 90 2
#2 1 2 50 20 24 6 NA NA NA NA NA

Resources