I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.
Example:
IDs
p001_i1
p001_i2
p501_i1
p501_i2
p501_i3
AA
NA
NA
1
NA
NA
AB
5
NA
NA
NA
NA
AC
NA
10
NA
NA
2
Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row.
The final result would be:
IDs
p001
p501
AA
NA
1
AB
5
NA
AC
10
2
If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).
I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!
Here is a tidyverse option
library(tidyverse)
df %>%
pivot_longer(
-IDs, names_pattern = c("(.+)_(.+)"), names_to = c("name", NA)) %>%
group_by(IDs, name) %>%
summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider()
## A tibble: 3 × 3
# IDs p001 p501
# <chr> <dbl> <dbl>
#1 AA NaN 1
#2 AB 5 NaN
#3 AC 10 2
Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001", "p501" through using names_pattern and names_to), group by IDs and name, then calculate the mean (ignoring NAs) and reshape again from long to wide.
Sample data
df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA NA NA 1 NA NA
AB 5 NA NA NA NA
AC NA 10 NA NA 2", header =T)
Another tidyverse option, but this preserves separate groups for the AC column:
df %>%
pivot_longer(-1) %>%
group_by(IDs, name_prefix = str_extract(name, "^.+?(?=_)")) %>%
filter(!is.na(value)) %>%
slice_tail(n = 1) %>%
ungroup() %>%
pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
select(-name)
# A tibble: 4 × 3
IDs p501 p001
<chr> <int> <int>
1 AA 1 NA
2 AB NA 5
3 AC NA 10
4 AC 2 NA
Also, you can check this out:
How to get value of last non-NA column
Related
I used gather function from tidyr to change the format of data wide to long. After that I removed NA values. How can I reverse this operation ?
first.df <- data.frame(a = c(1,NA,NA),b=c(2,3,NA),c=c(4,5,6))
second.df <- first.df %>% gather("x","y") %>% na.omit %>% `rownames<-`( NULL )
second.df:
|x|y|
|-|-|
|a|1|
|b|2|
|b|3|
|c|4|
|c|5|
|c|6|
How can I reverse this with spread or another function to first.df below ?
a
b
c
1
2
4
NA
3
5
NA
NA
6
It requires a sequence column and then either dcast or pivot_wider can work or spread
library(dplyr)
library(tidyr)
library(data.table)
second.df %>%
mutate(rn = rowid(x)) %>%
pivot_wider(names_from = x, values_from = y) %>%
select(-rn)
-output
# A tibble: 3 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 4
2 NA 3 5
3 NA NA 6
Or with dcast
library(data.table)
dcast(second.df, rowid(x) ~ x, value.var = 'y')[,-1]
I have a data frame that has about 200,000 rows with columns like:
ID
dictionary column 1
dictionary column 2
1
{""1720100"":4,""1720101"":3}
{""1720100"":5,""1720101"":1,""1720102"":2}
2
{""1720100"":4}
{""1720100"":4,""1720101"":2}
...
...
...
The output table I would like to get is:
ID
col_a
col_b
col_c
col_d
1
1720100
4
1720101
5
1
1720101
3
1720102
1
1
NA
NA
1720103
2
2
1720100
4
1720101
4
2
NA
NA
1720102
2
...
...
...
...
...
And, I feel like it would be even better if the data frame is divided into several chunks before splitting the columns above to reduce the time needed for the calculation. Could anyone help me with this?
Looks like you may want to extract json from the column, using jsonlite package. You can put data into longer form, since you have json in two columns. Then with more pivoting to get desired final format. The final select just reorders columns values on the number contained in the column name.
library(tidyverse)
library(jsonlite)
df %>%
pivot_longer(cols = -ID) %>%
mutate(json_parsed = map(value, ~fromJSON(sprintf("[%s]", .), flatten = T))) %>%
unnest(json_parsed) %>%
pivot_longer(cols = -c(ID, name, value), names_to = "n", values_to = "v") %>%
pivot_wider(id_cols = ID, values_from = c(n, v), values_fn = list) %>%
unnest(cols = -ID) %>%
select(ID, order(parse_number(names(.)[-1])) + 1)
Output
ID n_dictionary_column_1 v_dictionary_column_1 n_dictionary_column_2 v_dictionary_column_2
<dbl> <chr> <int> <chr> <int>
1 1 1720100 4 1720100 5
2 1 1720101 3 1720101 1
3 1 1720102 NA 1720102 2
4 2 1720100 4 1720100 4
5 2 1720101 NA 1720101 2
6 2 1720102 NA 1720102 NA
This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 10 months ago.
I have a dataframe exported from the web with this format
id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12}
I would like to extract ONLY the numbers (ignore curlys and commas) into multiple columns like this
id val_1 val_2 val_3 val_4 val_5
1 7 12 58 1
2 1 2 5 7
3 15 12
Even though the Max of values we got was 4 I want to always go up to value val_5.
Thanks!
We could use str_extract_all for this:
library(dplyr)
library(stringr)
df %>%
mutate(vals = str_extract_all(vals, '\\d+', ''))
or as #akrun suggest in the comments
df %>%
mutate(vals = str_extract_all(vals, '\\d+', '')) %>%
do.call(data.frame, .)
id vals.1 vals.2 vals.3 vals.4
1 1 7 12 58 1
2 2 1 2 5 7
3 3 15 12 <NA> <NA>
data:
df <- structure(list(id = 1:3, vals = c("{7,12,58,1}", "{1,2,5,7}",
"{15,12}")), class = "data.frame", row.names = c(NA, -3L))
Another possible tidyverse option, where we remove the curly brackets, then separate the rows on the ,, then pivot to wide form. Then, we can create the additional column (using add_column from tibble) based on the max value in the column names (which is 4 in this case), and then can create val_5.
library(tidyverse)
df %>%
mutate(vals = str_replace_all(vals, "\\{|\\}", "")) %>%
separate_rows(vals, sep=",") %>%
group_by(id) %>%
mutate(ind = row_number()) %>%
pivot_wider(names_from = ind, values_from = vals, names_prefix = "val_") %>%
add_column(!!(paste0("val_", parse_number(names(.)[ncol(.)])+1)) := NA)
Output
id val_1 val_2 val_3 val_4 val_5
1 1 7 12 58 1 NA
2 2 1 2 5 7 NA
3 3 15 12 <NA> <NA> NA
Data
df <- read.table(text = "id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12} ", header = T)
Using data.table
library(data.table)
library(stringi)
result <- setDT(df)[, stri_match_all_regex(vals, '\\d+')[[1]], by=.(id)]
result[, item:=paste('val', 1:.N, sep='_'), by=.(id)] # defines column names
dcast(result, id~item, value.var = 'V1') # convert from long to wide
## id val_1 val_2 val_3 val_4
## 1: 1 7 12 58 1
## 2: 2 1 2 5 7
## 3: 3 15 12 <NA> <NA>
This question already has answers here:
How to pivot_longer two groups of columns using regular expression [duplicate]
(1 answer)
Pivot longer using Tidyr - Multiple variables stored in column names
(2 answers)
Closed 12 months ago.
I have a dataframe like this :
data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),second_value=c("A","NA","D"),
first_date=c("2001",2010,2003),second_date=c("2003",2014,"2007"))
id first_value second_values first_date second_date
1 1 A A 2001 2003
2 2 B NA 2010 2014
3 3 NA D 2003 2007
I'am looking to transform it to longer dataframe like this with the samplist way:
id timing value date
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
I wasn't successful with the tidyr pivot_longer
You can do:
library(tidyverse)
df %>%
pivot_longer(cols = -id,
names_pattern = '(.*)_(.*)',
names_to = c('timing', '.value'))
Which gives:
# A tibble: 6 x 4
id timing value date
<dbl> <chr> <chr> <chr>
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
NOTE: this only works if you rename your second_values column to second_value. I assume the „values“ was just a typo?
Alternative and a bit simpler, based on #Maël‘s suggestion:
df %>%
pivot_longer(cols = -id,
names_sep = '_',
names_to = c('timing', '.value'))
Could try something like this?
library(tidyverse)
df <- data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),
second_values=c("A","NA","D"),
first_date=c("2001",2010,2003),
second_date=c("2003",2014,"2007"))
bind_rows(
df %>%
select(id, first_value, date = first_date) %>%
pivot_longer(cols = "first_value", names_to = "timing"),
df %>%
select(id, second_values, date = second_date) %>%
pivot_longer(cols = "second_values", names_to = "timing")
) %>%
relocate(id, timing, value, date) %>%
arrange(id)
The last two lines are just there to get the same formatting / order as you posted, so probably could be ommitted.
I am trying to create a table from a data set that takes two factors from a variable, pivots them wider, and lines them up in a single row. Unfortunately, I either keep producing two separate lists, or I get this:
dput(head(test1, 5))
# Edited section:
test1 <- df %>% # Code used to create the table below
select(`Incident ID`,`Device Time`, Description, `Elapsed Time`) %>%
filter(Description == "CPR Stopped" | Description == "CPR Started") %>%
mutate(Index = c(1:16)) %>%
pivot_wider(names_from = Description,
values_from = c(`Elapsed Time`, `Device Time`)) %>%
filter(!is.na(test1))
dput(head(test1, 5))
`Incident ID` Index `Elapsed Time_CPR Started` `Elapsed Time_CPR Stopped` `Device Time_CPR Started` `Device Time_CPR Stopped`
<chr> <int> <time> <time> <time> <time>
1 F190158585 1 01'03" NA 18'37" NA
2 F190158585 2 NA 01'08" NA 18'42"
3 F190158585 3 01'34" NA 19'08" NA
4 F190158585 4 NA 03'47" NA 21'22"
5 F190158585 5 04'00" NA 21'35" NA
I am trying to get a table that looks like this:
df <- data.frame(Index = c(1:3),
CPR_Started = c("00:01:00", "00:02:03", "00:05:46"),
CPR_Stopped = c("00:01:53", "00:04:30", "00:08:00"))
print(df)
Index CPR_Started CPR_Stopped
1 1 00:01:00 00:01:53
2 2 00:02:03 00:04:30
3 3 00:05:46 00:08:00