A Tricky Pivot Longer [duplicate] - r

This question already has answers here:
How to pivot_longer two groups of columns using regular expression [duplicate]
(1 answer)
Pivot longer using Tidyr - Multiple variables stored in column names
(2 answers)
Closed 12 months ago.
I have a dataframe like this :
data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),second_value=c("A","NA","D"),
first_date=c("2001",2010,2003),second_date=c("2003",2014,"2007"))
id first_value second_values first_date second_date
1 1 A A 2001 2003
2 2 B NA 2010 2014
3 3 NA D 2003 2007
I'am looking to transform it to longer dataframe like this with the samplist way:
id timing value date
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
I wasn't successful with the tidyr pivot_longer

You can do:
library(tidyverse)
df %>%
pivot_longer(cols = -id,
names_pattern = '(.*)_(.*)',
names_to = c('timing', '.value'))
Which gives:
# A tibble: 6 x 4
id timing value date
<dbl> <chr> <chr> <chr>
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
NOTE: this only works if you rename your second_values column to second_value. I assume the „values“ was just a typo?
Alternative and a bit simpler, based on #Maël‘s suggestion:
df %>%
pivot_longer(cols = -id,
names_sep = '_',
names_to = c('timing', '.value'))

Could try something like this?
library(tidyverse)
df <- data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),
second_values=c("A","NA","D"),
first_date=c("2001",2010,2003),
second_date=c("2003",2014,"2007"))
bind_rows(
df %>%
select(id, first_value, date = first_date) %>%
pivot_longer(cols = "first_value", names_to = "timing"),
df %>%
select(id, second_values, date = second_date) %>%
pivot_longer(cols = "second_values", names_to = "timing")
) %>%
relocate(id, timing, value, date) %>%
arrange(id)
The last two lines are just there to get the same formatting / order as you posted, so probably could be ommitted.

Related

R Rowise operation on partially similar columns

I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.
Example:
IDs
p001_i1
p001_i2
p501_i1
p501_i2
p501_i3
AA
NA
NA
1
NA
NA
AB
5
NA
NA
NA
NA
AC
NA
10
NA
NA
2
Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row.
The final result would be:
IDs
p001
p501
AA
NA
1
AB
5
NA
AC
10
2
If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).
I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!
Here is a tidyverse option
library(tidyverse)
df %>%
pivot_longer(
-IDs, names_pattern = c("(.+)_(.+)"), names_to = c("name", NA)) %>%
group_by(IDs, name) %>%
summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider()
## A tibble: 3 × 3
# IDs p001 p501
# <chr> <dbl> <dbl>
#1 AA NaN 1
#2 AB 5 NaN
#3 AC 10 2
Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001", "p501" through using names_pattern and names_to), group by IDs and name, then calculate the mean (ignoring NAs) and reshape again from long to wide.
Sample data
df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA NA NA 1 NA NA
AB 5 NA NA NA NA
AC NA 10 NA NA 2", header =T)
Another tidyverse option, but this preserves separate groups for the AC column:
df %>%
pivot_longer(-1) %>%
group_by(IDs, name_prefix = str_extract(name, "^.+?(?=_)")) %>%
filter(!is.na(value)) %>%
slice_tail(n = 1) %>%
ungroup() %>%
pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
select(-name)
# A tibble: 4 × 3
IDs p501 p001
<chr> <int> <int>
1 AA 1 NA
2 AB NA 5
3 AC NA 10
4 AC 2 NA
Also, you can check this out:
How to get value of last non-NA column

R Extract specific text from column into multiple columns [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 10 months ago.
I have a dataframe exported from the web with this format
id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12}
I would like to extract ONLY the numbers (ignore curlys and commas) into multiple columns like this
id val_1 val_2 val_3 val_4 val_5
1 7 12 58 1
2 1 2 5 7
3 15 12
Even though the Max of values we got was 4 I want to always go up to value val_5.
Thanks!
We could use str_extract_all for this:
library(dplyr)
library(stringr)
df %>%
mutate(vals = str_extract_all(vals, '\\d+', ''))
or as #akrun suggest in the comments
df %>%
mutate(vals = str_extract_all(vals, '\\d+', '')) %>%
do.call(data.frame, .)
id vals.1 vals.2 vals.3 vals.4
1 1 7 12 58 1
2 2 1 2 5 7
3 3 15 12 <NA> <NA>
data:
df <- structure(list(id = 1:3, vals = c("{7,12,58,1}", "{1,2,5,7}",
"{15,12}")), class = "data.frame", row.names = c(NA, -3L))
Another possible tidyverse option, where we remove the curly brackets, then separate the rows on the ,, then pivot to wide form. Then, we can create the additional column (using add_column from tibble) based on the max value in the column names (which is 4 in this case), and then can create val_5.
library(tidyverse)
df %>%
mutate(vals = str_replace_all(vals, "\\{|\\}", "")) %>%
separate_rows(vals, sep=",") %>%
group_by(id) %>%
mutate(ind = row_number()) %>%
pivot_wider(names_from = ind, values_from = vals, names_prefix = "val_") %>%
add_column(!!(paste0("val_", parse_number(names(.)[ncol(.)])+1)) := NA)
Output
id val_1 val_2 val_3 val_4 val_5
1 1 7 12 58 1 NA
2 2 1 2 5 7 NA
3 3 15 12 <NA> <NA> NA
Data
df <- read.table(text = "id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12} ", header = T)
Using data.table
library(data.table)
library(stringi)
result <- setDT(df)[, stri_match_all_regex(vals, '\\d+')[[1]], by=.(id)]
result[, item:=paste('val', 1:.N, sep='_'), by=.(id)] # defines column names
dcast(result, id~item, value.var = 'V1') # convert from long to wide
## id val_1 val_2 val_3 val_4
## 1: 1 7 12 58 1
## 2: 2 1 2 5 7
## 3: 3 15 12 <NA> <NA>

Identifying values from one database to use in another database

I am working on a project in which I need to work with 2 databases, identify values from one database to use in another.
I have a dataframe 1,
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995))
and a dataframe 2,
df2 <- data.frame("Condition A"=c("A","A","B","B"),"Condiction B"=c("1","2","1","2"),"<1990"=c(20,30,50,80),"1990-2000"=c(100,90,80,30),">2000"=c(300,200,800,400))
I would like to add a new column to df1 called "Value", in which, for each ID (from df1), collects the values from column 3,4 or 5 from df2 (depending on the year), and following conditions A and B available in both databases. The end result would be something like this:
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995),"Value"=c(800,50,90))
thanks!
I think we can simply left_join, then mutate with case_when, then drop the undesired columns with select:
library(dplyr)
left_join(df1, df2, by=c("Condition.A", "Condition.B"))%>%
mutate(Value=case_when(Year<1990 ~ X.1990,
Year<2000 ~ X1990.2000,
Year>=2000 ~ X.2000))%>%
select(-starts_with("X"))
ID Condition.A Condition.B Year Value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
EDIT: I edited your code, removing the "Condiction" typo
You could use
library(dplyr)
library(tidyr)
df2 %>%
rename(Condition.B = Condiction.B) %>%
pivot_longer(matches("\\d+{4}")) %>%
right_join(df1, by = c("Condition.A", "Condition.B")) %>%
filter(name == case_when(
Year < 1990 ~ "X.1990",
Year > 2000 ~ "X.2000",
TRUE ~ "X1990.2000")) %>%
select(ID, Condition.A, Condition.B, Year, Value = value) %>%
arrange(ID)
This returns
# A tibble: 3 x 5
ID Condition.A Condition.B Year Value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
At first we rename the misspelled column Condiction.B of df2 and bring it into a "long format" based on the "<1990", "1990-2000", ">2000" columns. Note that those columns can't be named like this, they are automatically renamed to X.1990, X1990.2000 and X.2000.
Next we use a right join with df1 on the two Condition columns.
Finally we filter just the matching years based on a hard coded case_when function and do some clean up (selecting and arranging).
We could do it this way:
Condiction must be a typo so I changed it to Condition
in df1 create a helper column that assigns each your to the group which is a column name in df2
bring df2 in long format
finally apply left_join by by=c("Condition.A", "Condition.B", "helper"="name")
library(dplyr)
library(tidyr)
df1 <- df1 %>%
mutate(helper = case_when(Year >=1990 & Year <=2000 ~"X1990.2000",
Year <1990 ~ "X.1990",
Year >2000 ~ "X.2000"))
df2 <- df2 %>%
pivot_longer(
cols=starts_with("X")
)
df3 <- left_join(df1, df2, by=c("Condition.A", "Condition.B", "helper"="name")) %>%
select(-helper)
ID Condition.A Condition.B Year value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

R - Return column name for row where first given value is found

I am trying to find the first occurrence of a FALSE in a dataframe for each row value. My rows are specific occurrences and the columns are dates. I would like to be able to find the date of first FALSE so that I can use that value to find a return date.
An example structure of my dataframe:
df <- data.frame(ID = c(1,2,3), '2001' = c(TRUE, TRUE, TRUE),
'2002' = c(FALSE, TRUE, FALSE), '2003' = c(TRUE, FALSE, TRUE))
I want to end up with a second dataframe or list that contains the ID and the column name that identifies the first instance of a FALSE.
For example :
ID | Date
1 | 2002
2 | 2003
3 | 2002
I do not know the mechanism to find such a result.
The actual dataframe contains a couple thousand rows so I unfortunately can't do it by hand.
I am a new R user so please don't refrain from suggesting things you might expect a more experienced R user to have already thought about.
Thanks in advance
Try this using tidyverse functions. You can reshape data to long and then filter for F values. If there are some duplicated rows the second filter can avoid them. Here the code:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
filter(!duplicated(value)) %>% select(-value) %>%
rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Another option without duplicated values can be using the row_number() to extract the first value (row_number()==1):
library(dplyr)
library(tidyr)
#Code 2
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
mutate(V=ifelse(row_number()==1,1,0)) %>%
filter(V==1) %>%
select(-c(value,V)) %>% rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Or using base R with apply() and a generic function:
#Code 3
out <- data.frame(df[,1,drop=F],Res=apply(df[,-1],1,function(x) names(x)[min(which(x==F))]))
Output:
ID Res
1 1 2002
2 2 2003
3 3 2002
We can use max.col with ties.method = 'first' after inverting the logical values.
cbind(df[1], Date = names(df[-1])[max.col(!df[-1], ties.method = 'first')])
# ID Date
#1 1 2002
#2 2 2003
#3 3 2002

Resources