Rebuild tibble under condition - r

My Tibble:
df1 <- tibble(a = c("123*", "123", "124", "678*", "678", "679", "677"))
# A tibble: 7 x 1
a
<chr>
1 123*
2 123
3 124
4 678*
5 678
6 679
7 677
What it should become:
# A tibble: 3 x 2
a b
<chr> <chr>
1 123 124
2 678 679
3 678 677
The values with the stars refer to the following values with no stars, until a new value with a star comes and so on.
Each value with a star should go to the first column, the other values (except the ones that are identical to the values with a star, except the star) should go to the second column. If one value with a star is followed by several values, they should still be linked to eachother, so the values in the first column are duplicated to keep the connection.
I know how to filter and bring the values in each column, but not sure how i would keep the connection.
Regards

We can use tidyverse. Create a grouping column based on the occurence of * in 'a', extract the numeric part with parse_number, get the distinct rows, grouped by 'grp', create a new column with the first value of 'b'
library(dplyr)
library(stringr)
df1 %>%
transmute(grp = cumsum(str_detect(a, fixed("*"))),
b = readr::parse_number(a)) %>%
distinct(b, .keep_all = TRUE) %>%
group_by(grp) %>%
mutate(a = first(b)) %>%
slice(-1) %>%
ungroup %>%
select(a, b)
-output
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 123 124
2 678 679
3 678 677

Here is one base R option -
Using cumsum and grepl we split the data on occurrence of *.
In each group, we drop the values which are similar to the star values and create a dataframe with two columns.
Finally, combine the list of dataframes in one combined dataframe.
result <- do.call(rbind, lapply(split(df1,
cumsum(grepl('*', df1$a, fixed = TRUE))), function(x) {
a <- x[[1]]
a[1] <- sub('*', '', a[1], fixed = TRUE)
data.frame(a = a[1], b = a[a != a[1]])
}))
rownames(result) <- NULL
result
# a b
#1 123 124
#2 678 679
#3 678 677

Related

Keeping one row and discarding others in R using specific criteria?

I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124

How to separate rows into columns based on variable number of pattern matches per row

I have a dataframe like this:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
I'd like to create as many new columns as there are matches to a pattern; the pattern is (?<="A":)\\d+(?=,), i.e., "match the number if you see the string "A":on the left and the comma ,on the right.
The problems: (i) the number of matches may vary from row to row and (ii) the maximum number of new columns is not known in advance.
What I've done so far is this:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
While 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))) may solve the problem of unknown number of new columns, I get a warning:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
and the assignment of the values clearly incorrect:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
The correct output would be this:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
Any idea?
Here's a somewhat pedestrian stringr solution:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
Created on 2020-07-06 by the reprex package (v0.3.0)

Select variables with different names

I have some DF’s with different variable names, but they have the same content. Unfortunately, my files have no pattern, but I am now trying to standardize them. For example, I have these 4 DF’s and I would like to select only one variable:
KEY_WIN <- c(123,456,789)
COUNTRY <- c("USA","FRANCE","MEXICO")
DF1 <- data.frame(KEY_WIN,COUNTRY)
KEY_WINN <- c(12,55,889)
FOOD <- c("RICE","TOMATO","MANGO")
CAR <- c("BMW","FERRARI","TOYOTA")
DF2 <- data.frame(KEY_WINN,FOOD,CAR)
ID <- c(555,698,33)
CITY <- c("NYC","LONDON","PARIS")
DF3 <- data.frame(ID,CITY)
NUMBER <- c(3,436,1000)
OCEAN <- c("PACIFIC","ATLANTIC","INDIAN")
DF4 <- data.frame(NUMBER,OCEAN)
I would like to create a routine to select only the variables KEY_WIN, KEY_WINN, ID, NUMBER. My expected result would be:
DF_FINAL<- data.frame(KEY=c(123,456,789, 12,55,889, 555,698,33, 3,436,1000))
How would I select only those variables?
There are multiple ways I would imagine you could approach this.
First, you could put your data frames in a list:
listofDF <- list(DF1, DF2, DF3, DF4)
Then, you could bind_rows to add the data frames together, and then coalesce to merge into one column.
library(tidyverse)
bind_rows(listofDF) %>%
mutate(KEY = coalesce(KEY_WIN, KEY_WINN, ID, NUMBER)) %>%
select(KEY)
KEY
1 123
2 456
3 789
4 12
5 55
6 889
7 555
8 698
9 33
10 3
11 436
12 1000
If you knew that the first column was always your KEY column, you could simply do:
KEY = unlist(lapply(listofDF, "[[", 1))
This would extract the first column from all of your data frames:
[1] 123 456 789 12 55 889 555 698 33 3 436 1000

Splitting a column into multiple columns based on 2 conditions

I have a large dataframe and I would like to split a column into many columns based on two conditions the caret character ^ and the letter following IMM-. Based on the data below Column 1 would be split into columns named IMM-A, IMM-B, IMM-C, and IMM-W. I tried the separate function but it only works if you specify the column names and because my data is not uniform I don't always know what the column names should be.
SampleId Column1
1 IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
The expected output would be;
SampleId IMM-A IMM-B IMM-C IMM-W
1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
Not clear about the expected output. Based on the description, we may need
library(tidyverse)
map(strsplit(df$Column1, "[*+^]"), ~
stack(setNames(as.list(.x[c(FALSE, TRUE)]), .x[c(TRUE, FALSE)])) %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
spread(ind, values)) %>%
set_names(df$SampleId) %>%
bind_rows(.id = 'SampleId') %>%
select(-rn)
# A tibble: 6 x 5
# SampleId `IMM-A` `IMM-B` `IMM-C` `IMM-W`
# <chr> <chr> <chr> <chr> <chr>
#1 1 010306 6900 1212 <NA>
#2 1 0209 779999 3333 <NA>
#3 2 010306 <NA> 6900 1212
#4 2 0209 <NA> 779999 3333
#5 3 <NA> 010306 6900 1212
#6 3 <NA> 0209 779999 3333
Update
Based on the OP's expected output, we expand the data by splitting the 'Column1' at the ^ delimiter, then separate the 'Column1' into 'colA', 'colB' at the delimiter *, remove the 'colB' and spread to 'wide' format
df %>%
separate_rows(Column1, sep = "\\^") %>%
separate(Column1, into = c("colA", "colB"), remove = FALSE, sep="[*]") %>%
select(-colB) %>%
spread(colA, Column1, fill = "")
#SampleId IMM-A IMM-B IMM-C IMM-W
#1 1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
#2 2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
#3 3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
data
df <- structure(list(SampleId = 1:3, Column1 =
c("IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333",
"IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333",
"IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333"
)), class = "data.frame", row.names = c(NA, -3L))

use lapply within ifelse and maintain column names

I have df1 sorted by date like this:
Date <- c("12/17/17","12/19/17","12/20/17","12/30/17","12/31/17","1/1/18")
Jon <- c(388,299,412,NA,NA,353)
Eric <- c(121,NA,321,473,832,NA)
Scott <- c(NA,122,NA,NA,NA,424)
df1 <- data.frame(Date,Jon,Eric,Scott)
df1$Date <- as.Date(df1$Date,format='%m/%d/%y')
#df1
Date Jon Eric Scott
1 12/17/17 388 121 NA
2 12/19/17 299 NA 122
3 12/20/17 412 321 NA
4 12/30/17 NA 473 NA
5 12/31/17 NA 832 NA
6 1/1/18 353 NA 424
I'm trying to create a new list that includes only the data that is within the last 12 days of each person's most recent date with a non-NA value. If there is only one non-NA value within 12 days of the person's most recent non-NA value, then I want to take the 2 most recent non-NA values for that person, even if one falls outside of the 12 day date range.
The code below successfully puts data within the last 12 days of each person's most recent non-NA value in a new list:
df2 <- lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)])
This code successfully takes the 2 most recent non-NA entries, regardless of whether or not it's within the 12 day range:
df3 <- lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2))
This code comes very close to doing what I want it to do, except it loses the column names. Notice that the column names are replaced with numbers, unlike the lapply statements above, which both keep the column names.
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
How can I maintain the column names?
I would approach this problem using the tidyverse packages.
Data
library(tidyr)
library(dplyr)
library(lubridate)
df <- tibble(
my_date = as.Date(
c("12/17/17", "12/19/17", "12/20/17", "12/30/17", "12/31/17", "1/1/18"),
"%m/%d/%y"
),
jon = c(388, 299, 412, NA, NA, 353),
eric = c(121, NA, 321, 473, 832, NA),
scott = c(NA, 122, NA, NA, NA, 424)
)
Long format data frame
This output feels more natural.
df_long <- df %>%
gather(key, value, -my_date) %>%
drop_na %>%
group_by(key) %>%
mutate(
in_date = if_else(my_date >= max(my_date) - days(12), TRUE, FALSE),
count = sum(in_date)
) %>%
filter(in_date | count < 2) %>%
top_n(2, my_date) %>%
ungroup %>%
select(-c(in_date, count))
df_long
# # A tibble: 6 x 3
# my_date key value
# <date> <chr> <dbl>
# 1 2017-12-20 jon 412
# 2 2018-01-01 jon 353
# 3 2017-12-30 eric 473
# 4 2017-12-31 eric 832
# 5 2017-12-19 scott 122
# 6 2018-01-01 scott 424
Wide format
Thankfully, it is only one additional step to spread to your original columns.
df_long %>% spread(key, value)
# # A tibble: 5 x 4
# my_date eric jon scott
# * <date> <dbl> <dbl> <dbl>
# 1 2017-12-19 NA NA 122
# 2 2017-12-20 NA 412 NA
# 3 2017-12-30 473 NA NA
# 4 2017-12-31 832 NA NA
# 5 2018-01-01 NA 353 424
Seems like the easiest thing to do for me is to store the column headers in a variable and then reattach them:
myHeaders <- names(df1[-1])
withinRange <-lapply(df1[-1],function(x)x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]) %>%
lapply(function(x)length(x[!is.na(x)])) %>%
as.data.frame()
df4 <- ifelse(withinRange[colnames(df1[-1])]>1,lapply(df1[-1],function(x) x[which((m=tail(df1$Date[!is.na(x)],1)-df1$Date)>=0&m<=12)]),lapply(df1[-1], function(x) tail(x[!is.na(x)], n = 2)))
names(df4) <- myHeaders

Resources