r - Using fill() with a conditional - r

library(tidyverse)
df <- tibble(X = c("A1", "A2", "A3", "A4", "A5", "A5", "A6", "A7", "A8", "A8", "A9", "A9"),
Y = c(31, 52, 45, 86, NA, 50, 93, 85, 59, NA, 85, NA),
Z = c(70, 64, 51, 38, 18, NA, 76, 54, NA, 69, NA, 96),
D = c(1,1,1,1,2,2,1,1,2,2,2,2))
> df
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 NA 18 2
6 A5 50 NA 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 NA 2
10 A8 NA 69 2
11 A9 85 NA 2
12 A9 NA 96 2
The column X has duplicate values that sometimes repeat twice. Column D is measuring those occurances. Column Y and Z have some scores. I want those scores to repeat within those duplicated observations within column X. I tried using fill() method and my output is below
df %>%
filter(D == 1) %>%
bind_rows(df %>%
filter(D != 1) %>%
fill(c("Y", "Z"), .direction = "downup")
)
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 18 2
10 A8 59 69 2
11 A9 85 69 2
12 A9 85 96 2
However, whatever .direction option I use, I cannot seem to get correct numbers. For example in the above output, for A9, Z should be repeating 96 twice. Same issue is with A8.
My desired output is below
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2

You could do:
library(tidyverse)
df %>%
group_by(X) %>%
mutate(across(Y:Z, ~ first(na.omit(.))))
Output:
# A tibble: 12 x 4
# Groups: X [9]
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 50 18 2
6 A5 50 18 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2
You could also use fill like below, but in my experience this can be quite slow:
df %>%
group_by(X) %>%
fill(Y, Z, .direction = 'downup')

You can use group_by and mutate to change the values of the NAs to the other one in the group
df %>%
dplyr::group_by(X) %>%
dplyr::mutate(
Y = dplyr::case_when(
is.na(Y) ~ Y[!is.na(Y)],
TRUE ~ Y),
Z = dplyr::case_when(
is.na(Z) ~ Z[!is.na(Z)],
TRUE ~ Z))

Related

How to concatenate two pairs of columns by name with shifting rows, in a dataframe with multiple column pairs [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 1 year ago.
I have this dataframe:
id a1 a2 b1 b2 c1 c2
<int> <int> <int> <int> <int> <int> <int>
1 1 83 33 55 33 85 86
2 2 37 0 60 98 51 0
3 3 97 71 85 8 44 40
4 4 51 6 43 15 55 57
5 5 28 53 62 73 70 9
df <- structure(list(id = 1:5, a1 = c(83L, 37L, 97L, 51L, 28L), a2 = c(33L,
0L, 71L, 6L, 53L), b1 = c(55L, 60L, 85L, 43L, 62L), b2 = c(33L,
98L, 8L, 15L, 73L), c1 = c(85L, 51L, 44L, 55L, 70L), c2 = c(86L,
0L, 40L, 57L, 9L)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
I want to:
Combine columns with same starting character to one column by shifting each row of the second column by 1 down and naming the new column with the character of the two columns.
My desired output:
id a b c
<dbl> <dbl> <dbl> <dbl>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
I have tried using lagfunction but I don`t know how to combine and shift columns at the same time!
To clarify a picture:
You can use the following solution. I also have modified your data set an added an id column:
library(tidyr)
df %>%
pivot_longer(!id, names_to = c(".value", NA), names_pattern = "([[:alpha:]])(\\d)")
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
We can pivot_longer, remove the digits from name, then pivot_wider and unnest
library(stringr)
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols = -id)%>%
mutate(name=str_remove(name, '[0-9]'))%>%
pivot_wider(names_from = name)%>%
unnest(everything())
# A tibble: 10 x 4
id a b c
<int> <int> <int> <int>
1 1 83 55 85
2 1 33 33 86
3 2 37 60 51
4 2 0 98 0
5 3 97 85 44
6 3 71 8 40
7 4 51 43 55
8 4 6 15 57
9 5 28 62 70
10 5 53 73 9
Doing it as a pivot_longer(), then pivot_wider() is easier to read, but #Anoushiravan R's answer to more direct
library(tidyverse)
df %>%
rownames_to_column(var = "id") %>% # Add the id column
pivot_longer(-id) %>% # Make long
mutate(order = str_sub(name, -1), name = str_sub(name, 1, 1)) %>% # Breakout the name column
pivot_wider(names_from = name) %>% # Make wide again
select(-order) # Drop the ordering column
I think ANoushiravan's solution is the tidiest way to do it. We could also use {dplyover} (disclaimer) for this:
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df %>%
group_by(id) %>%
summarise(across2(ends_with("1"),
ends_with("2"),
~ c(.x,.y),
.names = "{pre}"),
)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 10 x 4
#> # Groups: id [5]
#> id a b c
#> <int> <int> <int> <int>
#> 1 1 83 55 85
#> 2 1 33 33 86
#> 3 2 37 60 51
#> 4 2 0 98 0
#> 5 3 97 85 44
#> 6 3 71 8 40
#> 7 4 51 43 55
#> 8 4 6 15 57
#> 9 5 28 62 70
#> 10 5 53 73 9
Created on 2021-07-28 by the reprex package (v0.3.0)

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

R create new column based on data range at a certain time point

I have large data frame (>50 columns). A sample of the relevant columns are here:
tb <- data.frame(RowID=c("A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8", "A9", "A10", "A11", "A12", "A13", "A14", "A15"),
Patient=c("001", "001", "001", "002", "002", "035", "035", "035", "035", "035", "100", "100", "105", "105", "105"),
Time=c(1,2,3,1,2,1,2,3,4,5,1,2,1,2,3),
Value=c(NA,10,23,100,30,10,15,NA,60,56.7,30,51,3,13,77))
I am trying to create a new column (Value_status) that ranks the initial value for each patient as either low or high (Value <50, Value >=50). The Value_status should be carried through to the other rows for that patient.
Here's what I have:
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low", "high"))
I thought I had solved it by adding group_by, but it doesn't give the same value for each individual patient as I hoped. I think I need to nest the if_else with more conditions, something like this?
Note: If a patient is missing Value at a time point other than 1, then they can still be grouped according to high/low.
tb %>%
group_by(Patient) %>%
mutate(Value_status = if_else(Time == 1 & Value < 50, "low",
if_else(Time == 1 & >= 50, "high",
if_else(#Apply the value from time point 1#))))
The output I am trying to get should look like this:
It should group patients based on whether or not their baseline values are high
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10.0 <NA>
3 A3 001 3 23.0 <NA>
4 A4 002 1 100.0 high
5 A5 002 2 30.0 high
6 A6 035 1 10.0 low
7 A7 035 2 15.0 low
8 A8 035 3 NA low
9 A9 035 4 60.0 low
10 A10 035 5 56.7 low
11 A11 100 1 30.0 low
12 A12 100 2 51.0 low
13 A13 105 1 3.0 low
14 A14 105 2 13.0 low
15 A15 105 3 77.0 low
Instead of if_else nested, we could use case_when where we can have multiple conditions created, then do a group_by with 'Patient' and fill the 'Value_status' NA elements with the previous non-NA values
library(dplyr)
library(tidyr)
tb %>%
mutate(Value_status = case_when(Time == 1 & Value < 50 ~ "low",
Time == 1 & Value >= 50 ~ "high"
)) %>%
group_by(Patient) %>%
fill(Value_status) %>%
ungroup
-outupt
# A tibble: 15 x 5
RowID Patient Time Value Value_status
<chr> <chr> <dbl> <dbl> <chr>
1 A1 001 1 NA <NA>
2 A2 001 2 10 <NA>
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 high
6 A6 035 1 10 low
7 A7 035 2 15 low
8 A8 035 3 NA low
9 A9 035 4 60 low
10 A10 035 5 56.7 low
11 A11 100 1 30 low
12 A12 100 2 51 low
13 A13 105 1 3 low
14 A14 105 2 13 low
15 A15 105 3 77 low
Here a solution with a nested ifelse
tb %>%
mutate(Value_status = ifelse(Time != 1 & Value ==10, "medium",
ifelse(Time == 1 & Value < 50, "low",
ifelse(Time == 1 & Value >= 50, "high", NA)
)
))
Output:
RowID Patient Time Value Value_status
1 A1 001 1 NA <NA>
2 A2 001 2 10 medium
3 A3 001 3 23 <NA>
4 A4 002 1 100 high
5 A5 002 2 30 <NA>
6 A6 035 1 10 low
7 A7 035 2 15 <NA>
8 A8 035 3 NA <NA>
9 A9 035 4 60 <NA>
10 A10 035 5 57 <NA>
11 A11 100 1 30 low
12 A12 100 2 51 <NA>
13 A13 105 1 3 low
14 A14 105 2 13 <NA>
15 A15 105 3 77 <NA>

R: Dataframe Manipulation

I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!
This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19
Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)

Summarize Table based on a Threshold

It might be a very simple problem. But I failed to so by using my known dplyr functions. Here's the data:
tab1 <- read.table(header=TRUE, text="
Col1 A1 A2 A3 A4 A5
ID1 43 52 33 25 59
ID2 27 41 20 71 22
ID3 37 76 36 27 44
ID4 23 71 62 25 63
")
tab1
Col1 A1 A2 A3 A4 A5
1 ID1 43 52 33 25 59
2 ID2 27 41 20 71 22
3 ID3 37 76 36 27 44
4 ID4 23 71 62 25 63
I intend to get a contingency table like the following by keeping values lower than 30.
Col1 Col2 Val
ID1 A4 25
ID2 A1 27
ID2 A3 20
ID2 A5 22
ID3 A4 27
ID4 A1 23
ID4 A4 25
Or if you insist on dplyrness, you can gather the data first and then filter as desired
library(dplyr)
library(tidyr)
tab1 %>%
gather(Col2, Val, -Col1) %>%
filter(Val < 30)
# Col1 Col2 Val
# 1 ID2 A1 27
# 2 ID4 A1 23
# 3 ID2 A3 20
# 4 ID1 A4 25
# 5 ID3 A4 27
# 6 ID4 A4 25
# 7 ID2 A5 22
Use the reshape2 package with melt:
library(reshape2)
tab2 = melt(tab1)
tab2[tab2$value < 30,]
output:
Col1 variable value
2 ID2 A1 27
4 ID4 A1 23
10 ID2 A3 20
13 ID1 A4 25
15 ID3 A4 27
16 ID4 A4 25
18 ID2 A5 22
Using base R:
x<-apply(tab1, 1, function(y)y[y<30])
data.frame(Col1 = rep(tab1$Col1, sapply(x, length)),
Col2 = names(unlist(x)),
Val = unlist(x))
Col1 Col2 Val
1 ID1 A4 25
2 ID2 A1 27
3 ID2 A3 20
4 ID2 A5 22
5 ID3 A4 27
6 ID4 A1 23
7 ID4 A4 25

Resources