Summarize Table based on a Threshold - r

It might be a very simple problem. But I failed to so by using my known dplyr functions. Here's the data:
tab1 <- read.table(header=TRUE, text="
Col1 A1 A2 A3 A4 A5
ID1 43 52 33 25 59
ID2 27 41 20 71 22
ID3 37 76 36 27 44
ID4 23 71 62 25 63
")
tab1
Col1 A1 A2 A3 A4 A5
1 ID1 43 52 33 25 59
2 ID2 27 41 20 71 22
3 ID3 37 76 36 27 44
4 ID4 23 71 62 25 63
I intend to get a contingency table like the following by keeping values lower than 30.
Col1 Col2 Val
ID1 A4 25
ID2 A1 27
ID2 A3 20
ID2 A5 22
ID3 A4 27
ID4 A1 23
ID4 A4 25

Or if you insist on dplyrness, you can gather the data first and then filter as desired
library(dplyr)
library(tidyr)
tab1 %>%
gather(Col2, Val, -Col1) %>%
filter(Val < 30)
# Col1 Col2 Val
# 1 ID2 A1 27
# 2 ID4 A1 23
# 3 ID2 A3 20
# 4 ID1 A4 25
# 5 ID3 A4 27
# 6 ID4 A4 25
# 7 ID2 A5 22

Use the reshape2 package with melt:
library(reshape2)
tab2 = melt(tab1)
tab2[tab2$value < 30,]
output:
Col1 variable value
2 ID2 A1 27
4 ID4 A1 23
10 ID2 A3 20
13 ID1 A4 25
15 ID3 A4 27
16 ID4 A4 25
18 ID2 A5 22

Using base R:
x<-apply(tab1, 1, function(y)y[y<30])
data.frame(Col1 = rep(tab1$Col1, sapply(x, length)),
Col2 = names(unlist(x)),
Val = unlist(x))
Col1 Col2 Val
1 ID1 A4 25
2 ID2 A1 27
3 ID2 A3 20
4 ID2 A5 22
5 ID3 A4 27
6 ID4 A1 23
7 ID4 A4 25

Related

R: Dataframe Manipulation

I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!
This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19
Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)

Split up a dataframe by number of NAs in each row

Consider a dataframe made up of thousand rows and columns that inclues several NAs. I'd like to split this dataframe up into smaller ones based on the number of NAs in each row. All rows that contain the same number of NAs, if there is any, should be in the same group. The new data frames are then saved separately.
> DF
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
ff 12 NA NA 23 13
ee 67 23 NA NA 21
jj 31 14 NA 41 11
ss NA 15 11 12 11
The desired output will be:
> DF_chunk_1
ID C1 C2 C3 C4 C5
aa 12 13 10 NA 12
jj 31 14 NA 41 11
ss NA 15 11 12 11
> DF_chunk_2
ID C1 C2 C3 C4 C5
ff 12 NA NA 23 13
ee 67 23 NA NA 21
I appreciate any suggestion.
Try this following useful comments. You can split() and use apply() to build a group:
#Code
new <- split(DF,apply(DF[,-1],1,function(x)sum(is.na(x))))
Output:
$`1`
ID C1 C2 C3 C4 C5
1 aa 12 13 10 NA 12
4 jj 31 14 NA 41 11
5 ss NA 15 11 12 11
$`2`
ID C1 C2 C3 C4 C5
2 ff 12 NA NA 23 13
3 ee 67 23 NA NA 21
A more practical way (Many thanks and credits to #RuiBarradas):
#Code2
new <- split(DF, rowSums(is.na(DF[-1])))
Same output.

r - Using fill() with a conditional

library(tidyverse)
df <- tibble(X = c("A1", "A2", "A3", "A4", "A5", "A5", "A6", "A7", "A8", "A8", "A9", "A9"),
Y = c(31, 52, 45, 86, NA, 50, 93, 85, 59, NA, 85, NA),
Z = c(70, 64, 51, 38, 18, NA, 76, 54, NA, 69, NA, 96),
D = c(1,1,1,1,2,2,1,1,2,2,2,2))
> df
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 NA 18 2
6 A5 50 NA 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 NA 2
10 A8 NA 69 2
11 A9 85 NA 2
12 A9 NA 96 2
The column X has duplicate values that sometimes repeat twice. Column D is measuring those occurances. Column Y and Z have some scores. I want those scores to repeat within those duplicated observations within column X. I tried using fill() method and my output is below
df %>%
filter(D == 1) %>%
bind_rows(df %>%
filter(D != 1) %>%
fill(c("Y", "Z"), .direction = "downup")
)
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 18 2
10 A8 59 69 2
11 A9 85 69 2
12 A9 85 96 2
However, whatever .direction option I use, I cannot seem to get correct numbers. For example in the above output, for A9, Z should be repeating 96 twice. Same issue is with A8.
My desired output is below
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2
You could do:
library(tidyverse)
df %>%
group_by(X) %>%
mutate(across(Y:Z, ~ first(na.omit(.))))
Output:
# A tibble: 12 x 4
# Groups: X [9]
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 50 18 2
6 A5 50 18 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2
You could also use fill like below, but in my experience this can be quite slow:
df %>%
group_by(X) %>%
fill(Y, Z, .direction = 'downup')
You can use group_by and mutate to change the values of the NAs to the other one in the group
df %>%
dplyr::group_by(X) %>%
dplyr::mutate(
Y = dplyr::case_when(
is.na(Y) ~ Y[!is.na(Y)],
TRUE ~ Y),
Z = dplyr::case_when(
is.na(Z) ~ Z[!is.na(Z)],
TRUE ~ Z))

Ranking data that have the same values [duplicate]

This question already has answers here:
Rank vector with some equal values [duplicate]
(3 answers)
Closed 4 years ago.
I have a large data set including a column of counts for different genetic markers. I want to generate an overall ranking that takes into account the count number regardless of the genetic marker. For instance if 2 or more genetic markers all have a count of 5 they should all have the same rank number and I want the rank numbers to be displayed in a separate column. I have this dataframe;
SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8
I want the output to be:
SNP count rank
a1 26 1
a2 18 2
a3 16 3
a4 15 4
a8 15 4
a5 14 5
a6 14 5
a7 14 5
a9 13 7
a10 12 8
a11 12 8
a12 11 9
a13 10 10
a14 9 11
a15 8 12
Note that SNPs a4 and a8 are the same, a5, a6 a7 have equal count values and also a10 and a11. I've tried
transform(df, x= ave(count,FUN=function(x) order(x,decreasing=T)))
but it's not want I want
What you are looking for is the rleid function from the data.table package.
data.table::rleid(df$count)
[1] 1 2 3 4 5 5 5 6 7 8 8 9 10 11 12
df is obtained like so:
df <- read.table(text ="SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8",
stringsAsFactors =FALSE,
header = TRUE)
And for thoroughness:
df$rank <- data.table::rleid(df$count)
df
SNP count rank
1 a1 26 1
2 a2 18 2
3 a3 16 3
4 a4 15 4
5 a5 14 5
6 a6 14 5
7 a7 14 5
8 a8 15 6
9 a9 13 7
10 a10 12 8
11 a11 12 8
12 a12 11 9
13 a13 10 10
14 a14 9 11
15 a15 8 12
Edit:
Thanks to #Frank, a better solution would be to sort the data frame by count before applying rleid:
setDT(df)[order(-count), rank := rleid(count)]
Which gives:
df
SNP count rank
1: a1 26 1
2: a2 18 2
3: a3 16 3
4: a4 15 4
5: a5 14 5
6: a6 14 5
7: a7 14 5
8: a8 15 4
9: a9 13 6
10: a10 12 7
11: a11 12 7
12: a12 11 8
13: a13 10 9
14: a14 9 10
15: a15 8 11

R: Calculating New Variable R Code

I have
id_1 id_2 name count total
1 001 111 a 15
2 001 111 b 3
3 001 111 sum 28 28
4 002 111 a 7
5 002 111 b 33
6 002 111 sum 48 48
I want the rows that share the same id_1 and id_2 to share the total, like
id_1 id_2 name count total
1 001 111 a 15 28
2 001 111 b 3 28
3 001 111 sum 28 28
4 002 111 a 7 48
5 002 111 b 33 48
6 002 111 sum 48 48
We can use fill from tidyr.
library(tidyr)
dat2 <- dat %>% fill(total, .direction = "up")
dat2
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
DATA
dat <- read.table(text = " id_1 id_2 name count total
1 001 111 a 15 NA
2 001 111 b 3 NA
3 001 111 sum 28 28
4 002 111 a 7 NA
5 002 111 b 33 NA
6 002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)
Consider base R's ave calculating group max (na.rm to handle NA):
df$total <- ave(df$total, df$id_1, df$_id_2, FUN=function(i) max(i, na.rm=na.omit))
df
# id_1 id_2 name count total
# 1 1 111 a 15 28
# 2 1 111 b 3 28
# 3 1 111 sum 28 28
# 4 2 111 a 7 48
# 5 2 111 b 33 48
# 6 2 111 sum 48 48
Using zoo and data.table:
df <- read.table(text = "id_1 id_2 name count total
001 111 a 15 NA
001 111 b 3 NA
001 111 sum 28 28
002 111 a 7 NA
002 111 b 33 NA
002 111 sum 48 48",
header = TRUE, stringsAsFactors = FALSE)# create data
library(zoo)# load packages
library(data.table)
setDT(df)[, total := na.locf(na.locf(total, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = c("id_1", "id_2")]# convert df to data.table and carry forward and backward total by ids
Output:
id_1 id_2 name count total
1: 1 111 a 15 28
2: 1 111 b 3 28
3: 1 111 sum 28 28
4: 2 111 a 7 48
5: 2 111 b 33 48
6: 2 111 sum 48 48
Simple approach using the normal dplyr way:
dat %>% group_by(id_1, id_2) %>% mutate(total=count[name == "sum"])
Alternatively:
dat %>% group_by(id_1, id_2) %>% mutate(total=na.omit(total)[1])
id_1 id_2 name count total
<int> <int> <chr> <int> <int>
1 1 111 a 15 28
2 1 111 b 3 28
3 1 111 sum 28 28
4 2 111 a 7 48
5 2 111 b 33 48
6 2 111 sum 48 48

Resources