Frequency count, grouped by two coumns in R - r

How to count frequencies occurring in two columns ?
Sample datas :
> sample <- dput(df)
structure(list(Nom_xp = c("A1FAA", "A1FAJ", "A1FBB", "A1FJA",
"A1FJR", "A1FRJ"), GB05.x = c(100L, 98L, NA, 100L, 102L, 98L),
GB05.1.x = c(100L, 106L, NA, 100L, 102L, 98L), GB18.x = c(175L,
173L, 177L, 177L, 173L, 177L), GB18.1.x = c(177L, 175L, 177L,
177L, 177L, 177L)), .Names = c("Nom_xp", "GB05.x", "GB05.1.x",
"GB18.x", "GB18.1.x"), row.names = c(NA, 6L), class = "data.frame")
Counting frequencies :
apply(sample[,2:5],2,table)
Now, how to combine the count by prefix of columns, or by every two columns ? The expected output, for the four first columns would be a list:
$GB05
98 100 102 106
3 4 2 1
$GB18
173 175 177
2 2 8
One way to get the count for the first two columns :
table(c(apply(sample[,2:3],2,rbind)))
98 100 102 106
3 4 2 1
But how to apply this to a whole data.frame ?

If you want to apply table to your whole data frame, you can use :
table(unlist(sample[,-1]))
Which gives :
98 100 102 106 173 175 177
3 4 2 1 2 2 8
If you want to group by column name prefix, for example the 4th first characters, you can do something like this :
cols <- names(sample)[-1]
groups <- unique(substr(cols,0,4))
sapply(groups, function(name) table(unlist(sample[,grepl(paste0("^",name),names(sample))])))
Which gives :
$GB05
98 100 102 106
3 4 2 1
$GB18
173 175 177
2 2 8

i would've said juba's answer was correct, but given you're looking for something else, perhaps it's this?
library(reshape2)
x <- melt( sample[ , 2:5 ] )
table( x[ , c( 'variable' , 'value' ) ] )
which gives
value
variable 98 100 102 106 173 175 177
GB05.x 2 2 1 0 0 0 0
GB05.1.x 1 2 1 1 0 0 0
GB18.x 0 0 0 0 2 1 3
GB18.1.x 0 0 0 0 0 1 5
please provide an example of your desired output structure :)

Here is another answer that is sort of a hybrid between Anthony's answer and juba's answer.
The first step is to convert the data.frame into a "long" data.frame. I generally use stack when I can, but you can also do library(reshape2); df2 <- melt(df) to get output similar to my df2 object.
df2 <- data.frame(df[1], stack(df[-1]))
head(df2)
# Nom_xp values ind
# 1 A1FAA 100 GB05.x
# 2 A1FAJ 98 GB05.x
# 3 A1FBB NA GB05.x
# 4 A1FJA 100 GB05.x
# 5 A1FJR 102 GB05.x
# 6 A1FRJ 98 GB05.x
Next, we need to know the unique values of ind. juba did that with substr, but I've done it here with gsub and a regular expression. We don't need to add that into our data.frame; we can call it directly in our other functions. The two functions which immediately come to mind are by and tapply, and both give you the output you are looking for.
by(df2$values,
list(ind = gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind)),
FUN=table)
# ind: GB05
#
# 98 100 102 106
# 3 4 2 1
# ------------------------------------------------------------------------------
# ind: GB18
#
# 173 175 177
# 2 2 8
tapply(df2$values, gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind), FUN = table)
# $GB05
#
# 98 100 102 106
# 3 4 2 1
#
# $GB18
#
# 173 175 177
# 2 2 8

Related

Calculate Ratio of multiple pairs of columns

I have a table in R like this one:
id v1 v2 v3
1 115 116 150
2 47 50 55
3 70 77 77
I would like to calculate the ratio between v2/v1 as (v2/v1)-1, v3/v2 as (v3/v2)-1 and so on (I have around 55 variables, and need to get values like this:
id v1 v2 v3 rat1 rat2
1 115 116 150 0.01 0.29
2 47 50 55 0.06 0.10
3 70 77 77 0.10 0.00
Is there a workaround so I donĀ“t have to code each pair independently?
Thx!
It's essentially a loop over column i and column i+1, which you could write a for loop to do so. Or in R speak, use a vectorised function like Map/mapply:
vars <- paste0("v",1:3)
outs <- paste0("rat",1:2)
dat[outs] <- mapply(`/`, dat[vars[-1]], dat[vars[-length(vars)]]) - 1
dat
# id v1 v2 v3 rat1 rat2
#1 1 115 116 150 0.008695652 0.2931034
#2 2 47 50 55 0.063829787 0.1000000
#3 3 70 77 77 0.100000000 0.0000000
As we remove equal number of columns from the beginning and end ('id' common), the datasets would still be similar in dimensions, so can directly do a /
dat[paste0("rat", 1:2)] <- 1- dat[-c(1, ncol(dat))]/dat[-(1:2)]
data
dat <- structure(list(id = 1:3, v1 = c(115L, 47L, 70L), v2 = c(116L,
50L, 77L), v3 = c(150L, 55L, 77L)), class = "data.frame", row.names = c(NA,
-3L))

R Create a new column that identifies if row is the last entry for a user of the type

I'm trying to create a new column, presumably using mutate, that will identify if whether the row meets a few criteria. Basically, for each user, I want to identify the final row (by Time) for a certain DataCode. Only some DataCodes are applicable (1000 and 2000 in the example below), and others should return NA (3000 here). I've been trying to work this through in my head, and all I can think is a really long mutate item with a number of If statements. Is there a more elegant way?
The IsFinal column below demonstrates what the product would be.
User Time DataCode Data IsFinal
101 10 1000 50 0
101 20 2000 300 1
101 30 3000 150 NA
101 40 1000 250 1
101 50 3000 300 NA
102 10 2000 50 0
102 20 1000 150 0
102 30 1000 150 0
102 40 2000 350 1
102 50 3000 150 NA
102 60 1000 50 1
This desires what you need by using merge and dplyr package:
library(dplyr)
new.tab <- query.tab %>%
group_by(User, DataCode) %>%
arrange(Time) %>%
filter(DataCode != 3000) %>%
mutate(IsFinal = ifelse(row_number()==n(),1,0))
fin.tab <- merge(new.tab, query.tab, all.x = FALSE, all.y = TRUE)
If you want to do everything inside dplyr then this is your answer:
fin.tab <-
query.tab %>%
group_by(User, DataCode) %>%
arrange(User,Time) %>%
mutate(IsFinal = ifelse(DataCode == 3000 , NA,
ifelse(row_number()==n(),1,0)))
Both of these solutions will give:
> fin.tab
# User Time DataCode Data IsFinal
# 1 101 10 1000 50 0
# 2 101 20 2000 300 1
# 3 101 30 3000 150 NA
# 4 101 40 1000 250 1
# 5 101 50 3000 300 NA
# 6 102 10 2000 50 0
# 7 102 20 1000 150 0
# 8 102 30 1000 150 0
# 9 102 40 2000 350 1
# 10 102 50 3000 150 NA
# 11 102 60 1000 50 1
Data:
query.tab <- structure(list(User = c(101L, 101L, 101L, 101L, 101L, 102L, 102L,
102L, 102L, 102L, 102L), Time = c(10L, 20L, 30L, 40L, 50L, 10L,
20L, 30L, 40L, 50L, 60L), DataCode = c(1000L, 2000L, 3000L, 1000L,
3000L, 2000L, 1000L, 1000L, 2000L, 3000L, 1000L), Data = c(50L,
300L, 150L, 250L, 300L, 50L, 150L, 150L, 350L, 150L, 50L)), .Names = c("User",
"Time", "DataCode", "Data"), row.names = c(NA, -11L), class = "data.frame")
Note: Read history of edits. It may give you some insight how to handle similar problems.
Is it feasible for you to make an array of the approved codes? That would make the if statement much simpler.
# Can you obtain list of viable codes?
codes <- c("2000", "1000")
# Can you put them in order?
goodcodes <- codes[order(codes)]
# last item in ordered goodcodes should be the end code
endcode <- goodcodes[length(goodcodes)]
testcodes <- c("0500", "1000", "2000", "3000")
n <- length(testcodes)
IsFinal <- rep(0, n)
for (i in 1:n) {
if (testcodes[i] %in% goodcodes) {
if (testcodes[i] == endcode) (IsFinal[i] = 1)
} else (IsFinal[i] = NA)
}
> IsFinal
[1] NA 0 1 NA
>
In base R, we can use ave along with duplicated and its fromLast argument to get the binary values. Then replace the desired values with NA. Using the data in #masoud's answer.
# get binary values for final DataCode by user
query.tab$IsFinal <- with(query.tab,
ave(DataCode, User, FUN=function(x) !duplicated(x, fromLast=TRUE)))
# Fill in NA values
is.na(query.tab$IsFinal) <- query.tab$DataCode %in% c(3000)
This returns
query.tab
User Time DataCode Data IsFinal
1 101 10 1000 50 0
2 101 20 2000 300 1
3 101 30 3000 150 NA
4 101 40 1000 250 1
5 101 50 3000 300 NA
6 102 10 2000 50 0
7 102 20 1000 150 0
8 102 30 1000 150 0
9 102 40 2000 350 1
10 102 50 3000 150 NA
11 102 60 1000 50 1
Note that this assumes that the data is ordered by user-time. This can be achieved with a call to order prior to using the code above.
query.tab <- query.tab[order(query.tab$User, query.tab$Time),]

Sorting elements by column in R

I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto

Apply a rule to calculate sum of specific

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))
Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.
You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources