count the frequency, how to have zero frequencies? - r

I wish to know the frequencies that occurred 0. Is there any simple way for that?
mydf <-data.frame(v1= c("a","b","b","a","b"), v2 =c("l1", "l2","l1","l1","l2"))
I use this one to see the frequencies of v1 and v2
library(plyr)
count(mydf, c('v1','v2'))
It gives me the following outcome.
v1 v2 freq
1 a l1 2
2 b l1 1
3 b l2 2
I wish to have zeros in my output. For instance combination of a and l2 never occurred. How can I have the following output?
v1 v2 freq
1 a l1 2
2 b l1 1
3 b l2 2
4 a l2 0

table(mydf$v1, mydf$v2)
l1 l2
a 2 0
b 1 2
as.data.frame(table(mydf$v1, mydf$v2))
Var1 Var2 Freq
1 a l1 2
2 b l1 1
3 a l2 0
4 b l2 2

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

R Ifelse: Find if any column meet the condition

I'm trying to apply the same condition for multiple columns of an array and, then, create a new column if any of the columns meet the condition.
I can do it manually with an OR statement, but I was wondering if there is an easy way to apply it for more columns.
An example:
data <- data.frame(V1=c("A","B"),V2=c("A","A","A","B","B","B"),V3=c("A","A","B","B","A","A"))
data[4] <- ifelse((data[1]=="A"|data[2]=="A"|data[3]=="A"),1,0)
So the 4th row is the only that doesn't meet the condition for all columns:
V1 V2 V3 V1
1 A A A 1
2 B A A 1
3 A A B 1
4 B B B 0
5 A B A 1
6 B B A 1
Do you know a way to apply the condition in a shorter code?
I tried something like
data[4] <- ifelse(any(data[,c(1:3)]=="A"),1,0)
but it consider the condition for all the dataset instead of by rows, so all the rows are given 1.
We can use Reduce with lapply
data$NewCol <- +( Reduce(`|`, lapply(data, `==`, 'A')))
We can use apply row-wise :
data$ans <- +(apply(data[1:3] == "A", 1, any))
data
# V1 V2 V3 ans
#1 A A A 1
#2 B A A 1
#3 A A B 1
#4 B B B 0
#5 A B A 1
#6 B B A 1
Try:
data$V4 <- +(rowSums(data == 'A') > 0)
Output:
V1 V2 V3 V4
1 A A A 1
2 B A A 1
3 A A B 1
4 B B B 0
5 A B A 1
6 B B A 1

Frequency table with ddply function

ID<-c("R1","R2","R2","R3","R3","R4","R4","R4","R4","R3","R3","R3","R3","R2","R2","R2","R5","R6")
event<-c("a","b","b","M","s","f","y","b","a","a","a","a","s","c","c","b","m","a")
df<-data.frame(ID,event)
How can I modify the below code to get this table. 2-How can i get the average of frequency for each element of frequency?for example: the average of frequency for a would be 1+3+1+1/4.
ddply(df,.(ID),summarise,N=sum(!is.na(ID)),frequency=length(event))
ID N Number-event-level levels frequency
R1 1 1 a a=1
R2 5 2 b,c b=3,c=2
R3 6 3 M,a,s M=1,a=3,s=2
R4 4 4 f,y,b,a f=1,y=1,b=1,a=1
R5 1 1 m m=1
R6 1 1 a a=1
Here's an answer for the first question:
ddply(df,.(ID),summarise,
N=length(event),
Number.event.level=length(unique(event)),
levels=paste(sort(unique(event)),collapse=","),
frequency=paste(paste(sort(unique(event)),table(event)[table(event)>0],sep="="),collapse=","))
# ID N Number.event.level levels frequency
# 1 R1 1 1 a a=1
# 2 R2 5 2 b,c b=3,c=2
# 3 R3 6 3 a,M,s a=3,M=1,s=2
# 4 R4 4 4 a,b,f,y a=1,b=1,f=1,y=1
# 5 R5 1 1 m m=1
# 6 R6 1 1 a a=1
For your second question, it seems like you want to get the average frequency when the frequency is greater than 0. If that's the case, you can do this:
apply(table(df),2,function(x) mean(x[x>0]))
# a b c f m M s y
# 1.5 2.0 2.0 1.0 1.0 1.0 2.0 1.0
Update
If you want to do that last part for each level of a third variable and you still want to use ddply() you could do the following:
df1 <- rbind(df,df)
df1$cat <- rep(c("a","b"),each=nrow(df))
ddply(df1,.(cat),function(y) apply(table(y),2,function(x) mean(x[x>0])))
# cat a b c f m M s y
# 1 a 1.5 2 2 1 1 1 2 1
# 2 b 1.5 2 2 1 1 1 2 1

R counting strings variables in each row of a dataframe

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

How do I stack only some columns in a data frame?

I have some data in a data frame in the following form:
A B C V1 V2 V3
1 1 1 x y z
1 1 2 a b c
...
Where A,B,C are factors, and the combination A,B,C is unique for each row.
I need to convert some of the columns into factors, to achieve a form like:
A B C V val
1 1 1 V1 x
1 1 1 V2 y
1 1 1 V3 z
1 1 2 V1 a
1 1 2 V2 b
1 1 2 V2 c
...
This seems to relate to both stack and the inverse of xtabs, but I don't see how to specify that only certain columns should be "stacked".
And before #AnandaMahto gets here and offers his base reshape solution, here's my attempt:
dat <- read.table(text = 'A B C V1 V2 V3
1 1 1 x y z
1 1 2 a b c',header= T)
expandvars <- c("V1","V2","V3")
datreshape <- reshape(dat,
idvar=c("A","B","C"),
varying=list(expandvars),
v.names=c("val"),
times=expandvars,
direction="long")
> datreshape
A B C time val
1.1.1.V1 1 1 1 V1 x
1.1.2.V1 1 1 2 V1 a
1.1.1.V2 1 1 1 V2 y
1.1.2.V2 1 1 2 V2 b
1.1.1.V3 1 1 1 V3 z
1.1.2.V3 1 1 2 V3 c
Using reshape2 package
dat <- read.table(text = 'A B C V1 V2 V3
1 1 1 x y z
1 1 2 a b c',header= T)
library(reshape2)
melt(dat,id.vars = c('A','B','C'))
A B C variable value
1 1 1 1 V1 x
2 1 1 2 V1 a
3 1 1 1 V2 y
4 1 1 2 V2 b
5 1 1 1 V3 z
6 1 1 2 V3 c
stack
You are right that stack is a possibility, but you perhaps missed a key line in the documentation for stack:
Note that stack applies to vectors (as determined by is.vector): non-vector columns (e.g., factors) will be ignored (with a warning as from R 2.15.0).
So, how do we proceed?
Here's your data:
dat <- read.table(text = 'A B C V1 V2 V3
1 1 1 x y z
1 1 2 a b c',header= T)
Here, we convert the factors to as.character:
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)], as.character)
Here's how we specify which columns to stack:
stack(dat[4:6])
# values ind
# 1 x V1
# 2 a V1
# 3 y V2
# 4 b V2
# 5 z V3
# 6 c V3
But, we still need to "expand" your rows for columns 1-3. See here for how to do that.
With this information, we can use cbind to get the desired result.
cbind(dat[rep(row.names(dat), 3), 1:3], stack(dat[4:6]))
# A B C values ind
# 1 1 1 1 x V1
# 2 1 1 2 a V1
# 1.1 1 1 1 y V2
# 2.1 1 1 2 b V2
# 1.2 1 1 1 z V3
# 2.2 1 1 2 c V3
xtabs
You are also right that xtabs seems like it could be a likely possibility, but xtabs actually expects the opposite of what you've provided. That is to say, when you specify a formula, it expects the items on the left hand side to be numbers, and the items on the right hand side to be factors. Thus, is your data were swapped, you could certainly use xtabs.
Here's a demonstration (which only works because you are using a simple example where we can easily match "letters" to "numbers").
dat2 <- dat # Make a copy of "dat"
# Swap out dat 4-6 with numbers
dat2[4:6] <- lapply(dat2[4:6], function(x) match(x, letters))
# Swap out dat 1-3 with letters
dat2[1:3] <- lapply(dat2[1:3], function(x) letters[x])
# Our new "dat"
dat2
# A B C V1 V2 V3
# 1 a a a 24 25 26
# 2 a a b 1 2 3
data.frame(xtabs(cbind(V1, V2, V3) ~ A + B + C, dat2))
# A B C Var4 Freq
# 1 a a a V1 24
# 2 a a b V1 1
# 3 a a a V2 25
# 4 a a b V2 2
# 5 a a a V3 26
# 6 a a b V3 3
In other words, your choice of tools could potentially be right, but your data needs to also be in the form that the tools expect.
But, I'm not sure why you'd want to do all the work I've shown when better solutions exist with reshape and friends ;)
Very late update...
You can also look at merged.stack from my "splitstackshape" package:
library(splitstackshape)
merged.stack(dat, var.stubs = "V", sep = "NoSep")
# A B C .time_1 V
# 1: 1 1 1 V1 x
# 2: 1 1 1 V2 y
# 3: 1 1 1 V3 z
# 4: 1 1 2 V1 a
# 5: 1 1 2 V2 b
# 6: 1 1 2 V3 c
Or gather from "tidyr":
library(dplyr)
library(tidyr)
# gather(dat, var, val, V1:V3)
dat %>% gather(var, val, V1:V3)
# A B C var val
# 1 1 1 1 V1 x
# 2 1 1 2 V1 a
# 3 1 1 1 V2 y
# 4 1 1 2 V2 b
# 5 1 1 1 V3 z
# 6 1 1 2 V3 c

Resources