R table function - r

If I have a vector numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4), and I use 'table(numbers)', I get
names 1 2 4 5
counts 2 5 4 1
What if I want it to include 3 also or generally, all numbers from 1:max(numbers) even if they are not represented in numbers. Thus, how would I generate an output as such:
names 1 2 3 4 5
counts 2 5 0 4 1

If you want R to add up numbers that aren't there, you should create a factor and explicitly set the levels. table will return a count for each level.
table(factor(numbers, levels=1:max(numbers)))
# 1 2 3 4 5
# 2 5 0 4 1

For this particular example (positive integers), tabulate would also work:
numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4)
tabulate(numbers)
# [1] 2 5 0 4 1

Related

runner:streak_run shows unexpected result when k remains unchanged

I'm using runner:streak_run to count sequences of 0 and 1 in a column called "inactive_indicator".
The column is= 0,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1
For runner::streak_run(inactive_indicator))
I get the following:
1,2,3,1,2,3,1,1,2,1,2,3,4,5,5,5,5,1,2,3,4
Why is it stuck on 5 when it should go up to 8?
In documentation it says that k - running window size. By default window size equals length(x). Allow varying window size specified by vector of length(x)
As I understand, the default definition should be enough.
Problem resolves and I get expected results when running:
runner::streak_run(inactive_indicator),k=length(inactive_indicator))
Why doesn't it work in the first place?
This can be solved with rle from base R
sequence(rle(inactive_indicator)$lengths)
#[1] 1 2 3 1 2 3 1 1 2 1 2 3 4 5 6 7 8 1 2 3 4
Checked with runner
runner::streak_run(inactive_indicator)
#[1] 1 2 3 1 2 3 1 1 2 1 2 3 4 5 6 7 8 1 2 3 4
It is possible that there are some leading/lagging spaces in the column and it is not numeric. In that case, use trimws
runner::streak_run(trimws(inactive_indicator))
data
inactive_indicator <- c(0,0,0,1,1,1,0,1,1,0,0,0,0,0,0,0,0,1,1,1,1)

I am trying to take a vector of numbers 5:0 and repeat it 3 times, every other time reversing its order

I'd think this would be simple using the rev() and seq() functions, but am struggling to get the reverse order part correct.
I'm trying to get 5432101234543210 from 5:0.
Not too hard to set as a function...
try_it <- function(x) {
c(rev(x), x[2:length(x-1)], rev(x)[2:length(x-1)])
}
try_it(0:5)
# [1] 5 4 3 2 1 0 1 2 3 4 5 4 3 2 1 0
Edit
Extend function to have variable repeats
try_it <- function(x, reps) {
c(rev(x), rep(c(x[2:length(x-1)], rev(x)[2:length(x-1)]), (reps - 1) / 2))
}
try_it(0:5, 5)
# [1] 5 4 3 2 1 0 1 2 3 4 5 4 3 2 1 0 1 2 3 4 5 4 3 2 1 0
Note: I've not worked hard to generalise this extension, it will not return the correct length for an even number of repetitions. I'm sure you could modify to suit your requirements.

How to tidy up a character column?

What I have:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"))
> test_df
isolate label alignment
1 1 1 --at
2 2 1 at--
3 3 1 --at
4 4 1 --at
5 1 2 a--
6 2 2 acg
7 3 2 a--
8 4 2 a--
9 5 2 agg
What I want:
I'd like to explode the alignment field into two columns, position and character:
> test_df
isolate label aln_pos aln_char
1 1 1 1 -
2 1 1 2 -
3 1 1 3 a
4 1 1 4 t
...
Not all alignments are the same length, but all alignments with the same label have the same length.
What I've tried:
I was thinking I could use separate to first make each position have its own column, then use gather turn those columns into key value pairs. However, I haven't been able to get the separate part right.
Since you mentioned tidyr::gather, you could try this:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),
label=c(1,1,1,1,2,2,2,2,2),
alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"),
stringsAsFactors = FALSE)
library(tidyverse)
test_df %>%
mutate(alignment = strsplit(alignment,"")) %>%
unnest(alignment)
In base R, you can use indexing along with creation of a list with strsplit like this.
# make variable a character vector
test_df$alignment <- as.character(test_df$alignment)
# get list of individual characters
myList <- strsplit(test_df$alignment, split="")
then build the data.frame
# construct data.frame
final_df <- cbind(test_df[rep(seq_len(nrow(test_df)), lengths(myList)),
c("isolate", "label")],
aln_pos=sequence(lengths(myList)),
aln_char=unlist(myList))
Here, we take the first two columns of the original data.frame and repeat the rows using rep with a vector input in its second argument telling it how many times to repeat the corresponding value in its first argument. The number of times is calculated with lengths. The second argument of cbind is a call to sequence taking the same lengths output. this produces counts from 1 to the corresponding length. The third argument is the unlisted character values.
this returns
head(final_df, 10)
isolate label aln_pos aln_char
1 1 1 1 -
1.1 1 1 2 -
1.2 1 1 3 a
1.3 1 1 4 t
2 2 1 1 a
2.1 2 1 2 t
2.2 2 1 3 -
2.3 2 1 4 -
3 3 1 1 -
3.1 3 1 2 -

Determining congruence between rows in R, based on key variable

I have a few large data sets with many variables. There is a "key" variable that is the ID for the research participant. In these data sets, there are some IDs that are duplicated. I have written code to extract all data for duplicated IDs, but I would like a way to check if the remainder of the variables for those IDs are equal or not. Below is a simplistic example:
ID X Y Z
1 2 3 4
1 2 3 5
2 5 5 4
2 5 5 4
3 1 2 3
3 2 2 3
3 1 2 3
In this example, I would like to be able to identify that the rows for ID 1 and ID 3 are NOT all equal. Is there any way to do this in R?
You can use duplicated for this:
d <- read.table(text='ID X Y Z
1 2 3 4
1 2 3 5
2 5 5 4
2 5 5 4
3 1 2 3
3 2 2 3
3 1 2 3
4 1 1 1', header=TRUE)
tapply(duplicated(d), d[, 1], function(x) all(x[-1]))
## 1 2 3 4
## FALSE TRUE FALSE TRUE
Duplicated returns a vector indicating, for each row of a dataframe, whether it has been encountered earlier in the dataframe. We use tapply over this logical vector, splitting it in to groups based on ID and applying a function to each of these groups. The function we apply is all(x[-1]), i.e. we ask whether all rows for the group, other than the initial row, are duplicated?
Note that I added a group with a single record to ensure that the solution works in these cases as well.
Alternatively, you can reduce the dataframe to unique records with unique, and then split by ID and check whether each split has only a single row:
sapply(split(unique(d), unique(d)[, 1]), nrow) == 1
## 1 2 3 4
## FALSE TRUE FALSE TRUE
(If it's a big dataframe it's worth calculating unique(d) in advance rather than calling it twice.)

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources