Obtaining the index of elements that belong to common group in R - r

I have the following data frame,
>df
Label
0 control1
1 control1
2 control2
3 control2
4 control1
To get the index of the elements with label control1 and control2, I do the following
Index1 <- grep("control1",df[,1])
Index2 <- grep("control2",df[,1])
In the above syntax, the labels control1 and control2 are explicity mentioned in the command.
Is there a way to find the labels automatically? The reason is the data frame, df,contents are parsed from different input files.
For instance, I could have another data frame that reads
>df2
Label
0 trol1
1 trol1
2 trol2
3 trol3
4 trol2
Is there a way to create a list of unique labels present in the column of df?

We can use split to get list of index according to unique Label
split(1:nrow(df), df$Label)
#$control1
#[1] 1 2 5
#$control2
#[1] 3 4
With df2
split(1:nrow(df2), df2$Label)
#$trol1
#[1] 1 2
#$trol2
#[1] 3 5
#$trol3
#[1] 4

Using unique and which you can do:
df <- data.frame(Label = c("trol1", "trol1", "trol2", "trol3", "trol2"), stringsAsFactors=FALSE)
label_idx = list()
for(lbl in unique(df$Label)){
label_idx[[lbl]] = which(df$Label == lbl)
}
label_idx
$`trol1`
[1] 1 2
$trol2
[1] 3 5
$trol3
[1] 4

You can try also
lapply(unique(df$Label), function(x) which(df$Label%in% x))
#with df
[[1]]
[1] 1 2 5
[[2]]
[1] 3 4
lapply(unique(df2$Label), function(x) which(df2$Label%in% x))
#with df2
[[1]]
[1] 1 2
[[2]]
[1] 3 5
[[3]]
[1] 4

Related

Storing frequencies returned from table function in R

I have a vector of size 5 which stores random digits 0-9 so that there can be multiple occurrences of the same digit. Here is an example vector:
nums <- c(5,2,5,9,2)
If I print the results of running the table function on this vector, I get the following output:
nums
2 5 9
2 2 1
I would like to know what the highest and second highest frequencies are that are returned from table(nums). How can I store all of the frequencies that are returned from an iteration of the table function?
table returns an array that can be saved to a variable. If you convert it to a data.frame using as.data.frame you get an easier to work with object:
nums <- c(5,2,5,9,2)
tab <- as.data.frame(table(nums))
tab
nums Freq
1 2 2
2 5 2
3 9 1
You can use plyr, its lightening fast.
library(plyr)
nums <- c(5,2,5,9,2)
count(nums)
Result
x freq
2 2
5 2
9 1
To shrink the table only to the two most frequent options you would want
sort(table(nums), dec = TRUE)[1:2]
# nums
# 2 5
# 2 2
Just to get their names you could do
names(sort(table(nums), dec = TRUE))[1:2]
# [1] "2" "5"
If it may happen that there are not that many unique values, you could use na.omit, as in
names(sort(table(nums), dec = TRUE))[1:4]
# [1] "2" "5" "9" NA
na.omit(names(sort(table(nums), dec = TRUE))[1:4])
# [1] "2" "5" "9"
# attr(,"na.action")
# [1] 4
# attr(,"class")
# [1] "omit"
As for storing the results, using a list should be pretty convenient:
tabs <- list()
tabs[[1]] <- sort(table(nums), dec = TRUE)[1:2]
tabs[[2]] <- sort(table(c(1, 1, 2, 3, 3)), dec = TRUE)[1:2]
tabs
# [[1]]
# nums
# 2 5
# 2 2
#
# [[2]]
#
# 1 3
# 2 2
In particular, using lists is compatible with the option that the number of options is varying.

Combining lists which have the same column names, but NOT joining them

I have 3 lists 1 want to combine. These lists contain different data points, but the same types of data. I want to place these into 1 list for export side-by-side. For example,
A$a <- c(1,2,3)
A$b <- c(2,3,4)
B$a <- c(1,3,5)
B$b <- c(2,4,6)
I want to have a new list, C, which has columns A$a, A$b, B$a, and B$b as separate columns (and in that order). How can I do this?
C<-c(A,B)
names(C) <- paste0(c("A","B"),names(C))
C
$Aa
[1] 1 2 3
$Bb
[1] 2 3 4
$Aa
[1] 1 3 5
$Bb
[1] 2 4 6

Call a specific column from every dataframe from list of dataframes

I like to report a specific column from every dataframe from a list of dataframes. Any ideas? This is my code:
# Create dissimilarity matrix
df.diss<-dist(t(df[,6:11]))
mds.table<-list() # empty list of values
for(i in 1:6){ # For Loop to iterate over a command in a function
a<-mds(pk.diss,ndim=i, type="ratio", verbose=TRUE,itmax=1000)
mds.table[[i]]<-a # Store output in empty list
}
Now here is where I'm having trouble. After storing the values, I'm unable to call a specific column from every dataframe from the list.
# This function should call every $stress column from each data frame.
lapply(mds.table, function(x){
mds.table[[x]]$stress
})
Thanks again!
you are very close:
set.seed(1)
l_df <- lapply(1:5, function(x){
data.frame(a = sample(1:5,5), b = sample(1:5,5))
})
lapply(l_df, function(x){
x[['a']]
})
[[1]]
[1] 2 5 4 3 1
[[2]]
[1] 2 1 3 4 5
[[3]]
[1] 5 1 2 4 3
[[4]]
[1] 3 5 2 1 4
[[5]]
[1] 5 3 4 2 1

R: change data frame structure using values from one variable as new variable

df1 <- data.frame(
name = c("a", "b", "b", "c"),
score = c(1, 1, 2, 1)
)
How can I get a new data frame with variables/columns from df$name and with each 'corresponding' df$score. I figure that its actually a two-step problem:
First I would need to make a list of (in this example) unequal length vectors like this:
$a
[1] 1
$b
[1] 1 2
$c
[1] 1
Second, NAs need to be padded so one get vectors of equal length before making the desired data frame
that would be like:
a b c
1 1 1 1
2 NA 2 NA
I cannot find any simple means to do this - Im sure there must be!
If the solution can be delivered using dplyr it would be fantastic! Thanks!
To split the data:
(s <- split(df1$score, df1$name))
# $a
# [1] 1
#
# $b
# [1] 1 2
#
# $c
# [1] 1
To create the new data frame:
as.data.frame(sapply(s, `length<-`, max(vapply(s, length, 1L))))
# a b c
# 1 1 1 1
# 2 NA 2 NA
Slightly more efficient would be to use vapply in place of sapply
len <- max(vapply(s, length, 1L))
as.data.frame(vapply(s, `length<-`, double(len), len))
# a b c
# 1 1 1 1
# 2 NA 2 NA

Difference between as.data.frame(x) and data.frame(x)

What is the difference between as.data.frame(x) and data.frame(x)
In this following example, the result is the same at the exception of the columns names.
x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
> as.data.frame(x)
V1 V2 V3
1 1 1 1
2 1 1 1
3 1 1 1
As mentioned by Jaap, data.frame() calls as.data.frame() but there's a reason for it:
as.data.frame() is a method to coerce other objects to class data.frame. If you're writing your own package, you would store your method to convert an object of your_class under as.data.frame.your_class(). Here are just a few examples.
methods(as.data.frame)
[1] as.data.frame.AsIs as.data.frame.Date
[3] as.data.frame.POSIXct as.data.frame.POSIXlt
[5] as.data.frame.aovproj* as.data.frame.array
[7] as.data.frame.character as.data.frame.complex
[9] as.data.frame.data.frame as.data.frame.default
[11] as.data.frame.difftime as.data.frame.factor
[13] as.data.frame.ftable* as.data.frame.integer
[15] as.data.frame.list as.data.frame.logLik*
[17] as.data.frame.logical as.data.frame.matrix
[19] as.data.frame.model.matrix as.data.frame.numeric
[21] as.data.frame.numeric_version as.data.frame.ordered
[23] as.data.frame.raw as.data.frame.table
[25] as.data.frame.ts as.data.frame.vector
Non-visible functions are asterisked
data.frame() can be used to build a data frame while as.data.frame() can only be used to coerce other object to a data frame.
for example:
# data.frame()
df1 <- data.frame(matrix(1:12,3,4),1:3)
# as.data.frame()
df2 <- as.data.frame(matrix(1:12,3,4),1:3)
df1
# X1 X2 X3 X4 X1.3
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
df2
# V1 V2 V3 V4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
As you noted, the result does differ slightly, and this means that they are not exactly equal:
identical(data.frame(x),as.data.frame(x))
[1] FALSE
So you might need to take care to be consistent in which one you use.
But it is also worth noting that as.data.frame is faster:
library(microbenchmark)
microbenchmark(data.frame(x),as.data.frame(x))
Unit: microseconds
expr min lq median uq max neval
data.frame(x) 71.446 73.616 74.80 78.9445 146.442 100
as.data.frame(x) 25.657 27.631 28.42 29.2100 93.155 100
y <- matrix(1:1e6,1000,1000)
microbenchmark(data.frame(y),as.data.frame(y))
Unit: milliseconds
expr min lq median uq max neval
data.frame(y) 17.23943 19.63163 23.60193 41.07898 130.66005 100
as.data.frame(y) 10.83469 12.56357 14.04929 34.68608 38.37435 100
The difference becomes clearer when you look at their main arguments:
as.data.frame(x, ...): check if object is a data frame, or coerce if possible. Here, "x" can be any R object.
data.frame(...): build a data frame. Here, "..." allows specifying all the components (i.e. the variables of the data frame).
So, the results by Ophelia are similar since both functions received a single matrix as argument: however, when these functions receive 2 (or more) vectors, the distinction becomes clearer:
> # Set seed for reproducibility
> set.seed(3)
> # Create one int vector
> IDs <- seq(1:10)
> IDs
[1] 1 2 3 4 5 6 7 8 9 10
> # Create one char vector
> types <- sample(c("A", "B"), 10, replace = TRUE)
> types
[1] "A" "B" "A" "A" "B" "B" "A" "A" "B" "B"
> # Try to use "as.data.frame" to coerce components into a dataframe
> dataframe_1 <- as.data.frame(IDs, types)
> # Look at the result
> dataframe_1
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
duplicate row.names: A, B
> # Inspect result with head
> head(dataframe_1, n = 10)
IDs
A 1
B 2
A.1 3
A.2 4
B.1 5
B.2 6
A.3 7
A.4 8
B.3 9
B.4 10
> # Check the structure
> str(dataframe_1)
'data.frame': 10 obs. of 1 variable:
$ IDs: int 1 2 3 4 5 6 7 8 9 10
> # Use instead "data.frame" to build a data frame starting from two components
> dataframe_2 <- data.frame(IDs, types)
> # Look at the result
> dataframe_2
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Inspect result with head
> head(dataframe_2, n = 10)
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Check the structure
> str(dataframe_2)
'data.frame': 10 obs. of 2 variables:
$ IDs : int 1 2 3 4 5 6 7 8 9 10
$ types: Factor w/ 2 levels "A","B": 1 2 1 1 2 2 1 1 2 2
As you see "data.frame()" works fine, while "as.data.frame()" produces an error as it recognises the first argument as the object to be checked and coerced.
To sum up, "as.data.frame()" should be used to convert/coerce one single R object into a data frame (as you correctly did using a matrix), while "data.frame()" to build a data frame from scratch.
Try
colnames(x) <- c("C1","C2","C3")
and then both will give the same result
identical(data.frame(x), as.data.frame(x))
What is more startling are things like the following:
list(x)
Provides a one-elemnt list, the elemnt being the matrix x; whereas
as.list(x)
gives a list with 9 elements, one for each matrix entry
MM
Looking at the code, as.data.frame fails faster. data.frame will issue warnings, and do things like remove rownames if there are duplicates:
> x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> rownames(x) <- c("a", "b", "b")
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
Warning message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3 --> row.names NOT used
> as.data.frame(x)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
duplicate row.names: b

Resources