I have data on every interaction that could and did happen at a university club weekly social hour
A sample of my data is as follows
structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))
I am trying to calculate network statistics such as centrality for A,B,C for each individual week, the last two weeks, and year to date. The only way I have gotten this to work is by manually breaking up the file in the time unit I want to analyze but there has to be a less labourious way, I hope.
When timestalked is 0 this should be treated as no edge
The output would produce a .csv with the following:
actor cent_week1 cent_week2 cent_week3 cent_last2weeks cent_yeartodate
A
B
C
with cent_week1 being 1/1/2010 centrality; cent_last2weeks being just considering 1/8/2010 and 1/15/2010; and cent_yeartodate being all of the data being considered at once. This is being applied to a MUCH larger dataset of millions of observations.
Can do this by setting your windows in another table, then doing by group operations on each of the windows:
Data Preparation:
# Load Data
DT <- structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))
# Code
library(igraph)
library(data.table)
setDT(DT)
# setup events
DT <- DT[timestalked > 0]
DT[, week := as.Date(week, format = "%m/%d/%Y")]
# setup windows, edit as needed
date_ranges <- data.table(label = c("cent_week_1","cent_week_2","cent_last2weeks","cent_yeartodate"),
week_from = as.Date(c("2010-01-01","2010-01-08","2010-01-08","2010-01-01")),
week_to = as.Date(c("2010-01-01","2010-01-08","2010-01-15","2010-01-15"))
)
# find all events within windows
DT[, JA := 1]
date_ranges[, JA := 1]
graph_base <- merge(DT, date_ranges, by = "JA", allow.cartesian = TRUE)[week >= week_from & week <= week_to]
Here is now the by group code, the second line is a bit gross, open to ideas about how to avoid the double call
graph_base <- graph_base[, .(graphs = list(graph_from_data_frame(.SD))), by = label, .SDcols = c("from", "to", "timestalked")] # create graphs
graph_base <- graph_base[, .(vertex = names(eigen_centrality(graphs[[1]])$vector), ec = eigen_centrality(graphs[[1]])$vector), by = label] # calculate centrality
dcast for final formatting:
dcast(graph_base, vertex ~ label, value.var = "ec")
vertex cent_last2weeks cent_week_1 cent_week_2 cent_yeartodate
1: A 1.0000000 0.7071068 0.8944272 0.9397362
2: B 0.7052723 0.7071068 0.4472136 0.7134685
3: C 0.9008487 1.0000000 1.0000000 1.0000000
Can't comment, so I'm writing an "answer". If you want to perform some mathematical operation on timestalked and get values by the from (didn't find any variable called actor in your example), here's a data.table approach that can be helpful:
dat <- as.data.table(dat) # or add 'data.table' to the class parameter
dat$week <- as.Date(dat$week, format = "%m/%d/%Y")
dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]
This gives the below output:
dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]
from weeknum cent
1: A 1 0.5
2: A 2 2.0
3: A 3 1.5
4: B 1 0.5
5: B 2 1.0
6: B 3 0.5
7: C 1 1.5
8: C 2 0.5
9: C 3 0.0
Assign this to new_dat. You can subset by week simply with new_dat[weeknum %in% 2:3] or whatever other variation you want or sum over the year. Additionally, you can also sort/order as desired.
Hope this helps!
How about:
library(dplyr)
centralities <- tmp %>%
group_by(week) %>%
filter(timestalked > 0) %>%
do(
week_graph=igraph::graph_from_edgelist(as.matrix(cbind(.$from, .$to)))
) %>%
do(
ecs = igraph::eigen_centrality(.$week_graph)$vector
) %>%
summarise(ecs_A = ecs[[1]], ecs_B = ecs[[2]], ecs_C = ecs[[3]])
You can use summarise_all if you have a lot of actors. Putting it in long format is left as an exercise.
This analysis follows the general split-apply-combine approach, where the data re split by week, graph functions are applied, and then the results combined together. There are several tools for this, but below uses base R, and data.table.
Base R
First set data-class for your data, so that term last two weeks has meaning.
# Set date class and order
d$week <- as.Date(d$week, format="%m/%d/%Y")
d <- d[order(d$week), ]
d <- d[d$timestalked > 0, ] # remove edges // dont need to do this is using weights
Then split and apply graph functions
# split data and form graph for eack week
g1 <- lapply(split(seq(nrow(d)), d$week), function(i)
graph_from_data_frame(d[i,]))
# you can then run graph functions to extract specific measures
(grps <- sapply(g1, function(x) eigen_centrality(x,
weights = E(x)$timestalked)$vector))
# 2010-01-01 2010-01-08 2010-01-15
# A 0.5547002 0.9284767 1.0000000
# B 0.8320503 0.3713907 0.7071068
# C 1.0000000 1.0000000 0.7071068
# Aside: If you only have one function to run on the graphs,
# you could do this in one step
#
# sapply(split(seq(nrow(d)), d$week), function(i) {
# x = graph_from_data_frame(d[i,])
# eigen_centrality(x, weights = E(x)$timestalked)$vector
# })
You then need to combine in the the analysis on all the data - as you only have to build two further graphs, this is not the time-consuming part.
fun1 <- function(i, name) {
x = graph_from_data_frame(i)
d = data.frame(eigen_centrality(x, weights = E(x)$timestalked)$vector)
setNames(d, name)
}
a = fun1(d, "alldata")
lt = fun1(d[d$week %in% tail(unique(d$week), 2), ], "lasttwo")
# Combine: could use `cbind` in this example, but perhaps `merge` is
# safer if there are different levels between dates
data.frame(grps, lt, a) # or
Reduce(merge, lapply(list(grps, a, lt), function(x) data.frame(x, nms = row.names(x))))
# nms X2010.01.01 X2010.01.08 X2010.01.15 alldata lasttwo
# 1 A 0.5547002 0.9284767 1.0000000 0.909899 1.0
# 2 B 0.8320503 0.3713907 0.7071068 0.607475 0.5
# 3 C 1.0000000 1.0000000 0.7071068 1.000000 1.0
data.table
It is likely that the time-consuming step will be explicitly split-applying the function over the data. data.table should offer some benefit here, especially when the data becomes large, and/or there are more groups.
# function to apply to graph
fun <- function(d) {
x = graph_from_data_frame(d)
e = eigen_centrality(x, weights = E(x)$timestalked)$vector
list(e, names(e))
}
library(data.table)
dcast(
setDT(d)[, fun(.SD), by=week], # apply function - returns data in long format
V2 ~ week, value.var = "V1") # convert to wide format
# V2 2010-01-01 2010-01-08 2010-01-15
# 1: A 0.5547002 0.9284767 1.0000000
# 2: B 0.8320503 0.3713907 0.7071068
# 3: C 1.0000000 1.0000000 0.7071068
Then just run the function over the full data / last two weeks as before.
There are differences between the answers, which is down to how we use the use the weights argument when calculating the centralities, whereas the others don't use the weights.
d=structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))
Related
I am given a big data set with several columns. As an example
set.seed(1)
x <- 1:15
y <- letters[1:3][sample(1:3, 15, replace = T)]
z <- letters[10:13][sample(1:3, 15, replace = T)]
r <- letters[20:24][sample(1:3, 15, replace = T)]
df <- data.frame("Number"=x, "Section"=y,"Chapter"=z,"Rating"=r)
dput(df)
structure(list(Number = 1:15, Area = structure(c(1L, 2L, 2L, 3L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 1L, 3L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"), Section = structure(c(2L, 3L, 3L, 2L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 3L, 2L), .Label = c("j", "k", "l"), class = "factor"), Rating = structure(c(2L, 2L, 2L, 1L, 3L, 3L, 3L, 1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor")), class = "data.frame", row.names = c(NA,-15L))
I would like now to create frequency tables and graphs split by rating and a a chosen category, e.g. via a string:
Category<-"Section"
data_count <- ddply(df, .(get(Category),Rating), 'count')
data_rel_freq <- ddply(data_count, .(Rating), transform, rel_freq = freq/sum(freq))
dput(data_rel_freq)
structure(list(get.Category. = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("j", "k","l"), class = "factor"), Number = c(4L, 8L, 10L, 12L, 1L, 15L, 2L, 3L, 14L, 7L, 9L, 11L, 13L, 5L, 6L), Area = structure(c(3L, 2L, 1L, 1L, 1L, 3L, 2L, 2L, 2L, 3L, 2L, 1L, 3L, 1L, 3L), .Label = c("a", b", "c"), class = "factor"), Section = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("j", "k", "l"), class = "factor"), Rating = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), freq = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), rel_freq = c(0.5, 0.5, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.142857142857143, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667, 0.166666666666667)), class = "data.frame", row.names = c(NA, -15L))
Using ggplot
ggplot(data_rel_freq,aes(x = Rating, y = rel_freq,fill = get(Category)))+
geom_bar(position = "fill",stat = "identity",color="black") +
scale_y_continuous(labels = percent_format())+
labs(x = "Rating", y="Relative Frequency")
The issue is now that "get(Category)" is now treated as a new column
get.Category. Number Area Section Rating freq rel_freq
1 k 4 c k A 1 0.5000000
2 k 8 b k A 1 0.5000000
3 j 10 a j B 1 0.1428571
4 j 12 a j B 1 0.1428571
5 k 1 a k B 1 0.1428571
6 k 15 c k B 1 0.1428571
7 l 2 b l B 1 0.1428571
Moreover, the Number column should be summed, e.g. the other categories (here: Area) should be dropped and it we should have just one line with for Section "k" with Rating "A".
We can use count to get the frequency of the column 'Section' by evaluating the object identifier 'Category' after converting to symbol (sym) and evaluate (!!) it. Within the ggplot syntax, the aes can also take a symbol and can be evaluated as earlier
library(tidyverse)
library(scales)
library(ggplot2)
df %>%
count(!! rlang::sym(Category), Rating) %>%
group_by(Rating) %>%
mutate(rel_freq = n/sum(n)) %>%
ggplot(., aes(x =Rating, y = rel_freq, fill = !! rlang::sym(Category))) +
geom_bar(position = "fill",stat = "identity",color="black") +
scale_y_continuous(labels = percent_format())+
labs(x = "Rating", y="Relative Frequency")
-output
Here is a sample dataset.
test_data <- structure(list(ID = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("P39190",
"U93491", "X28348", "Z93930"), class = "factor"), Sex = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Group = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("C83Z", "CAP_1", "P000"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("ID", "Sex", "Group",
"Category"), class = "data.frame", row.names = c(NA, -36L))
head(test_data, n = 10)
ID Sex Group Category
1 Z93930 M CAP_1 A
2 Z93930 M CAP_1 A
3 Z93930 M C83Z A
4 Z93930 M C83Z A
5 Z93930 M C83Z A
6 Z93930 M C83Z A
7 X28348 F C83Z B
8 X28348 F C83Z B
9 X28348 F CAP_1 B
10 X28348 F CAP_1 B
I want to count the number of unique elements in three levels:
Count of unique elements per "Category"
Count of unique elements in each "Category" grouped by "Group"
Count of unique elements in each "Group" grouped by "Sex"
I can of course use base R and a bit of dplyr to achieve this:
library(dplyr)
for(i in 1:length(unique(test_data$Category))){
temp <- test_data %>% dplyr::filter(Category == unique(test_data$Category)[i])
message(paste0(unique(test_data$Category)[i]), ": ", length(unique(temp$ID)))
for(k in 1:length(unique(temp$Group))){
temp_grp <- temp %>% dplyr::filter(Group == unique(temp$Group)[k])
message(paste0("\n ├──", unique(temp$Group)[k],
": ", length(unique(temp_grp$ID))))
message(paste0("\n\t"), "F: ", length(unique(temp_grp[which(temp_grp$Sex == "F"),])$ID))
message(paste0("\n\t"), "M: ", length(unique(temp_grp[which(temp_grp$Sex == "M"),])$ID))
}
}
But this is too dirty and unclever.
Is there a function in R that can achieve this in a cleaner and more efficient manner and preferably produce the output in the form of a dataframe?
I was under the impression that dplyr::group_by was made for such tasks. But I cannot quite figure out how it works for sub-groupings.
The code below:
test_data %>% dplyr::group_by(Category) %>% summarise(n = n_distinct(ID))
achieves the first task (point 1. above). But I cannot achieve points 2 and 3 in the same way.
SOLUTION:
test_data %>% dplyr::group_by(Category, Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand your question correctly, you were not very far from it at all. The idea is just to group by two columns at a time this way: group_by(col1, col2).
For point 2:
test_data %>% dplyr::group_by(Category, Group) %>% summarise(n = n_distinct(ID))
Source: local data frame [9 x 3]
Groups: Category [?]
Category Group n
<fctr> <fctr> <int>
1 A C83Z 1
2 A CAP_1 1
3 A P000 2
4 B C83Z 1
5 B CAP_1 1
6 B P000 1
7 C C83Z 1
8 C CAP_1 1
9 C P000 2
And for point 3:
test_data %>% dplyr::group_by(Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand correctly, you can use dplyr::count for all three cases
test_data %>% dplyr::count(Category)
test_data %>% dplyr::count(Group, Category)
test_data %>% dplyr::count(Sex, Group)
For an example dataframe:
df <- structure(list(id = 1:18, region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), age.cat = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("0-18",
"19-35", "36-50", "50+"), class = "factor")), .Names = c("id",
"region", "age.cat"), class = "data.frame", row.names = c(NA,
-18L))
I want to reshape the data, as detailed below:
region 0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Do I simply aggregate or reshape the data? Any help would be much appreciated.
You can do it just using table:
table(df$region, df$age.cat)
0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Using reshape2:
install.packages('reshape2')
library(reshape2)
df1 <- melt(df, measure.vars = 'age.cat')
df1 <- dcast(df1, region ~ value)
So, I have a data frame with several continuous variables and several dummy variables. The survey that this data frame comes from uses 6,7,8 and 9 to denote different types of non-response. So, I would like to replace 6,7,8 and 9 with NA whenever they show up in a dummy variable column but leave them be in the continuous variable column.
Is there a concise way to go about doing this?
Here's my data:
> dput(head(sfsuse[c(4:16)]))
structure(list(famsize = c(3L, 1L, 2L, 5L, 3L, 5L), famtype = c(2L,
1L, 2L, 3L, 2L, 3L), cc = c(1L, 1L, 1L, 1L, 1L, 1L), nocc = c(1L,
1L, 1L, 3L, 1L, 1L), pdloan = c(2L, 2L, 2L, 2L, 2L, 2L), help = c(2L,
2L, 2L, 2L, 2L, 2L), budget = c(1L, 1L, 1L, 1L, 2L, 2L), income = c(340000L,
20500L, 0L, 165000L, 95000L, -320000L), govtrans = c(7500L, 15500L,
22000L, 350L, 0L, 9250L), childexp = c(0L, 0L, 0L, 0L, 0L, 0L
), homeown = c(1L, 1L, 1L, 1L, 1L, 2L), bank = c(2000L, 80000L,
25000L, 20000L, 57500L, 120000L), vehval = c(33000L, 7500L, 5250L,
48000L, 8500L, 50000L)), .Names = c("famsize", "famtype", "cc",
"nocc", "pdloan", "help", "budget", "income", "govtrans", "childexp",
"homeown", "bank", "vehval"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to subs in NA for 6,7,8 and 9 in columns 3:7 and column 11. I know how to do this one column at a time by the column names:
df$name[df$name %in% 6:9]<-NA
but I would have to do this for each column by name, is there a concise way to do it by column index?
Thanks
This function should work
f <- function(data,k) {
data[data[,k] %in% 6:9,k] <- NA
data
}
Now at the console:
> for (k in c(3:7,11)) { data <- f(data,k) }
Here is a simple question. I have a data frame with values ranging from 0 to 3 and I want to get the number of elements of the dataset, which should be 4 in this case. Here is an example of the data:
structure(list(X1 = c(2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L), X2 = c(1L,
1L, 1L, 2L, 1L, 0L, 2L, 3L, 1L), X3 = c(2L, 1L, 2L, 2L, 0L, 0L,
2L, 3L, 1L), X4 = c(1L, 2L, 2L, 2L, 1L, 2L, 0L, 2L, 2L), X5 = c(1L,
2L, 1L, 2L, 1L, 0L, 1L, 2L, 1L), X6 = c(1L, 2L, 1L, 1L, 1L, 2L,
1L, 2L, 1L)), .Names = c("X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-9L))
I've tried diff(range(d)) but it doesn't count 0. Thanks in advance.
length(unique(...)) does some possibly unexpected (although thoroughly documented) things when applied to a matrix or data frame.
s <- structure(list(X1 = c(2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L), X2 = c(1L,
1L, 1L, 2L, 1L, 0L, 2L, 3L, 1L), X3 = c(2L, 1L, 2L, 2L, 0L, 0L,
2L, 3L, 1L), X4 = c(1L, 2L, 2L, 2L, 1L, 2L, 0L, 2L, 2L), X5 = c(1L,
2L, 1L, 2L, 1L, 0L, 1L, 2L, 1L), X6 = c(1L, 2L, 1L, 1L, 1L, 2L,
1L, 2L, 1L)), .Names = c("X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-9L))
When applied to a data frame, unique returns the unique rows in the data frame. length() then counts the number of columns in the data frame. So in general (I can't think of a counterexample), this will always be equal to ncol(s).
length(unique(s)) ## 6
unique applied to a matrix also returns the unique rows, but now length() counts the total number of elements: for your data this will usually be equivalent to ncol(s)*nrow(s).
length(unique(as.matrix(s))) ## 54
If you want to apply unique to the elements in this situation, you probably want one of the following, all of which collapse the original data frame down to a single vector:
length(unique(as.vector(as.matrix(s)))) ## 4
length(unique(unlist(s))) ## 4
length(unique(c(as.matrix(s)))) ## 4
Whether you want diff(range(x))+1 or length(unique(...)) depends on how you would want to count a data frame composed (for example) entirely of {0,1,2,4} -- should that return 4 or 5? (As #Brian Diggs points out in his answer, diff(range(...))+1 will work on a matrix, without needing to flatten the structure further -- it will also work on an unlist()ed data frame.)
diff(range(d)) returns the difference between minimum and maximum, being 0 and 3 respectively
What you want to do is count how many elements there are in a set. Try length(d)
d <- 0:3
length(d)
Including the comments to this answer... Let the code speak
Example data:
dataset = 1:136
dataset = dataset %% 4
dim(dataset) <- c(4,34) //Now we have a table
diff(range(dataset))+1
It returns 4 like you wanted
Given the structure of d you have now provided, you can do a column-by-column calculation of this.
> diff(range(d$X1))+1
[1] 3
> diff(range(d$X1))+1
[1] 3
> diff(range(d$X2))+1
[1] 4
> diff(range(d$X3))+1
[1] 4
> diff(range(d$X4))+1
[1] 3
> diff(range(d$X5))+1
[1] 3
> diff(range(d$X6))+1
[1] 2
Or you can loop over all the columns
> lapply(d, function(dp) {diff(range(dp))+1})
$X1
[1] 3
$X2
[1] 4
$X3
[1] 4
$X4
[1] 3
$X5
[1] 3
$X6
[1] 2
Or if you want the range for all the columns collectively, treat it as a matrix:
> diff(range(as.matrix(d)))+1
[1] 4