Getting the range of a data frame including zero - r

Here is a simple question. I have a data frame with values ranging from 0 to 3 and I want to get the number of elements of the dataset, which should be 4 in this case. Here is an example of the data:
structure(list(X1 = c(2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L), X2 = c(1L,
1L, 1L, 2L, 1L, 0L, 2L, 3L, 1L), X3 = c(2L, 1L, 2L, 2L, 0L, 0L,
2L, 3L, 1L), X4 = c(1L, 2L, 2L, 2L, 1L, 2L, 0L, 2L, 2L), X5 = c(1L,
2L, 1L, 2L, 1L, 0L, 1L, 2L, 1L), X6 = c(1L, 2L, 1L, 1L, 1L, 2L,
1L, 2L, 1L)), .Names = c("X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-9L))
I've tried diff(range(d)) but it doesn't count 0. Thanks in advance.

length(unique(...)) does some possibly unexpected (although thoroughly documented) things when applied to a matrix or data frame.
s <- structure(list(X1 = c(2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 2L), X2 = c(1L,
1L, 1L, 2L, 1L, 0L, 2L, 3L, 1L), X3 = c(2L, 1L, 2L, 2L, 0L, 0L,
2L, 3L, 1L), X4 = c(1L, 2L, 2L, 2L, 1L, 2L, 0L, 2L, 2L), X5 = c(1L,
2L, 1L, 2L, 1L, 0L, 1L, 2L, 1L), X6 = c(1L, 2L, 1L, 1L, 1L, 2L,
1L, 2L, 1L)), .Names = c("X1", "X2", "X3", "X4", "X5", "X6"), class = "data.frame", row.names = c(NA,
-9L))
When applied to a data frame, unique returns the unique rows in the data frame. length() then counts the number of columns in the data frame. So in general (I can't think of a counterexample), this will always be equal to ncol(s).
length(unique(s)) ## 6
unique applied to a matrix also returns the unique rows, but now length() counts the total number of elements: for your data this will usually be equivalent to ncol(s)*nrow(s).
length(unique(as.matrix(s))) ## 54
If you want to apply unique to the elements in this situation, you probably want one of the following, all of which collapse the original data frame down to a single vector:
length(unique(as.vector(as.matrix(s)))) ## 4
length(unique(unlist(s))) ## 4
length(unique(c(as.matrix(s)))) ## 4
Whether you want diff(range(x))+1 or length(unique(...)) depends on how you would want to count a data frame composed (for example) entirely of {0,1,2,4} -- should that return 4 or 5? (As #Brian Diggs points out in his answer, diff(range(...))+1 will work on a matrix, without needing to flatten the structure further -- it will also work on an unlist()ed data frame.)

diff(range(d)) returns the difference between minimum and maximum, being 0 and 3 respectively
What you want to do is count how many elements there are in a set. Try length(d)
d <- 0:3
length(d)
Including the comments to this answer... Let the code speak
Example data:
dataset = 1:136
dataset = dataset %% 4
dim(dataset) <- c(4,34) //Now we have a table
diff(range(dataset))+1
It returns 4 like you wanted

Given the structure of d you have now provided, you can do a column-by-column calculation of this.
> diff(range(d$X1))+1
[1] 3
> diff(range(d$X1))+1
[1] 3
> diff(range(d$X2))+1
[1] 4
> diff(range(d$X3))+1
[1] 4
> diff(range(d$X4))+1
[1] 3
> diff(range(d$X5))+1
[1] 3
> diff(range(d$X6))+1
[1] 2
Or you can loop over all the columns
> lapply(d, function(dp) {diff(range(dp))+1})
$X1
[1] 3
$X2
[1] 4
$X3
[1] 4
$X4
[1] 3
$X5
[1] 3
$X6
[1] 2
Or if you want the range for all the columns collectively, treat it as a matrix:
> diff(range(as.matrix(d)))+1
[1] 4

Related

Time varying network in r

I have data on every interaction that could and did happen at a university club weekly social hour
A sample of my data is as follows
structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))
I am trying to calculate network statistics such as centrality for A,B,C for each individual week, the last two weeks, and year to date. The only way I have gotten this to work is by manually breaking up the file in the time unit I want to analyze but there has to be a less labourious way, I hope.
When timestalked is 0 this should be treated as no edge
The output would produce a .csv with the following:
actor cent_week1 cent_week2 cent_week3 cent_last2weeks cent_yeartodate
A
B
C
with cent_week1 being 1/1/2010 centrality; cent_last2weeks being just considering 1/8/2010 and 1/15/2010; and cent_yeartodate being all of the data being considered at once. This is being applied to a MUCH larger dataset of millions of observations.
Can do this by setting your windows in another table, then doing by group operations on each of the windows:
Data Preparation:
# Load Data
DT <- structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))
# Code
library(igraph)
library(data.table)
setDT(DT)
# setup events
DT <- DT[timestalked > 0]
DT[, week := as.Date(week, format = "%m/%d/%Y")]
# setup windows, edit as needed
date_ranges <- data.table(label = c("cent_week_1","cent_week_2","cent_last2weeks","cent_yeartodate"),
week_from = as.Date(c("2010-01-01","2010-01-08","2010-01-08","2010-01-01")),
week_to = as.Date(c("2010-01-01","2010-01-08","2010-01-15","2010-01-15"))
)
# find all events within windows
DT[, JA := 1]
date_ranges[, JA := 1]
graph_base <- merge(DT, date_ranges, by = "JA", allow.cartesian = TRUE)[week >= week_from & week <= week_to]
Here is now the by group code, the second line is a bit gross, open to ideas about how to avoid the double call
graph_base <- graph_base[, .(graphs = list(graph_from_data_frame(.SD))), by = label, .SDcols = c("from", "to", "timestalked")] # create graphs
graph_base <- graph_base[, .(vertex = names(eigen_centrality(graphs[[1]])$vector), ec = eigen_centrality(graphs[[1]])$vector), by = label] # calculate centrality
dcast for final formatting:
dcast(graph_base, vertex ~ label, value.var = "ec")
vertex cent_last2weeks cent_week_1 cent_week_2 cent_yeartodate
1: A 1.0000000 0.7071068 0.8944272 0.9397362
2: B 0.7052723 0.7071068 0.4472136 0.7134685
3: C 0.9008487 1.0000000 1.0000000 1.0000000
Can't comment, so I'm writing an "answer". If you want to perform some mathematical operation on timestalked and get values by the from (didn't find any variable called actor in your example), here's a data.table approach that can be helpful:
dat <- as.data.table(dat) # or add 'data.table' to the class parameter
dat$week <- as.Date(dat$week, format = "%m/%d/%Y")
dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]
This gives the below output:
dat[, .(cent = mean(timestalked)), by = list(from, weeknum = week(week))]
from weeknum cent
1: A 1 0.5
2: A 2 2.0
3: A 3 1.5
4: B 1 0.5
5: B 2 1.0
6: B 3 0.5
7: C 1 1.5
8: C 2 0.5
9: C 3 0.0
Assign this to new_dat. You can subset by week simply with new_dat[weeknum %in% 2:3] or whatever other variation you want or sum over the year. Additionally, you can also sort/order as desired.
Hope this helps!
How about:
library(dplyr)
centralities <- tmp %>%
group_by(week) %>%
filter(timestalked > 0) %>%
do(
week_graph=igraph::graph_from_edgelist(as.matrix(cbind(.$from, .$to)))
) %>%
do(
ecs = igraph::eigen_centrality(.$week_graph)$vector
) %>%
summarise(ecs_A = ecs[[1]], ecs_B = ecs[[2]], ecs_C = ecs[[3]])
You can use summarise_all if you have a lot of actors. Putting it in long format is left as an exercise.
This analysis follows the general split-apply-combine approach, where the data re split by week, graph functions are applied, and then the results combined together. There are several tools for this, but below uses base R, and data.table.
Base R
First set data-class for your data, so that term last two weeks has meaning.
# Set date class and order
d$week <- as.Date(d$week, format="%m/%d/%Y")
d <- d[order(d$week), ]
d <- d[d$timestalked > 0, ] # remove edges // dont need to do this is using weights
Then split and apply graph functions
# split data and form graph for eack week
g1 <- lapply(split(seq(nrow(d)), d$week), function(i)
graph_from_data_frame(d[i,]))
# you can then run graph functions to extract specific measures
(grps <- sapply(g1, function(x) eigen_centrality(x,
weights = E(x)$timestalked)$vector))
# 2010-01-01 2010-01-08 2010-01-15
# A 0.5547002 0.9284767 1.0000000
# B 0.8320503 0.3713907 0.7071068
# C 1.0000000 1.0000000 0.7071068
# Aside: If you only have one function to run on the graphs,
# you could do this in one step
#
# sapply(split(seq(nrow(d)), d$week), function(i) {
# x = graph_from_data_frame(d[i,])
# eigen_centrality(x, weights = E(x)$timestalked)$vector
# })
You then need to combine in the the analysis on all the data - as you only have to build two further graphs, this is not the time-consuming part.
fun1 <- function(i, name) {
x = graph_from_data_frame(i)
d = data.frame(eigen_centrality(x, weights = E(x)$timestalked)$vector)
setNames(d, name)
}
a = fun1(d, "alldata")
lt = fun1(d[d$week %in% tail(unique(d$week), 2), ], "lasttwo")
# Combine: could use `cbind` in this example, but perhaps `merge` is
# safer if there are different levels between dates
data.frame(grps, lt, a) # or
Reduce(merge, lapply(list(grps, a, lt), function(x) data.frame(x, nms = row.names(x))))
# nms X2010.01.01 X2010.01.08 X2010.01.15 alldata lasttwo
# 1 A 0.5547002 0.9284767 1.0000000 0.909899 1.0
# 2 B 0.8320503 0.3713907 0.7071068 0.607475 0.5
# 3 C 1.0000000 1.0000000 0.7071068 1.000000 1.0
data.table
It is likely that the time-consuming step will be explicitly split-applying the function over the data. data.table should offer some benefit here, especially when the data becomes large, and/or there are more groups.
# function to apply to graph
fun <- function(d) {
x = graph_from_data_frame(d)
e = eigen_centrality(x, weights = E(x)$timestalked)$vector
list(e, names(e))
}
library(data.table)
dcast(
setDT(d)[, fun(.SD), by=week], # apply function - returns data in long format
V2 ~ week, value.var = "V1") # convert to wide format
# V2 2010-01-01 2010-01-08 2010-01-15
# 1: A 0.5547002 0.9284767 1.0000000
# 2: B 0.8320503 0.3713907 0.7071068
# 3: C 1.0000000 1.0000000 0.7071068
Then just run the function over the full data / last two weeks as before.
There are differences between the answers, which is down to how we use the use the weights argument when calculating the centralities, whereas the others don't use the weights.
d=structure(list(from = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), to = structure(c(2L, 3L, 2L, 3L,
2L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B", "C"), class = "factor"), timestalked = c(0L, 1L, 0L, 4L,
1L, 2L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L), week = structure(c(1L,
1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 2L,
2L), .Label = c("1/1/2010", "1/15/2010", "1/8/2010"), class = "factor")), .Names = c("from",
"to", "timestalked", "week"), class = "data.frame", row.names = c(NA,
-18L))

Pairwise analyse at once in r

I have a data as follows. For each site I have certain amount of different measurements (value1, value2, value3). My goal is to perform, for e.g., Bartlett test for all possible pairs with all possible variables (like site id=1 vs site id=2 (and all the values), site id=1 vs site id=3 and so on).
Could You please teach me how to do it in automated way, cause with choosing pairs with subset or %in% it is quite time demanding and seems to be the wrong way.
pair1 = subset(mydata,site id==1|site id==2),
pair2 = subset(mydata,site id==1|site id==3).
etc...
DATA
dput(el)
structure(list(nr = 1:62, site_id = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), value1 = c(0.135956723, 0.244470396,
0.986831591, 0.272748803, 0.089672362, 0.087918874, 0.29432428,
0.281550906, 0.491512301, 0.202822283, 0.636965524, 0.439072133,
0.512626669, 0.076218623, 0.537676093, 0.410301432, 0.704414491,
0.028086268, 0.934842257, 0.319693894, 0.038503085, 0.724755387,
0.933940599, 0.293119698, 0.206668204, 0.931947832, 0.570267962,
0.153459278, 0.761549617, 0.168553595, 0.125666771, 0.072239583,
0.585168488, 0.434769948, 0.693265848, 0.507971072, 0.784221012,
0.625158967, 0.734257194, 0.745229936, 0.40953356, 0.070758169,
0.468803818, 0.482476343, 0.329618097, 0.690907203, 0.043867132,
0.335846451, 0.910523185, 0.337186798, 0.94565722, 0.468518602,
0.269354849, 0.357422627, 0.660574954, 0.636926103, 0.558315665,
0.489907305, 0.47082103, 0.808036842, 0.80682936, 0.486316865
), value2 = c(0.072786841, 0.53838031, 0.41372062, 0.927891345,
0.681514932, 0.099571511, 0.356290822, 0.22791718, 0.222255425,
0.274876628, 0.215780917, 0.679079775, 0.557144492, 0.768317182,
0.209794907, 0.756651704, 0.950439091, 0.394732921, 0.477008544,
0.248762115, 0.452692267, 0.479918885, 0.617401621, 0.107246095,
0.968902896, 0.581772822, 0.654269288, 0.2403724, 0.309798716,
0.305768959, 0.184387495, 0.035095852, 0.513505392, 0.976717695,
0.713275402, 0.948746684, 0.44320735, 0.222039163, 0.440820346,
0.914348945, 0.824638633, 0.392305879, 0.711367921, 0.013197053,
0.990004958, 0.46783633, 0.368384378, 0.105245106, 0.01894147,
0.351691108, 0.689240176, 0.281890828, 0.643299941, 0.295450072,
0.929042677, 0.451298968, 0.087512416, 0.367461399, 0.101109718,
0.388519279, 0.886552629, 0.371934921), value3 = c(0.862942279,
0.306199206, 0.815403468, 0.120029065, 0.120468166, 0.97214058,
0.605333252, 0.381385396, 0.501217425, 0.159266606, 0.712387132,
0.532604745, 0.581300843, 0.764953483, 0.833804202, 0.576785884,
0.739833632, 0.894288301, 0.533339352, 0.454653122, 0.141139261,
0.820376994, 0.804809068, 0.097680334, 0.286965944, 0.610407569,
0.084827216, 0.428986455, 0.080766377, 0.435308821, 0.93199262,
0.453242669, 0.106639551, 0.191650525, 0.807339195, 0.53331683,
0.101494804, 0.952323476, 0.243649472, 0.903883695, 0.265602323,
0.364928386, 0.239852295, 0.388701845, 0.964790214, 0.031507745,
0.922879901, 0.419279331, 0.923975616, 0.370413352, 0.159053801,
0.450200201, 0.262717668, 0.258232936, 0.604593393, 0.625352584,
0.086596067, 0.876201214, 0.95281149, 0.728431032, 0.232121342,
0.53337486)), .Names = c("nr", "site_id", "value1", "value2",
"value3"), row.names = c(NA, -62L), class = "data.frame")
This is probably not very efficient, but It does what you need.
First we create a matrix with all possible combinations of the site_id. We then create a list with all the subsetted data frames. Finally we apply the function to the list for all value columns.
m1 <- combn(1:length(unique(el$site_id)),2)
l2 <- lapply(1:ncol(m1), function(i) el[el$site_id %in% m1[,i],])
final.list <- lapply(l2, function(i) sapply(i, function(j) bartlett.test(j, i$site_id)))

Store maximum value for a factor variable and matching observations for that maximum value [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I am curious how to create another dataset in R, which would store maximum value for a factor variable and matching observation for that maximum value.
Here is a fragment of dataset with just 4 subjects and a code:
library(data.table)
my.data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), Supervisor = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Emmi", "Pauli"), class = "factor"), Time = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 3L, 3L), .Label = c("01.02.2016 09:45", "01.02.2016 09:48", "01.03.2016 09:55"), class = "factor"), Trials = c(1L, 2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 4L), Force = c(403.8, 464.6, 567.6, 572.9, 572.4, NA, 533.1, 547, 532.6, 503.8,464.6, 367.6, 372.9), ForceProduction = c(1073, 1149.6, 1944.7, 1906.4, 2260.9, NA, 2634.5, 2471.6, 1187.9, 1073, 1149.6,1944.7, 1906.4)), .Names = c("Subject", "Supervisor", "Time", "Trials", "Force", "ForceProduction"), class = "data.frame", row.names = c(NA, -13L))
DT=as.data.table(my.data)
new.data <- DT[,.SD[which.max(Force)],by=Trials]
Each subject did 2-4 trials. I need to select max value among all trials for a given subject based on Force. So I am interested in max value of Force column. All other observation related to this max Force should be preserved, those that are not in line with max Force should be abondened.
The code result is strange. Just for 3 subjects, ignoring the rest. And not best trial. But I think that I am totally wrong somewhere.
Can you please direct me to a better solution?
Here's a simply dplyr chain that should give you what you want. Grouping by each subject, filter only the values where Force is a maximum for that subject.
library(dplyr)
my.data %>%
group_by(Subject) %>%
filter(Force == max(Force, na.rm = TRUE))

Summarise/reshape data in R

For an example dataframe:
df <- structure(list(id = 1:18, region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), age.cat = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("0-18",
"19-35", "36-50", "50+"), class = "factor")), .Names = c("id",
"region", "age.cat"), class = "data.frame", row.names = c(NA,
-18L))
I want to reshape the data, as detailed below:
region 0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Do I simply aggregate or reshape the data? Any help would be much appreciated.
You can do it just using table:
table(df$region, df$age.cat)
0-18 19-35 36-50 50+
a 2 3 2 1
b 4 2 1 3
Using reshape2:
install.packages('reshape2')
library(reshape2)
df1 <- melt(df, measure.vars = 'age.cat')
df1 <- dcast(df1, region ~ value)

Removing Survey non-response in R

So, I have a data frame with several continuous variables and several dummy variables. The survey that this data frame comes from uses 6,7,8 and 9 to denote different types of non-response. So, I would like to replace 6,7,8 and 9 with NA whenever they show up in a dummy variable column but leave them be in the continuous variable column.
Is there a concise way to go about doing this?
Here's my data:
> dput(head(sfsuse[c(4:16)]))
structure(list(famsize = c(3L, 1L, 2L, 5L, 3L, 5L), famtype = c(2L,
1L, 2L, 3L, 2L, 3L), cc = c(1L, 1L, 1L, 1L, 1L, 1L), nocc = c(1L,
1L, 1L, 3L, 1L, 1L), pdloan = c(2L, 2L, 2L, 2L, 2L, 2L), help = c(2L,
2L, 2L, 2L, 2L, 2L), budget = c(1L, 1L, 1L, 1L, 2L, 2L), income = c(340000L,
20500L, 0L, 165000L, 95000L, -320000L), govtrans = c(7500L, 15500L,
22000L, 350L, 0L, 9250L), childexp = c(0L, 0L, 0L, 0L, 0L, 0L
), homeown = c(1L, 1L, 1L, 1L, 1L, 2L), bank = c(2000L, 80000L,
25000L, 20000L, 57500L, 120000L), vehval = c(33000L, 7500L, 5250L,
48000L, 8500L, 50000L)), .Names = c("famsize", "famtype", "cc",
"nocc", "pdloan", "help", "budget", "income", "govtrans", "childexp",
"homeown", "bank", "vehval"), row.names = c(NA, 6L), class = "data.frame")
I'm trying to subs in NA for 6,7,8 and 9 in columns 3:7 and column 11. I know how to do this one column at a time by the column names:
df$name[df$name %in% 6:9]<-NA
but I would have to do this for each column by name, is there a concise way to do it by column index?
Thanks
This function should work
f <- function(data,k) {
data[data[,k] %in% 6:9,k] <- NA
data
}
Now at the console:
> for (k in c(3:7,11)) { data <- f(data,k) }

Resources