I was trying to conduct the network graph and some statistics by rolling three-year period, but I don't know how to set the rolling function. Below is my coding without rolling.
> library(igraph)
> em <- read.csv(file.choose(), header = TRUE )
> network <- graph.data.frame(em, directed = TRUE )
> em
sender receiver weights year
1 a d 2 2001
2 b a 3 2001
3 c a 1 2001
4 d h 1 2001
5 e d 3 2001
6 a b 1 2002
7 b c 2 2002
8 c d 1 2002
9 d a 1 2002
10 e h 2 2002
11 a d 1 2003
12 b d 1 2003
13 c a 2 2003
14 d h 2 2003
15 e a 1 2003
16 a d 1 2004
17 b c 1 2004
18 c d 2 2004
19 d c 1 2004
20 e a 2 2004
> E(network)$weight <- as.numeric(network[,3])
Warning message:
In eattrs[[name]][index] <- value :
number of items to replace is not a multiple of replacement length
> p <- plot (network,edge.width=E(network)$weight)
So in this example eventually it would come up with one weighed network graph. I would like to conduct the graphs using data in 2001-2003 and 2002-2004 subsamples, further with some SNA statistics. Some online resource suggests -rollappy()- or the functions in package -tidyquant- could do, but I didn't manage to figure our how to recognise the 4th column as year and how to set the rolling period. Would much appreciate it if anyone can help out, as I am a newbie to R.
Many thanks!!
Thanks #emilliman5 for the further questions.
My real dataset is an unbalanced panel with 15-year time span. I was intended to conduct network graphs using part of the full data. Subtracting rule is the 3-year rolling period (in fact with some other conditions but I just asked rolling here). So I planned to call the rolling subsamples first and conduct graphs. I hope it is a bit clear now.
Above data was just a mock sample. 4-year range should generate two graphs (2001-2003, 2002-2004), and 15-year should come up with 13 graphs. The real weighting variable is not called weight, but I agree the line "as.numeric(network[,3])" is redundant. (I realise now the example I made wasn't good...sorry...)
I get someone helped me with that now, so I'll just post some of the codes. Hope it might help other people.
Method 1: call sub-samples by function. This save me from constructing the nested loops together with graphing.
# Function: conditions for substracting; here rolling-year only
f <- function(j){
df <- subset(em, year>=j & year <= j+2)
}
print (f(2001)) # Test function output, print-out
# Rolling by location and year, graph-plotting
for (j in 2001:2002){
sdf = f(j)
nw <- graph.data.frame(sdf, directed = TRUE)
plot(nw, edge.width = E(sdf[[j]])$weight)
}
Method 2: Use loops -- fine for one or two subtracting conditions but would be a bit clumsy for more.
c <- 2001 # a number for year count
sdf <- {} # Set an empty set to save subsets
for (j in 2001:2002){
df_temp <- subset(em, year>=j & year<=j+2)
print(nrow(df_temp)) # To check the subset, not necessary
sdf[[c]] <- cbind(df_temp)
nw <- graph.data.frame(sdf[[c]], directed = TRUE)
plot(nw, edge.width = E(sdf[[c]])$weight)
c <- c + 1
}
Related
I have panel data that has county data for 15 years of different economic measures (which I have created an index for). There are missing data in the values that I would like to interpolate. However, because the values are randomly missing by year, linear interpolation doesn't work, it only gives me interpolation values between the first and last data points. This is a problem because I need interpolated values for the entire series.
Since all of the series have more than 5 data points, is there any code out there that would interpolate the series based on data that already exists within the specific series?
I first thought about indexing my data to try and run a loop but then I found code on linear interpolation by groups. While the latter solved some of the NA's it did not interpolate all of them. Here would be an example of my data that interpolates some of the data but not all.
library(dplyr)
data <- read.csv(text="
index,year,value
1,2001,20864.135
1,2002,20753.867
1,2003,NA
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,NA
1,2008,NA
1,2009,9021.556
1,2010,NA
1,2011,NA
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,NA
2,2013,128.646
2,2014,NA
2,2015,NA")
Using
interpolation<-data %>%
group_by(index) %>%
mutate(valueIpol = approx(year, value, year,
method = "linear", rule = 1, f = 0, ties = mean)$y)
I get the following interpolated values.
1,2001,20864.135
1,2002,20753.867
1,2003,19231.046
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,11604.686
1,2008,10313.121
1,2009,9021.556
1,2010,10612.955
1,2011,12204.353
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,116.394
2,2013,128.646
2,2014,NA
2,2015,NA
Any help would be appreciated. I'm pretty new to R and have never worked with loops but I have looked up other "interpolation by groups" help. Nothing seems to solve the issue of filling in data when the first and last points are NA's as well.
Maybe this could help:
library(imputeTS)
for(i in unique(data$index)) {
data[data$index == i,] <- na.interpolation(data[data$index == i,])
}
Only works when the groups itself are already ordered by year. (which is the case in your example)
Output would look like this:
> data
index year value
1 1 2001 20864.135
2 1 2002 20753.867
3 1 2003 19231.046
4 1 2004 17708.224
5 1 2005 12483.767
6 1 2006 12896.251
7 1 2007 11604.686
8 1 2008 10313.121
9 1 2009 9021.556
10 1 2010 10612.955
11 1 2011 12204.353
12 1 2012 13795.752
13 1 2013 16663.741
14 1 2014 19349.992
15 1 2015 19349.992
16 2 2001 151.108
17 2 2002 151.108
18 2 2003 151.108
19 2 2004 151.108
20 2 2005 151.108
21 2 2006 151.108
22 2 2007 151.108
23 2 2008 151.108
24 2 2009 107.205
25 2 2010 90.869
26 2 2011 104.142
27 2 2012 116.394
28 2 2013 128.646
29 2 2014 128.646
30 2 2015 128.646
Since the na.interpolation function uses approx internally, you can pass parameters of approx trough to adjust the behavior.
The parameters you used in your example: method = "linear", rule = 1, f = 0, ties = mean are the standard parameters. If you want to use these you don't have to add anything.
Otherwise you would change the part in the loop with for example this:
data[data$index == i,] <- na.interpolation(data[data$index == i,], ties ="ordered", f = 1, rule = 2)
This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between
I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
I'm attempting to calculate two summary statistics (mean and standard error) from the following data set, where both Location and Adult should be factors.
Location Adult OverComp
F 1 7
P 1 8
P 0 10
F 1 3
F 0 11
I would like the output to appear as follows:
Location Adult OverComp.m OverComp.se
F 1 (mean) (standard error)
F 0 (mean) (standard error)
P 1 (mean) (standard error)
P 0 (mean) (standard error)
Where OverComp.m is the calculated mean for each combination of Location x Adult, and OverComp.se is standard error for each of those combinations. I want this format because I want to then use this with ggplot2, to make a bar plot of the four means & se's, color-coded for Location.
I've gotten this far:
summary.OverComp <-data.frame(
+ Location=levels(as.factor(data$FLocation)),
+ MeanOverComp=tapply(data$OverComp, list(data$FLocation,data$Adult), mean),
+ se=tapply(data$OverComp, list(data$FLocation,data$Adult),std.error))
Which produces the statistics I want, but not the format that I need for plotting in ggplot2 (as far as I can tell):
summary.OverComp
Location MeanOverComp.0 MeanOverComp.1 se.0 se.1
F Fiji 7.238095 8.454545 0.3792062 0.3023071
P Peru 6.893617 5.395833 0.4544304 0.3076155
I am now a bit clueless - not sure whether to pursue a different method for plotting, or a transformation to the above output, or to figure out how to incorporate Adult as a factor in my summary coding. I have an inkling that reshape2 may be involved, but not sure how to approach that. Your help would be much appreciated!
You could try data.table (if dat is the dataset)
library(plotrix)
library(data.table)
setDT(dat)[,list(OverComp.m=mean(OverComp),
Overcomp.se=std.error(OverComp)), by=list(Location, Adult)]
Location Adult OverComp.m Overcomp.se
#1: F 1 5 2
#2: P 1 8 NA
#3: P 0 10 NA
#4: F 0 11 NA
This is a typical use for aggregate, a base (actually stats-pkg) function:
> aggregate(dat$OverComp, # the values being aggregated
dat[-3], # the grouping factors
function(Ov) c(mean=mean(Ov), sd=sd(Ov) ) #aggregation function(s)
)
Location Adult x.mean x.sd
1 F 0 11.000000 NA
2 P 0 10.000000 NA
3 F 1 5.000000 2.828427
4 P 1 8.000000 NA
If you had more than one item in the three categories where you see NA's then a more ayttractive ouput would have occurred.
I have a data formatted as
PERSON_A PERSON_B MEET LEAVE
That describes basically when a PERSON_A met a PERSON_B at time MEET and they said "bye" to each other at moment LEAVE. The time is expressed in seconds, and there is a small part of the data on http://pastie.org/2825794 (simple.dat).
What I need is to count the number of meetings grouping it by day. At the moment, I have a code that works, the appearance is not beautiful. Anyway, I'd like a help in order to transform it in a code that reflects the grouping Im trying to do, e.g, using ddply, etc. Therefore, my main aim is to learn from this case. Probably there are many mistakes in this code regarding good practices in R.
library(plyr)
data = read.table("simple.dat", stringsAsFactors=FALSE)
names(data)=c('PERSON_A','PERSON_B','MEET','LEAVE')
attach(data)
min_interval = min(MEET)
max_interval = max(LEAVE)
interval = max_interval - min_interval
day = 86400
number_of_days = floor(interval/day)
g = data.frame(MEETINGS=c(0:number_of_days)) # just to store the result
g[,1] = 0
start_offset = min_interval # start of the first day
for (interval in c(0:number_of_days)) {
end_offset = start_offset + day
meetings = (length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_A) + length(data[data$MEET >= start_offset & data$LEAVE <= end_offset, ]$PERSON_B))
g[interval+1, ] = meetings
start_offset = end_offset # start next day
}
g
This code iterates over the days (intervals of 86400 seconds) and stores the number of meetings on the dataframe g. The correct output (shown bellow) of this code when executed on the linked dataset gives for each line (day) the number o meetings.
MEETINGS
1 38
2 10
3 16
4 18
5 24
6 6
7 4
8 10
9 28
10 14
11 22
12 2
13 .. 44 0 # I simplified the output here
45 2
Anyway, I know that I could use ddply to get the number of meetings for each pair o nodes:
contacts <- ddply(data, .(PERSON_A, PERSON_B), summarise
, CONTACTS = length(c(PERSON_A, PERSON_B)) /2
)
but there is a huge hill for me between this and the result I need.
As a end note, I read How to make a great R reproducible example? and tried my best :)
Thanks,
try this:
> d2 <- transform(data, m = floor(MEET/86400) + 1, l = floor(LEAVE/86400) + 1)
> d3 <- subset(d2, m == l)
> table(d3$m) * 2
1 2 3 4 5 6 7 8 9 10 11 12 45
38 10 16 18 24 6 4 10 28 14 22 2 2
floor(x/(60*60*24)) is a quick way to convert second into day.