Reduce the large dataset into smaller data set using R - r

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0

If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.

I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1

x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.

Related

Output of igraph clustering functions

I constructed a graph from a data-frame using the igraph graph_from_data_frame function. My two first column represent the edge list, and i have another column named "weight". There is several other attributes columns.
I then tried to find a community structure within my graph using cluster_fast_greedy.
data <- data %>% rename(weight = TH_LIEN_2)
graph <- graph_from_data_frame(data,directed=FALSE)
is_weighted(graph)
cluster_1 <- cluster_fast_greedy(graph, weights = NULL)
The output is a list of three (merges, modularity, membership), each containing some of my vertices.
However, the following returns "NULL":
cluster_1[["merges"]]
cluster_1[["modularity"]]
cluster_1[["membership"]]
(I believe cluster_1[["membership"]] is supposed to be a list of integer indicating the cluster the vertices belong to?)
I have tried different method of clustering (cluster_fast_greedy, cluster_label_prop, cluster_leading_eigen, cluster_spinglass, cluster_walktrap) and with a weighted and non weighted graph and the output looks the same every time. (The number of element on the list varying from 1 to 4)
Does anyone have an idea of why it does that?
Thank you and have a nice day!
Cassandra
You should use the dollar sign $ to access the cluster object. For example
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g <- add_edges(g, c(1, 6, 1, 11, 6, 11))
fc <- cluster_fast_greedy(g)
and you will see
> str(fc)
Class 'communities' hidden list of 5
$ merges : num [1:14, 1:2] 3 4 5 1 12 13 15 11 7 8 ...
$ modularity: num [1:15] -6.89e-02 -4.59e-02 6.94e-18 6.89e-02 1.46e-01 ...
$ membership: num [1:15] 3 3 3 3 3 1 1 1 1 1 ...
$ algorithm : chr "fast greedy"
$ vcount : int 15
> fc$merges
[,1] [,2]
[1,] 3 2
[2,] 4 16
[3,] 5 17
[4,] 1 18
[5,] 12 14
[6,] 13 20
[7,] 15 21
[8,] 11 22
[9,] 7 9
[10,] 8 24
[11,] 10 25
[12,] 6 26
[13,] 27 19
[14,] 23 28
> fc$modularity
[1] -6.887052e-02 -4.591368e-02 6.938894e-18 6.887052e-02 1.460055e-01
[6] 1.689624e-01 2.148760e-01 2.837466e-01 3.608815e-01 3.838384e-01
[11] 4.297521e-01 4.986226e-01 5.757576e-01 3.838384e-01 -1.110223e-16
> fc$membership
[1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2
> fc$algorithm
[1] "fast greedy"
> fc$vcount
[1] 15

R: calling a matrix value of column 2 dependent on the value of column 1

I admit that I am totally new to R and have a few beginner's problems;
my problem is the following:
I have quite a long matrix TEST of length 5000 with 2 columns (column 1 = time; column 2 = concentration of a species).
I want to use the right concentration values for calculation of propensities in stochastic simulations.
I already have an alogrithm that gives me the simulation time t_sim; what I would need is a line of code that gives the respective concentration value at t= t_sim;
also: the time vector might have a big step size so that t_sim would have to be rounded to a bigger value in order to call the respective concentration value.
I know this probably quite an easy problem but I really do not see the solution in R.
Best wishes and many thanks,
Arne
Without sample data this answer is kind of a shot in the dark, but I think that this might work:
t_conc <- TEST[which.min(abs(t_sim-TEST[,1])),2]
where TEST is the matrix with two columns as described in the OP and the output t_conc is the concentration that corresponds to the value of time in the matrix that is closest to the input value t_sim.
Here's another shot in the dark:
set.seed(1);
N <- 20; test <- matrix(c(sort(sample(100,N)),rnorm(N,0.5,0.2)),N,dimnames=list(NULL,c('time','concentration')));
test;
## time concentration
## [1,] 6 0.80235623
## [2,] 16 0.57796865
## [3,] 19 0.37575188
## [4,] 20 0.05706002
## [5,] 27 0.72498618
## [6,] 32 0.49101328
## [7,] 34 0.49676195
## [8,] 37 0.68876724
## [9,] 43 0.66424424
## [10,] 57 0.61878026
## [11,] 58 0.68379547
## [12,] 61 0.65642726
## [13,] 62 0.51491300
## [14,] 63 0.10212966
## [15,] 67 0.62396515
## [16,] 83 0.48877425
## [17,] 86 0.46884090
## [18,] 88 0.20584952
## [19,] 89 0.40436999
## [20,] 97 0.58358831
t_sim <- 39;
test[findInterval(t_sim,test[,'time']),'concentration'];
## concentration
## 0.6887672
Note that findInterval() returns the index of the lesser time value if t_sim falls between two time values, as my example shows. If you want the greater, you need a bit more work:
i <- findInterval(t_sim,test[,'time']);
if (test[i,'time'] != t_sim && i < nrow(test)) i <- i+1;
test[i,'concentration'];
## concentration
## 0.6642442
If you want the nearest, see R: find nearest index.

Using rollapply() to find modal value

I've got panel data and have been playing around with k-means clustering. So now I've got a panel of factor values that are mostly stable but I'd like to smooth that out a bit more so that (for example) the data says "Wyoming was in group 1 in earlier years, moved into group 2, then moved into group 5" rather than "Wyoming was in group 1,1,1,2,3,2,2,5,5,5".
So the approach I'm taking is to use rollapply() to calculate the modal value. Below is code that works to calculate the mode ("Mode()"), and a wrapper for that ("ModeR()") that (perhaps clumsily) resolves the problem of multi-modal windows by randomly picking a mode. All that is fine, but when I put it into rollapply() I'm getting problems.
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq)) # from dplyr
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(temp[,1])
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[round(runif(1,min=0.5000001,max=length(out)+0.499999999))])
}
temp <- round(runif(20,min=1,max=10)) # A vector to test this out on.
cbind(temp,rollapply(data=temp,width=5,FUN=ModeR,fill=NA,align="right"))
which returned:
temp
[1,] 5 NA
[2,] 6 NA
[3,] 5 NA
[4,] 5 NA
[5,] 7 1
[6,] 6 1
[7,] 5 1
[8,] 5 1
[9,] 3 2
[10,] 1 3
[11,] 5 3
[12,] 7 3
[13,] 5 3
[14,] 4 3
[15,] 3 3
[16,] 4 2
[17,] 8 2
[18,] 5 2
[19,] 6 3
[20,] 6 3
Compare that with:
> ModeR(temp[1:5])
[1] 5
Levels: 5 6 7
> ModeR(temp[2:6])
[1] 6
Levels: 5 6 7
So it seems like the problem is in how ModeR is being applied in rollapply(). Any ideas?
Thanks!
Rick
Thanks to /u/murgs! His comment pointed me in the right direction (in addition to helping me streamline ModeR() using sample()).
ModeR() as written above returns a factor (as does Mode()). I need it to be a number. I can fix this by updating my code as follows:
Mode <- function(vect){ # take a vector as input
temp <- as.data.frame(table(vect))
temp <- arrange(temp,desc(Freq))
max.f <- temp[1,2]
temp <- filter(temp,Freq==max.f) # cut out anything that isn't modal
return(as.numeric(as.character(temp[,1]))) #HERE'S THE BIG CHANGE
}
ModeR <- function(vect){
out <- Mode(vect)
return(out[sample(1:length(out),1)]) #HERE'S SOME IMPROVED CODE!
}
Now rollapply() does what I expected it to do! There's still that weird as.character() bit (otherwise it rounds down the number). I'm not sure what's going on there, but the code works so I won't worry about it...

Modified rollapply mean

I have a data file which consists of daily xy locations and a logical vector denoting whether or not the location is an outlier. Here is some (poorly created, I know) sample data:
x=seq(3,10,length.out=30)
y=seq(42,45,length.out=30)
outlier=c(F,F,F,F,F,F,F,F,T,T,T,F,F,F,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F)
data=cbind(x,y,outlier)
> data
x y outlier
[1,] 3.000000000 42.00000000 0
[2,] 3.241379310 42.10344828 0
[3,] 3.482758621 42.20689655 0
[4,] 3.724137931 42.31034483 0
[5,] 3.965517241 42.41379310 0
[6,] 4.206896552 42.51724138 0
[7,] 4.448275862 42.62068966 0
[8,] 4.689655172 42.72413793 0
[9,] 4.931034483 42.82758621 1
[10,] 5.172413793 42.93103448 1
[11,] 5.413793103 43.03448276 1
[12,] 5.655172414 43.13793103 0
[13,] 5.896551724 43.24137931 0
[14,] 6.137931034 43.34482759 0
[15,] 6.379310345 43.44827586 0
[16,] 6.620689655 43.55172414 0
[17,] 6.862068966 43.65517241 0
[18,] 7.103448276 43.75862069 0
[19,] 7.344827586 43.86206897 0
[20,] 7.586206897 43.96551724 0
[21,] 7.827586207 44.06896552 0
[22,] 8.068965517 44.17241379 1
[23,] 8.310344828 44.27586207 0
[24,] 8.551724138 44.37931034 1
[25,] 8.793103448 44.48275862 0
[26,] 9.034482759 44.58620690 0
[27,] 9.275862069 44.68965517 0
[28,] 9.517241379 44.79310345 0
[29,] 9.758620690 44.89655172 0
[30,] 10.000000000 45.00000000 0
What I need is to take a non-overlapping 6-day mean of the x and y columns. This is easy enough with rollapply(). However, I do not want outlier=1 values to be included in the 6-day mean; nor do I want the 6-day window to 'span' the gap left behind by removing all rows where outlier=T. Instead, I want to make an exception to the 'non-overlapping rule'.
I think this is best explained using the sample data above: the first value should be the mean of rows 1:6, but rather than the second value being the mean of rows 7:12 (including outlier=1 values) or of rows c(7:8,12:15) (skipping over outlier=1 values) I want it to overlap with the first window and take the mean of rows 3:8.
So for the length 30 sample data above, the end result should be of length 5, showing the mean values of rows 1:6, 3:8, 12:17, 16:21 & 25:30 (ideally all values which result from overlapping windows should be labelled as such; i.e. values 1:4 overlap, whereas the final value is unique)
Here is a function that will give you the indices of the endpoints of the averages that you want:
findIndices<-function(outlier,window=6){
r<-rle(outlier)
rends<-cumsum(r$lengths)
segs<-cbind(rends-r$lengths+1,rends)
segs<-segs[with(r,lengths>=window & values==0),]
indices<-unlist(apply(segs,1,function(x) seq(x[1]+window-1,x[2],by=window)))
sort(unique(c(indices,segs[,2])))
}
findIndices(data[,3])
## [1] 6 8 17 21 30
You can then get the averages you want like this:
id<-findIndices(data[,3])
require(zoo)
cbind(index=id,rollmean(data[,1:2],6)[id-5,])
## index x y
## [1,] 6 3.603448 42.25862
## [2,] 8 4.086207 42.46552
## [3,] 17 6.258621 43.39655
## [4,] 21 7.224138 43.81034
## [5,] 30 9.396552 44.74138
You can put it all together in a single function like this:
maWithOutliers<-function(x,outlier,window){
id<-findIndices(outlier,window)
cbind(index=id,rollmean(x,window)[id-window+1,])
}
> maWithOutliers(data[,1:2],data[,3],6)
index x y
[1,] 6 3.603448 42.25862
[2,] 8 4.086207 42.46552
[3,] 17 6.258621 43.39655
[4,] 21 7.224138 43.81034
[5,] 30 9.396552 44.74138
> maWithOutliers(data[,1:2],data[,3],4)
index x y
[1,] 4 3.362069 42.15517
[2,] 8 4.327586 42.56897
[3,] 15 6.017241 43.29310
[4,] 19 6.982759 43.70690
[5,] 21 7.465517 43.91379
[6,] 28 9.155172 44.63793
[7,] 30 9.637931 44.84483
>

fill gaps in a timeseries with averages

I have a dataframe like so:
day sum_flux samples mean
2005-10-26 0.02 48 0.02
2005-10-27 0.12 12 0.50
It's a series of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.
i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set.
if all of October was missing I'd want to apply this average to each missing day.
I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.
Many Thanks
Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.
library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
Add a month column.
readings$month <- month(readings$day, label = TRUE)
Use tapply to get the monthly mean flux.
monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))
Use this value whenever the flux is missing, or keep the flux if not.
readings$sum_flux2 <- with(readings, ifelse(
is.na(sum_flux),
monthly_avg_flux[month],
sum_flux
))
This is one (very fast) way in data.table.
Using the nice example data from Richie :
require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 NA 42 0.37566491
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Create the average for each month, in appearance order of each group :
> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
mnth V1
[1,] 10 0.4915999
[2,] 11 0.5107873
[3,] 12 0.4451787
[4,] 1 0.4966040
[5,] 2 0.4972244
[6,] 3 0.4952821
[7,] 4 0.5106539
[8,] 5 0.4717122
[9,] 6 0.5110490
[10,] 7 0.4507383
[11,] 8 0.4680827
[12,] 9 0.5150618
Next reorder avg to start in January :
avg = avg[order(mnth)]
avg
mnth V1
[1,] 1 0.4966040
[2,] 2 0.4972244
[3,] 3 0.4952821
[4,] 4 0.5106539
[5,] 5 0.4717122
[6,] 6 0.5110490
[7,] 7 0.4507383
[8,] 8 0.4680827
[9,] 9 0.5150618
[10,] 10 0.4915999
[11,] 11 0.5107873
[12,] 12 0.4451787
Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.
readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 0.51078729** 42 0.37566491 # ** updated with the Nov avg
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Done.

Resources