I have two single vector data frames of unequal length
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
For each observation in aa I want to count the number of instances bb is less than aa
My result:
bb<aa
1 1
2 7
3 9
I have been able to do it two ways by creating a function and using apply, but my datasets are large and I let one run all night without end.
What I have:
fun1<-function(a,b){k<-colSums(b<a)
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun1,b=bb))))
user system elapsed
3.813 0.011 3.883
Secondly,
fun2<-function(a,b){k<-length(which(b<a))
k<-k*.000058242}
system.time(replicate(5000,data.frame(apply(aa,1,fun2,b=bb))))
user system elapsed
3.648 0.006 3.664
The second function is slightly faster in all my tests, but I let the first run all night on a dataset where bb>1.7m and aa>160k
I found this post, and have tried using with() but cannot seem to get it to work, also tried a for loop without success.
Any help or direction is appreciated.
Thank you!
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
sapply(aa[[1]],function(x)sum(bb[[1]]<x))
# [1] 1 7 9
Some more realistic examples:
n <- 1.6e3
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 14.63 2.23 16.87
n <- 1.6e4
bb <- sample(1:n,1.7e6,replace=T)
aa <- 1:n
system.time(sapply(aa,function(x)sum(bb<x)))
# user system elapsed
# 148.77 18.11 167.26
So with length(aa) = 1.6e4 this takes about 2.5 min (on my system), and the process scales as O(length(aa)) - no surprise there. Therefore, with your full dataset, it should run in about 25 min. Still kind of slow. Maybe someone else will come up with a better way.
My original post I had been looking for the number of times bb
So in my example
aa<-data.frame(c(2,12,35))
bb<-data.frame(c(1,2,3,4,5,6,7,15,22,36))
x<-ecdf(bb[,1])
x(2)
[1] 0.2
x(12)
[1] 0.7
x(35)
[1] 0.9
To get the answers in my original post I would need to multiply by the number of data points within bb, in this instance 10. Although the first one is not the same because in my original post I had stated bb
I am dealing with large datasets of land elevation and water elevation over 1 million data points for each, but in the end I am creating an inundation curve. I want to know how much land will be inundated at a water levels given exceedance probability.
So using the above ecdf() function on all 1 million data points would still be time consuming, but I realized I do not need all the data points just enough to create my curve.
So I applied the ecdf() function to the entire land data set, but then created an elevation sequence of the water large enough to create the curve that I needed, but small enough that it could be computed rapidly.
land_elevation <- data.frame(rnorm(1e6))
water_elevation<- data.frame(rnorm(1e6))
cdf_land<- ecdf(land_elevation[,1])
elevation_seq <- seq(from = min(water_elevation[,1]), to = max(water_elevation[,1]), length.out = 1000)
land <- sapply(elevation_seq, cdf_land)
My results are the same, but they are much faster.
Related
Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.
Let's assume that we have a data frame x which contains the columns job and income. Referring to the data in the frame normally requires the commands x$jobfor the data in the job column and x$income for the data in the income column.
However, using the command attach(x) permits to do away with the name of the data frame and the $ symbol when referring to the same data. Consequently, x$job becomes job and x$income becomes income in the R code.
The problem is that many experts in R advise NOT to use the attach() command when coding in R.
What is the main reason for that? What should be used instead?
When to use it:
I use attach() when I want the environment you get in most stats packages (eg Stata, SPSS) of working with one rectangular dataset at a time.
When not to use it:
However, it gets very messy and code quickly becomes unreadable when you have several different datasets, particularly if you are in effect using R as a crude relational database, where different rectangles of data, all relevant to the problem at hand and perhaps being used in various ways of matching data from the different rectangles, have variables with the same name.
The with() function, or the data= argument to many functions, are excellent alternatives to many instances where attach() is tempting.
Another reason not to use attach: it allows access to the values of columns of a data frame for reading (access) only, and as they were when attached. It is not a shorthand for the current value of that column. Two examples:
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> dist <- 0.3048 * dist
> # convert speed to meters per second
> speed <- 0.44707 * speed
> # compute a meaningless time
> time <- dist / speed
> # check our work
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
No changes were made to the cars data set even though dist and speed were assigned to.
If explicitly assigned back to the data set...
> head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
> attach(cars)
> # convert stopping distance to meters
> cars$dist <- 0.3048 * dist
> # convert speed to meters per second
> cars$speed <- 0.44707 * speed
> # compute a meaningless time
> cars$time <- dist / speed
> # compute meaningless time being explicit about using values in cars
> cars$time2 <- cars$dist / cars$speed
> # check our work
> head(cars)
speed dist time time2
1 1.78828 0.6096 0.5000000 0.3408862
2 1.78828 3.0480 2.5000000 1.7044311
3 3.12949 1.2192 0.5714286 0.3895842
4 3.12949 6.7056 3.1428571 2.1427133
5 3.57656 4.8768 2.0000000 1.3635449
6 4.02363 3.0480 1.1111111 0.7575249
the dist and speed that are referenced in computing time are the original (untransformed) values; the values of cars$dist and cars$speed when cars was attached.
I think there's nothing wrong with using attach. I myself don't use it (then again, I love animals, but don't keep any, either). When I think of attach, I think long term. Sure, when I'm working with a script I know it inside and out. But in one week's time, a month or a year when I go back to the script, I find the overheads with searching where a certain variable is from, just too expensive. A lot of methods have the data argument which makes calling variables pretty easy (sensulm(x ~ y + z, data = mydata)). If not, I find the usage of with to my satisfaction.
In short, in my book, attach is fine for short quick data exploration, but for developing scripts that I or other might want to use, I try to keep my code as readable (and transferable) as possible.
If you execute attach(data) multiple time, eg, 5 times, then you can see (with the help of search()) that your data has been attached 5 times in the workspace environment. So if you de-attach (detach(data)) it once, there'll still be data present 4 times in the environment. Hence, with()/within() are better options. They help create a local environment containing that object and you can use it without creating any confusions.
I have a very large file that simply contains wave heights for different tidal scenarios at different locations. My file is organized into 13 wave heights x 9941 events, for 5153 locations.
What I want to do is read in this very long data file, which looks like this:
0.0
0.1
0.2
0.4
1.2
1.5
2.1
.....
Then split it into segments of length 129,233 (corresponds to 13 tidal scenarios for 9941 events at a specific location). On this subset of the data I'd like to perform some statistical functions to calculate exceedance probability, among other things. I will then join it to the file containing location information, and print some output files.
My code so far is not working, although I've tried many things. It seems to read the data just fine, however it is having trouble with the split. I suspect it may have something to do with the format of the input data from the file.
# read files with return period wave heights at defense points
#Read wave heights for 13 tides per 9941 events, for 5143 points
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRP <- c(WaveRPtable)
#colnames(WaveRP) <- c("WaveHeight")
print(paste(WaveRP))
#Read X,Y information for defense points
DefPT.file <- paste('DefXYevery10thpt.out')
DefPT <- read.table(DefPT.file, head=FALSE)
colnames(DefPT) <- c("X_UTM", "Y_UTM")
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, 129233)
print(paste(length(WaveByDefPt[[1]])))
for (i in 1:length(WaveByDefPt)/129233){
print(paste("i",i))
}
I have also tried
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, ceiling(seq_along(WaveRP)/129233))
No matter how I seem to perform the split, I am simply getting the original data as one long subset. Any help would be appreciated!
Thanks :)
Kimberly
Try cut to build groups:
v <- as.numeric(readLines(n = 7))
0.0
0.1
0.2
0.4
1.2
1.5
2.1
groups <- cut(v, breaks = 3) # you want breaks = 129233
aggregate(x = v, by = list(groups), FUN = mean) # e.g. means per group
# Group.1 x
# 1 (-0.0021,0.699] 0.175
# 2 (0.699,1.4] 1.200
# 3 (1.4,2.1] 1.800
You are kind of shuffling the data into various data types here.
When the file is originally read, it is a dataframe with 1 column (V1). Then you pass it to c(), which results in a list with a single vector in it. This means if you try and do anything to WaveRP you will probably fail because that's the name of the list. The numeric vector is WaveRP[[1]].
Instead, just extract the numeric vector using the $ operator and then you can work with it. Or just work with it inside the data frame. The fun part will be thinking of a way to create the grouping vector. I'll give an example.
Something like this:
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRPtable$group <- ceiling(seq_along(WaveRPtable$V1)/129233)
SplitWave <- split(WveRPtable,WaveRPtable$group)
Now you will have a list containing 13 dataframes. Look at each one using double bracket indexing. SplitWave[[2]], for example, to look at the second group. You can merge the location information file with these dataframes individually.
I want to calculate the pooled (actually weighted) standard deviation for all the unique sites in my data frame.
The values for these sites are values for single species forest stands and I want to pool the mean and the sd so that I can compare broadleaved stands with conifer stands.
This is the data frame (df) with values for the broadleaved stands:
keybl n mean sd
Vest02DenmDesp 3 58.16 6.16
Vest02DenmDesp 5 54.45 7.85
Vest02DenmDesp 3 51.34 1.71
Vest02DenmDesp 3 59.57 5.11
Vest02DenmDesp 5 62.89 10.26
Vest02DenmDesp 3 77.33 2.14
Mato10GermDesp 4 41.89 12.6
Mato10GermDesp 4 11.92 1.8
Wawa07ChinDesp 18 0.097 0.004
Chen12ChinDesp 3 41.93 1.12
Hans11SwedDesp 2 1406.2 679.46
Hans11SwedDesp 2 1156.2 464.07
Hans11SwedDesp 2 4945.3 364.58
Keybl is the code for the site. The formula for the pooled SD is:
s=sqrt((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
(Sorry I can't post pictures and did not find a link that would directly go to the formula)
Where 2 is the number of groups and therefore will change depending on site. I know this is used for t-test and two groups one wants to compare. In this case I'm not planning to compare these groups. My professor suggested me to use this formula to get a weighted sd. I didn't find a R function that incorporates this formula in the way I need it, therefore I tried to build my own. I am, however, new to R and not very good at making functions and loops, therefore I hope for your help.
This is what I got so far:
sd=function (data) {
nc1=data[z,"nc"]
sc1=data[z, "sc"]
nc2=data[z+1, "nc"]
sc2=data[z+1, "sc"]
sd1=(nc1-1)*sc1^2 + (nc2-1)*sc2^2
sd2=sd1/(nc1+nc2-length(nc1))
sqrt(sd2)
}
splitdf=split(df, with(df, df$keybl), drop = TRUE)
for (c in 1:length(splitdf)) {
for (i in 1:length(splitdf[[i]])) {
a = (splitdf[[i]])
b =sd(a)
}
}
1) The function itself is not correct as it gives slightly lower values than it should and I don't understand why. Could it be that it does not stop when z+1 has reached the last row? If so, how can that be corrected?
2) The loop is totally wrong but it is what I could come up with after several hours of no success.
Can anybody help me?
Thanks,
Antra
What you're trying to do would benefit from a more general formula which will make it easier. If you didn't need to break it into pieces by the keybl variable you'd be done.
dd <- df #df is not a good name for a data.frame variable since df has a meaning in statistics
dd$df <- dd$n-1
pooledSD <- sqrt( sum(dd$sd^2 * dd$df) / sum(dd$df) )
# note, in this case I only pre-calculated df because I'll need it more than once. The sum of squares, variance, etc. are only used once.
An important general principle in R is that you use vector math as much as possible. In this trivial case it won't matter much but in order to see how to do this on large data.frame objects where compute speed is more important read on.
# First use R's vector facilities to define the variables you need for pooling.
dd$df <- dd$n-1
dd$s2 <- dd$sd^2 # sd isn't a good name for standard deviation variable even in a data.frame just because it's a bad habit to have... it's already a function and standard deviations have a standard name
dd$ss <- dd$s2 * dd$df
And now just use convenience functions for splitting and calculating the necessary sums. Note only one function is executed here in each implicit loop (*apply, aggregate, etc. are all implicit loops executing functions many times).
ds <- aggregate(ss ~ keybl, data = dd, sum)
ds$df <- tapply(dd$df, dd$keybl, sum) #two different built in methods for split apply, we could use aggregate for both if we wanted
# divide your ss by your df and voila
ds$s2 <- ds$ss / ds$df
# and also you can easly get your sd
ds$s <- sqrt(ds$s2)
And the correct answer is:
keybl ss df s2 s
1 Chen12ChinDesp 2.508800e+00 2 1.254400e+00 1.120000
2 Hans11SwedDesp 8.099454e+05 3 2.699818e+05 519.597740
3 Mato10GermDesp 4.860000e+02 6 8.100000e+01 9.000000
4 Vest02DenmDesp 8.106832e+02 16 5.066770e+01 7.118125
5 Wawa07ChinDesp 2.720000e-04 17 1.600000e-05 0.004000
This looks much less concise than other methods (like 42-'s answer) but if you unroll those in terms of how many R commands are actually being executed this is much more concise. For a short problem like this either way is fine but I thought I'd show you the method that uses the most vector math. It also highlights why those convenient implicit loop functions are available, for expressiveness. If you used for loops to accomplish the same then the temptation would be stronger to put everything in the loop. This can be a bad idea in R.
The pooled SD under the assumption of independence (so the covariance terms can be assumed to be zero) will be: sqrt( sum_over_groups[ (var)/sum(n)-N_groups)] )
lapply( split(dat, dat$keybl),
function(dd) sqrt( sum( dd$sd^2 * (dd$n-1) )/(sum(dd$n-1)-nrow(dd)) ) )
#-------------------------
$Chen12ChinDesp
[1] 1.583919
$Hans11SwedDesp
[1] Inf
$Mato10GermDesp
[1] 11.0227
$Vest02DenmDesp
[1] 9.003795
$Wawa07ChinDesp
[1] 0.004123106
This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)