I like to find difference between my samples but when I use diff() my first sample miss.
input:
data
XX.3.22 XX.1.2 XX.5.19 XX.2.21 XX.2.16 XX.5.27 XX.3.5 XX.2.12 XX.4.15
0.00 0.12 0.17 0.20 0.21 0.26 0.27 0.27 0.32
diff(data)
output:
XX.1.2 XX.5.19 XX.2.21 XX.2.16 XX.5.27 XX.3.5 XX.2.12 XX.4.15
0.05 0.05 0.03 0.01 0.05 0.01 0.00 0.05
I do not want miss first (XX.3.22) sample.
I expect:
XX.3.22 = 0.12
Sorry possibly very silly question? Couldn't find the answer? How do I load this kind of .dat file in R and stck them in one column? I have been trying
NerveData<-as.vector(read.table("D:/Dropbox/nerve.dat", sep=" ")$value)
The data set looks like
0.21 0.03 0.05 0.11 0.59 0.06
0.18 0.55 0.37 0.09 0.14 0.19
0.02 0.14 0.09 0.05 0.15 0.23
0.15 0.08 0.24 0.16 0.06 0.11
0.15 0.09 0.03 0.21 0.02 0.14
0.24 0.29 0.16 0.07 0.07 0.04
0.02 0.15 0.12 0.26 0.15 0.33
If you want to read all the data in as a single vector, use
src <- "http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/nerve.dat"
NerveData <- scan(src, numeric())
Actually I found a easier solution thanks for the initial helps
Nervedata<-read.table("nerve.dat",sep ="\t")
Nervedata2<-c(t(Nervedata))
Simply use read.table with the correct separator. Which in your case is probably \t, a tab character.
So try:
NerveData = read.table("D:/Dropbox/nerve.dat", sep="\t")
I am working with a relatively large data.table dataset and trying to profile/optimize the code. I am using Rprof, but I'm noticing that the majority of time spent within a setkey operation is not included in the Rprof summary. Is there a way to include this time spent?
Here is a small test to show how time spent setting the key for a data table is not represented in the Rprof summary:
Create a test function that runs a profiled setkey operation on a data table:
testFun <- function(testTbl) {
Rprof()
setkey(testTbl, x, y, z)
Rprof(NULL)
print(summaryRprof())
}
Then create a test data table that is large enough to feel the weight of the setkey operation:
testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
Then run the code, and wrap it within a system.time operation to show the difference between the system.time total time and the rprof total time:
> system.time(testFun(testTbl))
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.88 75.86 0.88 75.86
"<Anonymous>" 0.08 6.90 1.00 86.21
"regularorder1" 0.08 6.90 0.92 79.31
"radixorder1" 0.08 6.90 0.12 10.34
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
$by.total
total.time total.pct self.time self.pct
"setkey" 1.16 100.00 0.00 0.00
"setkeyv" 1.16 100.00 0.00 0.00
"system.time" 1.16 100.00 0.00 0.00
"testFun" 1.16 100.00 0.00 0.00
"fastorder" 1.14 98.28 0.00 0.00
"tryCatch" 1.14 98.28 0.00 0.00
"tryCatchList" 1.14 98.28 0.00 0.00
"tryCatchOne" 1.14 98.28 0.00 0.00
"<Anonymous>" 1.00 86.21 0.08 6.90
"regularorder1" 0.92 79.31 0.08 6.90
"sort.list" 0.88 75.86 0.88 75.86
"radixorder1" 0.12 10.34 0.08 6.90
"doTryCatch" 0.12 10.34 0.00 0.00
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
"is.unsorted" 0.02 1.72 0.00 0.00
"simpleError" 0.02 1.72 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.16
user system elapsed
31.112 0.211 31.101
Note the 1.16 and 31.101 time differences.
Reading ?Rprof, I see why this difference might have occurred:
Functions will only be recorded in the profile log if they put a
context on the call stack (see sys.calls). Some primitive functions do
not do so: specifically those which are of type "special" (see the ‘R
Internals’ manual for more details).
So is this the reason why time spent within the setkey operation isn't represented in Rprof? Is there a workaround have Rprof watch all of data.table's operations (including setkey, and maybe others that I haven't noticed)? I essentially want to have the system.time and Rprof time match up.
Here is the most-likely relevant sessionInfo():
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
data.table_1.8.11
I still observe this issue when Rprof() isn't within a function call:
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
28.855 0.191 28.854
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.86 71.67 0.88 73.33
"regularorder1" 0.08 6.67 0.92 76.67
"<Anonymous>" 0.06 5.00 0.98 81.67
"radixorder1" 0.06 5.00 0.10 8.33
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
$by.total
total.time total.pct self.time self.pct
"system.time" 1.20 100.00 0.00 0.00
"setkey" 1.10 91.67 0.00 0.00
"setkeyv" 1.10 91.67 0.00 0.00
"testFun" 1.10 91.67 0.00 0.00
"fastorder" 1.08 90.00 0.00 0.00
"tryCatch" 1.08 90.00 0.00 0.00
"tryCatchList" 1.08 90.00 0.00 0.00
"tryCatchOne" 1.08 90.00 0.00 0.00
"<Anonymous>" 0.98 81.67 0.06 5.00
"regularorder1" 0.92 76.67 0.08 6.67
"sort.list" 0.88 73.33 0.86 71.67
"radixorder1" 0.10 8.33 0.06 5.00
"doTryCatch" 0.10 8.33 0.00 0.00
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
"formals" 0.02 1.67 0.00 0.00
"is.unsorted" 0.02 1.67 0.00 0.00
"match.arg" 0.02 1.67 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.2
EDIT2: Same issue with 1.8.10 on my machine with only the data.table package loaded. Times are not equal even when the Rprof() call is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> base::source("/tmp/r-plugin-claytonstanley/Rsource-86075-preProcess.R", echo=TRUE)
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
29.516 0.281 29.760
> Rprof(NULL)
> summaryRprof()
EDIT3: Doesn't work even if setkey is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
EDIT4: Doesn't work even when R is called from a --vanilla bare-bones terminal prompt.
EDIT5: Does work when tested on a Linux VM. But still does not work on darwin machine for me.
EDIT6: Doesn't work after watching the Rprof.out file get created, so it isn't a write access issue.
EDIT7: Doesn't work after compiling data.table from source and creating a new temp user and running on that account.
EDIT8: Doesn't work when compiling R 3.0.2 from source for darwin via MacPorts.
EDIT9: Does work on a different darwin machine, a Macbook Pro laptop running the same OS version (10.6.8). Still doesn't work on a MacPro desktop machine running same OS version, R version, data.table version, etc.
I'm thinking it's b/c the desktop machine is running in 64-bit kernel mode (not default), and the laptop is 32-bit (default). Confirmed.
Great question. Given edits, I'm not sure then, can't reproduce. Leaving remainder of answer here for now.
I've tested on my (very slow) netbook and it works fine, see output below.
I can tell you right now why setkey is so slow on that test case. When the number of levels are large (greater than 100,000 as here) it reverts to comparison sort rather than counting sort. Yes pretty poor if you have data like that in practice. Typically we have under 100,000 unique values in the first column then, say, dates in the 2nd column. Both columns can be sorted using counting sort and performance is ok.
It's a known issue and we've been working hard on it. Arun has implemented radix sort for integers with range > 100,000 to solve this problem and that's in the next release. But we are still tidying up v1.8.11. See our presentation in Cologne which goes into more detail and gives some idea of speedups.
Inroduction to data.table and news from v1.8.11
Here is the output with v1.8.10, along with R version and lscpu info (for your entertainment). I like to test on a very poor machine with small cache so that in development I can see what's likely to bite when the data is scaled up on larger machines with larger cache.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 1995.01
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(data.table)
Loading required package: data.table
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> testTbl
x y z
1: 1748920 6694402 7501082
2: 4571252 565976 5695727
3: 1284455 8282944 7706392
4: 8452994 8765774 6541097
5: 6429283 329475 5271154
---
9999996: 2019750 5956558 1735214
9999997: 1096888 1657401 3519573
9999998: 1310171 9002746 350394
9999999: 5393125 5888350 7657290
10000000: 2210918 7577598 5002307
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"<Anonymous>" 4.32 2.02 203.62 95.17
"radixorder1" 4.32 2.02 4.74 2.22
"regularorder1" 4.28 2.00 199.30 93.15
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$by.total
total.time total.pct self.time self.pct
"setkey" 213.96 100.00 0.00 0.00
"setkeyv" 213.96 100.00 0.00 0.00
"fastorder" 208.36 97.38 0.00 0.00
"tryCatch" 208.36 97.38 0.00 0.00
"tryCatchList" 208.36 97.38 0.00 0.00
"tryCatchOne" 208.36 97.38 0.00 0.00
"<Anonymous>" 203.62 95.17 4.32 2.02
"regularorder1" 199.30 93.15 4.28 2.00
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"radixorder1" 4.74 2.22 4.32 2.02
"doTryCatch" 4.74 2.22 0.00 0.00
"is.unsorted" 0.22 0.10 0.00 0.00
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$sample.interval
[1] 0.02
$sampling.time
[1] 213.96
>
The problem was that the darwin machine was running Snow Leopard with a 64-bit kernel, which is not the default for that OS X version.
I also verified that this is not a problem for another darwin machine running Mountain Lion which uses a 64-bit kernel by default. So it's an interaction between Snow Leopard and running a 64-bit kernel specifically.
As a side note, the official OS X binary installer for R is still built with Snow Leopard, so I do think that this issue is still relevant, as Snow Leopard is still a widely-used OS X version.
When the 64-bit kernel in Snow Leopard is enabled, no kernel extensions that are compatible only with the 32-bit kernel are loaded. After booting into the default 32-bit kernel for Snow Leopard, kextfind shows that these 32-bit only kernel extensions are on the machine and (most likely) loaded:
$ kextfind -not -arch x86_64
/System/Library/Extensions/ACard6280ATA.kext
/System/Library/Extensions/ACard62xxM.kext
/System/Library/Extensions/ACard67162.kext
/System/Library/Extensions/ACard671xSCSI.kext
/System/Library/Extensions/ACard6885M.kext
/System/Library/Extensions/ACard68xxM.kext
/System/Library/Extensions/AppleIntelGMA950.kext
/System/Library/Extensions/AppleIntelGMAX3100.kext
/System/Library/Extensions/AppleIntelGMAX3100FB.kext
/System/Library/Extensions/AppleIntelIntegratedFramebuffer.kext
/System/Library/Extensions/AppleProfileFamily.kext/Contents/PlugIns/AppleIntelYonahProfile.kext
/System/Library/Extensions/IO80211Family.kext/Contents/PlugIns/AirPortAtheros.kext
/System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/AppleRTL8139Ethernet.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/InternalModemSupport.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/MotorolaSM56KUSB.kext
/System/Library/Extensions/JMicronATA.kext
/System/Library/Extensions/System.kext/PlugIns/BSDKernel6.0.kext
/System/Library/Extensions/System.kext/PlugIns/IOKit6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Libkern6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Mach6.0.kext
/System/Library/Extensions/System.kext/PlugIns/System6.0.kext
/System/Library/Extensions/ufs.kext
So it could be any one of those loaded extensions that is enabling something for the Rprof package to use, so that the setkey operation in data.table is profiled correctly.
If anyone wants to investigate this further, dig a bit deeper, and get to the root cause of the problem, please post an answer and I'll happily accept that one.
The product of one simulation is a large data.frame, with fixed columns and rows. I ran several hundreds of simulations, with each result stored in a separate RData file (for efficient reading).
Now I want to gather all those files together and create statistics for each field of this data.frame into the "cells" structure which is basically a list of vectors with . This is how I do it:
#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation
cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
cells[[i]]<-as.list(rep(NA,rowscount))
for(j in 1:rows)
{
cells[[i]][[j]]<-rep(NA,simcount)
}
}
names(cells)<-colnames
addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
simdata<readRDS(simfilenames[[simnr]])
for(i in 1:colscount)
{
for(j in 1:rowscount)
{
if (!is.na(simdata[j,i]))
{
cells[[i]][[j]][simnr]<-simdata[j,i]
}
}
}
}
library(plyr)
a_ply(1:simcount,1,addcells)
The problem is, that this the
> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
user system elapsed
0.088 0.004 0.093
While
? system.time(addcells(1))
user system elapsed
147.328 0.296 147.644
I would expect both commands to have comparable execution times (or at least the latter be max 10 x slower). I guess I am doing something very inefficient there, but what? The whole cells data structure is rather big, it takes around 1GB of memory.
I need to transpose data in this way, because later I do many descriptive statistics on the results (like computing means, sd, quantiles, and maybe histograms), so it is important, that the data for each cell is stored as a (single-dimensional) vector.
Here is profiling output:
> summaryRprof('/tmp/temp/rprof.out')
$by.self
self.time self.pct total.time total.pct
"[.data.frame" 71.98 47.20 129.52 84.93
"names" 11.98 7.86 11.98 7.86
"length" 10.84 7.11 10.84 7.11
"addcells" 10.66 6.99 151.52 99.36
".subset" 10.62 6.96 10.62 6.96
"[" 9.68 6.35 139.20 91.28
"match" 6.06 3.97 11.36 7.45
"sys.call" 4.68 3.07 4.68 3.07
"%in%" 4.50 2.95 15.86 10.40
"all" 4.28 2.81 4.28 2.81
"==" 2.34 1.53 2.34 1.53
".subset2" 1.28 0.84 1.28 0.84
"is.na" 1.06 0.70 1.06 0.70
"nargs" 0.62 0.41 0.62 0.41
"gc" 0.54 0.35 0.54 0.35
"!" 0.42 0.28 0.42 0.28
"dim" 0.34 0.22 0.34 0.22
".Call" 0.12 0.08 0.12 0.08
"readRDS" 0.10 0.07 0.12 0.08
"cat" 0.10 0.07 0.10 0.07
"readLines" 0.04 0.03 0.04 0.03
"strsplit" 0.04 0.03 0.04 0.03
"addParaBreaks" 0.02 0.01 0.04 0.03
It looks that indexing the list structure takes a lot of time. But I can't make it array, because not all cells are numeric, and R doesn't easily support hash map...
I always transpose by using t(file) command in R.
But i it is not running properly (not running at all) on big data file (250,000 rows and 200 columns). Any ideas.
I need to calculate correlation between 2nd row (PTBP1) with all other rows (except 8 rows including header). In order to do this I transpose rows to columns and then use cor function.
But I struck at transpose fn. Any help would be really appreciated!
I copied example from one of the post in stackoverflow (They are also almost discussing the same problem but seems no answer yet!)
ID A B C D E F G H I [200 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
250,000 rows
Use a matrix instead. The only advantage of a dataframe over a matrix is the capacity to have different classes in the columns and you clearly do not have that situation, since a transposed dataframe could not support such a result.
I don't get why you want to transpose the data.frame. If you just use cor it doesn't matter if your data is in rows or columns.
Actually, it is one of the major advantages of R that it doen's matter if your data fits in the classical row-column pattern as SPSS and others programs require data to be.
There are numerous ways to correlate the first row with all other rows (I don't get which rows you want to exclude). One is using a loop (here the loop is implicit in the call to one of the *apply family functions):
lapply(2:(dim(fn)[1]), function(x) cor(fn[1,],fn[x,]))
Note that I expect you data.frame to ba called fn. To skip some rows change the 2 to the number you want. Furthermore, I would probably use vapply here.
I hope this answer points you in the correct direction and that is to not use t() if you absolutely don't need it.