I'm trying to pull down a subset of rows in a sqlite database using dplyr. Since slice doesn't work with tbl_sql objects, I'm using the window function row_number. But I get the following error:
Source: sqlite 3.8.6
[/Library/Frameworks/R.framework/Versions/3.2/Resources/library/dplyr/db/nycflights13.sqlite]
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: ROW_NUMBER
dplyr version 0.4.3.9000, RSQLite version 1.0.0. Reproducible example:
library(dplyr)
library(nycflights13)
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
filter(flights_sqlite, row_number(month) == 1L) %>% collect()
Probably there's a more efficient and faster way, but head seems to do the job.
To extract first n rows, for instance first 10 records:
head(flights_sqlite, 10) %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
3 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
4 2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
5 2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54
6 2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54
7 2013 1 1 555 -5 913 19 B6 N516JB 507 EWR FLL 158 1065 5 55
8 2013 1 1 557 -3 709 -14 EV N829AS 5708 LGA IAD 53 229 5 57
9 2013 1 1 557 -3 838 -8 B6 N593JB 79 JFK MCO 140 944 5 57
10 2013 1 1 558 -2 753 8 AA N3ALAA 301 LGA ORD 138 733 5 58
A percentage of the first rows
head(flights_sqlite, nrow(flights_sqlite)*0.1) %>% collect()
To subset any specific number of rows. For instance rows 578 and 579:
head(flights_sqlite, nrow(flights_sqlite))[578:579, ] %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
578 2013 1 1 1701 -9 2026 11 AA N3FUAA 695 JFK AUS 247 1521 17 1
579 2013 1 1 1701 1 1856 16 UA N418UA 689 LGA ORD 144 733 17 1
I have some problems in reading in date and time in a proper way, and I wonder why I get these problems. The problem is only on my windows installation of R. Running the exact same script on my UNIX installation works fine.
Basically, I want to read in a file with data and time as the second column, like this:
TrainData[[i]] = read.csv(TrainFiles[i],header=F, colClasses=c(NA,"POSIXct",rep(NA,8)))
colnames(TrainData[[i]])=c("comp","time","s1","s2","s3","s4","r1","r2","r3","r4")
However, only the dates are read, not the times, and my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 711 630 69 600 689 20 40 1
2 5 2009-08-18 725 460 101 705 689 20 40 1
3 6 2009-08-18 711 505 69 678 689 20 40 1
4 1 2009-08-18 705 630 69 600 689 20 40 1
5 2 2009-08-18 734 516 101 671 689 20 40 1
6 3 2009-08-18 743 637 69 595 689 20 40 1
7 4 2009-08-18 730 577 101 633 689 20 40 1
8 2 2009-08-18 721 511 101 674 689 20 40 1
9 3 2009-08-18 747 563 101 642 689 20 40 1
10 4 2009-08-18 716 572 101 636 689 20 40 1
Running the exact same cond on UNIX returned both time and dates.
When I read in another file in the same script, with dates and times in the two first columns, I get the correct format of the date/time:
TrainData[[i]]=read.csv(TrainFiles[i],header=F, colClasses=c("POSIXct","POSIXct",NA))
colnames(TrainData[[i]])=c("start","end","fault")
returns
start end fault
1 2010-10-24 04:25:53 2010-10-24 11:22:33 6
2 2010-10-30 12:57:16 2010-11-02 12:29:54 6
3 2010-11-05 10:40:17 2010-11-05 11:59:51 6
4 2010-11-05 17:07:37 2010-11-06 14:30:01 6
5 2010-11-06 23:59:59 2010-11-07 00:14:49 6
6 2010-11-06 23:59:59 2010-11-07 00:14:49 6
7 2010-11-06 23:59:59 2010-11-07 00:14:49 6
8 2010-11-06 23:59:59 2010-11-07 00:14:49 6
9 2010-11-06 23:59:59 2010-11-07 00:14:50 6
10 2010-11-06 23:59:47 2010-11-07 00:14:51 6
Actually, I found a solution that works, eventually, but I wonder why I get these problems.
It appears that my Sys.timezone is set to "Europe/Berlin". If I set this to NA, the times will be read in as well, i.e. using Sys.setenv(tz=NA). If I then run the same code, my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 18:12:00 711 630 69 600 689 20 40 1
2 5 2009-08-18 18:14:27 725 460 101 705 689 20 40 1
3 6 2009-08-18 18:14:31 711 505 69 678 689 20 40 1
4 1 2009-08-18 18:14:43 705 630 69 600 689 20 40 1
5 2 2009-08-18 18:14:47 734 516 101 671 689 20 40 1
6 3 2009-08-18 18:14:51 743 637 69 595 689 20 40 1
7 4 2009-08-18 18:15:00 730 577 101 633 689 20 40 1
8 2 2009-08-18 18:29:33 721 511 101 674 689 20 40 1
9 3 2009-08-18 18:29:37 747 563 101 642 689 20 40 1
10 4 2009-08-18 18:29:45 716 572 101 636 689 20 40 1
The other file still get times, but now consistently two hours different.
This is how the csv-files look like (basically, text separated by commas):
this is my file (basically text separated by commas):
1,2009-08-18 18:12:00,711,630,69,600,689,20,40,1
5,2009-08-18 8:14:27,725,460,101,705,689,20,40,1
6,2009-08-18 18:14:31,711,505,69,678,689,20,40,1
1,2009-08-18 18:14:43,705,630,69,600,689,20,40,1
2,2009-08-18 8:14:47,734,516,101,671,689,20,40,1
3,2009-08-18 18:14:51,743,637,69,595,689,20,40,1
4,2009-08-18 8:15:00,730,577,101,633,689,20,40,1
2,2009-08-18 8:29:33,721,511,101,674,689,20,40,1
3,2009-08-18 8:29:37,747,563,101,642,689,20,40,1
4,2009-08-18 8:29:45,716,572,101,636,689,20,40,1
Why am I having these problems with reading in the times? I would expect that it is not correct to use tz=NA, but this is the only way I found to work. Can anyone help me figure out why the times are ignored when tz = "Europe/Berlin"?
Is it generally adviced to put tz=NA when reading files like this? Even if this seems to work in reading in the times, the tz="NA" results in warning messages when I later want to work with the data:
Warning message:
In as.POSIXlt.POSIXct(x, tz) : unknown timezone 'NA'
Can anyone help me explain the differences I get?
My data is follow the sequence:
deptime .count
1 4.5 6285
2 14.5 5901
3 24.5 6002
4 34.5 5401
5 44.5 5080
6 54.5 4567
7 104.5 3162
8 114.5 2784
9 124.5 1950
10 134.5 1800
11 144.5 1630
12 154.5 1076
13 204.5 738
14 214.5 556
15 224.5 544
16 234.5 650
17 244.5 392
18 254.5 309
19 304.5 356
20 314.5 364
My ggplot code:
ggplot(pplot, aes(x=deptime, y=.count)) + geom_bar(stat="identity",fill='#FF9966',width = 5) + labs(x="time", y="count")
output figure
There are a gap between each 100. Does anyone know how to fix it?
Thank You
Apologies if a similar query has been posted - couldn't find it.
I have GPS locations (UTM) for multiple individuals.
X Y AnimalID DATE
1 550466 4789843 10 1/25/2008
2 550820 4790544 10 1/26/2008
3 551071 4791230 10 1/26/2008
4 550462 4789292 10 1/26/2008
5 550390 4789934 10 1/27/2008
6 550543 4790085 10 1/27/2008
I am attempting to calculate Net Squared Displacement and once NSD has reached at least 800m, I'd like to repeat the formula starting at 0 at the next row.
Desired output is this:
XLOC YLOC ANIMALID DATETIME Xdist Ydist NSD GROUP
1 550466 4789843 10 1/25/2008 17:00 354 701 785 1
2 550820 4790544 10 1/26/2008 1:00 605 1387 1513 1
3 551071 4791230 10 1/26/2008 9:00 609 1938 2031 2
4 550462 4789292 10 1/26/2008 17:00 72 642 646 3
5 550390 4789934 10 1/27/2008 1:00 81 793 797 3
6 550543 4790085 10 1/27/2008 9:00 82 149 170 3
7 550380 4789441 10 1/27/2008 17:00 178 192 262 3
8 550284 4789484 10 1/28/2008 1:00 559 426 703 3
9 549903 4789718 10 1/28/2008 9:00 0 35 35 3
10 550462 4789327 10 1/28/2008 17:00 574 275 636 3
11 549888 4789567 10 1/29/2008 1:00 532 263 593 3
12 549930 4789555 10 1/29/2008 9:00 65 4 65 3
13 550397 4789288 10 1/29/2008 17:00 124 140 187 3
14 550338 4789432 10 1/30/2008 1:00 554 339 649 3
15 549908 4789631 10 1/30/2008 9:00 84 75 113 3
16 550378 4789367 10 1/30/2008 17:00 657 1876 1988 3
17 550414 4789354 10 1/31/2008 1:00 531 91 539 4
18 549883 4789445 10 1/31/2008 9:00 188 136 232 4
19 550226 4789490 10 1/31/2008 17:00 126 141 189 4
20 550288 4789495 10 2/1/2008 1:00 176 187 257 4
I added the 'Group' column to indicate when 800 NSD was attained.
I'm really struggling with how exactly to code for this particular approach mainly because the first UTM has to be identical until 800m has been reached.
In other words, I can't do this:
xdist<-abs(diff(X)
ydist<-abs(diff(Y)
nsd<-sqrt(xdist^2+ydist^2)
I need to do this until the target of 800m was reached:
xdist <- abs(X in row 2 - 550446)
ydist <- abs(Y in row 2 - 4789843)
Then the unique UTMs will need to be from rows 3, 4, 17 and so on.
I hope this makes sense and I'd appreciate any help!
I think this is what you are looking for:
data$GROUP[1] <- 1
data$Xdist[1] <- data$XLOC[2] - data$XLOC[1]
data$Ydist[1] <- data$YLOC[2] - data$YLOC[1]
data$NSD[1] <- as.integer(sqrt(data$Xdist[1]^2+data$Ydist[1]^2))
for ( i in 2:(nrow(data)-1)) {
if ( data$NSD[i-1] > 800) {
data$Xdist[i] <- data$XLOC[i+1] - data$XLOC[i]
data$Ydist[i] <- data$YLOC[i+1] - data$YLOC[i]
data$NSD[i] <- as.integer(sqrt(data$Xdist[i]^2+data$Ydist[i]^2))
data$GROUP[i] <- (data$GROUP[i-1] + 1)
} else {
data$Xdist[i] <- data$XLOC[i+1] - data$XLOC[i] + data$Xdist[i-1]
data$Ydist[i] <- data$YLOC[i+1] - data$YLOC[i] + data$Ydist[i-1]
data$NSD[i] <- as.integer(sqrt(data$Xdist[i]^2+data$Ydist[i]^2))
data$GROUP[i] <- (data$GROUP[i-1])
}
}
output:
> data
XLOC YLOC ANIMALID DATE TIME Xdist Ydist NSD GROUP
1 550466 4789843 10 1/25/20081 7:00 354 701 785 1
2 550820 4790544 10 1/26/2008 1:00 605 1387 1513 1
3 551071 4791230 10 1/26/2008 9:00 -609 -1938 2031 2
4 550462 4789292 10 1/26/2008 17:00 -72 642 646 3
5 550390 4789934 10 1/27/2008 1:00 81 793 797 3
6 550543 4790085 10 1/27/2008 9:00 -82 149 170 3
7 550380 4789441 10 1/27/2008 17:00 -178 192 261 3
8 550284 4789484 10 1/28/2008 1:00 -559 426 702 3
9 549903 4789718 10 1/28/2008 9:00 0 35 35 3
10 550462 4789327 10 1/28/2008 17:00 -574 275 636 3
11 549888 4789567 10 1/29/2008 1:00 -532 263 593 3
12 549930 4789555 10 1/29/2008 9:00 -65 -4 65 3
13 550397 4789288 10 1/29/2008 17:00 -124 140 187 3
14 550338 4789432 10 1/30/2008 1:00 -554 339 649 3
15 549908 4789631 10 1/30/2008 9:00 -84 75 112 3
16 550378 4789367 10 1/30/2008 17:00 -48 62 78 3
17 550414 4789354 10 1/31/2008 1:00 -579 153 598 3
18 549883 4789445 10 1/31/2008 9:00 -236 198 308 3
19 550226 4789490 10 1/31/2008 17:00 -174 203 267 3
20 550288 4789495 10 2/1/2008 1:00 NA NA NA NA
Also I think you made a mistake above at xdist16 because for xlocline17 - xlocline16 + xdistline15 = 550414 - 550378 + (-84) = -48 and not 657 as you specified. Unless I missed something at your formula.
Hope this helps!
The data in the table is given below:
Year NSW Vic. Qld SA WA Tas. NT ACT Aust.
1 1917 1904 1409 683 440 306 193 5 3 4941
2 1927 2402 1727 873 565 392 211 4 8 6182
3 1937 2693 1853 993 589 457 233 6 11 6836
4 1947 2985 2055 1106 646 502 257 11 17 7579
5 1957 3625 2656 1413 873 688 326 21 38 9640
6 1967 4295 3274 1700 1110 879 375 62 103 11799
7 1977 5002 3837 2130 1286 1204 415 104 214 14192
8 1987 5617 4210 2675 1393 1496 449 158 265 16264
9 1997 6274 4605 3401 1480 1798 474 187 310 18532
I want to plot a graph with (Year) on my x-axis and (total value) on my Y-axis. The barplot should depicting the ACT and NT value for the respective (Years).
I tried the following command:
barplot(as.matrix(r_data$ACT, r_data$NT), main="r_data", ylab="Total", beside=TRUE)
The above command showed the barplot of ACT column per year but didn't show the Bar plot of NT column.
You have to create the matrix in a different way:
barplot(as.matrix(r_data[c("ACT", "NT")]),
main="r_data", ylab="Total", beside=TRUE)
You can also use cbind instead of as.matrix and keep the rest of your original approach:
barplot(cbind(r_data$ACT, r_data$NT),
main="r_data", ylab="Total", beside=TRUE)