I have a dataset v1 where I want to get data of certain grid boxes.
Here's an extract from v1:
"V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15" "V16" "V17" "V18"
43 1 0 69 60 9 19501201 1080 0 1 641 30 0 291 272 136 29 3650
43 1 1 69 60 9 19501201 884 0 1 705 30 3 290 293 136 29 3650
43 1 2 70 61 9 19501201 553 293 1 1090 30 6 264 468 138 31 3650
43 1 3 71 62 9 19501201 416 290 1 1240 30 9 303 503 140 33 3650
43 1 4 72 63 9 19501201 396 287 1 1160 30 12 334 444 142 35 3650
43 1 5 73 64 9 19501201 163 285 1 1440 30 15 377 687 144 37 3650
43 1 6 74 66 9 19501201 29 475 1 1490 30 18 386 674 146 41 3650
43 1 7 74 67 9 19501201 -257 222 1 1960 30 21 444 875 146 43 3650
43 1 8 74 68 9 19501202 -216 222 1 1850 30 0 438 806 146 45 3650
43 1 9 74 69 9 19501202 -393 222 1 1950 30 3 444 847 146 47 3650
43 1 10 74 70 9 19501202 -500 222 1 2130 30 6 457 901 146 49 3650
The list "v1" has the columns longitudes (V16) and the latitudes (V17) of the boundary conditions you see below.
For example, I need to filter between 80°W-30°E (V16) and 25°N-75°N (V17) by boxes of 5° each.
I want to keep all other columns from the filtered-out box.
These are my boundary conditions:
lon1_i <- seq(-80,25, by=5)
lon2_i <- seq(-75,30, by=5)
lat1_i <- seq(25,70, by=5)
lat2_i <- seq(30,75, by=5)
So the first grid box has all the info in -80° to -75° and 25°-30°, then the second box contains the data from -75° to -70° and 30°-35°. And so on until the last box of 25°-30°E and 70°-75°N.
I tried to use a for loop with two indices:
for (i in 1:22) {
for(k in 1:10) {
test[[i]][[k]] <- v1 %>%
filter(between(V16, lon1_i[[i]], lon2_i[[i]]), between(V17, lat1_i[[k]], lat2_i[[k]])) %>%
group_by(group = cumsum(V3 == 0))
}
}
And with outer:
test <- outer(seq(lon1_i),seq(lon2_i),seq(lat1_i),seq(lat2_i),
function(i,j) v1 %>%
filter(between(V16, lon1_i[i], lon2_i[i]),
between(V17, lat1_i[j], lat2_i[j])) %>%
group_by(group = cumsum(V3 == 0)))
Also lapply:
test <- lapply(seq(22,10),function(x) v1 %>%
filter(between(V16, lon1_i[x], lon2_i[x]), between(V17, lat1_i[x], lat2_i[x])) %>%
group_by(group = cumsum(V3 == 0)))
The output should be in the form of new data tables/lists so I guess 22x10 from my chosen coordinates.
Is it possible with these functions/types of loops? I would much appreciate some help on this. Thanks!
Looks like you have a list of points in test and you have a list of areas describing the boundaries. I would use spatial joining for filtering a table of points e.g. using function st_within of R package sf
Related
Assume a data.frame as follows:
df <- data.frame(name = paste0("Person",rep(1:30)),
number = sample(1:100, 30, replace=TRUE),
focus = sample(1:500, 30, replace=TRUE))
I want to split the above data.frame into 9 groups, each with 9 observations. Each person can be assigned to multiple groups (replacement), so that all 9 groups have all 10 observations (since 9 groups x 9 observations require 81 rows while the df has only 30).
The output will ideally be a large list of 1000 data.frames.
Are there any efficient ways of doing this? This is just a sample data.frame. The actual df has ~10k rows and will require 1000 groups each with 30 rows.
Many thanks.
Is this what you are looking for?
res <- replicate(1000, df[sample.int(nrow(df), 30, TRUE), ], FALSE)
df I used
df <- data.frame(name = paste0("Person",rep(1:1e4)),
number = sample(1:100, 1e4, replace=TRUE),
focus = sample(1:500, 1e4, replace=TRUE))
Output
> res[1:3]
[[1]]
name number focus
529 Person529 5 351
9327 Person9327 4 320
1289 Person1289 78 164
8157 Person8157 46 183
6939 Person6939 38 61
4066 Person4066 26 103
132 Person132 34 39
6576 Person6576 36 397
5376 Person5376 47 456
6123 Person6123 10 18
5318 Person5318 39 42
6355 Person6355 62 212
340 Person340 90 256
7050 Person7050 19 198
1500 Person1500 42 208
175 Person175 34 30
3751 Person3751 99 441
3813 Person3813 93 492
7428 Person7428 72 142
6840 Person6840 58 45
6501 Person6501 95 499
5124 Person5124 16 159
3373 Person3373 38 36
5622 Person5622 40 203
8761 Person8761 9 225
6252 Person6252 75 444
4502 Person4502 58 337
5344 Person5344 24 233
4036 Person4036 59 265
8764 Person8764 45 1
[[2]]
name number focus
8568 Person8568 87 360
3968 Person3968 67 468
4481 Person4481 46 140
8055 Person8055 73 286
7794 Person7794 92 336
1110 Person1110 6 434
6736 Person6736 4 58
9758 Person9758 60 49
9356 Person9356 89 300
9719 Person9719 100 366
4183 Person4183 5 124
1394 Person1394 87 346
2642 Person2642 81 449
3592 Person3592 65 358
579 Person579 21 395
9551 Person9551 39 495
4946 Person4946 73 32
4081 Person4081 98 270
4062 Person4062 27 150
7698 Person7698 52 436
5388 Person5388 89 177
9598 Person9598 91 474
8624 Person8624 3 464
392 Person392 82 483
5710 Person5710 43 293
4942 Person4942 99 350
3333 Person3333 89 91
6789 Person6789 99 259
7115 Person7115 100 320
1431 Person1431 77 263
[[3]]
name number focus
201 Person201 100 272
4674 Person4674 27 410
9728 Person9728 18 275
9422 Person9422 2 396
9783 Person9783 45 37
5552 Person5552 76 109
3871 Person3871 49 277
3411 Person3411 64 24
5799 Person5799 29 131
626 Person626 31 122
3103 Person3103 2 76
8043 Person8043 90 384
3157 Person3157 90 392
7093 Person7093 11 169
2779 Person2779 83 2
2601 Person2601 77 122
9003 Person9003 50 163
9653 Person9653 4 235
9361 Person9361 100 391
4273 Person4273 83 383
4725 Person4725 35 436
2157 Person2157 71 486
3995 Person3995 25 258
3735 Person3735 24 221
303 Person303 81 407
4838 Person4838 64 198
6926 Person6926 90 417
6267 Person6267 82 284
8570 Person8570 67 317
2670 Person2670 21 342
I am trying to run a time series analysis on the following data set:
Year 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780
Number 101 82 66 35 31 7 20 92 154 125
Year 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790
Number 85 68 38 23 10 24 83 132 131 118
Year 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800
Number 90 67 60 47 41 21 16 6 4 7
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810
Number 14 34 45 43 48 42 28 10 8 2
Year 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820
Number 0 1 5 12 14 35 46 41 30 24
Year 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
Number 16 7 4 2 8 17 36 50 62 67
Year 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840
Number 71 48 28 8 13 57 122 138 103 86
Year 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850
Number 63 37 24 11 15 40 62 98 124 96
Year 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860
Number 66 64 54 39 21 7 4 23 55 94
Year 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870
Number 96 77 59 44 47 30 16 7 37 74
My problem is that the data is placed in multiple rows. I am trying to make two columns from the data. One for Year and one for Number, so that it is easily readable in R. I have tried
> library(tidyverse)
> sun.df = data.frame(sunspots)
> Year = filter(sun.df, sunspots == "Year")
to isolate the Year data, and it works, but I am unsure of how to then place it in a column.
Any suggestions?
Try this:
library(tidyverse)
df <- read_csv("test.csv",col_names = FALSE)
df
# A tibble: 6 x 4
# X1 X2 X3 X4
# <chr> <dbl> <dbl> <dbl>
# 1 Year 123 124 125
# 2 Number 1 2 3
# 3 Year 126 127 128
# 4 Number 4 5 6
# 5 Year 129 130 131
# 6 Number 7 8 9
# Removing first column and transpose it to get a dataframe of numbers
df_number <- as.data.frame(as.matrix(t(df[,-1])),row.names = FALSE)
df_number
# V1 V2 V3 V4 V5 V6
# 1 123 1 126 4 129 7
# 2 124 2 127 5 130 8
# 3 125 3 128 6 131 9
# Keep the first two column (V1,V2) and assign column names
df_new <- df_number[1:2]
colnames(df_new) <- c("Year","Number")
# Iterate and rbind with subsequent columns (2 by 2) to df_new
for(i in 1:((ncol(df_number) - 2 )/2)) {
df_mini <- df_number[(i*2+1):(i*2+2)]
colnames(df_mini) <- c("Year","Number")
df_new <- rbind(df_new,df_mini)
}
df_new
# Year Number
# 1 123 1
# 2 124 2
# 3 125 3
# 4 126 4
# 5 127 5
# 6 128 6
# 7 129 7
# 8 130 8
# 9 131 9
I have this data from an r package, where X is the dataset with all the data
library(ISLR)
data("Hitters")
X=Hitters
head(X)
here is one part of the data:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NA A
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A
I want to convert all the columns and the rows with non numeric values to zero, is there any simple way to do this.
I found here an example how to remove the rows for one column just but for more I have to do it for every column manually.
Is in r any function that does this for all columns and rows?
To remove non-numeric columns, perhaps something like this?
df %>%
select(which(sapply(., is.numeric)))
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 NA
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
or
df %>%
select(-which(sapply(., function(x) is.character(x) | is.factor(x))))
Or much neater (thanks to #AntoniosK):
df %>% select_if(is.numeric)
Update
To additionally replace NAs with 0, you can do
df %>% select_if(is.numeric) %>% replace(is.na(.), 0)
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 0.0
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
library(ISLR)
data("Hitters")
d = head(Hitters)
library(dplyr)
d %>%
mutate_if(function(x) !is.numeric(x), function(x) 0) %>% # if column is non numeric add zeros
mutate_all(function(x) ifelse(is.na(x), 0, x)) # if there is an NA element replace it with 0
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
# 1 293 66 1 30 29 14 1 293 66 1 30 29 14 0 0 446 33 20 0.0 0
# 2 315 81 7 24 38 39 14 3449 835 69 321 414 375 0 0 632 43 10 475.0 0
# 3 479 130 18 66 72 76 3 1624 457 63 224 266 263 0 0 880 82 14 480.0 0
# 4 496 141 20 65 78 37 11 5628 1575 225 828 838 354 0 0 200 11 3 500.0 0
# 5 321 87 10 39 42 30 2 396 101 12 48 46 33 0 0 805 40 4 91.5 0
# 6 594 169 4 74 51 35 11 4408 1133 19 501 336 194 0 0 282 421 25 750.0 0
If you want to avoid function(x) you can use this
d %>%
mutate_if(Negate(is.numeric), ~0) %>%
mutate_all(~ifelse(is.na(.), 0, .))
You can get the numeric columns with sapply/inherits.
X <- Hitters
inx <- sapply(X, inherits, c("integer", "numeric"))
Y <- X[inx]
Then, it wouldn't make much sense to remove the rows with non-numeric entries, they were already removed, but you could do
inx <- apply(Y, 1, function(y) all(inherits(y, c("integer", "numeric"))))
Y[inx, ]
I have a list of numbers that are increasing in nature (i.e. ever increasing).
alist <- c(1:20, 50:70, 210:235, 240:250)
The difference from one number to the next, is n.
I'd like to automatically split the list based on whether the difference between each item on the list is bigger than the threshold value of n.
For example, if the value of n > 20, for the particular list above it should split itself into 3 datasets.
Calling which(diff(alist) >20) tells me where I should "cut" the data up, but for the life of me I cannot figure out the next step... I might be missing something very simple here.
The result should ideally become a list of lists, or a table (I don't mind either):
[[1]]
[1] 1 2 3 4 5 6 7 8 9 0 11 12 13 14 15 16 17 18 19 20
[[2]]
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65...
[[3]]
[1] 210 211 212 213...
We can use cumsum on a logical vector to create a group for splitting
unname(split(alist, cumsum(c(TRUE, diff(alist) > 20))))
#[[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#[[2]]
# [1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
#[[3]]
# [1] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 240 241 242 243 244 245 246 247 248
# [36] 249 250
If we need to use the which approach,
i1 <- which(diff(alist) >20)
Map(function(i,j) alist[i:j], c(1, i1 +1), c(i1, length(alist)))
I would like to merge and sum the values of each row that contains duplicated IDs.
For example, the data frame below contains a duplicated symbol 'LOC102723897'. I would like to merge these two rows and sum the value within each column, so that one row appears for the duplicated symbol.
> head(y$genes)
SM01 SM02 SM03 SM04 SM05 SM06 SM07 SM08 SM09 SM10 SM11 SM12 SM13 SM14 SM15 SM16 SM17 SM18 SM19 SM20 SM21 SM22
1 32 29 23 20 27 105 80 64 83 80 94 58 122 76 78 70 34 32 45 42 138 30
2 246 568 437 343 304 291 542 457 608 433 218 329 483 376 410 296 550 533 537 473 296 382
3 30 23 30 13 20 18 23 13 31 11 15 27 36 21 23 25 26 27 37 27 31 16
4 1450 2716 2670 2919 2444 1668 2923 2318 3867 2084 1121 2175 3022 2308 2541 1613 2196 1851 2843 2078 2180 1902
5 288 366 327 334 314 267 550 410 642 475 219 414 679 420 425 308 359 406 550 398 399 268
6 34 59 62 68 42 31 49 45 62 51 40 32 30 39 41 75 54 59 83 99 37 37
SM23 SM24 SM25 SM26 SM27 SM28 SM29 SM30 Symbol
1 41 23 57 160 84 67 87 113 LOC102723897
2 423 535 624 304 568 495 584 603 LINC01128
3 31 21 49 13 33 31 14 31 LINC00115
4 2453 3041 3590 2343 3450 3725 3336 3850 NOC2L
5 403 347 468 478 502 563 611 577 LOC102723897
6 45 51 56 107 79 105 92 131 PLEKHN1
> dim(y)
[1] 12928 30
I attempted using plyr to merge rows based on the 'Symbol' column, but it's not working.
> ddply(y$genes,"Symbol",numcolwise(sum))
> dim(y)
[1] 12928 30
> length(y$genes$Symbol)
[1] 12928
> length(unique(y$genes$Symbol))
[1] 12896
You group-by on Symbol and sum all columns.
library(dplyr)
df %>% group_by(Symbol) %>% summarise_all(sum)
using data.table
library(data.table)
setDT(df)[ , lapply(.SD, sum),by="Symbol"]
We can just use aggregate from base R
aggregate(.~ Symbol, df, FUN = sum)