Putting several rows into one column in R - r

I am trying to run a time series analysis on the following data set:
Year 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780
Number 101 82 66 35 31 7 20 92 154 125
Year 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790
Number 85 68 38 23 10 24 83 132 131 118
Year 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800
Number 90 67 60 47 41 21 16 6 4 7
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810
Number 14 34 45 43 48 42 28 10 8 2
Year 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820
Number 0 1 5 12 14 35 46 41 30 24
Year 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
Number 16 7 4 2 8 17 36 50 62 67
Year 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840
Number 71 48 28 8 13 57 122 138 103 86
Year 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850
Number 63 37 24 11 15 40 62 98 124 96
Year 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860
Number 66 64 54 39 21 7 4 23 55 94
Year 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870
Number 96 77 59 44 47 30 16 7 37 74
My problem is that the data is placed in multiple rows. I am trying to make two columns from the data. One for Year and one for Number, so that it is easily readable in R. I have tried
> library(tidyverse)
> sun.df = data.frame(sunspots)
> Year = filter(sun.df, sunspots == "Year")
to isolate the Year data, and it works, but I am unsure of how to then place it in a column.
Any suggestions?

Try this:
library(tidyverse)
df <- read_csv("test.csv",col_names = FALSE)
df
# A tibble: 6 x 4
# X1 X2 X3 X4
# <chr> <dbl> <dbl> <dbl>
# 1 Year 123 124 125
# 2 Number 1 2 3
# 3 Year 126 127 128
# 4 Number 4 5 6
# 5 Year 129 130 131
# 6 Number 7 8 9
# Removing first column and transpose it to get a dataframe of numbers
df_number <- as.data.frame(as.matrix(t(df[,-1])),row.names = FALSE)
df_number
# V1 V2 V3 V4 V5 V6
# 1 123 1 126 4 129 7
# 2 124 2 127 5 130 8
# 3 125 3 128 6 131 9
# Keep the first two column (V1,V2) and assign column names
df_new <- df_number[1:2]
colnames(df_new) <- c("Year","Number")
# Iterate and rbind with subsequent columns (2 by 2) to df_new
for(i in 1:((ncol(df_number) - 2 )/2)) {
df_mini <- df_number[(i*2+1):(i*2+2)]
colnames(df_mini) <- c("Year","Number")
df_new <- rbind(df_new,df_mini)
}
df_new
# Year Number
# 1 123 1
# 2 124 2
# 3 125 3
# 4 126 4
# 5 127 5
# 6 128 6
# 7 129 7
# 8 130 8
# 9 131 9

Related

Is there an R function that turns a frequency table into a prop table?

What is the simplest way of turning a frequency data table into a prop table in R?
This is the data:
Time Total Blog News Social.Network Microblog Other Forums Pictures Video
1 15.KW 2022 1816 23 326 39 678 99 27 523 0
2 16.KW 2022 2535 32 690 42 815 135 26 644 1
3 17.KW 2022 2181 20 362 79 805 110 14 634 1
4 18.KW 2022 2583 19 895 25 692 127 6 658 0
5 19.KW 2022 2337 21 555 22 908 148 8 599 0
6 20.KW 2022 2091 23 392 18 851 119 5 554 0
7 21.KW 2022 1658 17 344 16 650 129 1 417 0
8 22.KW 2022 2476 24 798 24 937 150 7 443 0
9 23.KW 2022 1687 14 341 17 691 102 9 400 0
10 24.KW 2022 2476 21 521 29 984 110 19 509 0
11 25.KW 2022 2412 22 696 31 845 115 29 561 0
12 26.KW 2022 2197 22 715 13 709 128 59 445 0
13 27.KW 2022 2111 20 429 10 937 86 28 474 1
14 28.KW 2022 752 5 121 4 373 42 3 172 0
Your data frame df has a 2nd column called Total. It seems that you want to divide subsequent columns by this one.
df[-1] <- df[-1] / df$Total
After this, the 1st column Time does not change. 2nd column Total becomes 1. Other columns become proportions.

r - lapply/loop or outer with two indices

I have a dataset v1 where I want to get data of certain grid boxes.
Here's an extract from v1:
"V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15" "V16" "V17" "V18"
43 1 0 69 60 9 19501201 1080 0 1 641 30 0 291 272 136 29 3650
43 1 1 69 60 9 19501201 884 0 1 705 30 3 290 293 136 29 3650
43 1 2 70 61 9 19501201 553 293 1 1090 30 6 264 468 138 31 3650
43 1 3 71 62 9 19501201 416 290 1 1240 30 9 303 503 140 33 3650
43 1 4 72 63 9 19501201 396 287 1 1160 30 12 334 444 142 35 3650
43 1 5 73 64 9 19501201 163 285 1 1440 30 15 377 687 144 37 3650
43 1 6 74 66 9 19501201 29 475 1 1490 30 18 386 674 146 41 3650
43 1 7 74 67 9 19501201 -257 222 1 1960 30 21 444 875 146 43 3650
43 1 8 74 68 9 19501202 -216 222 1 1850 30 0 438 806 146 45 3650
43 1 9 74 69 9 19501202 -393 222 1 1950 30 3 444 847 146 47 3650
43 1 10 74 70 9 19501202 -500 222 1 2130 30 6 457 901 146 49 3650
The list "v1" has the columns longitudes (V16) and the latitudes (V17) of the boundary conditions you see below.
For example, I need to filter between 80°W-30°E (V16) and 25°N-75°N (V17) by boxes of 5° each.
I want to keep all other columns from the filtered-out box.
These are my boundary conditions:
lon1_i <- seq(-80,25, by=5)
lon2_i <- seq(-75,30, by=5)
lat1_i <- seq(25,70, by=5)
lat2_i <- seq(30,75, by=5)
So the first grid box has all the info in -80° to -75° and 25°-30°, then the second box contains the data from -75° to -70° and 30°-35°. And so on until the last box of 25°-30°E and 70°-75°N.
I tried to use a for loop with two indices:
for (i in 1:22) {
for(k in 1:10) {
test[[i]][[k]] <- v1 %>%
filter(between(V16, lon1_i[[i]], lon2_i[[i]]), between(V17, lat1_i[[k]], lat2_i[[k]])) %>%
group_by(group = cumsum(V3 == 0))
}
}
And with outer:
test <- outer(seq(lon1_i),seq(lon2_i),seq(lat1_i),seq(lat2_i),
function(i,j) v1 %>%
filter(between(V16, lon1_i[i], lon2_i[i]),
between(V17, lat1_i[j], lat2_i[j])) %>%
group_by(group = cumsum(V3 == 0)))
Also lapply:
test <- lapply(seq(22,10),function(x) v1 %>%
filter(between(V16, lon1_i[x], lon2_i[x]), between(V17, lat1_i[x], lat2_i[x])) %>%
group_by(group = cumsum(V3 == 0)))
The output should be in the form of new data tables/lists so I guess 22x10 from my chosen coordinates.
Is it possible with these functions/types of loops? I would much appreciate some help on this. Thanks!
Looks like you have a list of points in test and you have a list of areas describing the boundaries. I would use spatial joining for filtering a table of points e.g. using function st_within of R package sf

Pivot / Reshape data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows

Convert non-numeric rows and columns to zero

I have this data from an r package, where X is the dataset with all the data
library(ISLR)
data("Hitters")
X=Hitters
head(X)
here is one part of the data:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
-Andy Allanson 293 66 1 30 29 14 1 293 66 1 30 29 14 A E 446 33 20 NA A
-Alan Ashby 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
-Alvin Davis 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
-Andres Galarraga 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A
I want to convert all the columns and the rows with non numeric values to zero, is there any simple way to do this.
I found here an example how to remove the rows for one column just but for more I have to do it for every column manually.
Is in r any function that does this for all columns and rows?
To remove non-numeric columns, perhaps something like this?
df %>%
select(which(sapply(., is.numeric)))
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 NA
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
or
df %>%
select(-which(sapply(., function(x) is.character(x) | is.factor(x))))
Or much neater (thanks to #AntoniosK):
df %>% select_if(is.numeric)
Update
To additionally replace NAs with 0, you can do
df %>% select_if(is.numeric) %>% replace(is.na(.), 0)
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
#-Andy Allanson 293 66 1 30 29 14 1 293 66 1
#-Alan Ashby 315 81 7 24 38 39 14 3449 835 69
#-Alvin Davis 479 130 18 66 72 76 3 1624 457 63
#-Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
#-Andres Galarraga 321 87 10 39 42 30 2 396 101 12
#-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
# CRuns CRBI CWalks PutOuts Assists Errors Salary
#-Andy Allanson 30 29 14 446 33 20 0.0
#-Alan Ashby 321 414 375 632 43 10 475.0
#-Alvin Davis 224 266 263 880 82 14 480.0
#-Andre Dawson 828 838 354 200 11 3 500.0
#-Andres Galarraga 48 46 33 805 40 4 91.5
#-Alfredo Griffin 501 336 194 282 421 25 750.0
library(ISLR)
data("Hitters")
d = head(Hitters)
library(dplyr)
d %>%
mutate_if(function(x) !is.numeric(x), function(x) 0) %>% # if column is non numeric add zeros
mutate_all(function(x) ifelse(is.na(x), 0, x)) # if there is an NA element replace it with 0
# AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
# 1 293 66 1 30 29 14 1 293 66 1 30 29 14 0 0 446 33 20 0.0 0
# 2 315 81 7 24 38 39 14 3449 835 69 321 414 375 0 0 632 43 10 475.0 0
# 3 479 130 18 66 72 76 3 1624 457 63 224 266 263 0 0 880 82 14 480.0 0
# 4 496 141 20 65 78 37 11 5628 1575 225 828 838 354 0 0 200 11 3 500.0 0
# 5 321 87 10 39 42 30 2 396 101 12 48 46 33 0 0 805 40 4 91.5 0
# 6 594 169 4 74 51 35 11 4408 1133 19 501 336 194 0 0 282 421 25 750.0 0
If you want to avoid function(x) you can use this
d %>%
mutate_if(Negate(is.numeric), ~0) %>%
mutate_all(~ifelse(is.na(.), 0, .))
You can get the numeric columns with sapply/inherits.
X <- Hitters
inx <- sapply(X, inherits, c("integer", "numeric"))
Y <- X[inx]
Then, it wouldn't make much sense to remove the rows with non-numeric entries, they were already removed, but you could do
inx <- apply(Y, 1, function(y) all(inherits(y, c("integer", "numeric"))))
Y[inx, ]

dplyr Error: cannot modify grouping variable even when first applying ungroup

I'm getting this error but the fixes in related posts don't seem to apply I'm using ungroup, though it's no longer needed (can I switch the grouping variable in a single dplyr statement? but see Format column within dplyr chain). Also I have no quotes in my group_by call and I'm not applying any functions that act on the grouped-by columns (R dplyr summarize_each --> "Error: cannot modify grouping variable") but I'm still getting this error:
> games2 = baseball %>%
+ ungroup %>%
+ group_by(id, year) %>%
+ summarize(total=g+ab, a = ab+1, id = id)%>%
+ arrange(desc(total)) %>%
+ head(10)
Error: cannot modify grouping variable
This is the baseball set that comes with plyr:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
I loaded plyr before dplyr. Other bugs to check for? Thanks for any corrections/suggestions.
Not clear what you are doing. I think following is what you are looking for:
games2 = baseball %>%
group_by(id, year) %>%
mutate(total=g+ab, a = ab+1)%>%
arrange(desc(total)) %>%
head(10)
> games2
Source: local data frame [10 x 24]
Groups: id, year
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp total a
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 590 469
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 755 603
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 762 610
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 766 616
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 754 602
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 783 630
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 743 591
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 758 604
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 748 593
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 792 632
The problem is that you are trying to edit id in the summarize call, but you have grouped on id.
From your example, it looks like you want mutate anyway. You would use summarize if you were looking to apply a function that would return a single value like sum or mean.
games2 = baseball %>%
dplyr::group_by(id, year) %>%
dplyr::mutate(
total = g + ab,
a = ab + 1
) %>%
dplyr::select(id, year, total, a) %>%
dplyr::arrange(desc(total)) %>%
head(10)
Source: local data frame [10 x 4]
Groups: id, year
id year total a
1 aaronha01 1954 590 469
2 aaronha01 1955 755 603
3 aaronha01 1956 762 610
4 aaronha01 1957 766 616
5 aaronha01 1958 754 602
6 aaronha01 1959 783 630
7 aaronha01 1960 743 591
8 aaronha01 1961 758 604
9 aaronha01 1962 748 593
10 aaronha01 1963 792 632

Resources