I am having trouble importing a .tsv file in R. The data file is from Eurostats, and is publicly available: http://ec.europa.eu/eurostat/en/web/products-datasets/-/MIGR_IMM10CTB
I use the below code to import it:
immig <- read.table(file="immig.tsv", sep="\t", header=TRUE)
However, the code does not seem to work. I do not receive any error messages, but the output looks like this:
> immig[1:3, 1:3]
age.agedef.c_birth.unit.sex.geo.time X2015 X2014
1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093
2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953
3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577
What am I doing wrong? I tried to use sep="," instead, but it seems to solve some problems while creating others.
Is the problem that you are missing the 2013 data?
I downloaded the file at that link, unzipped it using a command line tool, and then it can be imported just fine using the readr library:
library(readr)
immigration <- read_tsv("~/Downloads/migr_imm10ctb.tsv", na = ":")
#> Parsed with column specification:
#> cols(
#> `age,agedef,c_birth,unit,sex,geo\time` = col_character(),
#> `2015` = col_character(),
#> `2014` = col_character(),
#> `2013` = col_character()
#> )
immigration
#> # A tibble: 45,558 x 4
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013`
#> <chr> <chr> <chr> <chr>
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743 p
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY <NA> <NA> 54
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE <NA> <NA> 14984
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE 23 7 16
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL <NA> <NA> 234
#> # ... with 45,548 more rows
Looks like there are some spare characters floating around (743 p) where there should only be numbers, so you'll need to do more cleaning and then convert to numeric.
library(dplyr)
library(stringr)
immigration %>%
mutate_at(vars(`2015`:`2013`), str_extract, pattern = "[0-9]+") %>%
mutate_at(vars(`2015`:`2013`), as.numeric)
#> # A tibble: 45,558 x 4
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013`
#> <chr> <dbl> <dbl> <dbl>
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY NA NA 54
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE NA NA 14984
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE 23 7 16
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL NA NA 234
#> # ... with 45,548 more rows
It's a tab-delimited file, but that first column is all put together with commas, so if what you are wanting is that information separated out, you could do that with tidyr::separate().
library(tidyr)
immigration %>%
separate(`age,agedef,c_birth,unit,sex,geo\\time`,
c("age", "agedef", "c_birth", "unit", "sex", "geo"),
sep = ",")
#> # A tibble: 45,558 x 9
#> age agedef c_birth unit sex geo `2015` `2014` `2013`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 TOTAL COMPLET CC5_13_FOR_X_IS NR F AT 4723 4093 4085
#> 2 TOTAL COMPLET CC5_13_FOR_X_IS NR F BE 1017 953 1035
#> 3 TOTAL COMPLET CC5_13_FOR_X_IS NR F BG 559 577 743 p
#> 4 TOTAL COMPLET CC5_13_FOR_X_IS NR F CH 2876 2766 2758
#> 5 TOTAL COMPLET CC5_13_FOR_X_IS NR F CY <NA> <NA> 54
#> 6 TOTAL COMPLET CC5_13_FOR_X_IS NR F CZ 120 106 155
#> 7 TOTAL COMPLET CC5_13_FOR_X_IS NR F DE <NA> <NA> 14984
#> 8 TOTAL COMPLET CC5_13_FOR_X_IS NR F DK 372 365 405
#> 9 TOTAL COMPLET CC5_13_FOR_X_IS NR F EE 23 7 16
#> 10 TOTAL COMPLET CC5_13_FOR_X_IS NR F EL <NA> <NA> 234
#> # ... with 45,548 more rows
something like this could be a starting point:
link <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/migr_imm10ctb.tsv.gz"
data <- readr::read_csv(link) %>%
separate("geo\\time\t2015 \t2014 \t2013", into = c("geo", "2015", "2014", "2013"), sep = "\t")
Related
I need to divide columns despesatotal and despesamonetaria by the row named Total:
Lets suppose your data set is df.
# 1) Delete the last row
df <- df[-nrow(df),]
# 2) Build the desired data.frame [combining the CNAE names and the proportion columns
new.df <- cbind(grup_CNAE = df$grup_CNAE,
100*prop.table(df[,-1],margin = 2))
Finally, rename your columns. Be careful with the matrix or data.frame formats, because sometimes mathematical operations may suppose a problem. If you you use dput function in order to give us a reproducible example, the answer would be more accurate.
Here is a way to get it done. This is not the best way, but I think it is very readable.
Suppose this is your data frame:
mydf = structure(list(grup_CNAE = c("A", "B", "C", "D", "E", "Total"
), despesatotal = c(71, 93, 81, 27, 39, 311), despesamonetaria = c(7,
72, 36, 22, 73, 210)), row.names = c(NA, -6L), class = "data.frame")
mydf
# grup_CNAE despesatotal despesamonetaria
#1 A 71 7
#2 B 93 72
#3 C 81 36
#4 D 27 22
#5 E 39 73
#6 Total 311 210
To divide despesatotal values with its total value, you need to use the total value (311 in this example) as the denominator. Note that the total value is located in the last row. You can identify its position by indexing the despesatotal column and use nrow() as the index value.
mydf |> mutate(percentage1 = despesatotal/despesatotal[nrow(mydf)],
percentage2 = despesamonetaria /despesamonetaria[nrow(mydf)])
# grup_CNAE despesatotal despesamonetaria percentage1 percentage2
#1 A 71 7 0.22829582 0.03333333
#2 B 93 72 0.29903537 0.34285714
#3 C 81 36 0.26045016 0.17142857
#4 D 27 22 0.08681672 0.10476190
#5 E 39 73 0.12540193 0.34761905
#6 Total 311 210 1.00000000 1.00000000
library(tidyverse)
Sample data
# A tibble: 11 x 3
group despesatotal despesamonetaria
<chr> <int> <int>
1 1 198 586
2 2 186 525
3 3 202 563
4 4 300 562
5 5 126 545
6 6 215 529
7 7 183 524
8 8 163 597
9 9 213 592
10 10 175 530
11 Total 1961 5553
df %>%
mutate(percentage_total = despesatotal / last(despesatotal),
percentage_monetaria = despesamonetaria/ last(despesamonetaria)) %>%
slice(-nrow(.))
# A tibble: 10 x 5
group despesatotal despesamonetaria percentage_total percentage_monetaria
<chr> <int> <int> <dbl> <dbl>
1 1 198 586 0.101 0.106
2 2 186 525 0.0948 0.0945
3 3 202 563 0.103 0.101
4 4 300 562 0.153 0.101
5 5 126 545 0.0643 0.0981
6 6 215 529 0.110 0.0953
7 7 183 524 0.0933 0.0944
8 8 163 597 0.0831 0.108
9 9 213 592 0.109 0.107
10 10 175 530 0.0892 0.0954
This is a good place to use dplyr::mutate(across()) to divide all relevant columns by the Total row. Note this is not sensitive to the order of the rows and will apply the manipulation to all numeric columns. You can supply any tidyselect semantics to across() instead if needed in your case.
library(tidyverse)
# make sample data
d <- tibble(grup_CNAE = paste0("Group", 1:12),
despesatotal = sample(1e6:5e7, 12),
despesamonetaria = sample(1e6:5e7, 12)) %>%
add_row(grup_CNAE = "Total", summarize(., across(where(is.numeric), sum)))
# divide numeric columns by value in "Total" row
d %>%
mutate(across(where(is.numeric), ~./.[grup_CNAE == "Total"]))
#> # A tibble: 13 × 3
#> grup_CNAE despesatotal despesamonetaria
#> <chr> <dbl> <dbl>
#> 1 Group1 0.117 0.0204
#> 2 Group2 0.170 0.103
#> 3 Group3 0.0451 0.0837
#> 4 Group4 0.0823 0.114
#> 5 Group5 0.0170 0.0838
#> 6 Group6 0.0174 0.0612
#> 7 Group7 0.163 0.155
#> 8 Group8 0.0352 0.0816
#> 9 Group9 0.0874 0.135
#> 10 Group10 0.113 0.0877
#> 11 Group11 0.0499 0.0495
#> 12 Group12 0.104 0.0251
#> 13 Total 1 1
Created on 2022-11-08 with reprex v2.0.2
I have a data.frame (df) with two columns (Date & Count) which looks something like shown below:
Date Count
1/1/2022 5
1/2/2022 13
1/3/2022 21
1/4/2022 29
1/5/2022 37
1/6/2022 45
1/7/2022 53
1/8/2022 61
1/9/2022 69
1/10/2022 77
1/11/2022 85
1/12/2022 93
1/13/2022 101
1/14/2022 109
1/15/2022 117
Since I have single variable (count), the idea is to identify if there's been a change in mean in every three days, therefore I want to apply rolling t.test with a window of 3 days and save the resulting p-value next to Count column which I can plot later. Since I have seen people doing these sorts of tests with two variables usually, I can't figure out how to do it with a single variable.
For example, I saw this relevant answer here:
ttestFun <- function(dat) {
myTtest = t.test(x = dat[, 1], y = dat[, 2])
return(myTtest$p.value)
}
rollapply(df_ts, 7, FUN = ttestFun, fill = NA, by.column = FALSE)
But again, this is with two columns. Any guidance please?
Irrespective of any discussion about the usefulness of the approach, given a fixed number of measurements of 3, you could just shift the counts by 3 and perform t-test between two columns as in your example, such as:
library(data.table)
set.seed(123)
dates <- seq(as.POSIXct("2022-01-01"), as.POSIXct("2022-02-01"), by = "1 day")
dt <- data.table(Date=dates, count = sample(1:200, length(dates), replace=TRUE), key="Date")
dt[, nxt:=shift(count, 3, type = "lead")]
dt[, group:=rep(1:ceiling(length(dates)/3), each=3)[seq_along(dates)]]
dt[, p:= tryCatch(t.test(count, nxt)$p.value, error=function(e) NA), by="group"][]
#> Date count nxt group p
#> 1: 2022-01-01 159 195 1 0.7750944
#> 2: 2022-01-02 179 170 1 0.7750944
#> 3: 2022-01-03 14 50 1 0.7750944
#> 4: 2022-01-04 195 118 2 0.2240362
#> 5: 2022-01-05 170 43 2 0.2240362
#> 6: 2022-01-06 50 14 2 0.2240362
#> 7: 2022-01-07 118 118 3 0.1763296
#> 8: 2022-01-08 43 153 3 0.1763296
#> 9: 2022-01-09 14 90 3 0.1763296
#> 10: 2022-01-10 118 91 4 0.8896343
#> 11: 2022-01-11 153 197 4 0.8896343
#> 12: 2022-01-12 90 91 4 0.8896343
#> 13: 2022-01-13 91 185 5 0.8065021
#> 14: 2022-01-14 197 92 5 0.8065021
#> 15: 2022-01-15 91 137 5 0.8065021
#> 16: 2022-01-16 185 99 6 0.1060465
#> 17: 2022-01-17 92 72 6 0.1060465
#> 18: 2022-01-18 137 26 6 0.1060465
#> 19: 2022-01-19 99 7 7 0.5283156
#> 20: 2022-01-20 72 170 7 0.5283156
#> 21: 2022-01-21 26 137 7 0.5283156
#> 22: 2022-01-22 7 164 8 0.9612965
#> 23: 2022-01-23 170 78 8 0.9612965
#> 24: 2022-01-24 137 81 8 0.9612965
#> 25: 2022-01-25 164 43 9 0.6111337
#> 26: 2022-01-26 78 103 9 0.6111337
#> 27: 2022-01-27 81 117 9 0.6111337
#> 28: 2022-01-28 43 76 10 0.6453494
#> 29: 2022-01-29 103 143 10 0.6453494
#> 30: 2022-01-30 117 NA 10 0.6453494
#> 31: 2022-01-31 76 NA 11 NA
#> 32: 2022-02-01 143 NA 11 NA
#> Date count nxt group p
Created on 2022-04-07 by the reprex package (v2.0.1)
You could further clean that up, e.g. by taking the first date per group:
dt[, .(Date=Date[1], count=round(mean(count), 2), p=p[1]), by="group"]
#> group Date count p
#> 1: 1 2022-01-01 117.33 0.7750944
#> 2: 2 2022-01-04 138.33 0.2240362
#> 3: 3 2022-01-07 58.33 0.1763296
#> 4: 4 2022-01-10 120.33 0.8896343
#> 5: 5 2022-01-13 126.33 0.8065021
#> 6: 6 2022-01-16 138.00 0.1060465
#> 7: 7 2022-01-19 65.67 0.5283156
#> 8: 8 2022-01-22 104.67 0.9612965
#> 9: 9 2022-01-25 107.67 0.6111337
#> 10: 10 2022-01-28 87.67 0.6453494
#> 11: 11 2022-01-31 109.50 NA
You can create a grp, and then simply apply a t.test to each consecutive pair of groups:
d <- d %>% mutate(grp=rep(1:(n()/3), each=3))
d %>% left_join(
tibble(grp = 2:max(d$grp),
pval = sapply(2:max(d$grp), function(x) {
t.test(d %>% filter(grp==x) %>% pull(Count),
d %>% filter(grp==x-1) %>% pull(Count))$p.value
})
)) %>% group_by(grp) %>% slice_min(Date)
Output: (p-value is constant only because of the example data you provided)
Date Count grp pval
<date> <dbl> <int> <dbl>
1 2022-01-01 5 1 NA
2 2022-01-04 29 2 0.0213
3 2022-01-07 53 3 0.0213
4 2022-01-10 77 4 0.0213
5 2022-01-13 101 5 0.0213
Or a data.table approach:
setDT(d)[, `:=`(grp=rep(1:(nrow(d)/3), each=3),cy=shift(Count,3))] %>%
.[!is.na(cy), pval:=t.test(Count,cy)$p.value, by=grp] %>%
.[,.SD[1], by=grp, .SDcols=!c("cy")]
Output:
grp Date Count pval
<int> <Date> <num> <num>
1: 1 2022-01-01 5 NA
2: 2 2022-01-04 29 0.02131164
3: 3 2022-01-07 53 0.02131164
4: 4 2022-01-10 77 0.02131164
5: 5 2022-01-13 101 0.02131164
Let's assume we somehow ended up with data frame object (T2 in below example) and we want to subset our original data with that dataframe. Is there a way to do without using | in subset object?
Here is a dataset I was playing but failed
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
head(education)
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
subset(education, State=="ME"| State=="MA" | State=="MI" | State=="MN" | State=="MO" | State=="MD" | State=="MS" | State=="MT")
subset(education, State==T2[3])
subset(education, State==T2)
PS: I created T2 as states starting with M but I don't want using string or anything. Just assume we somehow ended up with T2 in which outputs are some states.
I'm not quite sure what would be an acceptable answer but subset(education, State %in% T2) uses T2 as is and does not use |. Does this solve your problem? It's almost the same approach as Jon Spring points out in the comments, but instead of specifying a vector we can just use T2 with %in%. You say T2 is a data.frame object, but in the data you provided it turns out to be a character vector.
education = read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv", stringsAsFactors = FALSE)
colnames(education) = c("X", "State", "Region", "Urban.Population", "Per.Capita.Income", "Minor.Population", "Education.Expenditures")
T1 = c(1,4,13,15,17,23,33,38)
T2 = education[T1,]$State
T2 # T2 is not a data.frame object (R 4.0)
#> [1] "ME" "MA" "MI" "MN" "MO" "MD" "MS" "MT"
subset(education, State %in% T2)
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
But lets say T2 would be an actual data.frame:
T2 = education[T1,]["State"]
T2 #check
#> State
#> 1 ME
#> 4 MA
#> 13 MI
#> 15 MN
#> 17 MO
#> 23 MD
#> 33 MS
#> 38 MT
Then we could coerce it into a vector by subsetting it with drop = TRUE.
subset(education, State %in% T2[, , drop = TRUE])
#> X State Region Urban.Population Per.Capita.Income Minor.Population
#> 1 1 ME 1 508 3944 325
#> 4 4 MA 1 846 5233 305
#> 13 13 MI 2 738 5439 337
#> 15 15 MN 2 664 4921 330
#> 17 17 MO 2 701 4672 309
#> 23 23 MD 3 766 5331 323
#> 33 33 MS 3 445 3448 358
#> 38 38 MT 4 534 4418 335
#> Education.Expenditures
#> 1 235
#> 4 261
#> 13 379
#> 15 378
#> 17 231
#> 23 330
#> 33 215
#> 38 302
Created on 2021-06-12 by the reprex package (v0.3.0)
I am splitting a data.frame into a list on the basis of its column names. What I want is to include a id column (id) to not just one item but into all elements of the resulting list.
Presently I am doing it through subsequent binding of id column to all items of list through map and bind_cols (alternatives through Map/do.call/mapply etc. I can do similarly myself). What I want to know is there any canonical way of doing it directly, maybe with a function argument of split.default or through some other function directly and thus saving two or three extra steps.
Reproducible example
df <- data.frame(
stringsAsFactors = FALSE,
id = c("A", "B", "C"),
nm1_a = c(928L, 476L, 928L),
nm1_b = c(61L, 362L, 398L),
nm2_a = c(965L, 466L, 369L),
nm2_b = c(240L, 375L, 904L),
nm3_a = c(429L, 730L, 788L),
nm3_b = c(99L, 896L, 540L),
nm3_c = c(463L, 143L, 870L)
)
df
#> id nm1_a nm1_b nm2_a nm2_b nm3_a nm3_b nm3_c
#> 1 A 928 61 965 240 429 99 463
#> 2 B 476 362 466 375 730 896 143
#> 3 C 928 398 369 904 788 540 870
What I am doing presently
library(tidyverse)
split.default(df[-1], gsub('^(nm\\d+).*', '\\1', names(df)[-1])) %>%
map(~ .x %>% bind_cols('id' = df$id, .))
#> $nm1
#> id nm1_a nm1_b
#> 1 A 928 61
#> 2 B 476 362
#> 3 C 928 398
#>
#> $nm2
#> id nm2_a nm2_b
#> 1 A 965 240
#> 2 B 466 375
#> 3 C 369 904
#>
#> $nm3
#> id nm3_a nm3_b nm3_c
#> 1 A 429 99 463
#> 2 B 730 896 143
#> 3 C 788 540 870
What I want is exactly the same output, but is there any way to do it directly or a more canonical way?
Just for a diversity of options, here's what you said you didn't want to do. The pivot / split / pivot method can help scale better and adapt beyond keeping an ID based just on column position. It also makes use of the ID in order to do the reshaping, so it might also be more flexible if you have other operations to do in the intermediate steps and don't know for sure that your row order will stay the same—that's one of the reasons I sometimes avoid binding columns. It also (at least for me) makes sense to split data based on some variable rather than by groups of columns.
library(tidyr)
df %>%
pivot_longer(-id) %>%
split(stringr::str_extract(.$name, "^nm\\d+")) %>%
purrr::map(pivot_wider, id_cols = id, names_from = name)
#> $nm1
#> # A tibble: 3 x 3
#> id nm1_a nm1_b
#> <chr> <int> <int>
#> 1 A 928 61
#> 2 B 476 362
#> 3 C 928 398
#>
#> $nm2
#> # A tibble: 3 x 3
#> id nm2_a nm2_b
#> <chr> <int> <int>
#> 1 A 965 240
#> 2 B 466 375
#> 3 C 369 904
#>
#> $nm3
#> # A tibble: 3 x 4
#> id nm3_a nm3_b nm3_c
#> <chr> <int> <int> <int>
#> 1 A 429 99 463
#> 2 B 730 896 143
#> 3 C 788 540 870
You can make use of a temporary variable so that the code is cleaner and easy to understand.
common_cols <- 1
tmp <- df[-common_cols]
lapply(split.default(tmp, sub('^(nm\\d+).*', '\\1', names(tmp))),
function(x) cbind(df[common_cols], x))
#$nm1
# id nm1_a nm1_b
#1 A 928 61
#2 B 476 362
#3 C 928 398
#$nm2
# id nm2_a nm2_b
#1 A 965 240
#2 B 466 375
#3 C 369 904
#$nm3
# id nm3_a nm3_b nm3_c
#1 A 429 99 463
#2 B 730 896 143
#3 C 788 540 870
This one should be just two steps, split and replace.
Map(`[<-`, split.default(df[-1], substr(names(df)[-1], 1, 3)), 'id', value=df[1])
# $nm1
# nm1_a nm1_b id
# 1 928 61 A
# 2 476 362 B
# 3 928 398 C
#
# $nm2
# nm2_a nm2_b id
# 1 965 240 A
# 2 466 375 B
# 3 369 904 C
#
# $nm3
# nm3_a nm3_b nm3_c id
# 1 429 99 463 A
# 2 730 896 143 B
# 3 788 540 870 C
I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R.
For example:
data$age is the response variable
X is the design matrix whose column names are, for example, data$education and data$wage.
The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:
best_model <- lm(data$age ~ paste(colnames(x[, GA#solution == 1]), sep = "+"))
But it doesn't work.
Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)
library(tidyverse)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
#> # ... with 224 more rows
Select
dplyr::select() subsets column.
mpg %>%
select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
lm(hwy ~ ., data = .)
#>
#> Call:
#> lm(formula = hwy ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) manufacturerchevrolet manufacturerdodge
#> 2.65526 -1.08632 -2.55442
#> manufacturerford manufacturerhonda manufacturerhyundai
#> -2.29897 -2.98863 -0.94980
#> manufacturerjeep manufacturerland rover manufacturerlincoln
#> -3.36654 -1.87179 -1.10739
#> manufacturermercury manufacturernissan manufacturerpontiac
#> -2.64828 -2.44447 0.75427
#> manufacturersubaru manufacturertoyota manufacturervolkswagen
#> -3.04204 -2.73963 -1.62987
#> displ cyl cty
#> -0.03763 0.06134 1.33805
Denote that -col.name exclude that column. %>% enables formula to use . notation.
Tidyselect
Lots of data sets group their columns using underscore.
nycflights13::flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance)
#> # A tibble: 336,776 x 5
#> sched_dep_time sched_arr_time dep_delay arr_delay distance
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 515 819 2 11 1400
#> 2 529 830 4 20 1416
#> 3 540 850 2 33 1089
#> 4 545 1022 -1 -18 1576
#> 5 600 837 -6 -25 762
#> 6 558 728 -4 12 719
#> 7 600 854 -5 19 1065
#> 8 600 723 -3 -14 229
#> 9 600 846 -3 -8 944
#> 10 600 745 -2 8 733
#> # ... with 336,766 more rows
After that, just %>% lm().
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance) %>%
lm(dep_delay ~ ., data = .)
#>
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) sched_dep_time sched_arr_time arr_delay
#> -0.151408 0.002737 0.000951 0.816684
#> distance
#> 0.001859