How to separate each column name of a matrix by the + - r

I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R.
For example:
data$age is the response variable
X is the design matrix whose column names are, for example, data$education and data$wage.
The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:
best_model <- lm(data$age ~ paste(colnames(x[, GA#solution == 1]), sep = "+"))
But it doesn't work.

Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)
library(tidyverse)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
#> # ... with 224 more rows
Select
dplyr::select() subsets column.
mpg %>%
select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
lm(hwy ~ ., data = .)
#>
#> Call:
#> lm(formula = hwy ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) manufacturerchevrolet manufacturerdodge
#> 2.65526 -1.08632 -2.55442
#> manufacturerford manufacturerhonda manufacturerhyundai
#> -2.29897 -2.98863 -0.94980
#> manufacturerjeep manufacturerland rover manufacturerlincoln
#> -3.36654 -1.87179 -1.10739
#> manufacturermercury manufacturernissan manufacturerpontiac
#> -2.64828 -2.44447 0.75427
#> manufacturersubaru manufacturertoyota manufacturervolkswagen
#> -3.04204 -2.73963 -1.62987
#> displ cyl cty
#> -0.03763 0.06134 1.33805
Denote that -col.name exclude that column. %>% enables formula to use . notation.
Tidyselect
Lots of data sets group their columns using underscore.
nycflights13::flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance)
#> # A tibble: 336,776 x 5
#> sched_dep_time sched_arr_time dep_delay arr_delay distance
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 515 819 2 11 1400
#> 2 529 830 4 20 1416
#> 3 540 850 2 33 1089
#> 4 545 1022 -1 -18 1576
#> 5 600 837 -6 -25 762
#> 6 558 728 -4 12 719
#> 7 600 854 -5 19 1065
#> 8 600 723 -3 -14 229
#> 9 600 846 -3 -8 944
#> 10 600 745 -2 8 733
#> # ... with 336,766 more rows
After that, just %>% lm().
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance) %>%
lm(dep_delay ~ ., data = .)
#>
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) sched_dep_time sched_arr_time arr_delay
#> -0.151408 0.002737 0.000951 0.816684
#> distance
#> 0.001859

Related

How to add two data frames together in R?

I have a data frame delineated by ownership, private(50) and state(30). Looking to create 5 new rows that are the sum of ownership 50 and ownership 30 as long as they have a matching area value. Desired result is below.
naics <- c(611,611,611,611,611,611,611,611,611,611)
ownership <- c(50,50,50,50,50,30,30,30,30,10)
area <- c(001,003,005,009,011,001,003,005,011,001)
d200201 <- c(14,17,20,23,26,3,5,7,9,100)
d200202 <- c(15,18,21,24,28,9,11,13,15,105)
private <- data.frame(naics,ownership,area,d200201,d200202)
naics ownership area d200201 d200202
611 50 001 17 24
611 50 003 22 29
611 50 005 27 34
611 50 009 23 24 (no sum because no 30 value)
611 50 011 35 43
Is this what you are looking for?
library(dplyr)
private %>%
group_by(naics, area) %>%
summarize(
across(c(d200201, d200202), ~sum(.x[ownership %in% c(30, 50)])),
ownership = 50, .groups = "drop"
)
Output
# A tibble: 5 x 5
naics area d200201 d200202 ownership
<dbl> <dbl> <dbl> <dbl> <dbl>
1 611 1 17 24 50
2 611 3 22 29 50
3 611 5 27 34 50
4 611 9 23 24 50
5 611 11 35 43 50
library(tidyverse)
private %>%
filter(ownership %in% c(50, 30)) %>%
group_by(area) %>%
summarize(across(starts_with("d200"), sum))
#> # A tibble: 5 × 3
#> area d200201 d200202
#> <dbl> <dbl> <dbl>
#> 1 1 17 24
#> 2 3 22 29
#> 3 5 27 34
#> 4 9 23 24
#> 5 11 35 43
Created on 2022-01-08 by the reprex package (v2.0.1)

finding minimum for a column based on another column and keep result as a data frame

I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9

Plotting continuous versus categorical variable in a bar chart using ggplot

I am a newbie to base R. I have gone through similar issues here but didn't get it resolved. I am using the code:
ggplot(combined_Attributes, aes(x = factor(CatAge), y = Total_Expenditure,
fill = "#0073C2FF" +
geom_bar(stat = "identity", position = "dodge"))) + geom_text(aes(label = CatAge))
I do not want a text written on the plot but the categories and as reference. Struggling with this.
I don't know, whether you are looking for something like this, but here I used mpg demo data-frame from the tidyverse and calculated the frequency of each model prepared by the manufacturer and plotted as bar plot.
library(tidyverse)
data(mpg)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
#> 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
#> 3 audi a4 2 2008 4 manual~ f 20 31 p comp~
#> 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
#> 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
#> 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
#> 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
#> 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
#> 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
#> 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~
#> # ... with 224 more rows
ggplot(mpg, aes(x = manufacturer, y = frequency(model),
fill = model)) + geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45))
Created on 2021-09-03 by the reprex package (v2.0.1)

Visualisation of principal components (PC)

I am using ggplot2 in R to produce the plot from the code:
final_flights <- augment(flights_model, flights_tbl) %>% collect()'
final_flights
# A tibble: 327,346 x 22
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ... with 327,336 more rows, and 14 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
# PC1 <dbl>, PC2 <dbl>, PC3 <dbl>
I have tried already:
ggplot(final_flights, aes(PC1, PC2)) +
geom_point(aes(colour=air_time))
ggplot(final_flights, aes(PC1, PC2, PC3))+
geom_point(aes(colour=air_time, distance, dep_time))
ml_predict(kmeans_model) %>%
collect() %>%
ggplot(aes(air_time, distance, dep_time)) +
geom_point(aes(air_time, distance, dep_time, col = factor(prediction+1)),
size=2, alpha=0.5)+
geom_point(data=kmeans_model$k, aes(air_time, distance, dep_time),
pch='x', size=12)+
scale_color_discrete(name="Predicted cluster")
> Warning: Ignoring unknown aesthetics: Warning: Ignoring unknown
> aesthetics: Error: Column 3 must be named.
I want to produce the ggplot model with two principal components, which variable explains the clustering in the data

Importing a .tsv file

I am having trouble importing a .tsv file in R. The data file is from Eurostats, and is publicly available: http://ec.europa.eu/eurostat/en/web/products-datasets/-/MIGR_IMM10CTB
I use the below code to import it:
immig <- read.table(file="immig.tsv", sep="\t", header=TRUE)
However, the code does not seem to work. I do not receive any error messages, but the output looks like this:
> immig[1:3, 1:3]
age.agedef.c_birth.unit.sex.geo.time X2015 X2014
1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093
2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953
3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577
What am I doing wrong? I tried to use sep="," instead, but it seems to solve some problems while creating others.
Is the problem that you are missing the 2013 data?
I downloaded the file at that link, unzipped it using a command line tool, and then it can be imported just fine using the readr library:
library(readr)
immigration <- read_tsv("~/Downloads/migr_imm10ctb.tsv", na = ":")
#> Parsed with column specification:
#> cols(
#> `age,agedef,c_birth,unit,sex,geo\time` = col_character(),
#> `2015` = col_character(),
#> `2014` = col_character(),
#> `2013` = col_character()
#> )
immigration
#> # A tibble: 45,558 x 4
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013`
#> <chr> <chr> <chr> <chr>
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743 p
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY <NA> <NA> 54
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE <NA> <NA> 14984
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE 23 7 16
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL <NA> <NA> 234
#> # ... with 45,548 more rows
Looks like there are some spare characters floating around (743 p) where there should only be numbers, so you'll need to do more cleaning and then convert to numeric.
library(dplyr)
library(stringr)
immigration %>%
mutate_at(vars(`2015`:`2013`), str_extract, pattern = "[0-9]+") %>%
mutate_at(vars(`2015`:`2013`), as.numeric)
#> # A tibble: 45,558 x 4
#> `age,agedef,c_birth,unit,sex,geo\\time` `2015` `2014` `2013`
#> <chr> <dbl> <dbl> <dbl>
#> 1 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,AT 4723 4093 4085
#> 2 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BE 1017 953 1035
#> 3 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,BG 559 577 743
#> 4 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CH 2876 2766 2758
#> 5 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CY NA NA 54
#> 6 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,CZ 120 106 155
#> 7 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DE NA NA 14984
#> 8 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,DK 372 365 405
#> 9 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EE 23 7 16
#> 10 TOTAL,COMPLET,CC5_13_FOR_X_IS,NR,F,EL NA NA 234
#> # ... with 45,548 more rows
It's a tab-delimited file, but that first column is all put together with commas, so if what you are wanting is that information separated out, you could do that with tidyr::separate().
library(tidyr)
immigration %>%
separate(`age,agedef,c_birth,unit,sex,geo\\time`,
c("age", "agedef", "c_birth", "unit", "sex", "geo"),
sep = ",")
#> # A tibble: 45,558 x 9
#> age agedef c_birth unit sex geo `2015` `2014` `2013`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 TOTAL COMPLET CC5_13_FOR_X_IS NR F AT 4723 4093 4085
#> 2 TOTAL COMPLET CC5_13_FOR_X_IS NR F BE 1017 953 1035
#> 3 TOTAL COMPLET CC5_13_FOR_X_IS NR F BG 559 577 743 p
#> 4 TOTAL COMPLET CC5_13_FOR_X_IS NR F CH 2876 2766 2758
#> 5 TOTAL COMPLET CC5_13_FOR_X_IS NR F CY <NA> <NA> 54
#> 6 TOTAL COMPLET CC5_13_FOR_X_IS NR F CZ 120 106 155
#> 7 TOTAL COMPLET CC5_13_FOR_X_IS NR F DE <NA> <NA> 14984
#> 8 TOTAL COMPLET CC5_13_FOR_X_IS NR F DK 372 365 405
#> 9 TOTAL COMPLET CC5_13_FOR_X_IS NR F EE 23 7 16
#> 10 TOTAL COMPLET CC5_13_FOR_X_IS NR F EL <NA> <NA> 234
#> # ... with 45,548 more rows
something like this could be a starting point:
link <- "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/migr_imm10ctb.tsv.gz"
data <- readr::read_csv(link) %>%
separate("geo\\time\t2015 \t2014 \t2013", into = c("geo", "2015", "2014", "2013"), sep = "\t")

Resources