pivot_longer, transform first column rownames, to column names - r

Using this data:
https://www.health.govt.nz/system/files/documents/publications/suicide-2015-tables.xlsx
The first column has rows:
"Number of suicides
Male
Female
Total
Age-standardised rate (deaths per 100,000)
Male
Female
Total"
However, I need these to be the column headers instead. Is this done so via pivot_longer() ?
Thanks

This is a pretty ugly data file, but hopefully this should get to what you want:
library(dplyr)
library(tidyr)
# After downloading the file in your project folder
dat <- readxl::read_excel("suicide-2015-tables.xlsx", skip = 2)
dat %>%
select(variables = ...1, `2006`:`2015`) %>% # Remove unneeded/blank columns
mutate(headers = if_else(is.na(`2006`), variables, NA_character_)) %>% # Create a headers variable
fill(headers, .direction = "down") %>% # Fill the headers down
pivot_longer(`2006`:`2015`, names_to = "year", values_to = "counts") %>% # Reshape data from wide to long
drop_na() %>%
unite("headers_vars", headers, variables, sep = " - ") %>% # Create a new variable that combines the headers and the subgroup breakdown
pivot_wider(names_from = headers_vars, values_from = counts) # Reshape back from long to wide
# A tibble: 10 x 15
year `Number of suic~ `Number of suic~ `Number of suic~ `Age-standardis~ `Age-standardis~ `Age-standardis~ `Age-specific r~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2006 388 138 526 18.6 6.25 12.2 19.7
2 2007 371 116 487 17.4 5.01 11.0 15.2
3 2008 381 139 520 17.6 6.24 11.8 19.5
4 2009 393 117 510 17.9 5.03 11.3 18.1
5 2010 386 149 535 17.3 6.63 11.8 17.7
6 2011 377 116 493 17.0 5.06 10.9 20.1
7 2012 404 146 550 18.1 6.39 12.1 23.0
8 2013 367 147 514 16.0 6.41 11.0 17.8
9 2014 380 130 510 16.5 5.4 10.8 14.1
10 2015 384 143 527 16.4 6.1 11.1 16.9
# ... with 7 more variables: `Age-specific rates by life-stage age group (deaths per 100,000) - 25–44` <dbl>, `Age-specific rates by
# life-stage age group (deaths per 100,000) - 45–64` <dbl>, `Age-specific rates by life-stage age group (deaths per 100,000) - 65+` <dbl>,
# `Age-standardised suicide rates for Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide rates for Māori (deaths per
# 100,000) - Female` <dbl>, `Age-standardised suicide rates for non-Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide
# rates for non-Māori (deaths per 100,000) - Female` <dbl>

Related

Reshaping Long data frame to wide using R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have the following data set containing the quantity of phytosanitary products purchased per zip code in france between 2015 and 2019 with its classification (other,toxic,mineral,organic).
the dataframe looks like this, so with the zip_code, the year and the classification you can see the quantity that was purchased
zip_code
year
classification
total_quantity
01000
2015
other
44.305436
01000
2015
toxic
212.783330
01000
2015
mineral
value
01000
2015
organic
value
01000
2016
other
value
01000
2016
toxic
value
01000
2016
mineral
value
it follows the same pattern .....
zip_code
year
classification
total_quantity
01000
2019
organic
value
01090
2015
other
value
but I would like something where you have only one entry per zip code like this (of course going to 2019 and not stoping at 2016 like i did in my exemple)
zip_code
other_total-quantity-2015
Toxic_total-quantity-2015
Mineral_total-quantity-2015
organic_total-quantity-2015
other_total-quantity-2016
Toxic_total-quantity-2016
01000
value
value
value
value
value
value
01090
value
value
value
value
value
I tried to do this using the reshape function but the closest i got from what i want is a table where the zip_code is repeated 4 times (for every classification).
Thank you
The following uses pivot_wider from package tidyr to do the reshape. I'm aware that's a personal preference, but maybe it's helpful though.
library(tidyr)
library(dplyr)
## or install and load these and related packages
## in bulk through the `tidyverse` package
df %>%
pivot_wider(
id_cols = zip_code,
names_from = c(year, classification),
values_from = total_quantity,
names_prefix = 'total-quantity' ## redundant, actually
)
I created a sample dataset that looks like this:
# A tibble: 40 × 4
zip_code year classification total_quantity
<dbl> <dbl> <chr> <dbl>
1 1000 2015 other 61.1
2 1000 2015 toxic 32.8
3 1000 2015 mineral 11.4
4 1000 2015 organic 38.9
5 1000 2016 other 18.8
6 1000 2016 toxic 65.0
7 1000 2016 mineral 0.382
8 1000 2016 organic 18.8
9 1000 2017 other 96.0
10 1000 2017 toxic 60.4
# … with 30 more rows
If you run the following code you will get your requested table:
library(tidyr)
library(dplyr)
df %>%
pivot_wider(
id_cols = zip_code,
names_from = c(year, classification),
values_from = total_quantity,
names_glue = "{classification}_total-quantity-{year}"
)
Output:
# A tibble: 2 × 21
zip_code `other_total-quantity…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…` `other_total-q…` `toxic_total-q…` `mineral_total…` `organic_total…`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1000 61.1 32.8 11.4 38.9 18.8 65.0 0.382 18.8 96.0 60.4 80.4 81.2 47.4 87.4 52.2 9.65 19.7 11.3 45.7 12.8
2 1090 75.2 40.1 47.9 10.3 86.2 97.9 11.2 93.3 55.0 88.5 63.5 46.6 5.30 13.1 20.4 83.9 58.6 61.3 6.56 46.7
As you can see the year and classification are added to the columnnames using names_glue in the pivot_wider function.

r pdf_text() split into lines and words

I can't upload the file into stackoverflow but I have a PDF containing a table spanning 3 pages. After using library(pdftools) and pdf_text(), it creates a 3 element character list where each element is a long string of all text from each page.
library(pdftools)
df <- pdf_text(file.pdf)
The data I need is on the 2nd page. I get the output:
df[2]
All Households 19,015 10,030 8,985 3,635 585 3,055 19.1 5.8 34.0\n\nHousing above standards 12,365 8,225 4,145 0 0 0 0.0 0.0 0.0\n\nBelow one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1\n\nBelow affordability standard12 4,885 1,230 3,660 3,125 535 2,590 64.0 43.5 70.8\n\nBelow adequacy standard13 1,360 555 810 425 75 350 31.2 13.5 43.2\n\n\n\n\n
I want to isolate the row "Below one or more housing standards" and the 8th column which contains the value "54.7".
I believe the next steps are to split the long string into lines by the line break character "\n", identify the applicable line, split the line into words, and select the 8th word.
I've tried splitting into lines using:
library(stringr)
lines <- df[2] %>% str_split("\n")
It returns a "List of 1" and I'm not sure how to work with it. Any suggestions on the syntax?
It's a bit convoluted to get to the original file.
https://www03.cmhc-schl.gc.ca/hmip-pimh/en/#Profile/126504/5/Alta%20Vista
Core Housing Need -> Full Report -> Export.
Oddly there isn't a way to just download a CSV.
Use readLines (which doesn't use the scan(text= ...) pathway and therefore needs textConnection.
library(pdftools)
#Using poppler version 0.62.0
df <- pdf_text("Downloads/TableExport.pdf")
str(df)
# chr [1:3] "Core Housing Need (2016 Statistics Canada's Census) — Alta Vista\n H "| __truncated__ ...
# for each page read in with readLines to make character vectors
# separated by \n
lines <- lapply(df, function(t) readLines( textConnection(t)) )
Then search for the line with the target:
lines[[2]][grep("Below one or more housing standards", lines[[2]])]
[1] "Below one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1"
If you assigned that value to the name target you could get the 8th column with this rather baroque regex:
sub("(Below one or more housing standards)([ ]*\\d*[,]*\\d*){6}[ ]*(\\d*[.]*\\d*)(.*)", "\\3", target)
#[1] "54.7"
Notice the need to allow commas and decimal points in the numeric specifications. As written it may not be totally general since the first six of the numeric columns are only allowed to have commas and not decimals. I guess you could allow a character class like "[.,]" to be more general. Or even: "([ ]*\\d*[,]*\\d+[.]*\\d*){6}" (lightly tested). I suspect there are packages that will handle tabular pdf formatting in a more principled manner.
This does not use pdftools, but I hope it is helpful to you. First, use rvest package to read the url of this table, then use html_table to extract into a table. Then, there is some manual manipulation
library(tidyverse)
library(rvest)
url = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/Profile/DetailsCoreHousingNeed?geographyId=126504&t=5"
# Read the url
doc = rvest::read_html(url)
# Extract the table, and provide anonymous V<x> names
table = rvest::html_table(doc)[[1]]
names(table) = paste0("V",1:ncol(table))
# drop first three rows
table <- table %>% filter(row_number()>2)
# Manually, identify the split rows (i.e. subheadings)
split_rows = c(1,9,24,32,36,40,44,48,55,62)
# Extract the subheadings
sub_table_names = table %>% filter(row_number() %in% split_rows) %>% pull(V1)
# Now, use lapply to filter the rows that are between the splits, and use as.numeric and str_remove_all to convert to numeric values
tables = lapply(seq_along(split_rows), function(x) {
table %>%
filter(between(row_number(), split_rows[x]+1, split_rows[x+1]-1 )) %>%
mutate(across(V2:V10, ~as.numeric(str_remove_all(.x,","))))
})
# Name the list of tables
names(tables) <- sub_table_names
Output:
$`Age of primary household maintainer3`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 15 to 24 years 1030 45 980 220 0 220 21.4 0 22.4
3 25 to 34 years 2700 715 1990 555 40 515 20.6 5.6 25.9
4 35 to 44 years 2795 1360 1440 545 25 520 19.5 1.8 36.1
5 45 to 54 years 3565 2005 1565 740 135 610 20.8 6.7 39
6 55 to 64 years 3535 2225 1315 615 155 455 17.4 7 34.6
7 65 years and over 5380 3685 1700 960 220 735 17.8 6 43.2
$`Household Type4`
# A tibble: 14 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Couple with children 4360 3145 1220 585 100 485 13.4 3.2 39.8
3 Couple without children 4755 3195 1555 390 70 315 8.2 2.2 20.3
4 Senior-led (65+) couple without children 2030 1695 335 140 50 90 6.9 2.9 26.9
5 Lone-parent household 2220 810 1405 845 135 710 38.1 16.7 50.5
6 Female lone-parent household 1845 660 1190 730 105 625 39.6 15.9 52.5
7 Male lone-parent household 370 155 220 115 30 85 31.1 19.4 38.6
8 Multiple-family household 265 165 100 70 20 45 26.4 12.1 45
9 One-person household 6075 2385 3685 1525 235 1290 25.1 9.9 35
10 Female one-person households 3615 1590 2025 920 135 795 25.4 8.5 39.3
11 Senior (65+) female living alone 1810 980 830 525 90 435 29 9.2 52.4
12 Male one-person household 2455 800 1660 605 105 500 24.6 13.1 30.1
13 Senior (65+) male living alone 600 350 250 170 50 120 28.3 14.3 48
14 Other non-family household 1345 330 1015 230 25 205 17.1 7.6 20.2
$`Immigrant households5`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Non-immigrant 12500 7115 5395 1665 230 1440 13.3 3.2 26.7
3 Non-permanent resident6 430 25 400 140 10 130 32.6 40 32.5
4 Immigrant 6085 2890 3190 1825 345 1485 30 11.9 46.6
5 Landed before 2001 4105 2480 1620 1065 275 790 25.9 11.1 48.8
6 Landed 2001 to 2010 1340 340 1000 460 55 400 34.3 16.2 40
7 Recent immigrants (landed 2011-2016)7 640 70 575 310 10 295 48.4 14.3 51.3
$`Households with seniors`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one senior (65 or older) 5910 4085 1825 1015 245 770 17.2 6 42.2
3 Other household type 13105 5945 7155 2625 340 2285 20 5.7 31.9
$`Households with children under 18`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one child less than 18 years old 4465 2455 2005 1140 170 975 25.5 6.9 48.6
3 Other household type 14550 7575 6980 2500 420 2080 17.2 5.5 29.8
$`Activity limitations8`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one person with activity limitations 10955 5830 5120 2285 385 1895 20.9 6.6 37
3 All other households 8060 4195 3865 1360 200 1160 16.9 4.8 30
$`Aboriginal households9`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Aboriginal households 655 215 440 120 20 105 18.3 9.3 23.9
3 Non-Aboriginal households 18355 9815 8540 3515 565 2955 19.2 5.8 34.6
$`Incomes, shelter costs10, and STIRs11`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average household income before taxes ($) 96464 134172 54357 29101 31212 28696 NA NA NA
2 Average monthly shelter costs ($) 1256 1408 1085 1039 1243 1000 NA NA NA
3 Average STIR before taxes (%) 24 17.2 31.5 46.8 49.7 46.2 NA NA NA
4 Median household income before taxes ($) 72502 107762 44596 27711 28437 27568 NA NA NA
5 Median monthly shelter costs ($) 1097 1193 1076 1013 1115 1006 NA NA NA
6 Median STIR before taxes (%) 19.3 14 26 43.8 45.8 43.3 NA NA NA
$`Housing standards`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Housing above standards 12365 8225 4145 0 0 0 0 0 0
3 Below one or more housing standards 6650 1805 4845 3640 585 3055 54.7 32.4 63.1
4 Below affordability standard12 4885 1230 3660 3125 535 2590 64 43.5 70.8
5 Below adequacy standard13 1360 555 810 425 75 350 31.2 13.5 43.2
6 Below suitability standard14 1480 210 1270 800 55 745 54.1 26.2 58.7
you could check if there is more up to date 2018 data by following the crumbs to https://www150.statcan.gc.ca/n1/pub/46-25-0001/462500012021001-eng.htm ,
However, If you only want one row it is easy to save the source with right clicks,
<tr>
<th scope="row">Below one or more housing standards</th>
<td>6,650</td>
<td>1,805</td>
<td>4,845</td>
<td>3,640</td>
<td>585</td>
<td>3,055</td>
<td>54.7</td>
<td>32.4</td>
<td>63.1</td>
</tr>
for the headings you need
HOUSEHOLDS TESTED FOR CORE HOUSING NEED 1 HOUSEHOLDS IN CORE HOUSING NEED 2 % OF HOUSEHOLDS IN CORE HOUSING NEED
TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS
and for footins
1 Data include all non-farm, non-band, non-reserve private households reporting positive incomes and shelter cost-to-income ratios less than 100 per cent.
2 A household is in core housing need if its housing does not meet one or more standards for housing adequacy (repair), suitability (crowding), or affordability and if it would have to spend 30 per cent or more of its before-tax income to pay the median rent (including utilities) of appropriately sized alternative local market housing. Adequate housing does not require any major repairs, according to residents. Suitable housing has enough bedrooms for the size and make-up of resident households. Affordable housing costs less than 30 per cent of before-tax household income.
You have a PDF and want to work with the raw Text but its clear there is some issue with the generated searchable text and we can see that in the headings and with copy and paste. Belowone ormore housing standards so here is the expected extraction from bottom of page 2
pdftotext -f 2 -l 2 -nopgbrk -simple -margint 650 tableexport.pdf -

finding minimum for a column based on another column and keep result as a data frame

I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9

mutate arithmetic using dplyr::lag

I want to automate calculating the difference in means from a grouped mean_se table. But using lag() in a mutate function produces NA's.
iris %>% group_by(Species) %>%
group_modify(~ mean_se(.x$Sepal.Length)) %>% mutate(difference = y-lag(y))
What I would like is a difference column that says NA, 0.93, 0.65
A harder case would be to specify the particular category against which to calculate the operation, for example filter(marital == "No answer") so that the mean differences in each raceXmarital status are calculated against the values of "No answer" in the marital column (34, 64 and 56)
gss_cat %>% group_by(race, marital) %>%
group_modify(~ mean_se(.x$age))
The group attribute is still present after the mutate unless we do ungroup. There is no equivalent option is mutate as in summarise i.e. .groups = 'drop'
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ mean_se(.x$Sepal.Length)) %>%
ungroup %>%
mutate(difference = y-lag(y))
-output
# A tibble: 3 x 5
Species y ymin ymax difference
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 4.96 5.06 NA
2 versicolor 5.94 5.86 6.01 0.93
3 virginica 6.59 6.50 6.68 0.652
Essentially, when we have a single row (i.e. here the output after the group_modify is a single row per group) and we take the lag it is just NA because the default option is NA
lag(5)
[1] NA
and any value subtracted from NA returns NA
6 - NA
[1] NA
For the second case, the data is grouped by two columns, therefore, we can change the grouping to 'race' and do the subtraction by subsetting
library(forcats)
data(gss_cat)
gss_cat %>%
group_by(race, marital) %>%
group_modify(~ mean_se(.x$age)) %>%
group_by(race) %>%
mutate(diff = y - y[marital == 'No answer']) %>%
ungroup
-output
# A tibble: 18 x 6
race marital y ymin ymax diff
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Other No answer 34 28 40 0
2 Other Never married 30.2 29.8 30.6 -3.79
3 Other Separated 42.5 41.3 43.8 8.54
4 Other Divorced 45.5 44.7 46.3 11.5
5 Other Widowed 64.5 62.7 66.3 30.5
6 Other Married 42.2 41.8 42.7 8.24
7 Black No answer 64 NA NA 0
8 Black Never married 34.5 34.2 34.8 -29.5
9 Black Separated 46.2 45.2 47.1 -17.8
10 Black Divorced 51.0 50.4 51.5 -13.0
11 Black Widowed 67.5 66.7 68.4 3.53
12 Black Married 46.4 46.0 46.9 -17.6
13 White No answer 56 50.1 61.9 0
14 White Never married 34.4 34.2 34.7 -21.6
15 White Separated 45.6 44.9 46.2 -10.4
16 White Divorced 51.6 51.3 51.8 -4.44
17 White Widowed 72.8 72.5 73.1 16.8
18 White Married 49.7 49.5 49.8 -6.32

How can I find the matching values from a column within 3 or more grouped dataframes?

I have grouped data frames (in my case three data frames grouped together ). I want to find the intersection between all three data frames based on a value in a column.
I have been playing around with the dplyr intersect function but don't see how I can use this with my grouped data frames. I want to find all rows within all three data frames that have the same Start.Coord value.
Here is one failed attempt with the resulting error message:
SameWithinTreatment <= SorbitolGroup %>% group_by(Sample) %>% intersect(Start.Coord)
Error in intersect_data_frame(x, y) : object 'Start.Coord' not found
Obviously I need another parameter to give to intersect(). I see that intersect() doesn't seem to be the function I need but it seems that there must be a way to do what I need.
I have done a lot of searching but everything I find only works with 2 data frames.
Here is some example data from my grouped data frames. There is one row with a common Start.Coord value between these three: the row with 8805 as the Start.Coord.
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Covera~ SD.of.Normalized.Covera~ TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 1019 1023 X1.combined 19 18 9.91 3.98 7.95
2 1510 1514 X1.combined 19 18 9.91 3.98 7.95
3 1514 1518 X1.combined 19 18 9.91 3.98 7.95
4 1520 1524 X1.combined 19 18 9.91 3.98 7.95
5 8805 8809 X1.combined 19 18 9.91 3.98 7.95
6 48185 48189 X1.combined 19 18 9.91 3.98 7.95
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X2 167 166 122. 21.7 43.4
2 11874 11878 X2 169 168 122. 21.7 43.4
3 12042 12046 X2 169 168 122. 21.7 43.4
4 18321 18325 X2 175 174 122. 21.7 43.4
5 25187 25191 X2 167 166 122. 21.7 43.4
6 25308 25312 X2 194 193 122. 21.7 43.4
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X3 132 131 94.4 16.7 33.5
2 10340 10344 X3 135 134 94.4 16.7 33.5
3 11874 11878 X3 141 140 94.4 16.7 33.5
4 12042 12046 X3 137 136 94.4 16.7 33.5
5 18209 18213 X3 133 132 94.4 16.7 33.5
6 18218 18222 X3 143 142 94.4 16.7 33.5
So I would like to get back a new data frame that looks like this:
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
8805 8809 X1.combined 19 18 9.91 3.98 7.95
8805 8809 X2 167 166 122. 21.7 43.4
8805 8809 X3 132 131 94.4 16.7 33.5
Is there a way to accomplish this?
If your 3 data frames have the same column names use rbind to combine them
SorbitolGroup<- rbind(df1,df2,df3)
then add
Start.Coord to group_by:
SorbitolGroup %>% group_by(Sample,Start.Coord)
If you want to count the number of observations in both groups
SorbitolGroup %>% group_by(Sample,Start.Coord) %>% tally()
it sounds like you need to use filter(), in addition to what #W148SMH suggested.
a <- data.frame(sample='a',value=sample(1:10,10,T))
b <- data.frame(sample='b',value=sample(1:10,10,T))
c <- data.frame(sample='c',value=sample(1:10,10,T))
df <- rbind(a,b,c)
summary(df)
df %>% filter(value==9)
df_new <- df %>% filter(value==9) # new data frame including all cases with value==9
df %>% count(sample,value)
df %>% group_by(sample,value) %>%
summarise(...) # to summarise other variables at each level of sample and value

Resources