I'm trying to produce a table with summary totals and means across the whole dataset, and then by sub-category (f_grp), and show this by site.
I can use the group_by function to group by, which works well for reporting total_count and Mean_per_litre, but I would then like the same values for each category, as shown in f_grp.
|Site
|total_count
|Mean_per_litre
|1 |66 |3.33333333
|2 |77 |4.27777778
|3 |65 |3.38541667
|4 |154 |8.85057471
etc
I've tried group_by for both site and f_grp but this isn't quite right
|site
|f_grp
|total_count
|mean_per_litre
|1 |1c |3 |1.666667
|1 |1d |15 |4.166667
|1 |2a |1 |1.666667
|1 |2b |47 |11.190476
This isn't quite right as its not easy to read and I've now lost the original total columns I had in the first table (sorry about the tables, cant get them to work here).
dat$site=as.factor(dat$site)
dat$count=as.numeric(dat$count)
dat$f_grp=as.factor(dat$f_grp)
# totals across all f_grp
tabl1 <- dat %>%
group_by(site) %>%
summarise (total_count = sum(count), Mean_per_litre = mean(count_l_site))
tabl1
# totals FG 1b
tabl2 <- dat %>%
group_by(site) %>%
filter(f_grp== '1b') %>%
summarise ('1b_total_count' = sum(count))
tabl2
### BUT - this doesnt give a correct mean, as it only shows the mean of '1b' when only '1b' is present. I need a mean over the entire dataset at that site.
# table showing totals across whole dataset
tabl7 <- dat %>%
summarise (total_count = sum(count, na.rm = TRUE), Total_mean_per_litre = mean(count_l_site, na.rm = TRUE))
tabl7
# table with means for each site by fg
table6 <- dat %>%
group_by(site, f_grp) %>%
summarise (total_count = sum(count), mean_per_litre = mean(count_l_site, na.rm = TRUE))
table6
Ideally I need a way to extract the f-grp categories, put them as column headings, and then summarise means by site for those categories. But filtering the data and then joining multiple tables, gives incorrect means (as not mean of whole dataset, but a subset of that category, ie: when f_grp value is present only).
Many thanks to all who have read this far :)
> dput(head(dat))
structure(list(X = 1:6, site = structure(c(1L, 10L, 11L, 12L,
13L, 14L), levels = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18"), class = "factor"),
count = c(0, 0, 0, 0, 0, 0), f_grp = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), levels = c("1b", "1c", "1d", "2a", "2b"), class = "factor"),
count_l_site = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L
), class = "data.frame")
Updated:
Following advice here from Jon, and using the mtcars data (which worked as expected) I've tried the same method using my own data.
I can almost produce what's needed, but the totals are coming through as a row when they are needed as a column.
tabl1 <- dat %>%
group_by(site) %>%
summarise (total_count = sum(count), Mean_per_litre = mean(count_l_site)) %>%
mutate(fg = "total")
tabl1
tabl2_fg <- dat %>%
group_by(site, f_grp = as.character(f_grp)) %>%
summarize(total_count = sum(count), Mean_per_litre = mean(count_l_site))
tabl2_fg
tabl4 <-
bind_rows(tabl1, tabl2_fg) %>%
arrange(site, f_grp) %>%
tidyr::pivot_wider(names_from = f_grp, values_from = c(Mean_per_litre, total_count), names_vary = "slowest")
tabl4
Output as follows
Next steps:
move the circled outputs and put them at the beginning of the table
remove every other line
result - left with a simple table rows = sites; columns: total count; total mean; then columns for each fg count & mean: eg 1c count; 1c mean; 1d count; 1d mean.
Something like this?
library(dplyr)
avg_gear <- mtcars %>%
group_by(gear) %>%
summarize(avg_mpg = mean(mpg), n = n()) %>%
mutate(cyl = "total")
avg_gear_cyl <- mtcars %>%
group_by(gear,cyl = as.character(cyl)) %>%
summarize(avg_mpg = mean(mpg), n = n())
bind_rows(avg_gear, avg_gear_cyl) %>%
arrange(gear, cyl)
# A tibble: 11 × 4
gear avg_mpg n cyl
<dbl> <dbl> <int> <chr>
1 3 21.5 1 4
2 3 19.8 2 6
3 3 15.0 12 8
4 3 16.1 15 total
5 4 26.9 8 4
6 4 19.8 4 6
7 4 24.5 12 total
8 5 28.2 2 4
9 5 19.7 1 6
10 5 15.4 2 8
11 5 21.4 5 total
Or if you want categories as columns:
bind_rows(avg_gear, avg_gear_cyl) %>%
arrange(gear, desc(cyl)) %>%
tidyr::pivot_wider(names_from = cyl, values_from = c(avg_mpg, n), names_vary = "slowest")
# A tibble: 3 × 9
gear avg_mpg_total n_total avg_mpg_8 n_8 avg_mpg_6 n_6 avg_mpg_4 n_4
<dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
1 3 16.1 15 15.0 12 19.8 2 21.5 1
2 4 24.5 12 NA NA 19.8 4 26.9 8
3 5 21.4 5 15.4 2 19.7 1 28.2 2
Related
I'm unsure how to structure my pivot longer command when I have both annual and monthly data. For example I have:
wide <- data.frame(region_name = character(), # Create empty data frame
total_population_2019 = numeric(),
total_population_2020 = numeric(),
mean_temperature_2019_1 = numeric(),
mean_temperature_2019_2 = numeric(),
mean_temperature_2020_1 = numeric(),
mean_temperature_2020_2 = numeric(),
stringsAsFactors = FALSE)
wide[1, ] <- list("funville", 50000, 51250, 26.3, 24.6, 25.7, 24.9)
region_name total_population_2019 total_population_2020 mean_temperature_2019_1 mean_temperature_2019_2 mean_temperature_2020_1 mean_temperature_2020_2
funville 50000 51250 26.3 24.6 25.7 24.9
I'm able to pivot on the monthly columns using spread:
long <- pivot_longer(wide, cols = 4:7, names_to = c("layer" ,"year", "month"),
names_pattern = "(.*)_(.*)_?_(.*)") %>%
group_by(layer) %>%
mutate(n = 1:n()) %>%
spread(layer, value) %>%
select(-n)
which gives
region_name total_population_2019 total_population_2020 year month mean_temperature
1 funville 50000 51250 2019 1 26.3
2 funville 50000 51250 2019 2 24.6
3 funville 50000 51250 2020 1 25.7
4 funville 50000 51250 2020 2 24.9
I'd like to now have a population column where the values are attributed for each row/month that falls in that year, ideally would look like:
desired.df <- data.frame(region_name = c("funville", "funville", "funville", "funville"),
year = c("2019", "2019", "2020", "2020"),
month = c("1", "2", "1", "2"),
population = c("50000", "50000", "51250", "51250"),
mean_temperature = c("26.3", "24.6", "25.7", "24.9"))
which gives
region_name year month population mean_temperature
1 funville 2019 1 50000 26.3
2 funville 2019 2 50000 24.6
3 funville 2020 1 51250 25.7
4 funville 2020 2 51250 24.9
Does anyone have a solution? Thanks in advance
One option would be to use the names_pattern argument and the special .value. To make this work I first add a helper month to your population columns. Additionally I use tidyr::fill to fill up the population column:
library(dplyr)
library(tidyr)
wide |>
rename_with(~ paste(.x, 1, sep = "_"), starts_with("total")) |>
pivot_longer(-region_name,
names_to = c(".value", "year", "month"),
names_pattern = "^(.*?)_(\\d+)_(\\d+)$") |>
group_by(year) |>
fill(total_population) |>
arrange(year)
#> # A tibble: 4 × 5
#> # Groups: year [2]
#> region_name year month total_population mean_temperature
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 funville 2019 1 50000 26.3
#> 2 funville 2019 2 50000 24.6
#> 3 funville 2020 1 51250 25.7
#> 4 funville 2020 2 51250 24.9
I have a column of species abundance
Pct_Cov Species Site Plot
1 2.25 AMLA AC 1
2 4.75 BECA4 AC 1
3 9.50 BEPA AC 1
4 7.00 BEPO AC 1
5 9.25 PIRU AC 1
6 2.25 PIRI AC 1
tail
tail(st.ov)
Pct_Cov Species Site Plot
612207 8.0 QUGA ZI 527
612208 1.0 RHAR4 ZI 527
612209 0.5 ARTR2 ZI 527
612210 1.0 POFE ZI 527
612211 3.0 VICIA ZI 527
612212 0.5 ARLU ZI 527
There are a LOT of plots here, 12438 to be exact. Each plot has a variety of different species, etc. I'm trying to write a function that creates a new column to calculate the ratio of the abundance of the dominant species / abundance of the subordinate species.
"Dominant" would be the sum of the top 1/4 of the species per each plot. So if a plot had 20 species, it would be the sum of the abundance of the 4 most abundant species.
I'm having a hard time going about this and was wondering if anyone had any tips. It would also be helpful to know what those species are, but that seems to be tricky.
Thanks!
Here's another tidyverse option. Since your data only has 6 rows for each of two Plots, I'll go with the "top 2" and "all but top 2", instead of your "4". It's easily modified.
dat %>%
group_by(Plot) %>%
mutate(R = dense_rank(Pct_Cov)) %>%
summarize(Ratio = sum(Pct_Cov[R %in% 1:2]) / sum(Pct_Cov[! R %in% 1:2]))
# # A tibble: 2 x 2
# Plot Ratio
# <int> <dbl>
# 1 1 0.359
# 2 527 0.273
This does not protect against plots with few unique species. For that, one might add some row-counting logic:
dat %>%
group_by(Plot) %>%
mutate(R = dense_rank(Pct_Cov)) %>%
summarize(Ratio = if (n() > (2+2)) sum(Pct_Cov[R %in% 1:2]) / sum(Pct_Cov[! R %in% 1:2]) else NA_real_)
If you get an NA, that means that that Plot had too few unique species.
Also, it doesn't acknowledge the possibility of 3 (my "2" plus one) having the same Pct_Cov, which sounds unlikely but would be a corner-case that will skew the math.
Data
dat <- structure(list(Pct_Cov = c(2.25, 4.75, 9.5, 7, 9.25, 2.25, 8, 1, 0.5, 1, 3, 0.5), Species = c("AMLA", "BECA4", "BEPA", "BEPO", "PIRU", "PIRI", "QUGA", "RHAR4", "ARTR2", "POFE", "VICIA", "ARLU"), Site = c("AC", "AC", "AC", "AC", "AC", "AC", "ZI", "ZI", "ZI", "ZI", "ZI", "ZI"), Plot = c(1L, 1L, 1L, 1L, 1L, 1L, 527L, 527L, 527L, 527L, 527L, 527L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "612207", "612208", "612209", "612210", "612211", "612212"))
We could use count to get the frequency count of 'plot', 'Species', arrange by 'plot' and descending order of 'n', then grouped by 'plot', create the ratio by taking the sum of first 3 'n' values divided by the sum of the rest and join with the original data
library(dplyr)
out <- df1 %>%
count(plot, Species) %>%
arrange(plot, desc(n)) %>%
group_by(plot) %>%
mutate(ratio = sum(n[1:3])/sum(n[-(1:3)])) %>%
right_join(df1)
I am trying to replace some text in my dataframe (a few rows given below)
> dput(Henry.longer[1:4,])
structure(list(N_l = c(4, 4, 4, 4), UG = c("100", "100", "100",
"100"), S = c(12, 12, 12, 12), Sample = c(NA, NA, NA, NA), EQ = c("Henry",
"Henry", "Henry", "Henry"), DF = c(0.798545454545455, 0.798545454545455,
0.798545454545455, 0.798545454545455), meow = c("Henry.Exterior.single",
"Multi", "Henry.Exterior.multi", "Henry.Interior.single"), Girder = c("Henry.Exterior.single",
"Henry.Interior.multi", "Henry.Exterior.multi", "Interior")), row.names = c(NA,
-4L), groups = structure(list(UG = "100", S = 12, .rows = list(
1:4)), row.names = c(NA, -1L), class = c("tbl_df", "tbl",
"data.frame"), .drop = FALSE), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
I try to mutate the dataframe as:
Henry.longer <- Henry.longer %>%
mutate(Loading = str_replace(meow, "Henry.Exterior.single", "Single")) %>%
mutate(Loading = str_replace(meow, "Henry.Exterior.multi", "Multi")) %>%
mutate(Loading = str_replace(meow, "Henry.Interior.single", "Single")) %>%
mutate(Loading = str_replace(meow, "Henry.Interior.multi", "Multi")) %>%
mutate(Girder = str_replace(meow, "Henry.Exterior.multi", "Exterior")) %>%
mutate(Girder = str_replace(meow, "Henry.Exterior.single", "Exterior")) %>%
mutate(Girder = str_replace(meow, "Henry.Interior.multi", "Interior")) %>%
mutate(Girder = str_replace(meow, "Henry.Interior.single", "Interior")) %>%
select(-meow)
But for some reason the results does not get applied to all the rows and only:
N_l UG S Sample EQ DF Loading Girder
1 4 100 12 NA Henry 0.799 Henry.Exterior.single Henry.Exterior.single
2 4 100 12 NA Henry 0.799 Multi Henry.Interior.multi
3 4 100 12 NA Henry 0.799 Henry.Exterior.multi Henry.Exterior.multi
4 4 100 12 NA Henry 0.799 Henry.Interior.single Interior
I think we can use lookup vectors for this, if it's easy or safer to use static string lookups:
tr_vec <- c(Henry.Exterior.single = "Single", Henry.Exterior.multi = "Multi", Henry.Interior.single = "Single", Henry.Interior.multi = "Multi")
tr_vec2 <- c(Henry.Exterior.multi = "Exterior", Henry.Exterior.single = "Exterior", Henry.Interior.multi = "Interior", Henry.Interior.single = "Interior")
Henry.longer %>%
mutate(
Loading = coalesce(tr_vec[Loading], Loading),
Girder = coalesce(tr_vec2[Girder], Girder)
)
# # A tibble: 4 x 8
# # Groups: UG, S [1]
# N_l UG S Sample EQ DF Loading Girder
# <dbl> <chr> <dbl> <lgl> <chr> <dbl> <chr> <chr>
# 1 4 100 12 NA Henry 0.799 Single Exterior
# 2 4 100 12 NA Henry 0.799 Multi Interior
# 3 4 100 12 NA Henry 0.799 Multi Exterior
# 4 4 100 12 NA Henry 0.799 Single Interior
The advantage of RonakShah's regex solution is that it can very easily handle many of the types of substrings you appear to need. Regexes do carry a little risk, though, in that they may (unlikely in that answer, but) miss match.
Instead of using str_replace I guess it would be easier to extract what you want using regex.
library(dplyr)
Henry.longer %>%
mutate(Loading = sub('.*\\.', '', meow),
Girder = sub('.*\\.(\\w+)\\..*', '\\1', meow))
where
Loading - removes everything until last dot
Girder - extracts a word between two dots.
Oh boy, looks like you've got some answers here already but here's a super-simple one that uses stringr::str_extract:
Henry.longer <- Henry.longer %>%
mutate(Loading = str_extract(meow, "single|multi")) %>%
mutate(Girder = str_extract(meow, "Interior|Exterior"))
It's worth noting that the demo data has a weird entry for meow in one column, so it didn't run perfectly on my machine:
I am trying to use bind_rows and summarize_if to add a total bottom row in a sample data set. There are different postings that are related to this type of question, but not exactly my issue. Additionally, some posted questions have so much other code and data that I wind up spending more time trying to figure out the code and example instead of how the answer applies more generally.
With that in mind, I have a simple sample data set.
Reproducible example:
library(tidyverse)
library(readxl)
sample_pivot_data <- structure(list(Group = c("A", "B", "A", "A", "A", "B", "B", "B",
"C", "C", "C"), Season = c("Winter", "Summer", "Winter", "Fall",
"Spring", "Winter", "Fall", "Spring", "Winter", "Summer", "Summer"
), Expense = c("Insurance", "Rent", "Utilities", "Misc", "Insurance",
"Rent", "Utilities", "Insurance", "Rent", "Utilities", "Misc"
), Fixed_Variable = c("Fixed", "Fixed", "Variable", "Variable",
"Fixed", "Fixed", "Variable", "Variable", "Fixed", "Variable",
"Variable"), Amount = c(300, 200, 400, 300, 800, 400, 200, 300,
450, 230, 120)), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
# A look at the data:
> sample_pivot_data
# A tibble: 11 x 5
Group Season Expense Fixed_Variable Amount
<chr> <chr> <chr> <chr> <dbl>
1 A Winter Insurance Fixed 300
2 B Summer Rent Fixed 200
3 A Winter Utilities Variable 400
4 A Fall Misc Variable 300
5 A Spring Insurance Fixed 800
6 B Winter Rent Fixed 400
7 B Fall Utilities Variable 200
8 B Spring Insurance Variable 300
9 C Winter Rent Fixed 450
10 C Summer Utilities Variable 230
11 C Summer Misc Variable 120
I found a similar problem which was addressed in this post here which gave me this solution which works:
# This works, no syntax issues
my_pivot <- sample_pivot_data %>%
group_by(Group, Fixed_Variable) %>%
summarize(category_total = sum(Amount)) %>%
pivot_wider(names_from = Fixed_Variable, values_from = category_total) %>%
ungroup() %>%
mutate(GrandTotal = rowSums(.[-1])) %>%
bind_rows(summarize_all(.,
funs(if (is.numeric(.))
sum(.)
else
"Grand_Total"))
) %>%
print()
# A tibble: 4 x 4
Group Fixed Variable GrandTotal
<chr> <dbl> <dbl> <dbl>
1 A 1100 700 1800
2 B 600 500 1100
3 C 450 350 800
4 Grand_Total 2150 1550 3700
When I tried to do the same thing, but use summarize_if with the below code, I get an error:
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "function" I looked here as a possible solution to the error, but I didn't follow how this applied in this case.
# This does not work
my_pivot2 <- sample_pivot_data %>%
group_by(Group, Fixed_Variable) %>%
summarize(category_total = sum(Amount)) %>%
pivot_wider(names_from = Fixed_Variable, values_from = category_total) %>%
ungroup() %>%
mutate(GrandTotal = rowSums(.[-1])) %>%
bind_rows(summarize_if(is.numeric, sum, na.rm = TRUE)) %>%
print()
If someone could explain why the above doesn't work I would appreciate it. On a related note, I also tried bind_rows(summarize_all(., list(~if(is.numeric(.)) sum(.) else "Grand_Total" ))) which worked but RStudio kept giving me an indication that parenthesis were unmatched...perhaps a different question, but I figured I would mention rather than posting an entirely separate question.
There is a missing . in summarize_if(). This works fine:
my_pivot2 <- sample_pivot_data %>%
group_by(Group, Fixed_Variable) %>%
summarize(category_total = sum(Amount)) %>%
pivot_wider(names_from = Fixed_Variable, values_from = category_total) %>%
ungroup() %>%
mutate(GrandTotal = rowSums(.[-1])) %>%
bind_rows(summarize_if(., is.numeric, sum, na.rm = TRUE)) %>%
print()
giving:
# A tibble: 4 x 4
Group Fixed Variable GrandTotal
<chr> <dbl> <dbl> <dbl>
1 A 1100 700 1800
2 B 600 500 1100
3 C 450 350 800
4 NA 2150 1550 3700
So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply with as.numeric:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
Benchmark
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.
Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.
If we remove the select, it becomes just a base R i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
Benchmarks
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
data
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
Similar to akrun you can try
df %>%
mutate_at(vars(starts_with("q")),funs(as.numeric)) %>%
mutate(sum_new = rowSums(select(., starts_with("q")), na.rm = TRUE)) %>%
filter(sum_new>0)
Here i use reduce in purrr to sum rows, it's the fastest way.
library(tidyverse)
data %>% filter_at(vars(starts_with('q')),~!is.na(.)) %>%
mutate( Sum = reduce(select(., starts_with("q")), `+`)) %>%
filter(Sum > 0)