Related
I have a table that has for each survey site and survey date, a total of the number organisms counted, and measurements for each organism found. I would like to make sure that the data is correct by making sure the total organism counted match the total number of measurements taken.
I initially tried to gather the table, changed the values to 1 or 0 if a measurement was taken, and then group_by and summarise. This method didnt work, and I am sure there is a nicer method so any help would be appreciated.
Ideally I would like a table that has site, survey data, total counts and a count column derived from summing the number of measurements taken. The idea would be that the two count columns should have the same values, and hence not be missing data.
Sample data -
structure(list(Date.of.Survey = c("12/04/2022", "16/04/2022",
"12/04/2022", "13/04/2022", "14/04/2022", "15/04/2022"), Location = c("Wandle - Merton Abbey Mills",
"Wandle - Merton Abbey Mills", "Medway - Allington Weir", "Medway - Allington Weir",
"Medway - Allington Weir", "Medway - Allington Weir"), Was.the.trap.working.when.you.checked.it. = c("Yes",
"Yes", "Yes", "Yes", "Yes", "Yes"), Number.of.eels = c(0L, 1L,
0L, 0L, 0L, 20L), X1..Length..mm. = c("", "180", "", "", "",
"72"), X2..Length..mm. = c("", "", "", "", "", "69"), X3..Length..mm. = c("",
"", "", "", "", "76"), X4..Length..mm. = c("", "", "", "", "",
"72"), X5..Length..mm. = c("", "", "", "", "", "72"), X6..Length..mm. = c("",
"", "", "", "", "73"), X7..Length..mm. = c(NA, NA, NA, NA, NA,
77L), X8..Length..mm. = c(NA, NA, NA, NA, NA, 78L), X9..Length..mm. = c(NA,
NA, NA, NA, NA, 75L), X10..Length..mm. = c(NA, NA, NA, NA, NA,
72L), X11..Length..mm. = c(NA, NA, NA, NA, NA, 75L), X12..Length..mm. = c(NA,
NA, NA, NA, NA, 78L), X13..Length..mm. = c(NA, NA, NA, NA, NA,
74L), X14..Length..mm. = c(NA, NA, NA, NA, NA, 70L), X15..Length..mm. = c(NA,
NA, NA, NA, NA, 75L), X16..Length..mm. = c(NA, NA, NA, NA, NA,
75L), X17..Length..mm. = c(NA, NA, NA, NA, NA, 73L), X18..Length..mm. = c(NA,
NA, NA, NA, NA, 72L), X19..Length..mm. = c(NA, NA, NA, NA, NA,
75L), X20..Length..mm. = c(NA, NA, NA, NA, NA, 71L), X21..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X22..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X23..Length..mm. = c(NA, NA, NA, NA, NA, NA), X24..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X25..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X26..Length..mm. = c(NA, NA, NA, NA, NA, NA), X27..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X28..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X29..Length..mm. = c(NA, NA, NA, NA, NA, NA), X30..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X31..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X32..Length..mm. = c(NA, NA, NA, NA, NA, NA), X33..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X34..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X35..Length..mm. = c(NA, NA, NA, NA, NA, NA), X36..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X37..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X38..Length..mm. = c(NA, NA, NA, NA, NA, NA), X39..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X40..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X41..Length..mm. = c(NA, NA, NA, NA, NA, NA), X42..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X43..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X44..Length..mm. = c(NA, NA, NA, NA, NA, NA), X45..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X46..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X47..Length..mm. = c(NA, NA, NA, NA, NA, NA), X48..Length..mm. = c(NA,
NA, NA, NA, NA, NA), X49..Length..mm. = c(NA, NA, NA, NA, NA,
NA), X50..Length..mm. = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))```
Thanks in advance
You want to first make sure that your blanks are NAs. Then you can use rowSums to count the number of non-NA columns, and finally use case_when to create a variable to identify whether the count matches the number of measurements. I also recommend using janitor's clean_names function to make it a little easier to work with your variable names.
library(dplyr)
library(janitor)
df <- df %>%
mutate_all(na_if,"") %>%
mutate(count = rowSums(!is.na(select(., 5:50)))) %>%
mutate(count_match = case_when(number_of_eels == count ~1,
TRUE ~0))
I am attempting to output a fairly long gt table using gtsave. It keeps getting truncated, where the last few columns are missing. The scrollbar also shows up in the outputted image.
Here is the outputted table:
I want it to look like my 5% variation table, which is below:
Here is my code for both tables (including R Markdown headers):
3% Table
{r hotspot 3% 100 coverage reads table, fig.dim = c(12, 6)}
setwd(output_hotspot)
hotspot_3pct_table <- hotspot_3pct_longer %>%
ungroup() %>%
gt() %>%
tab_header(title = "Variation Across Lots at Select Base Pairs", subtitle = paste0("At least 100 coverage reads with at least 3% variation")) %>%
cols_label(
lot = "Lot"
) %>%
fmt_number(columns = 2:24, decimals = 2) %>%
fmt_missing(columns = everything(), missing_text = "--") %>%
tab_source_note(source_note = paste0("Dash indicates no variation present in lot")) %>%
tab_options(
table.width = pct(100)
)
gtsave(hotspot_3pct_table, "hotspot_3pct.png")
5% Table
{r hotspot 5% 100 coverage reads table, fig.dim = c(12, 6)}
setwd(output_hotspot)
hotspot_5pct_table <- hotspot_5pct_longer %>%
ungroup() %>%
gt() %>%
tab_header(title = "Variation Across Lots at Select Base Pairs", subtitle = paste0("At least 100 coverage reads with at least 5% variation")) %>%
cols_label(
lot = "Lot"
) %>%
fmt_number(columns = 2:18, decimals = 2) %>%
fmt_missing(columns = everything(), missing_text = "--") %>%
tab_source_note(source_note = paste0("Dash indicates no variation present in lot"))
gtsave(hotspot_5pct_table, "hotspot_5pct.png", expand = 10)
I have tried different fig.dim settings and different expand settings. I haven't encountered this issue before, so I am not sure how to approach this.
Reprex of each dataset:
3% Dataset
hotspot_3pct_longer = structure(list(lot = c("ABL GMP1", "MVS", "Tox Lot", "MVB1",
"MVB2", "CTM2", "CTM1", "Fuji 30k", "Fuji 7.5k"), `12` = c(0.0382775119617225,
0.0390625, 0.034883720930233, NA, NA, NA, NA, NA, NA), `13` = c(0.0588235294117647,
NA, NA, 0.048076923076924, 0.0714285714285714, 0.0417789757412399,
NA, NA, NA), `253` = c(NA, NA, 0.03360709902766, NA, NA, NA,
NA, NA, NA), `1266` = c(0.0646451454923886, NA, NA, NA, NA, NA,
NA, NA, NA), `1820` = c(1, NA, NA, NA, NA, NA, NA, NA, NA), `1821` = c(1,
NA, NA, NA, NA, NA, NA, NA, NA), `2861` = c(0.0434994715017482,
NA, NA, NA, NA, NA, NA, NA, NA), `3031` = c(0.183159188690842,
NA, NA, NA, NA, NA, NA, NA, NA), `3252` = c(0.0521527362955475,
NA, NA, NA, NA, NA, NA, NA, NA), `3368` = c(0.107515576323988,
NA, NA, NA, NA, NA, NA, NA, NA), `3512` = c(0.345980014097939,
NA, 0.064333937531195, NA, NA, 0.0822086320821032, 0.078818748712571,
0.089279658964298, NA), `3527` = c(0.17209788747124, NA, 0.0377838832455329,
NA, NA, 0.0471288691223414, 0.044333490343853, 0.059236465044716,
NA), `3554` = c(0.250983372072233, NA, 0.05112660944206, NA,
NA, 0.0639663737103554, 0.055374526495866, 0.0861535232698471,
0.031875819851334), `4752` = c(NA, NA, 0.04827943749595, NA,
NA, 0.0498005129666572, 0.049766115231033, 0.052495800335974,
0.0519236625723281), `4761` = c(NA, NA, 0.038136808232708, NA,
NA, 0.0317014863319821, 0.036080058906219, 0.034794423440454,
0.033717392388648), `7078` = c(NA, NA, 0.032269021739131, NA,
NA, NA, NA, NA, NA), `7299` = c(0.0830269157229004, 0.083128195417535,
0.361278273500727, 0.375946173254836, 0.216166788588149, 0.110078513058805,
0.10393717387867, 0.355137204850032, 0.310679611650486), `7300` = c(0.0525369400359628,
0.050222762251924, 0.245149911816579, 0.232037691401649, 0.148067737733391,
0.0629358437935844, 0.063008245663919, 0.236435818262021, 0.20123839009288
), `7301` = c(NA, NA, 0.0519736842105269, 0.038054968287527,
NA, NA, NA, 0.037803780378038, 0.034240150093809), `7315` = c(NA,
NA, NA, 0.037735849056604, 0.0406386066763426, NA, NA, NA, 0.036363636363637
), `7318` = c(0.0474754244861484, 0.07482430756511, NA, NA, 0.0369206598586017,
0.0493811726465808, 0.046463780540078, NA, NA), `7319` = c(0.0623240852432649,
0.083063994828701, NA, 0.0326086956521739, 0.058765915768854,
0.0560072267389341, 0.0604447228311939, 0.053601340033501, 0.039495798319328
), `7320` = c(0.0808298755186722, 0.10897808803568, NA, 0.045643153526971,
0.0581113801452785, 0.0764283011729096, 0.081006685017696, 0.04,
0.031458531935177)), row.names = c(NA, -9L), groups = structure(list(
lot = c("ABL GMP1", "CTM1", "CTM2", "Fuji 30k", "Fuji 7.5k",
"MVB1", "MVB2", "MVS", "Tox Lot"), .rows = structure(list(
1L, 7L, 6L, 8L, 9L, 4L, 5L, 2L, 3L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
5% Dataset
hotspot_5pct_longer = structure(list(lot = c("MVB2", "ABL GMP1", "CTM1", "CTM2", "Tox Lot",
"Fuji 30k", "Fuji 7.5k", "MVB1", "MVS"), `13` = c(0.0714285714285714,
0.0588235294117647, NA, NA, NA, NA, NA, NA, NA), `1266` = c(NA,
0.0646451454923886, NA, NA, NA, NA, NA, NA, NA), `1820` = c(NA,
1, NA, NA, NA, NA, NA, NA, NA), `1821` = c(NA, 1, NA, NA, NA,
NA, NA, NA, NA), `3031` = c(NA, 0.183159188690842, NA, NA, NA,
NA, NA, NA, NA), `3252` = c(NA, 0.0521527362955475, NA, NA, NA,
NA, NA, NA, NA), `3368` = c(NA, 0.107515576323988, NA, NA, NA,
NA, NA, NA, NA), `3512` = c(NA, 0.345980014097939, 0.078818748712571,
0.0822086320821032, 0.064333937531195, 0.089279658964298, NA,
NA, NA), `3527` = c(NA, 0.17209788747124, NA, NA, NA, 0.059236465044716,
NA, NA, NA), `3554` = c(NA, 0.250983372072233, 0.055374526495866,
0.0639663737103554, 0.05112660944206, 0.0861535232698471, NA,
NA, NA), `4752` = c(NA, NA, NA, NA, NA, 0.052495800335974, 0.0519236625723281,
NA, NA), `7299` = c(0.216166788588149, 0.0830269157229004, 0.10393717387867,
0.110078513058805, 0.361278273500727, 0.355137204850032, 0.310679611650486,
0.375946173254836, 0.083128195417535), `7300` = c(0.148067737733391,
0.0525369400359628, 0.063008245663919, 0.0629358437935844, 0.245149911816579,
0.236435818262021, 0.20123839009288, 0.232037691401649, 0.050222762251924
), `7301` = c(NA, NA, NA, NA, 0.0519736842105269, NA, NA, NA,
NA), `7318` = c(NA, NA, NA, NA, NA, NA, NA, NA, 0.07482430756511
), `7319` = c(0.058765915768854, 0.0623240852432649, 0.0604447228311939,
0.0560072267389341, NA, 0.053601340033501, NA, NA, 0.083063994828701
), `7320` = c(0.0581113801452785, 0.0808298755186722, 0.081006685017696,
0.0764283011729096, NA, NA, NA, NA, 0.10897808803568)), row.names = c(NA,
-9L), groups = structure(list(lot = c("ABL GMP1", "CTM1", "CTM2",
"Fuji 30k", "Fuji 7.5k", "MVB1", "MVB2", "MVS", "Tox Lot"), .rows = structure(list(
2L, 3L, 4L, 6L, 7L, 8L, 1L, 9L, 5L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -9L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Pass the options to webshot() by using vwidth and vheight. Refer the documentation,
https://www.rdocumentation.org/packages/webshot/versions/0.5.2/topics/webshot
gtsave(hotspot_3pct_table, "hotspot_3pct.png", vwidth = 1500, vheight = 1000)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a data frame like this:
df <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
)
base on member variable, I need to fill corresponding variables with dummy data.For example if member = 2 then q_c3_2:q_c7_2 should have dummy values --> q_c3 = some character like "Arne", q_c4 with 1 and q_c5 with 1900 and q_c6 and q_c7 with 0 , if member == 3 then q_c3_2:q_c7_2 and q_c3_3:q_c7_3 should have dummy values (same as dummy values as above) and so on. How may i do this and efficiently with tidyverse? Thanks
My desire output shall be like this data frame
df2 <- data.frame(stringsAsFactors=FALSE,
member = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 4L, 3L, 5L),
q_c3_1 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
q_c4_1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
q_c5_1 = c(1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L, 1900L,
1900L),
q_c6_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c7_1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
q_c3_2 = c(NA, NA, "Arne", NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_2 = c(NA, NA, 1L, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_2 = c(NA, NA, 1900L, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_2 = c(NA, NA, 0L, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_3 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", "Arne", "Arne"),
q_c4_3 = c(NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L),
q_c5_3 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, 1900L, 1900L),
q_c6_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c7_3 = c(NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 0L),
q_c3_4 = c(NA, NA, NA, NA, NA, NA, NA, "Arne", NA, "Arne"),
q_c4_4 = c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L),
q_c5_4 = c(NA, NA, NA, NA, NA, NA, NA, 1900L, NA, 1900L),
q_c6_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c7_4 = c(NA, NA, NA, NA, NA, NA, NA, 0L, NA, 0L),
q_c3_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, "Arne"),
q_c4_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L),
q_c5_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1900L),
q_c6_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L),
q_c7_5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L)
)
With the assumption that it does not matter what the dummy variables are and using dplyr:
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = "dummy",
compare = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
Edit: With dummy variables as requested.
library(dplyr)
temp <- df %>%
melt(id.vars = "member") %>%
mutate(compare = as.numeric(gsub("q_c\\d_(\\d)", "\\1", variable)),
dummy_match = as.numeric(gsub("q_c(\\d)_\\d", "\\1", variable))) %>%
filter(compare <= member) %>%
mutate(value = case_when(dummy_match == 4 ~ 1,
dummy_match == 5 ~ 1900,
dummy_match >= 6 ~ 0,
T ~ 9999),
compare = NULL,
dummy_match = NULL) %>%
unique() %>%
spread(variable, value)
df <- df %>%
select(member) %>%
left_join(., temp, by = "member")
df[df == 9999] <- "Arne"
I've encountered a rather puzzling tidyr::spread() error. When I tried to run code (example below) in the full dataframe, I got the "Duplicate identifiers for rows" error.
I subset the (very large) dataframe to investigate and re-ran the code. This time it worked (see subset1df dput below). Then I tried it again with a different subset (subset2df), and I got the error message again. I honestly have no idea how to make sense of this and will greatly appreciate any help.
Reproducible code below:
subset1df:
structure(list(v1 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, NA, NA), .Label = "2", class = "factor"),
v2 = structure(c(1L, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA), .Label = "2", class = "factor"),
v3 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v4 = structure(c(NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v5 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v6 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L), .Label = "2", class = "factor"),
v7 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v8 = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v9 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v10 = structure(c(NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_), .Label = character(0), class = "factor"),
v11 = structure(c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v12 = structure(c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v13 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v14 = structure(c(NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA,
NA, NA, 1L, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v15 = structure(c(NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v16 = structure(c(NA, NA, 2L, NA, NA, 2L, NA, NA, NA, NA,
NA, NA, NA, NA, 1L, NA), .Label = c("1", "2"), class = "factor"),
respondentID = structure(c(7L, 7L, 7L, 5L, 6L, 6L, 4L, 4L,
4L, 3L, 3L, 3L, 2L, 2L, 2L, 1L), .Label = c("EO15", "EO17",
"EO19", "EO21", "Eo23", "EO23", "EO24"), class = "factor")), .Names = c("v1",
"v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10", "v11",
"v12", "v13", "v14", "v15", "v16", "respondentID"), row.names = c(NA,
-16L), class = "data.frame")
subset2df:
structure(list(v2 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v4 = structure(c(NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v5 = structure(c(NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("1",
"2"), class = "factor"), v6 = structure(c(NA, 1L, NA, NA,
2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v9 = structure(c(NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v11 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA,
NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "1", class = "factor"),
v12 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, NA), .Label = c("1",
"2"), class = "factor"), v13 = structure(c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 2L, NA, NA, NA), .Label = c("1", "2"), class = "factor"),
v14 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v15 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
v16 = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Label = "2", class = "factor"),
respondentID = structure(c(21L, 20L, 20L, 19L, 18L, 18L,
1L, 1L, 16L, 16L, 16L, 10L, 10L, 17L, 15L, 15L, 15L, 14L,
14L, 14L, 13L, 12L, 12L, 11L, 11L, 11L, 8L, 9L, 9L, 6L, 7L,
7L, 3L, 2L, 3L, 4L, 4L, 4L, 5L), .Label = c("EO11", "Eo14",
"EO14", "EO16", "EO18", "EO26", "EO27", "Eo28", "EO28", "EO3",
"Eo30", "EO32", "EO331", "EO35", "EO37", "EO4", "EO41", "EO6",
"EO6 ", "EO7", "EO7 "), class = "factor")), .Names = c("v2",
"v4", "v5", "v6", "v9", "v11", "v12", "v13", "v14", "v15", "v16",
"respondentID"), row.names = c(NA, -39L), class = "data.frame")
combodf_id (needed to execute code):
structure(list(color1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("red", "ruby"
), class = "factor"), color2 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("blue",
"violet"), class = "factor"), color3 = structure(c(1L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("green",
"turqoise"), class = "factor"), color4 = structure(c(2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("black",
"yellow"), class = "factor"), combo = c("v1", "v2", "v3", "v4",
"v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14",
"v15", "v16")), class = "data.frame", .Names = c("color1", "color2",
"color3", "color4", "combo"), row.names = c(NA, -16L))
code:
result_df <- subset1df %>%
gather(key = combo, value = val, -respondentID) %>%
filter(!is.na(val)) %>%
left_join(combodf_id, by = "combo") %>%
arrange(respondentID) %>%
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1, 0)) %>%
select(respondentID, flower, color1, color2, choice = val)
The solution by #Tung below is very similar to the one in Spread with duplicate identifiers (using tidyverse and %>%)
, but neither of them quite solves the problem.
In your subset2df...
you have removed some of the response columns (e.g. v1, v3, v7, etc.)... that's probably why you have a bunch of rows/respondants without any response (all NA)
your respondentID column has levels with whitespace in some values that will probably mess things up later on (eg. "EO7" and "EO7 ")
there are duplicated rows, e.g. subset2df[c(15, 17), ]
all of your columns are factors... particularly with the response columns with integer values, I find that strange. tidyr functions gather and spread will take into account the levels of a factor, which can seem especially strange when you have subset a factor variable and not all of the levels are represented in the data.
You should probably fix those problems first, because they are what likely lead to your problems later, however... the reason you get the error "Duplicate identifiers for rows" is because the data frame you're passing to spread(color, value) has duplicate rows. You can force this to work by adding distinct() %>% one line before it, but be aware that the only reason you have to do that is because of the other problems before.
subset2df %>%
as_tibble() %>%
mutate_at(vars(v2:v16), as.integer) %>%
gather(key = combo, value = val, -respondentID, na.rm = T) %>%
filter(!is.na(val)) %>%
left_join(combodf_id, by = "combo") %>%
arrange(respondentID) %>%
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
distinct() %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1, 0)) %>%
select(respondentID, flower, color1, color2, choice = val)
I would strongly suggest fixing all the above problems first though, like so (notice you won't need the distinct command further down in the chain because you will have already applied that to the original data)...
subset2df %>%
as_tibble() %>% # tibble has better printing methods
mutate_at(vars(-respondentID), as.integer) %>% # convert response to numeric
mutate(respondentID = as.character(respondentID)) %>% # convert to char
mutate(respondentID = trimws(respondentID)) %>% # remove whitespace
distinct() %>% # remove duplicate rows
gather(key = combo, value = val, -respondentID, na.rm = T) %>%
left_join(combodf_id, by = "combo") %>%
mutate_at(vars(color1:color4), as.character) %>% # convert colors to char
rename(rose_color1 = color1, rose_color2 = color2,
tulip_color1 = color3, tulip_color2 = color4) %>%
gather(color, value, rose_color1:tulip_color2) %>%
separate(color, into = c('flower', 'color')) %>%
spread(color, value) %>%
mutate(val = if_else(val == 1, 'rose', 'tulip')) %>%
mutate(val = if_else(val == flower, 1L, 0L)) %>%
select(respondentID, flower, color1, color2, choice = val)
For the following dataframe:
eu <- structure(list(land = structure(c(1L, 4L, 5L, 12L, 9L, 13L, 16L, 18L, 27L, 10L, 25L, 21L, 28L, 19L, 8L, 26L, 6L, 3L, 15L, 14L, 11L, 17L, 20L, 23L, 24L, 2L, 22L, 7L), .Label = c("Belgie", "Bulgarije", "Cyprus", "Denemarken", "Duitsland", "Estland", "Europese Unie", "Finland", "Frankrijk", "Griekenland", "Hongarije", "Ierland", "Italie", "Letland", "Litouwen", "Luxemburg", "Malta", "Nederland", "Oostenrijk", "Polen", "Portugal", "Roemenie", "Slovenie", "Slowakije", "Spanje", "Tsjechie", "Verenigd Koninkrijk", "Zweden"), class = "factor"), `1979` = c(91.36, 47.82, 65.73, 63.61, 60.71, 85.65, 88.91, 58.12, 32.35, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 61.99), `1981` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 81.48, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1984` = c(92.09, 52.38, 56.76, 47.56, 56.72, 82.47, 88.79, 50.88, 32.57, 80.59, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 58.98), `1987` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 68.52, 72.42, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1989` = c(90.73, 46.17, 62.28, 68.28, 48.8, 81.07, 87.39, 47.48, 36.37, 80.03, 54.71, 51.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 58.41), `1994` = c(90.66, 52.92, 60.02, 43.98, 52.71, 73.6, 88.55, 35.69, 36.43, 73.18, 59.14, 35.54, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 56.67), `1995` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 41.63, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1996` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.73, 57.6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), `1999` = c(91.05, 50.46, 45.19, 50.21, 46.76, 69.76, 87.27, 30.02, 24, 70.25, 63.05, 39.93, 38.84, 49.4, 30.14, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 49.51), `2004` = c(90.81, 47.89, 43, 58.58, 42.76, 71.72, 91.35, 39.26, 38.52, 63.22, 45.14, 38.6, 37.85, 42.43, 39.43, 28.3, 26.83, 72.5, 48.38, 41.34, 38.5, 82.39, 20.87, 28.35, 16.97, NA, NA, 45.47), `2007` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29.22, 29.47, NA), `2009` = c(90.39, 59.54, 43.3, 58.64, 40.63, 65.05, 90.75, 36.75, 34.7, 52.61, 44.9, 36.78, 45.53, 45.97, 40.3, 28.2, 43.9, 59.4, 20.98, 53.7, 36.31, 78.79, 24.53, 28.33, 19.64, 38.99, 27.67, 43), inwoners = c(11161642, 5602628, 80523746, 4591087, 65578819, 59685227, 537039, 16779575, 63896071, 11062508, 46727890, 10487289, 9555893, 8451860, 5426674, 10516125, 1320174, 865878, 2971905, 2023825, 9908798, 421364, 38533299, 2058821, 5410836, 7284552, 20020074, 501403599), plicht = structure(c(1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ja", "nee"), class = "factor")), .Names = c("land", "1979", "1981", "1984", "1987", "1989", "1994", "1995", "1996", "1999", "2004", "2007", "2009", "inwoners", "plicht"), row.names = c(NA, -28L), class = "data.frame")
I need conditional column means. I can do that with:
verplicht <- c("Europese Unie (stemplicht)", colMeans(eu[eu$plicht=="ja",c(2:13)], na.rm=TRUE), NA)
vrij <- c("Europese Unie (geen stemplicht)", colMeans(eu[eu$plicht=="nee",c(2:13)], na.rm=TRUE), NA)
eu2 <- rbind(eu, verplicht, vrij)
However, I need weighted column means with country population (the inwoners column) as the weights. I tried to that with:
verplicht <- c("Europese Unie (stemplicht)", lapply(eu[eu$plicht=="ja",c(2:13)], weighted.mean(x, eu[eu$plicht=="ja",14], na.rm=TRUE)), NA)
but that resulted in the following error:
Error in weighted.mean.default(x, eu[eu$plicht == "ja", 14], na.rm = TRUE) :
'x' and 'w' must have the same length
I understand what the error-message is saying, but don't know how to solve this. Any suggestions?
The problem is with how you're using lapply. Here's the correct code:
lapply(eu[eu$plicht=='ja',2:13], weighted.mean, eu[eu$plicht=='ja','inwoners'], na.rm=TRUE)
lapply(eu[eu$plicht=='nee',2:13], weighted.mean, eu[eu$plicht=='nee','inwoners'], na.rm=TRUE)
Notice how weighted.mean is used as an argument, rather than inside an anonymous function with x as an argument. You could equivalently do:
lapply(eu[eu$plicht=='ja',2:13], function(x) weighted.mean(x, eu[eu$plicht=='ja','inwoners'], na.rm=TRUE))
lapply(eu[eu$plicht=='nee',2:13], function(x) weighted.mean(x, eu[eu$plicht=='nee','inwoners'], na.rm=TRUE))
But you're currently kind of mixing the two different ways of using lapply.
If inwoners is the population, then
> (weights <- with(eu, inwoners/sum(inwoners)))
# [1] 0.0111303968 0.0055869443 0.0802983327 0.0045782350 0.0653952416
# [6] 0.0595181478 0.0005355356 0.0167326033 0.0637172042 0.0110315403
# [11] 0.0465970828 0.0104579315 0.0095291428 0.0084282004 0.0054114829
# [16] 0.0104866868 0.0013164784 0.0008634541 0.0029635856 0.0020181596
# [21] 0.0098810599 0.0004201845 0.0384254312 0.0020530577 0.0053956892
# [26] 0.0072641601 0.0199640310 0.5000000000
and the weighted mean of the 2004 column, for example, is
> weighted.mean(eu$`2004`, w = weights, na.rm = TRUE)
# [1] 45.31782
To get the weighted mean of each of the year columns for when plicht == 'ja',
> s <- subset(eu, plicht == "ja")
> w2 <- weights[as.numeric(rownames(s))]
> newDF <- do.call(rbind, lapply(2:13, function(i){
data.frame(wtMean.ja = weighted.mean(s[,i], w = w2, na.rm = TRUE))
}))
> rownames(newDF) <- names(s)[2:13]
> newDF
# wtMean.ja
# 1979 86.56735
# 1981 81.48000
# 1984 83.56127
# 1987 68.52000
# 1989 72.30636
# 1994 69.86950
# 1995 NaN
# 1996 NaN
# 1999 69.28708
# 2004 63.17060
# 2007 NaN
# 2009 58.99465