Weighted mean calculation in R with missing values - r

Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel

Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333

The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})

Related

Running my multivariable function for several vectors?

I have the following dataframe with 6 columns and several thousand rows.
Example:
Screenshot of example data
Each column represents a different timepoint 0,1,3,6,9,12. I want to calculate the area under the curve for each row of values.
For example for row 1, I would use the following function from the DescTools package
x=c(0,1,3,6,9,12)
y=c(130, 125, 120, 115, 108, 115)
AUC(x, y, method = c("linear"), na.rm=FALSE)
Is there a way to create a new variable which is the AUC for each row from my dataframe?
Thanks!
We can use apply with MARGIN = 1 to do rowwise
library(DescTools)
i1 <- rowSums(!is.na(df1)) >2
df1$AUC[i1] <- apply(df1[i1,], 1, FUN = function(y)
AUC(x, y, method = "linear", na.rm = FALSE))
df1$AUC[i1]
[1] 1394 1518
data
df1 <- structure(list(col1 = c(130, 140), col2 = c(125, 137), col3 = c(120,
125), col4 = c(115, 120), col5 = c(108, 125), col6 = c(115, 130
)), class = "data.frame", row.names = c(NA, -2L))
x <- c(0,1,3,6,9,12)

Removing duplicate rows from one column without losing data from another column

I apologize if this has already been asked, but the solutions I came across didn't seem to work for me.
I have a data set that was initially multiple excel sheets containing different variables for the same subjects. I was able to import the data into r and combine into a single data frame using:
x1_data <- "/data.xlsx"
excel_sheets(path = x1_data)
tab_names <- excel_sheets(path = x1_data)
list_all <- lapply(tab_names, function(x)
read_excel(path = x1_data, sheet = x))
str(list_all)
df <- rbind.fill(list_all)
df <- as_tibble(df)
However, I now have many duplicate rows for each subject, as each sheet was essentially added beneath the preceding sheet. Something like this:
Sheet 1
ID: 1,2
Age: 32, 29
Sex: M, F
Sheet 2
ID: 1, 2
Weight: 75, 89
Height: 157, 146
Combined
ID: 1, 2, 1, 2
Age: 32, 29, NA, NA
Sex: M, F, NA, NA
Weight: NA, NA, 75, 89
Height: NA, NA, 157, 146
I can't seem to figure out how to delete the duplicate ID rows without losing the data in the columns that belong to those rows. I tried aggregate and group_by without success. What I am after is this:
Combined
ID: 1, 2
Age: 32, 29
Sex: M, F
Weight: 75, 89
Height: 157, 146
Any help would be appreciated. Thanks.
Here's a possible solution:
library(tidyverse)
df <- tibble(ID = c(1, 2, 1, 2),
Age = c(32, 29, NA, NA),
Sex = c("M", "F", NA, NA),
Weight = c(NA, NA, 75, 89),
Height = c(NA, NA, 157, 146))
df1 <- df %>% filter(is.na(Age)) %>% select(ID, Weight, Height)
df2 <- df %>% filter(!is.na(Age)) %>% select(ID, Age, Sex)
df.merged <- df2 %>% left_join(df1, by = "ID")
For future questions, please provide a already formatted sample of your data which makes it much more easier to work with.

Reshape from long to wide according to the number of occurrence of one variable

I have a dataframe looks like this
df1<-structure(list(person = c("a", "a", "a", "a", "b", "b", "b",
"c"), visitID = c(123, 123, 256, 816, 237, 828, 828, 911), v1 = c(10,
5, 15, 8, 95, 41, 31, 16), v2 = c(8, 72, 29, 12, 70, 23, 28,
66), v3 = c(0, 1, 0, 0, 1, 1, 0, 1)), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
Where person is the name/id of the person and visitID is a number generated for each visit. Now each visit may have one or multiple variables (v1, v2, v3). My problem is that I'm trying to transform the structure where cases are aggregated into unique row with wide visits and variables, to look like:
df2<-structure(list(person = c("a", "b", "c"), visit1 = c(123, 237,
911), visit2 = c(256, 828, NA), visit3 = c(816, NA, NA), v1.visit1 = c("10,5",
"95", "16"), v1.visit2 = c("15", "41,31", NA), v1.visit3 = c("8",
NA, NA), v2.visit1 = c("8,72", "70", "66"), v2.visit2 = c("29",
"23,28", NA), v1.visit3 = c("12", NA, NA), v3.visit1 = c("0,1",
"1", "1"), v3.visit2 = c("0", "1,0", NA), v3.visit3 = c("0",
NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
Methods I have tried so far:
Method1:
1-aggregate according to "person" with all other variables separated by comma
2-split the variables into multiple columns
The problem with this method is that I would not know then which variable corresponds to which visit, especially that some may have multiple entries and some may not.
Method2:
1-Count number of each visitID. Take the maximum number of visits per unique person (in the case above is 3)
2-Create 3 columns for each variable.
didn't know how to proceed from here
I found a great answer in the thread Reshape three column data frame to matrix ("long" to "wide" format)
so tried working around with reshape and pivot_wider but couldn't get it to work.
Any ideas are appreciated even if did not lead to the same output.
Thank you
You can try something like this:
df1 %>%
group_by(person, visitID) %>%
summarise(across(matches("v[0-9]+"), list)) %>%
group_by(person) %>%
mutate(visit = seq_len(n()) %>% str_c("visit.", .)) %>%
ungroup() %>%
pivot_wider(
id_cols = person,
names_from = visit,
values_from = c("visitID", matches("v[0-9]+"))
)
replace list with ~str_c(.x, collapse = ",") if you want to have it in character style.

Conditionally replace values across multiple columns based on string match in a separate column

I'm trying to conditionally replace values in multiple columns based on a string match in a different column but I'd like to be able to do so in a single line of code using the across() function but I keep getting errors that don't quite make sense to me. I feel like this is probably a simple solution so if anyone could point me in the right direction, that would be fantastic!
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
# working but not concise
df %>%
mutate(total = ifelse(str_detect(type, "Park"), NA, total),
group_a = ifelse(str_detect(type, "Park"), NA, group_a),
group_b = ifelse(str_detect(type, "Park"), NA, group_b))
# concise but not working
df %>% mutate(across(total, group_a, group_b), ifelse(str_detect(type, "Park"), NA, .))
Update
We got a solution that works with my dummy dataset but is not working with my real data, so I am going to share a small snippet of my real data frame with the numbers changed and organization names hidden. When I run this line of code (df %>% mutate(across(c(Attempts, Canvasses, Completes)), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .))) on these data, I get the following error message:
Error: Problem with mutate() input ..2. x Input ..2 must be a
vector, not a formula object. i Input ..2 is
~ifelse(str_detect(long_name, "park-cemetery"), NA, .).
This a small sample of the data that produces this error:
df <- structure(list(Org = c("OrgName", "OrgName", "OrgName", "OrgName",
"OrgName", "OrgName", "OrgName", "OrgName", "OrgName", "OrgName"
), nCode = c("M34", "R36", "R46", "X29", "M31", "K39", "Q12",
"Q39", "X41", "K27"), Attempts = c(100, 100, 100, 100, 100, 100,
100, 100, 100, 100), Canvasses = c(80, 80, 80, 80, 80, 80, 80,
80, 80, 80), Completes = c(50, 50, 50, 50, 50, 50, 50, 50, 50,
50), van_nocc_id = c(999, 999, 999, 999, 999, 999, 999, 999,
999, 999), van_name = c("M-Upper West Side", "SI-Rosebank", "SI-Tottenville",
"BX-park-cemetery-etc-Bronx", "M-Stuyvesant Town-Cooper Village",
"BK-Kensington", "Q-Broad Channel", "Q-Lindenwood", "BX-Wakefield",
"BK-East New York"), boro_short = c("M", "SI", "SI", "BX", "M",
"BK", "Q", "Q", "BX", "BK"), long_name = c("Upper West Side",
"Rosebank", "Tottenville", "park-cemetery-etc-Bronx", "Stuyvesant Town-Cooper Village",
"Kensington", "Broad Channel", "Lindenwood", "Wakefield", "East New York"
)), row.names = c(NA, -10L), class = "data.frame")
Final update
The curse of the misplaced closing bracket! Thanks to everyone for your help... the correct solution was df %>% mutate(across(c(Attempts, Canvasses, Completes), ~ifelse(str_detect(long_name, "park-cemetery"), NA, .)))
If you use the newly introduced function across (which is the correct way to approach this task), you have to specify inside across itself the function you want to apply. In this case the function ifelse(...) has to be a purrr-style lambda (so starting with ~). Check out across documentation and look for the arguments .cols and .fns.
df %>%
mutate(across(c(total, group_a, group_b), ~ifelse(str_detect(type, "Park"), NA, .)))
Output
# type total group_a group_b
# 1 Park NA NA NA
# 2 Neighborhood 56 26 30
# 3 Airport 75 45 30
# 4 Park NA NA NA
# 5 Neighborhood 21 3 18
# 6 Neighborhood 56 46 10
Here a data.table solution.
require(data.table)
df <- data.frame("type" = c("Park", "Neighborhood", "Airport", "Park", "Neighborhood", "Neighborhood"),
"total" = c(34, 56, 75, 89, 21, 56),
"group_a" = c(30, 26, 45, 60, 3, 46),
"group_b" = c(4, 30, 30, 29, 18, 10))
setDT(df)
df[type == "Park", c("total", "group_a", "group_b") := NA]
Update: that didn't take long to figure out! Just needed to place the columns in a vector:
# concise AND working!
df %>% mutate(across(c(total, group_a, group_b)), ifelse(str_detect(type, "Park"), NA, .))
I had tried this initially but placed the columns in quotes... don't do that :)

R: wilcoxon test Error: Grouping Factor must have exactly 2 levels

I read this SO question and that one, but still could not solve my problem. I have the following data.table which includes only a few of my total columns and rows of my data.table.
library(data.table)
structure(list(Patient = c("MB108", "MB108", "MB108", "MB108",
"MB108", "MB108", "MB108", "MB108", "MB108", "MB108"), Visit = c(1,
1, 1, 1, 9, 9, 9, 9, 12, 12), Stimulation = c("NC", "SEB", "PPD",
"E6C10", "NC", "SEB", "PPD", "E6C10", "NC", "SEB"), `CD38 ` = c(83.3,
63.4, 83.2, 91.5, 90.9, 70.9, 71, 88.4, 41.7, 47.9)), .Names = c("Patient",
"Visit", "Stimulation", "CD38 "), class = c("data.table", "data.frame"
), row.names = c(NA, -10L), .internal.selfref = <pointer: 0x102806578>)
I would like to do a t.test on column 4 when visit is 1 and when visit is 9.
I checked for NAs as well as the length of both columns.
Thanks for any help!
#na.omit(boolean_dt3)
#print(length(unlist(boolean_dt3[Visit== 1,4, with = FALSE])))
#print(length(unlist(boolean_dt3[Visit== 9,4, with = FALSE])))
wilcox.test( unlist(boolean_dt3[Visit== 1,4, with = FALSE])~ unlist(boolean_dt3[Visit== 9,4, with = FALSE]) , paired = T, correct=FALSE)
I just figured out , instead of ~ works for my problem.
Here's how to perform wilcoxon test on column 4 grouping by Value
library(dplyr)
wilcox.test( filter(df, Visit==1)$CD38, filter(df, Visit==9)$CD38, paired=TRUE)
try this formulation:
wilcox.test(numeric_var ~ two_level_group_var)

Resources