Using mean imputation for different variables - r

I have a data set where data are missing. Here is a sample of what my data looks like:
df<-read.csv(id, test1, test2, test3
1, 9, 1, 3
2, 8, 2, NA
3, NA, 3, NA
4, 1, 3, 4
5, 2, 44, NA
6, 4, 4, 1
7, NA, NA, NA)
How would I input the respective mean of each test into the corresponding column for each NA?
Output should look like
id test1 test2 test3
1, 9, 1, 3
2, 8, 2, 2.66
3, 4.8, 3, 2.66
4, 1, 3, 4
5, 2, 44, 2.66
6, 4, 4, 1
7, 4.8, 9.5, 2.66

An option would be na.aggregate
library(zoo)
df[-1] <- na.aggregate(df[-1])

Related

Removing a row based on a condition

my_df <- tibble(
b1 = c(2, 1, 1, 2, 2, 2, 1, 1, 2),
b2 = c(NA, 4, 6, 2, 6, 6, 1, 1, 7),
b3 = c(5, 9, 8, NA, 2, 3, 9, 5, NA),
b4 = c(NA, 6, NA, 10, 12, 8, 3, 6, 2),
b5 = c(2, 12, 1, 7, 8, 5, 5, 6, NA),
b6 = c(9, 2, 4, 6, 7, 6, 6, 7, 9),
b7 = c(1, 3, 7, 7, 4, 2, 2, 9, 5),
b8 = c(NA, 8, 4, 5, 1, 4, 1, 3, 6),
b9 = c(4, 5, NA, 9, 5, 1, 1, 2, NA),
b10 = c(14, 3, NA, 2, 2, 2, 3, NA, 5))
I have a df like this, and would like to tell R to remove all '3' or 'NA' in b10 if b1 = 1. I have tried this with this, but it seems to keep the '3' and 'NA' instead of removing them;
new_df <- my_df %>% filter(is.na(b10) | b10 == 3 | b1==1 & b10 ==NA)
This should do the trick:
library(dplyr)
my_df %>%
filter(!(b1 == 1 & (is.na(b10) | b10 == 3)))
Edit: assuming you want to remove the rows where the conditions meet

Paired T-Test over multiple paired columns (wide data format)

I've converted a data frame into wide format and now want to compute paired t-tests to obtain p-values. I have managed to do this for each pair of columns individually, but it's a lot more code than I feel is necessary. I'm still very new to R, data and coding generally, and couldn't easily see a solution here on Stack Overflow.
My wide data frame is:
> head(df_wide)
# A tibble: 6 x 21
Assessor `Appearance1 `Appearance2 `Aroma_1 `Aroma_2 `Flavour_1 `Flavour_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 10 10 10 10
2 6 7 7 5 8 4
# ... with 14 more variables
I want to perform a paired T-Test over the attributes, i.e. Appearance1 and Appearance2, Aroma1 and Aroma2, etc. The 14 other variables are all <dbl> and are also attributes to be included as paired columns for the T-Test.
Ideally, the output would be a vector of just the p-values, rather than having all the information. I've managed to do that coding for individual pairs, but I wanted to know if this would be possible to do as part of performing the T-Test over multiple pairs of columns.
Here is the code I have for the first two attributes:
p_values <- c(t.test(df_wide$`Appearance1`, df_wide$`Appearance2`, paired = T)[["p.value"]],
t.test(df_wide$`Aroma1`, df_wide$`Aroma2`, paired = T)[["p.value"]])
This creates the vector I want, but is cumbersome and error-prone. Ideally, I'd be able to perform it over all the pairs at once without needing to use column names.
I do have the original data frame in long format, if it would be easier to do it using that (EDIT: used dput() for first 20 rows instead of head():
> dput(df_test[1:20,])
structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
`JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
`JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
`JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
`JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
`JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
`Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
The Product variable is the one which was used to make the paired columns in the wide format data frame. Thanks in advance.
if I understand correctly
df <- structure(list(Assessor = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
Product = c("MC", "MV", "MC", "MV", "MV", "MC", "MC", "MV", "MV", "MC", "MC", "MV", "MC", "MV", "MC", "MV", "MV", "MC", "MV", "MC"),
Appearance = c(10, 10, 6, 7, 9, 6, 7, 8, 9, 8, 10, 8, 6, 6, 9, 8, 8, 8, 9, 9),
Aroma = c(10, 10, 7, 5, 9, 8, 6, 7, 5, 7, 9, 8, 6, 6, 5, 3, 6, 7, 9, 6),
Flavour = c(10, 10, 8, 4, 10, 7, 7, 6, 8, 8, 9, 10, 8, 8, 6, 8, 7, 9, 9, 8),
Texture = c(10, 10, 8, 8, 9, 6, 7, 8, 8, 8, 9, 10, 8, 8, 9, 8, 8, 9, 9, 8),
`JAR Colour` = c(3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3),
`JAR Strength Chocolate` = c(2, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 2),
`JAR Strength Vanilla` = c(3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3),
`JAR Sweetness` = c(2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3),
`JAR Creaminess` = c(3, 3, 3, 3, 3, 1, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3),
`Overall Acceptance` = c(9, 10, 8, 4, 10, 5, 7, 7, 8, 8, 9, 10, 8, 8, 8, 8, 8, 9, 8, 8)),
row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
head(df)
#> # A tibble: 6 x 12
#> Assessor Product Appearance Aroma Flavour Texture `JAR Colour`
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 MC 10 10 10 10 3
#> 2 1 MV 10 10 10 10 2
#> 3 2 MC 6 7 8 8 2
#> 4 2 MV 7 5 4 8 3
#> 5 3 MV 9 9 10 9 3
#> 6 3 MC 6 8 7 6 3
#> # ... with 5 more variables: JAR Strength Chocolate <dbl>,
#> # JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>, JAR Creaminess <dbl>,
#> # Overall Acceptance <dbl>
library(tidyverse)
map_df(df[-c(1:2)], ~t.test(.x ~ df$Product, paired = TRUE)$p.value)
#> # A tibble: 1 x 10
#> Appearance Aroma Flavour Texture `JAR Colour` `JAR Strength Chocolate`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.496 0.576 1 0.309 0.678 1
#> # ... with 4 more variables: JAR Strength Vanilla <dbl>, JAR Sweetness <dbl>,
#> # JAR Creaminess <dbl>, Overall Acceptance <dbl>
sapply(df[-c(1:2)], function(x) t.test(x ~ df$Product, paired = TRUE)$p.value)
#> Appearance Aroma Flavour
#> 0.4961016 0.5763122 1.0000000
#> Texture JAR Colour JAR Strength Chocolate
#> 0.3092332 0.6783097 1.0000000
#> JAR Strength Vanilla JAR Sweetness JAR Creaminess
#> 0.6783097 1.0000000 0.4433319
#> Overall Acceptance
#> 0.7803523
Created on 2021-06-22 by the reprex package (v2.0.0)

R Tidyverse expand a dataframe with all combinations of two variables (edgelist)

This code generates a data frame just so:
library(tidyverse)
A = c(7, 4, 3, 12, 6)
B = c(1, 10, 9, 8, 5)
C = c(5, 3, 1, 7, 6)
df <- data_frame(A, B, C) %>% gather(letter1, rank)
nested <- df %>% group_by(letter1) %>% nest(ranks = c(rank))
nested
A grouped_df: 3 × 2
letter1 ranks
<chr> <list>
A 7, 4, 3, 12, 6
B 1, 10, 9, 8, 5
C 5, 3, 1, 7, 6
This is the desired data frame:
A tibble: 9 × 4
letter1 letter2 data1 data2
<chr> <chr> <list> <list>
A A 7, 4, 3, 12, 6 7, 4, 3, 12, 6
B A 1, 10, 9, 8, 5 7, 4, 3, 12, 6
C A 5, 3, 1, 7, 6 7, 4, 3, 12, 6
A B 7, 4, 3, 12, 6 1, 10, 9, 8, 5
B B 1, 10, 9, 8, 5 1, 10, 9, 8, 5
C B 5, 3, 1, 7, 6 1, 10, 9, 8, 5
A C 7, 4, 3, 12, 6 5, 3, 1, 7, 6
B C 1, 10, 9, 8, 5 5, 3, 1, 7, 6
C C 5, 3, 1, 7, 6 5, 3, 1, 7, 6
Once this step is solved, I'll run a mutate using data1 and data2 to get value, and then selecting letter1, letter2 and value will give an edgelist. I'm working with about 700 letters and the ranks lists will all be the same size and contain about 20 elements.
I'd expected to be able to use expand or expand.grid, but to no avail. Any tidyverse suggestions will be greatly appreciated.
crossing can be used
library(tidyr)
library(purrr)
library(dplyr)
crossing(ind1 = seq_len(nrow(nested)),
ind2 = seq_len(nrow(nested))) %>%
pmap_dfr(~ bind_cols(nested[..1,], nested[..2,]) )
We can use crossing after renaming the second dataframe.
tidyr::crossing(nested, setNames(nested, c('letter2', 'rank2')))
# letter1 ranks letter2 rank2
#1 A 7, 4, 3, 12, 6 A 7, 4, 3, 12, 6
#2 A 7, 4, 3, 12, 6 B 1, 10, 9, 8, 5
#3 A 7, 4, 3, 12, 6 C 5, 3, 1, 7, 6
#4 B 1, 10, 9, 8, 5 A 7, 4, 3, 12, 6
#5 B 1, 10, 9, 8, 5 B 1, 10, 9, 8, 5
#6 B 1, 10, 9, 8, 5 C 5, 3, 1, 7, 6
#7 C 5, 3, 1, 7, 6 A 7, 4, 3, 12, 6
#8 C 5, 3, 1, 7, 6 B 1, 10, 9, 8, 5
#9 C 5, 3, 1, 7, 6 C 5, 3, 1, 7, 6
The same is also valid for expand_grid.
tidyr::expand_grid(nested, setNames(nested, c('letter2', 'rank2')))

Check row by row and highlight mismatches in row/column when it occurred

I have a data frame with 3 months of data with individual information. Individual information must be fixed during the whole period, however, in my real data set it is not the case. I would like to check row by row and highlight the dates that something went wrong during data entry.
Here is sample of my dataset ( real dataset has more variables):
input <- data.frame(stringsAsFactors=FALSE,
date = c(20190218, 20190219, 20190220, 20190221, 20190222,
20190223, 20190101, 20190103, 20190105, 20190110,
20190112, 20190218, 20190219, 20190220, 20190221, 20190222,
20190223),
id = c("18105265-ab", "18105265-ab", "18105265-ab",
"18105265-ab", "18105265-ab", "18105265-ab",
"18161665-aa", "18161665-aa", "18161665-aa", "18161665-aa",
"18161665-aa", "18502020-aa", "18502020-aa", "18502020-aa",
"18502020-aa", "18502020-aa", "18502020-aa"),
size = c(3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1),
type = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 2, 2),
county = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5),
member_p10 = c(3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1),
youngest_age = c(5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7),
sex = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1),
position = c(5, 5, 5, 5, 5, 5, 4, 4, 4, 0, 0, 3, 3, 3, 3, 0, 0))
Is there any way for this type of operation? I would like to have this output at the end:
date id size type county member_p10 youngest_age sex position
1 20190221 18105265-ab 3 4 1 3 5 1 5
2 20190222 18105265-ab 2 4 1 2 7 1 5
3 20190105 18161665-aa 2 4 1 2 7 2 4
4 20190110 18161665-aa 1 2 1 1 7 2 0
5 20190221 18502020-aa 2 4 5 2 7 1 3
6 20190222 18502020-aa 1 2 5 1 7 1 0

How to identify fully connected node clusters with igraph?

I'm trying to calculate the clusters of a network using igraph in R, where all nodes are connected. The plot seems to work OK, but then I'm not able to return the correct groupings from my clusters.
In this example, the plot shows 4 main clusters, but in the largest cluster, not all nodes are connected:
I would like to be able to return the following list of clusters from this graph object:
[[1]]
[1] 8 9
[[2]]
[1] 7 10
[[3]]
[1] 4 6 11
[[4]]
[1] 2 3 5
[[5]]
[1] 1 3 5 12
Example code:
library(igraph)
topology <- structure(list(N1 = c(1, 3, 5, 12, 2, 3, 5, 1, 2, 3, 5, 12, 4,
6, 11, 1, 2, 3, 5, 12, 4, 6, 11, 7, 10, 8, 9, 8, 9, 7, 10, 4,
6, 11, 1, 3, 5, 12), N2 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3,
3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,
11, 11, 11, 12, 12, 12, 12)), .Names = c("N1", "N2"), row.names = c(NA,
-38L), class = "data.frame")
g2 <- graph.data.frame(topology, directed=FALSE)
g3 <- simplify(g2)
plot(g3)
The cliques function gets me part of the way there:
tmp <- cliques(g3)
tmp
but, this list also gives groupings where not all nodes connect. For example, this clique includes the nodes 1,2,3,5 but 1 only connects to 3, and 2 only connects to 3 and 5, and 5 only connects to 2 :
topology[tmp[[31]],]
# N1 N2
#6 3 2
#7 5 2
#8 1 3
Thanks in advance for any help.
You could use maximal.cliques in the igraph package. See below.
# Load package
library(igraph)
# Load data
topology <- structure(list(N1 = c(1, 3, 5, 12, 2, 3, 5, 1, 2, 3, 5, 12, 4,
6, 11, 1, 2, 3, 5, 12, 4, 6, 11, 7, 10, 8, 9, 8, 9, 7, 10, 4,
6, 11, 1, 3, 5, 12), N2 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3,
3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10,
11, 11, 11, 12, 12, 12, 12)), .Names = c("N1", "N2"), row.names = c(NA,
-38L), class = "data.frame")
# Get rid of loops and ensure right naming of vertices
g3 <- simplify(graph.data.frame(topology[order(topology[[1]]),],directed = FALSE))
# Plot graph
plot(g3)
# Calcuate the maximal cliques
maximal.cliques(g3)
# > maximal.cliques(g3)
# [[1]]
# [1] 9 8
#
# [[2]]
# [1] 10 7
#
# [[3]]
# [1] 2 3 5
#
# [[4]]
# [1] 6 4 11
#
# [[5]]
# [1] 12 1 5 3

Resources