Trouble creating frequency column from character column - r

I'm trying to add a column to a dataframe that gives the frequency of unique values in a character column. This is what I have so far:
term estimate std.error statistic p.value
1 (Intercept) 6.0888310 1.3601938 4.4764437 8.318542e-06
2 factor(age76)25 0.6884056 0.8861507 0.7768494 4.374021e-01
3 factor(age76)26 0.2177806 0.9997128 0.2178431 8.275887e-01
4 factor(age76)27 0.5539639 0.9255542 0.5985213 5.496061e-01
5 factor(age76)28 0.8705031 0.5343690 1.6290300 1.035716e-01
6 factor(age76)29 1.2249185 0.7557118 1.6208804 1.053084e-01
7 factor(age76)30 0.6254308 0.8861507 0.7057838 4.804608e-01
8 factor(age76)31 1.2295179 0.5343690 2.3008782 2.157089e-02
9 factor(age76)32 0.3032523 0.8449115 0.3589161 7.197216e-01
10 factor(age76)33 1.1344686 0.7557118 1.5011921 1.335714e-01
sapply(df.b, class)
term estimate std.error statistic p.value
"character" "numeric" "numeric" "numeric" "numeric"
library(dplyr)
df.b$n <- group_by(df.b$term) %>%
summarise(df.b$term, freq = n())
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "character"
There seems to be a problem with the character type of my column. When I change it to numeric I am under the impression that it will change to NA.
dput(head(df.b))
structure(list(term = c("(Intercept)", "factor(age76)25", "factor(age76)26",
"factor(age76)27", "factor(age76)28", "factor(age76)29"), estimate = c(6.08883100125014,
0.688405615000334, 0.21778058000053, 0.553963930000528, 0.870503050000005,
1.22491850000015), std.error = c(1.36019381570938, 0.886150663575717,
0.999712776013908, 0.925554182033106, 0.534368956146369, 0.75571182509336
), statistic = c(4.47644367363531, 0.776849404166263, 0.217843149778352,
0.598521340785982, 1.62902998010529, 1.6208804193964), p.value = c(8.31854214736379e-06,
0.437402143453174, 0.827588701982869, 0.549606122411782, 0.103571567056818,
0.105308432290008)), .Names = c("term", "estimate", "std.error",
"statistic", "p.value"), row.names = c(NA, 6L), class = "data.frame")
I have also tried this but it gives a warning code:
df.b$n <- group_by(df.b, term)%>%
summarise(freq = n())
head(df.b)
term estimate std.error statistic p.value n
1 (Intercept) 6.0888310 1.3601938 4.4764437 8.318542e-06 # A tibble: 6 x 2
2 factor(age76)25 0.6884056 0.8861507 0.7768494 4.374021e-01 term freq
3 factor(age76)26 0.2177806 0.9997128 0.2178431 8.275887e-01 <chr> <int>
4 factor(age76)27 0.5539639 0.9255542 0.5985213 5.496061e-01 1 (Intercept) 1
5 factor(age76)28 0.8705031 0.5343690 1.6290300 1.035716e-01 2 factor(age76)25 1
6 factor(age76)29 1.2249185 0.7557118 1.6208804 1.053084e-01 3 factor(age76)25:factor(black)1 1
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
Korrupter Data Frame: Spalten werden abgeschnitten oder mit NAs aufgefüllt

I think you misunderstand the use of the key functions (group_by and summarise) in dplyr.
First of all, the output of these key functions is a data frame, not a vector. So you should not assign the output to df.b$n, a new column in the data frame.
Secondly, if you want to create a new column, use mutate. summarise it to summarise the group statistics, not to create a new column.
Thirdly, you may want to review how the pipe operation works (http://seananderson.ca/2014/09/13/dplyr-intro.html). The first argument of these key functions are all data frames. You should begin with df.b2 <- df.b %>% group_by(...) or df.b2 <- group_by(df.b, ...), where ... should be column names. In your original code, you use group_by(df.b$term) %>%
summarise(df.b$term, freq = n()) and leads to the error. This makes sense because group_by should take the first argument as a data frame, but you provided a character vector.
One final note, you may not show your entire data frame, but it seems like the elements in the term column are all unique, so the frequency count based on that column is probably all 1. Make sure this is what you want.
I modified your code a little bit as follows. Hopefully, the output df.b2 makes sense.
library(dplyr)
df.b2 <- df.b %>%
group_by(term) %>%
mutate(freq = n()) %>%
ungroup()
df.b2
# # A tibble: 6 x 6
# term estimate std.error statistic p.value freq
# <chr> <dbl> <dbl> <dbl> <dbl> <int>
# 1 (Intercept) 6.0888310 1.3601938 4.4764437 8.318542e-06 1
# 2 factor(age76)25 0.6884056 0.8861507 0.7768494 4.374021e-01 1
# 3 factor(age76)26 0.2177806 0.9997128 0.2178431 8.275887e-01 1
# 4 factor(age76)27 0.5539639 0.9255542 0.5985213 5.496061e-01 1
# 5 factor(age76)28 0.8705031 0.5343690 1.6290300 1.035716e-01 1
# 6 factor(age76)29 1.2249185 0.7557118 1.6208804 1.053084e-01 1

Related

R - Multiple chi square test on dataframe per row - get results same dataframe

I'm interested in performing chi-square test of group 1 x group2 (per gene/row) and group1 x group3 (per gene/row) and get the p-value and residuals in the same data frame. Creating columns for each comparison.
Genes<-c("GENE_A", "GENE_B","GENE_C")
Group1_Mut<-c(20,10,5)
Group1_WT<-c(40,50,55)
Group2_Mut<-c(10, 30, 10)
Group2_WT<-c(80, 60, 80)
Group3_Mut <- c(10,15,30)
Group3_WT <- c(30,40,45)
main<-data.frame(Genes,Group1_Mut,Group1_WT,Group2_Mut,Group2_WT, Group3_Mut,Group3_WT)
First I tried the example I found here at stackoverflow (but just for two groups comparations)
library(dplyr)
main %>%
rowwise() %>%
mutate(
chisq.statistic = chisq.test(matrix(c(Group1_Mut, Group1_WT, Group2_Mut, Group2_WT),
nrow = 2))$statistic
)
The chi-square value doesn´t match another statistics program
I tried again for just two groups:
main2 <- select(main,c(Group1_Mut, Group1_WT, Group2_Mut, Group2_WT))
main2 %>%
rowwise() %>%
mutate (statistics = chisq.test(main2))
but got this error:
Error: Problem with `mutate()` column `statistics`.
i `statistics = chisq.test(main2)`.
x `statistics` must be a vector, not a `htest` object.
i Did you mean: `statistics = list(chisq.test(main2))` ?
i The error occurred in row 1.
Then tried this:
main2 %>%
rowwise() %>%
mutate (statistics = list(chisq.test(main2)))
got:
Group1_Mut Group1_WT Group2_Mut Group2_WT statistics
<dbl> <dbl> <dbl> <dbl> <list>
1 20 40 10 80 <htest>
2 10 50 30 60 <htest>
3 5 55 10 80 <htest>
Any ideas on how can I do this test? Is there any functions that perform Chi-square on multiple comparisons?

Iteratively create global environment objects from tibble

I'm trying to make objects directly from information listed in a tibble that can be called on by later functions/tibbles in my environment. I can make the objects manually but I'm working to do this iteratively.
library(tidyverse)
##determine mean from 2x OD Negatives in experimental plates, then save summary for use in appending table
ELISA_negatives = "my_file.csv"
neg_tibble <- as_tibble(read_csv(ELISA_negatives, col_names = TRUE)) %>%
group_by(Species_ab, Antibody, Protein) %>%
filter(str_detect(Animal_ID, "2x.*")) %>%
summarize(ave_neg_U_mL = mean(U_mL, na.rm = TRUE), n=sum(!is.na(U_mL)))
neg_tibble
# A tibble: 4 x 5
# Groups: Species_ab, Antibody [2]
Species_ab Antibody Protein ave_neg_U_mL n
<chr> <chr> <chr> <dbl> <int>
1 Mouse IgG GP 28.2 6
2 Mouse IgG NP 45.9 6
3 Rat IgG GP 5.24 4
4 Rat IgG NP 1.41 1
I can write the object manually based off the above tibble:
Mouse_IgG_GP_cutoff <- as.numeric(neg_tibble[1,4])
Mouse_IgG_GP_cutoff
[1] 28.20336
In my attempt to do this iteratively, I can make a new tibble neg_tibble_string with the information I need. All I would need to do now is make a global object from the Name in the first column Test_Name, and assign it to the numeric value in the second column ave_neg_U_mL (which is where I'm getting stuck).
neg_tibble_string <- neg_tibble %>%
select(Species_ab:Protein) %>%
unite(col='Test_Name', c('Species_ab', 'Antibody', 'Protein'), sep = "_") %>%
mutate(Test_Name = str_c(Test_Name, "_cutoff")) %>%
bind_cols(neg_tibble[4])
neg_tibble_string
# A tibble: 4 x 2
Test_Name ave_neg_U_mL
<chr> <dbl>
1 Mouse_IgG_GP_cutoff 28.2
2 Mouse_IgG_NP_cutoff 45.9
3 Rat_IgG_GP_cutoff 5.24
4 Rat_IgG_NP_cutoff 1.41
I feel like there has to be a way to do this to get this from the above tibble neg_tibble_string, and make this for all four of the rows. I've tried a variant of this and this, but can't get anywhere.
> list_df <- mget(ls(pattern = "neg_tibble_string"))
> list_output <- map(list_df, ~neg_tibble_string$ave_neg_U_mL)
Warning message:
Unknown or uninitialised column: `ave_neg_U_mL`.
> list_output
$neg_tibble_string
NULL
As always, any insight is appreciated! I'm making progress on my R journey but I know I am missing large gaps in knowledge.
As we already returned the object value in a list, we need only to specify the lambda function i.e. .x returns the value of the list element which is a tibble and extract the column
library(purrr)
list_output <- map(list_df, ~.x$ave_neg_U_ml)
If the intention is to create global objects, deframe, convert to a list and then use list2env
library(tibble)
list2env(as.list(deframe(neg_tibble_string)), .GlobalEnv)

Adding 'List' Objects to Word document using the Officer package

First time posting here.
I'm trying to get some statistical results to output onto a Word doc using the Officer package. I understand that the body_add_* functions seem to only work on data frames. However, functions and tests like gvlma and ncvTest output as a list with unconventional dimensions so I'm unable to use the tidyr package to tidy the lists before turning them into a data frame using data.frame(). So I need help adding these block of text that are lists into a Word Document.
So far I have this as the ADF test outputs as a very nice list that is easily convertible to a data frame:
# ADF test into dataframe
adf_df = data.frame(adf)
adf_df
ft <- flextable(data = adf_df) %>%
theme_booktabs() %>%
autofit()
# Output table into Word doc
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_par(gvlma)
fileout <- "test.docx"
print(doc, target = fileout)
The body_add_par(gvlma) line gives the error:
Warning messages:
1: In if (grepl("<|>", x)) { :
the condition has length > 1 and only the first element will be used
2: In charToRaw(enc2utf8(x)) :
argument should be a character vector of length 1
all but the first element will be ignored
gvlma outputs as a list and here is the output:
Call:
lm(formula = PD ~ ., data = dataset)
Coefficients:
(Intercept) WorldBank_Oil
1.282 -1.449
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = model)
Value p-value Decision
Global Stat 4.6172 0.3289 Assumptions acceptable.
Skewness 0.1858 0.6664 Assumptions acceptable.
Kurtosis 0.1812 0.6703 Assumptions acceptable.
Link Function 1.7823 0.1819 Assumptions acceptable.
Heteroscedasticity 2.4678 0.1162 Assumptions acceptable.
Replicating the error with iris data-set:
library(officer); library(flextable)
adf_df <- iris
ft <- flextable(data = adf_df) %>%
theme_booktabs() %>%
autofit()
gvlma <- lm(Petal.Length ~ Sepal.Length + Sepal.Width, data=iris)
# Output table into Word doc
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_par(gvlma)
Warning messages: 1: In if (grepl("<|>", x)) { : the condition has
length > 1 and only the first element will be used 2: In
charToRaw(enc2utf8(x)) : argument should be a character vector of
length 1 all but the first element will be ignored
Issue here is that the linear model are kept as list that is efficient in calling out test parameters or model statistics. Not great as a static output.
One way to work around this is to use the commands from library(broom)
library(broom)
gvlma2 <- tidy(gvlma)
gvlma3 <- glance(gvlma)
doc <- read_docx() %>%
body_add_flextable(value = ft) %>%
body_add_flextable(value = flextable(gvlma2)) %>%
body_add_flextable(value = flextable(gvlma3))
fileout <- "test.docx"
print(doc, target = fileout)
gvlma2:
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.52 0.563 -4.48 1.48e- 5
2 Sepal.Length 1.78 0.0644 27.6 5.85e-60
3 Sepal.Width -1.34 0.122 -10.9 9.43e-21
gvlma3:
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 0.868 0.866 0.646 482. 2.74e-65 3 -146. 300. 312. 61.4 147

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

Collecting p-values within pipe (dplyr)

how are you?
So, I have a dataset that looks like this:
dirtax_trev indtax_trev lag2_majority pub_exp
<dbl> <dbl> <dbl> <dbl>
0.1542 0.5186 0 9754
0.1603 0.4935 0 9260
0.1511 0.5222 1 8926
0.2016 0.5501 0 9682
0.6555 0.2862 1 10447
I'm having the following problem. I want to execute a series of t.tests along a dummy variable (lag2_majority), collect the p-value of this tests, and attribute it to a vector, using a pipe.
All variables that I want to run these t-tests are selected below, then I omit NA values for my t.test variable (lag2_majority), and then I try to summarize it with this code:
test <- g %>%
select(dirtax_trev, indtax_trev, gdpc_ppp, pub_exp,
SOC_tot, balance, fdi, debt, polity2, chga_demo, b_gov, social_dem,
iaep_ufs, gini, pov4, informal, lab, al_ethnic, al_language, al_religion,
lag_left, lag2_left, majority, lag2_majority, left, system, b_system,
execrlc, allhouse, numvote, legelec, exelec, pr) %>%
na.omit(lag2_majority) %>%
summarise_all(funs(t.test(.[lag2_majority], .[lag2_majority == 1])$p.value))
However, once I run this, the response I get is: Error in summarise_impl(.data, dots): Evaluation error: data are essentially constant., which is confusing since there is a clear difference on means along the dummy variable. The same error appears when I replace the last line of the code indicated above with: summarise_all(funs(t.test(.~lag2_majority)$p.value)).
Alternatively, since all I want to do is: t.test(dirtax_trev~lag2_majority, g)$p.value, for instance, I thought I could do a loop, like this:
for (i in vars){
t.test(i~lag2_majority, g)$p.value
},
Where vars is an object that contains all variables selected in code indicated above. But once again I get an error message. Specifically, this one: Error in model.frame.default(formula = i ~ lag2_majority, data = g): comprimentos das variáveis diferem (encontradas em 'lag2_majority')
What am I doing wrong?
Best Regards!
Your question is not reproducible, please read this for how you could improve its quality.
My answer has been generalised to be reproducible because I don't have your data and cannot therefore adapt your code directly.
Using a tidy approach I'll produce a data frame of p-values for each variable.
library(tidyr)
library(dplyr)
library(purrr)
mtcars %>%
select_if(is.numeric) %>%
map(t.test) %>%
lapply(`[[`, "p.value") %>%
as_tibble %>%
gather(key, p.value)
# # A tibble: 11 x 2
# key p.value
# <chr> <dbl>
# 1 mpg 1.526151e-18
# 2 cyl 5.048147e-19
# 3 disp 9.189065e-12
# 4 hp 2.794134e-13
# 5 drat 1.377586e-27
# 6 wt 2.257406e-18
# 7 qsec 7.790282e-33
# 8 vs 2.776961e-05
# 9 am 6.632258e-05
# 10 gear 1.066949e-23
# 11 carb 4.590930e-11
update
Thank you for updating your question, note that the value you included in your earlier comment is likely from your original dataset and is still not reproducible here. When I run the code, this is the output.
t.test(dirtax_trev ~ lag2_majority, g)$p.value
# [1] 0.5272474
Please frame your questions in a way that anyone can see the problem in the same way that you do.
To build up the formula you are running through the t.test, I have taken a slightly different approach.
library(magrittr)
library(dplyr)
library(purrr)
g <- tribble(
~dirtax_trev, ~indtax_trev, ~lag2_majority, ~pub_exp,
0.1542, 0.5186, 0, 9754,
0.1603, 0.4935, 0, 9260,
0.1511, 0.5222, 1, 8926,
0.2016, 0.5501, 0, 9682,
0.6555, 0.2862, 1, 10447
)
dummy <- "lag2_majority"
colnames(g) %>%
.[. != dummy] %>% # vector of variables to send through t.test
paste(., "~", dummy) %>% # build formula as character
map(as.formula) %>% # convert to formula class
map(t.test, data = g) %$% # run t.test for each, note the special operator
tibble(
data.name = unlist(lapply(., `[[`, "data.name")),
p.value = unlist(lapply(., `[[`, "p.value"))
)
# # A tibble: 3 x 2
# data.name p.value
# <chr> <dbl>
# 1 dirtax_trev by lag2_majority 0.5272474
# 2 indtax_trev by lag2_majority 0.5021217
# 3 pub_exp by lag2_majority 0.8998690
If you prefer to drop the dummy variable name from data.name, you could modify its assignment in the tibble with:
data.name = unlist(strsplit(unlist(lapply(., `[[`, "data.name")), paste(" by", dummy)))
N.B. I used the special %$% from magrittr to expose the names from the list of tests to build a data frame. I'm sure there are other ways that may be more elegant, however, I find this form quite easy to reason about.

Resources