R: ggplot to visualize all variables in each cluster after cluster analysis - r

Sorry in advance if the post isn't clear.
So I have my dataframe, 74 observations and 43 columns. I performed cluster analysis on them.
I then got 5 clusters, and assigned the cluster number to each respective row.
Now,
my df has 74 rows (obs) and 44 variables. And I would like to plot and see in each cluster what variables are enriched and what variables are not, for all variables.
I want to achieve this by ggplot.
My imaginary output panel is to have 5 boxplots per row, and 42 rows plots, each row will describe a variable measured in the dataset.
Example of the dataset (sorry its very big so I made an example, actual values are different)
df
ID EGF FGF_2 Eotaxin TGF G_CSF Flt3L GMSF Frac IFNa2 .... Cluster
4300 4.21 139.32 3.10 0 1.81 3.48 1.86 9.51 9.41 .... 1
2345 7.19 233.10 0 1.81 3.48 1.86 9.41 0 11.4 .... 1
4300 4.21 139.32 4.59 0 1.81 3.48 1.86 9.51 9.41 .... 1
....
3457 0.19 233.10 0 1.99 3.48 1.86 9.41 0 20.4 .... 3
5420 4.21 139.32 3.10 0.56 1.81 3.48 1.86 9.51 29.8 .... 1
2334 7.19 233.10 2.68 2.22 3.48 1.86 9.41 0 28.8 .... 5
str(df)
$ ID : Factor w/ 45 levels "4300"..... : 44 8 24 ....
$ EGF : num ....
$ FGF_2 : num ....
$ Eotaxin : num ....
....
$ Cluster : Factor w/ 5 levels "1" , "2"...: 1 1 1.....3 1 5
#now plotting
#thought I pivot the datafram
new_df <- pivot_longer(df[,2:44],df$cluster, names_to = "Cytokine measured", values_to = "count")
#ggplot
ggplot(new_df,aes(x = new_df$cluster, y = new_df$count))+
geom_boxplot(width=0.2,alpha=0.1)+
geom_jitter(width=0.15)+
facet_grid(new_df$`Cytokine measured`~new_df$cluster, scales = 'free')
So the code did generate a small panel of the graphs that fit my imaginary output. But I can see only
5 rows instead of 42.
So going back to new_df, the last 3 columns draw my attention:
Cluster Cytokine measured count
1 EGF 2.66
1 FGF_2 390.1
1 Eotaxin 6.75
1 TGF 0
1 G_CSF 520
3 EGF 45
5 FGF_2 4
4 Eotaxin 0
1 TGF 0
1 G_CSF 43
....
So it seems the cluster number and count column is correct whereas the cytokine measured just kept repeating the 5 variable names, instead of the total 42 variables I want to see.
I think the table conversion step is wrong, but I dont quite know what went wrong and how to fix it.
Please enlighten me.

We can try this, I simulate something that looks like your data frame:
df = data.frame(
ID=1:74,matrix(rnorm(74*43),ncol=43)
)
colnames(df)[-1] = paste0("Measurement",1:43)
df$cluster = cutree(hclust(dist(scale(df[,-1]))),5)
df$cluster = factor(df$cluster)
Then melt:
library(ggplot2)
library(tidyr)
library(dplyr)
melted_df = df %>% pivot_longer(-c(cluster,ID),values_to = "count")
g = ggplot(melted_df,aes(x=cluster,y=count,col=cluster)) + geom_boxplot() + facet_wrap(~name,ncol=5,scale="free_y")
You can save it as a bigger plot to look at:
ggsave(g,file="plot.pdf",width=15,height=15)

Related

Using lists and user-generated to run specific commands on multiple variables and datasets

I want to use lists and user-generated to run specific commands on multiple variables and datasets.
For examples, I want to turn the table, cut, and color variables into factors using the as.factor(as.character()) command in R on 3 different datasets, diamonds, diamonds_bottom300, and diamonds_top300, with the results being put into 3 new and user specified datasets called diamonds_post, diamonds_bottom300_post, and diamonds_top300_post.
I can do this the long way:
## long way to turn data into factors
### individually
#### for diamonds dataset
diamonds_post$table <- as.factor(as.character(diamonds$table))
diamonds_post$cut <- as.factor(as.character(diamonds$cut))
diamonds_post$color <- as.factor(as.character(diamonds$color))
#### for diamonds_bottom300 dataset
diamonds_bottom300_post$table <- as.factor(as.character(diamonds_bottom300$table))
diamonds_bottom300_post$cut <- as.factor(as.character(diamonds_bottom300$cut))
diamonds_bottom300_post$color <- as.factor(as.character(diamonds_bottom300$color))
#### for diamonds_top300 dataset
diamonds_top300_post$table <- as.factor(as.character(diamonds_top300$table))
diamonds_top300_post$cut <- as.factor(as.character(diamonds_top300$cut))
diamonds_top300_post$color <- as.factor(as.character(diamonds_top300$color))
## gives str of datasets
str(diamonds_post)
str(diamonds_top300_post)
str(diamonds_top300_post)
> ## gives str of datasets
> str(diamonds_post)
'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
$ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : Factor w/ 127 levels "43","44","49",..: 31 91 116 61 61 51 51 31 91 91 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
> str(diamonds_top300_post)
'data.frame': 327 obs. of 10 variables:
$ carat : num 0.23 0.86 0.84 0.7 0.76 0.57 0.74 0.91 0.98 0.71 ...
$ cut : Factor w/ 3 levels "Fair","Good",..: 2 1 1 1 1 1 1 1 1 1 ...
$ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 4 4 4 2 3 5 2 1 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 5 2 3 7 5 7 4 2 2 4 ...
$ depth : num 56.9 55.1 55.1 58.8 59 58.7 61.1 61.3 53.3 56.9 ...
$ table : Factor w/ 12 levels "65","65.4","66",..: 1 6 4 3 7 3 5 4 4 1 ...
$ price : int 327 2757 2782 2797 2800 2805 2805 2825 2855 2858 ...
$ x : num 4.05 6.45 6.39 5.81 5.89 5.34 5.82 6.24 6.82 5.89 ...
$ y : num 4.07 6.33 6.2 5.9 5.8 5.43 5.75 6.19 6.74 5.84 ...
$ z : num 2.31 3.52 3.47 3.44 3.46 3.16 3.53 3.81 3.61 3.34 ...
> str(diamonds_top300_post)
'data.frame': 327 obs. of 10 variables:
$ carat : num 0.23 0.86 0.84 0.7 0.76 0.57 0.74 0.91 0.98 0.71 ...
$ cut : Factor w/ 3 levels "Fair","Good",..: 2 1 1 1 1 1 1 1 1 1 ...
$ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 4 4 4 2 3 5 2 1 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 5 2 3 7 5 7 4 2 2 4 ...
$ depth : num 56.9 55.1 55.1 58.8 59 58.7 61.1 61.3 53.3 56.9 ...
$ table : Factor w/ 12 levels "65","65.4","66",..: 1 6 4 3 7 3 5 4 4 1 ...
$ price : int 327 2757 2782 2797 2800 2805 2805 2825 2855 2858 ...
$ x : num 4.05 6.45 6.39 5.81 5.89 5.34 5.82 6.24 6.82 5.89 ...
$ y : num 4.07 6.33 6.2 5.9 5.8 5.43 5.75 6.19 6.74 5.84 ...
$ z : num 2.31 3.52 3.47 3.44 3.46 3.16 3.53 3.81 3.61 3.34 ...
I tried to create a user-generated function to do this task, and also a corresponding list:
### creates function to turn into numeric form
function_turn_dataset_variable_into_factor_form <-
# ---- NOTE: turns variable into sum contrasted version of variable
# ---- NOTE: variable_name == variable to be turned to sum contrast
# ---- NOTE: dataset_name == dataset that contains variable name
# ---- NOTE: generally speaking, procedure is to create new variable with "_c" as suffix for corresponding sum contrasted variable
function(variable_name, dataset_name)
{
# ---- NOTE: # changes variable_name and dataset_name to object
colmn1 <- variable_name
nm1 <- dataset_name
# ---- NOTE: inserts dataset into function
dataset_funct_object_A <-
data.frame(
get(nm1)
)
# ---- NOTE: transdorms data into factor form
dataset_funct_object_A[[colmn1]] <- as.factor(as.character(dataset_funct_object_A[[colmn1]]))
# ---- NOTE: returns appropriate object
return(dataset_funct_object_A)
}
# ---- NOTE: dataset with lists of corresponding variables/dfs
variable_to_become_factors_in_specific_datasets
# A tibble: 9 x 3
variable_to_become_factors datasets_to_become_factors datasets_post
<chr> <chr> <chr>
1 color diamonds diamonds_post
2 color diamonds_bottom300 diamonds_bottom300_post
3 color diamonds_top300 diamonds_top300_post
4 cut diamonds diamonds_post
5 cut diamonds_bottom300 diamonds_bottom300_post
6 cut diamonds_top300 diamonds_top300_post
7 table diamonds diamonds_post
8 table diamonds_bottom300 diamonds_bottom300_post
9 table diamonds_top300 diamonds_top300_post
It does work when I use it individually, although it's not really faster to use the function than it is when I use the long way.
### runs user generated function on 1 variable/dataset
# ---- NOTE: gives structure of data
str(diamonds_post$color)
# ---- NOTE: runs function
diamonds_post <- function_turn_dataset_variable_into_factor_form(variable_to_become_factors_in_specific_datasets$variable_to_become_factors[1],variable_to_become_factors_in_specific_datasets$datasets_to_become_factors[1])
# ---- NOTE: gives structure of data
str(diamonds_post$color)
# ---- NOTE: works
# ---- NOTE: not really much faster than the long way
I can't really get it to work in the way that I want it when I apply it to lists using mapply(). Is there any way to get this task to work using a user generated function that returns the transformed variables to correspoding user specificed datasets that are different than the start dataset?
Thanks ahead of time for any help.
Here is the code used for the example:
# Loads packages
# ---- NOTE: making plots and diamonds dataset
if(!require(ggplot2)){install.packages("ggplot2")}
# ---- NOTE: run mixed effects models
if(!require(lme4)){install.packages("lme4")}
# ---- NOTE: for data wrangling
if(!require(dplyr)){install.packages("dplyr")}
# dataset creation
## for dataset with top 300 rows
# ---- NOTE: selects only the top 300 rows of the dataset
diamonds_top300 <- data.frame(dplyr::top_n(diamonds, 300, table))
# ---- NOTE: gives dataset info
head(diamonds_top300)
str(diamonds_top300)
colnames(diamonds_top300)
nrow(diamonds_top300)
# ---- NOTE: gives unique values of Fixed and Random effects, and dvs
unique(diamonds_top300$price)
unique(diamonds_top300$y)
unique(diamonds_top300$cut)
unique(diamonds_top300$color)
unique(diamonds_top300$carat)
unique(diamonds_top300$clarity)
unique(diamonds_top300$depth)
unique(diamonds_top300$table)
## for dataset with bottom 300 rows
### dataset
# ---- NOTE: selects only the bottom 300 rows of the dataset
diamonds_bottom300 <- data.frame(dplyr::top_n(diamonds, -300, table))
# ---- NOTE: gives dataset info
head(diamonds_bottom300)
str(diamonds_bottom300)
colnames(diamonds_bottom300)
nrow(diamonds_bottom300)
# ---- NOTE: gives unique values of Fixed and Random effects, and dvs
unique(diamonds_bottom300$price)
unique(diamonds_bottom300$y)
unique(diamonds_bottom300$cut)
unique(diamonds_bottom300$color)
unique(diamonds_bottom300$carat)
unique(diamonds_bottom300$clarity)
unique(diamonds_bottom300$depth)
unique(diamonds_bottom300$table)
### creates end result variables
diamonds_post <- data.frame(diamonds_bottom300)
diamonds_top300_post <- data.frame(diamonds_top300)
diamonds_bottom300_post <- data.frame(diamonds_bottom300)
# turns variables into factor for using as.factor(as.character()) command
## data frame with transformation info
### creates list of variable names to turn into factors
variable_to_become_factors <-
data.frame(
variable_to_become_factors = c("table", "cut", "color")
)
### creates list of data frames for transformation
datasets_to_become_factors <-
data.frame(
datasets_to_become_factors = c("diamonds", "diamonds_bottom300", "diamonds_top300"),
datasets_post = c("diamonds_post", "diamonds_bottom300_post", "diamonds_top300_post")
)
### creates dataframe with all possible combinations of data
variable_to_become_factors_in_specific_datasets <-
tidyr::crossing(variable_to_become_factors, datasets_to_become_factors)
### splits variable_to_become_factors_in_specific_datasets data frame by data frame name
# ---- NOTE: creates list
variable_to_become_factors_in_specific_datasets_list <- split(variable_to_become_factors_in_specific_datasets, variable_to_become_factors_in_specific_datasets$datasets_to_become_factors)
# ---- NOTE: changes list object name
variable_to_become_factors_in_specific_datasets_list <-
setNames(variable_to_become_factors_in_specific_datasets_list, paste("variable_to_become_factors_in_specific_dataset",
datasets_to_become_factors$datasets_to_become_factors,
sep = "__")
)
# ---- NOTE: creates unique objects for each part list object
list2env(variable_to_become_factors_in_specific_datasets_list, .GlobalEnv)
# ---- NOTE: gathers objects with prefix
apropos("variable_to_become_factors_in_specific_dataset")
## long way to turn data into factors
### individually
#### for diamonds dataset
diamonds_post$table <- as.factor(as.character(diamonds$table))
diamonds_post$cut <- as.factor(as.character(diamonds$cut))
diamonds_post$color <- as.factor(as.character(diamonds$color))
#### for diamonds_bottom300 dataset
diamonds_bottom300_post$table <- as.factor(as.character(diamonds_bottom300$table))
diamonds_bottom300_post$cut <- as.factor(as.character(diamonds_bottom300$cut))
diamonds_bottom300_post$color <- as.factor(as.character(diamonds_bottom300$color))
#### for diamonds_top300 dataset
diamonds_top300_post$table <- as.factor(as.character(diamonds_top300$table))
diamonds_top300_post$cut <- as.factor(as.character(diamonds_top300$cut))
diamonds_top300_post$color <- as.factor(as.character(diamonds_top300$color))
## gives str of datasets
str(diamonds_post)
str(diamonds_top300_post)
str(diamonds_top300_post)
## medium way
### creates function to turn into numeric form
function_turn_dataset_variable_into_factor_form <-
# ---- NOTE: turns variable into sum contrasted version of variable
# ---- NOTE: variable_name == variable to be turned to sum contrast
# ---- NOTE: dataset_name == dataset that contains variable name
# ---- NOTE: generally speaking, procedure is to create new variable with "_c" as suffix for corresponding sum contrasted variable
function(variable_name, dataset_name)
{
# ---- NOTE: # changes variable_name and dataset_name to object
colmn1 <- variable_name
nm1 <- dataset_name
# ---- NOTE: inserts dataset into function
dataset_funct_object_A <-
data.frame(
get(nm1)
)
# ---- NOTE: transdorms data into factor form
dataset_funct_object_A[[colmn1]] <- as.factor(as.character(dataset_funct_object_A[[colmn1]]))
# ---- NOTE: returns appropriate object
return(dataset_funct_object_A)
}
### runs user generated function on 1 variable/dataset
# ---- NOTE: gives structure of data
str(diamonds_post$color)
# ---- NOTE: runs function
diamonds_post <- function_turn_dataset_variable_into_factor_form(variable_to_become_factors_in_specific_datasets$variable_to_become_factors[1],variable_to_become_factors_in_specific_datasets$datasets_to_become_factors[1])
# ---- NOTE: gives structure of data
str(diamonds_post$color)
# ---- NOTE: works
# ---- NOTE: not really much faster than the long way
### use mapply on individual lists
# ---- NOTE: applies functions to appropriate variables
function_test_object <-
mapply(function_turn_dataset_variable_into_factor_form,
variable_to_become_factors_in_specific_datasets$variable_to_become_factors, variable_to_become_factors_in_specific_datasets$datasets_to_become_factors, SIMPLIFY = FALSE)
# ---- NOTE: does not work as desired
EDIT 1:
Results from commenter "Ronak Shah":
This didn't seem to work; it's probably because of my own ignorance with R.
Here were the steps:
Run all of the code associated with the "Here is the code used for the example:" portion of the original post (not displayed).
Run commenter's script (didn't work for me):
> #Define the columns to change
> cols <- c('table', 'cut', 'color')
> cols
[1] "table" "cut" "color"
> #Define the names of the dataframe to change
> original_names <- c('diamonds', 'diamonds_bottom300', 'diamonds_top300')
> original_names
[1] "diamonds" "diamonds_bottom300" "diamonds_top300"
> #New names of the changed dataframe
> new_names <- paste0(original_names, '_post')
> new_names
[1] "diamonds_post" "diamonds_bottom300_post" "diamonds_top300_post"
> #apply function to each column in each dataframe
> lapply(mget(original), function(x) {
+ x[cols] <- lapply(x[cols], function(y) as.factor(as.character(y)))
+ x
+ }) -> result
Error in mget(original) : object 'original' not found
> result
Error: object 'result' not found
> #Write to global environment.
> names(result) <- new_names
Error in names(result) <- new_names : object 'result' not found
> list2env(result, .GlobalEnv)
Error in list2env(result, .GlobalEnv) : object 'result' not found
Upon close inspection, it could have not worked because one of the calls was written as "original", not "original_names". Here is the results of this change:
> #Define the columns to change
> cols <- c('table', 'cut', 'color')
> cols
[1] "table" "cut" "color"
> #Define the names of the dataframe to change
> original_names <- c('diamonds', 'diamonds_bottom300', 'diamonds_top300')
> original_names
[1] "diamonds" "diamonds_bottom300" "diamonds_top300"
> #New names of the changed dataframe
> new_names <- paste0(original_names, '_post')
> new_names
[1] "diamonds_post" "diamonds_bottom300_post" "diamonds_top300_post"
> #apply function to each column in each dataframe
> lapply(mget(original_names), function(x) {
+ x[cols] <- lapply(x[cols], function(y) as.factor(as.character(y)))
+ x
+ }) -> result
Error: value for ‘diamonds’ not found
> result
Error: object 'result' not found
> #Write to global environment.
> names(result) <- new_names
Error in names(result) <- new_names : object 'result' not found
> list2env(result, .GlobalEnv)
Error in list2env(result, .GlobalEnv) : object 'result' not found
Not sure what to do. Any advice for a fix could help. It could be my own fault, and I'm just not seeing the error.
#Define the columns to change
cols <- c('table', 'cut', 'color')
#Define the names of the dataframe to change
original_names <- c('diamonds', 'diamonds_bottom300', 'diamonds_top300')
#New names of the changed dataframe
new_names <- paste0(original_names, '_post')
#apply function to each column in each dataframe
lapply(mget(original), function(x) {
x[cols] <- lapply(x[cols], function(y) as.factor(as.character(y)))
x
}) -> result
#Write to global environment.
names(result) <- new_names
list2env(result, .GlobalEnv)
Check the output for one dataframe -
str(diamonds_post)
#tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
# $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
# $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
# $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
# $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
# $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
# $ table : Factor w/ 127 levels "43","44","49",..: 31 91 116 61 61 51 51 31 91 91 ...
# $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
# $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
# $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
# $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!
I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

Unable to apply ddply-summarise in R correctly

new here and new to R, so bear with me, please.
I have a data.frame similar to this:
time. variable TEER
1 0.07 cntrl 234.2795
2 1.07 cntrl 602.8245
3 2.07 cntrl 703.6844
4 3.07 cntrl 699.4538
...
48 0.07 cntrl 234.2795
49 1.07 cntrl 602.8245
50 2.07 cntrl 703.6844
51 3.07 cntrl 699.4538
...
471 0.07 agr1111 251.9119
472 1.07 agr1111 480.1573
473 2.07 agr1111 629.3744
474 3.07 agr1111 676.6782
...
518 0.07 agr1111 251.9119
519 1.07 agr1111 480.1573
520 2.07 agr1111 629.3744
521 3.07 agr1111 676.6782
...
753 0.07 agr2222 350.1049
754 1.07 agr2222 306.6072
755 2.07 agr2222 346.0387
756 3.07 agr2222 447.0137
757 4.07 agr2222 530.2433
...
802 2.07 agr2222 346.0387
803 3.07 agr2222 447.0137
804 4.07 agr2222 530.2433
805 5.07 agr2222 591.2122
I'm trying to apply ddply() to this data frame to get a new data frame with means and standard error (to plot later) like so:
> ddply(data_melt, c("time.", "variable"), summarise,
mean = mean(TEER), sd = sd(TEER),
sem = sd(TEER)/sqrt(length(TEER)))
What I get as an output data frame are same values of TEER in the mean column as in the first rows of the original data frame and zeroes in sd and sem columns. Also an error:
Warning message:
In levels<-(*tmp*, value = if (nl == nL) as.character(labels) else
paste0(labels, : duplicated levels in factors are deprecated
It looks like the function only goes through the first part of the data frame and doesn't bother looking at the duplicates of time. and variable group?
I already tried looking at the solutions to similar problems here but nothing seems to work. Am I missing something or is this a legitimate problem?
Any help / tips appreciated.
P.S Let me know if I'm not explaining the problem coherently enough and I'll try to go into more detail.
I think I've found a way around my problem.
Initially, when I load the data frame, each of the variables ("cntrl, "agr1111", "agr2222"), has a unique letter and number near them ("A1", "A2", "B1", "B2"), hence, looking like this: "cntrl.A1", "agr1111.B2". Instead, of substracting the letter-number from each of them using gsub i tried using filter with grepl to isolate certain rows that I need and summarise then.
Here's the code:
library(dplyr)
dt_11 <- dt %>%
group_by(time.) %>%
filter(grepl("agr1111", variable)) %>%
summarise(avg_11 = mean(teer),
sd_11 = sd(teer),
sem_11 = sd(teer)/sqrt(length(teer)))
This only gives me a data frame with one group of variables ("agr1111") and I'll have to do this two more times, for "cntrl" and "agr2222", hence resulting in 3 data frames. But I'm sure, I'll be able to either merge the data frames or plot them on the same graph separately.
This doesnt fit to be an answer, but too long to be a comment :
I ran your exact code and everything works fine!
> ddply(dt, c("time.", "variable"), summarise,
+ mean = mean(TEER), sd = sd(TEER),
+ sem = sd(TEER)/sqrt(length(TEER)), count = length(TEER))
#time. variable mean sd sem count
# 0.07 agr1111 251.9119 0 0 2
# 0.07 agr2222 350.1049 NA NA 1
# 0.07 cntrl 234.2795 0 0 2
# 1.07 agr1111 480.1573 0 0 2
# 1.07 agr2222 306.6072 NA NA 1
# 1.07 cntrl 602.8245 0 0 2
# 2.07 agr1111 629.3744 0 0 2
# 2.07 agr2222 346.0387 0 0 2
# 2.07 cntrl 703.6844 0 0 2
# 3.07 agr1111 676.6782 0 0 2
# 3.07 agr2222 447.0137 0 0 2
# 3.07 cntrl 699.4538 0 0 2
# 4.07 agr2222 530.2433 0 0 2
# 5.07 agr2222 591.2122 NA NA 1
> sessionInfo()
#other attached packages:
#[1] plyr_1.8.4
Could you update to latest version of packaes. I am not sure of the cause to your problem. I hope you understand how sd actually is calculated and why `NA~ appear.(HINT : look at the count column)

Get monthly means from dataframe of several years of daily temps

I have daily temperature values for several years, 1949-2010. I would like to calculate monthly means. Here is an example of the data:
head(tmeasmax)
TIMESTEP MEAN.C. MINIMUM.C. MAXIMUM.C. VARIANCE.C.2. STD_DEV.C. SUM COUNT
1949-01-01 6.836547 6.65 7.33 0.02850574 0.1688364 1.426652 6
1949-01-02 10.533371 10.24 10.74 0.06140426 0.2477988 1.426652 6
1949-01-03 18.746729 18.02 19.78 0.18507860 0.4302076 1.426652 6
1949-01-04 21.244562 20.09 22.40 0.76106980 0.8723931 1.426652 6
1949-01-05 3.826716 3.11 5.37 0.52706647 0.7259935 1.426652 6
1949-01-06 9.127782 8.46 10.26 0.20236358 0.4498484 1.426652 6
str(tmeasmax)
'data.frame': 22645 obs. of 8 variables:
$ TIMESTEP : Date, format: "1949-01-01" "1949-01-02" ...
$ MEAN.C. : num 6.84 10.53 18.75 21.24 3.83 ...
$ MINIMUM.C. : num 6.65 10.24 18.02 20.09 3.11 ...
$ MAXIMUM.C. : num 7.33 10.74 19.78 22.4 5.37 ...
$ VARIANCE.C.2.: num 0.0285 0.0614 0.1851 0.7611 0.5271 ...
$ STD_DEV.C. : num 0.169 0.248 0.43 0.872 0.726 ...
$ SUM : num 1.43 1.43 1.43 1.43 1.43 ...
$ COUNT : int 6 6 6 6 6 6 6 6 6 6 ...
There is a previous question that I couldn't make heads or tails of. I imagine I can probably use aggregate, but I don't know how to break up the dates into the years and months and then approach the nesting of the months inside the years. I tried a loop inside of a loop, but I can never get nested loops to work.
EDIT to reply to comments/questions:
I was looking for the mean of "MEAN.C."
Here's a quick data.table solution. I assuming you want the means of MEAN.C. (?)
library(data.table)
setDT(tmeasmax)[, .(MontlyMeans = mean(MEAN.C.)), by = .(year(TIMESTEP), month(TIMESTEP))]
# year month MontlyMeans
# 1: 1949 1 11.71928
You can also do this for all the columns at once if you want
tmeasmax[, lapply(.SD, mean), by = .(year(TIMESTEP), month(TIMESTEP))]
# year month MEAN.C. MINIMUM.C. MAXIMUM.C. VARIANCE.C.2. STD_DEV.C. SUM COUNT
# 1: 1949 1 11.71928 11.095 12.64667 0.2942481 0.482513 1.426652 6
Here's a way to do it with the dplyr package:
library(dplyr)
library(lubridate)
tmeasmax$TIMESTEP = ymd(tmeasmax$TIMESTEP)
tmeasmax %>%
group_by(Year=year(TIMESTEP), Month=month(TIMESTEP)) %>%
summarise(meanDailyMin=mean(MINIMUM.C.),
meanDailyMean=mean(MEAN.C.))
Year Month meanDailyMin meanDailyMean
1 1949 1 11.095 11.71928
You can summarise any other column by month in a similar way.
You can use the lubridate package to create a new factor variable consisting of the year-month combinations, then use aggregate.
library('lubridate')
tmeasmax2 <- within(tmeasmax, {
monthlies <- paste(year(TIMESTEP),
month(TIMESTEP))
})
aggregate(tmeasmax2, list(monthlies), mean, na.rm = TRUE)

Resources