Unable to apply ddply-summarise in R correctly - r

new here and new to R, so bear with me, please.
I have a data.frame similar to this:
time. variable TEER
1 0.07 cntrl 234.2795
2 1.07 cntrl 602.8245
3 2.07 cntrl 703.6844
4 3.07 cntrl 699.4538
...
48 0.07 cntrl 234.2795
49 1.07 cntrl 602.8245
50 2.07 cntrl 703.6844
51 3.07 cntrl 699.4538
...
471 0.07 agr1111 251.9119
472 1.07 agr1111 480.1573
473 2.07 agr1111 629.3744
474 3.07 agr1111 676.6782
...
518 0.07 agr1111 251.9119
519 1.07 agr1111 480.1573
520 2.07 agr1111 629.3744
521 3.07 agr1111 676.6782
...
753 0.07 agr2222 350.1049
754 1.07 agr2222 306.6072
755 2.07 agr2222 346.0387
756 3.07 agr2222 447.0137
757 4.07 agr2222 530.2433
...
802 2.07 agr2222 346.0387
803 3.07 agr2222 447.0137
804 4.07 agr2222 530.2433
805 5.07 agr2222 591.2122
I'm trying to apply ddply() to this data frame to get a new data frame with means and standard error (to plot later) like so:
> ddply(data_melt, c("time.", "variable"), summarise,
mean = mean(TEER), sd = sd(TEER),
sem = sd(TEER)/sqrt(length(TEER)))
What I get as an output data frame are same values of TEER in the mean column as in the first rows of the original data frame and zeroes in sd and sem columns. Also an error:
Warning message:
In levels<-(*tmp*, value = if (nl == nL) as.character(labels) else
paste0(labels, : duplicated levels in factors are deprecated
It looks like the function only goes through the first part of the data frame and doesn't bother looking at the duplicates of time. and variable group?
I already tried looking at the solutions to similar problems here but nothing seems to work. Am I missing something or is this a legitimate problem?
Any help / tips appreciated.
P.S Let me know if I'm not explaining the problem coherently enough and I'll try to go into more detail.

I think I've found a way around my problem.
Initially, when I load the data frame, each of the variables ("cntrl, "agr1111", "agr2222"), has a unique letter and number near them ("A1", "A2", "B1", "B2"), hence, looking like this: "cntrl.A1", "agr1111.B2". Instead, of substracting the letter-number from each of them using gsub i tried using filter with grepl to isolate certain rows that I need and summarise then.
Here's the code:
library(dplyr)
dt_11 <- dt %>%
group_by(time.) %>%
filter(grepl("agr1111", variable)) %>%
summarise(avg_11 = mean(teer),
sd_11 = sd(teer),
sem_11 = sd(teer)/sqrt(length(teer)))
This only gives me a data frame with one group of variables ("agr1111") and I'll have to do this two more times, for "cntrl" and "agr2222", hence resulting in 3 data frames. But I'm sure, I'll be able to either merge the data frames or plot them on the same graph separately.

This doesnt fit to be an answer, but too long to be a comment :
I ran your exact code and everything works fine!
> ddply(dt, c("time.", "variable"), summarise,
+ mean = mean(TEER), sd = sd(TEER),
+ sem = sd(TEER)/sqrt(length(TEER)), count = length(TEER))
#time. variable mean sd sem count
# 0.07 agr1111 251.9119 0 0 2
# 0.07 agr2222 350.1049 NA NA 1
# 0.07 cntrl 234.2795 0 0 2
# 1.07 agr1111 480.1573 0 0 2
# 1.07 agr2222 306.6072 NA NA 1
# 1.07 cntrl 602.8245 0 0 2
# 2.07 agr1111 629.3744 0 0 2
# 2.07 agr2222 346.0387 0 0 2
# 2.07 cntrl 703.6844 0 0 2
# 3.07 agr1111 676.6782 0 0 2
# 3.07 agr2222 447.0137 0 0 2
# 3.07 cntrl 699.4538 0 0 2
# 4.07 agr2222 530.2433 0 0 2
# 5.07 agr2222 591.2122 NA NA 1
> sessionInfo()
#other attached packages:
#[1] plyr_1.8.4
Could you update to latest version of packaes. I am not sure of the cause to your problem. I hope you understand how sd actually is calculated and why `NA~ appear.(HINT : look at the count column)

Related

Saving function output to specific place in dataframe

I'm working on a function that predicts using a gbm model, one row at a time. Then, I want to save the predicted value in a specific place in my DF so that the next value can be predicted with that output included. Basically, a prediction with a lagged dependent variable. Below is a snippet of my data
DEC AAA_CCC BBBB LLLLL DDD_SHR ST_DSC WKG.P WKG.P.1T _CHNG XXXX_pr XXXX_pr_r XXXX_vol XXXX_.T.1.
38 0 0.99 0 0 0.51 8.28 0 0 6.04 2.84 2.84 10.49 9.83
39 0 0.99 0 0 0.51 8.27 0 0 5.97 2.75 2.75 10.33 10.49
40 0 1.04 0 0 0.51 8.27 0 0 6.01 2.81 2.81 10.58 10.33
41 0 0.98 0 0 0.51 8.28 0 0 5.99 2.87 2.87 9.49 10.58
42 0 0.98 0 1 0.52 8.27 0 0 6.10 2.81 2.81 10.35 9.49
43 0 0.95 0 1 0.51 8.27 0 0 6.01 2.72 2.71 10.67 10.35
XXXX_wd XXXX_ICP_A XXXX_ICP_A_.T.1.
38 4.41 0 1
39 4.33 1 0
40 4.36 0 1
41 4.32 1 0
42 4.19 0 1
43 4.25 1 0
This function needs to: find columns with specific names within a DF, check if there are 0s inside, if yes - predict a value based on the row with a 0 in it. Then, save the predicted value in that 0 place, and in a different column with a 0. Keep repeating that until there are no more 0s in the 'vol' column.
I've come up with this:
PREDICTION<-function (a, model)
{
vol<-select(a, ends_with("vol"))
vol_1<-select(a, ends_with("vol_.T.1."))
while (min(which(a[,colnames(vol)]== 0))!=0) {
PRED<-predict(model, a[min(which(a[,colnames(vol)]== 0)),])
a[[min(which(a[,colnames(vol_1)]== 0)),colnames(vol_1)]]<<-print(PRED)
a[[min(which(a[,colnames(vol)]== 0)),colnames(vol)]]<<-print(PRED)
}}
It prints the right values but doesn't save them the way I wanted. So the while part also doesn't work - values are not saved properly so it loops over the same row forever. I've tried replacing the print with return which didn't change anything. I don't really know where to go from here so I appreciate any help.
I have found a solution - definitely not elegant I will work on improving it. Posting if anyone is interesed.
I simplified the loop withing my function - for instead of while. The rows I'm working on are always in the same range so this helped even visually make the code easier to look at. I will be messing with this more so that might make a comeback to make the code more universal.
PREDICTION <- function(a, model)
{
vol <- select(a, ends_with("vol"))
vol_1 <- select(a, ends_with("vol_.T.1."))
for (i in 54:72) {
a[[i,colnames(vol)]] <- print(predict(model, a[i,]))
a[[i+1,colnames(vol_1)]] <- print(predict(model, a[i,]))
}
return(a)
}
Then I use mapply to map this function over my DF which contains dataframes and models to use.
LISTA$DF <- mapply(PREDICTION, a=LISTA$DF, model = LISTA$GBM)
Essentially this was mostly a syntax problem on my end. Just goes to show how much more I have to learn to be able to code more efficiently and functionally.

R: ggplot to visualize all variables in each cluster after cluster analysis

Sorry in advance if the post isn't clear.
So I have my dataframe, 74 observations and 43 columns. I performed cluster analysis on them.
I then got 5 clusters, and assigned the cluster number to each respective row.
Now,
my df has 74 rows (obs) and 44 variables. And I would like to plot and see in each cluster what variables are enriched and what variables are not, for all variables.
I want to achieve this by ggplot.
My imaginary output panel is to have 5 boxplots per row, and 42 rows plots, each row will describe a variable measured in the dataset.
Example of the dataset (sorry its very big so I made an example, actual values are different)
df
ID EGF FGF_2 Eotaxin TGF G_CSF Flt3L GMSF Frac IFNa2 .... Cluster
4300 4.21 139.32 3.10 0 1.81 3.48 1.86 9.51 9.41 .... 1
2345 7.19 233.10 0 1.81 3.48 1.86 9.41 0 11.4 .... 1
4300 4.21 139.32 4.59 0 1.81 3.48 1.86 9.51 9.41 .... 1
....
3457 0.19 233.10 0 1.99 3.48 1.86 9.41 0 20.4 .... 3
5420 4.21 139.32 3.10 0.56 1.81 3.48 1.86 9.51 29.8 .... 1
2334 7.19 233.10 2.68 2.22 3.48 1.86 9.41 0 28.8 .... 5
str(df)
$ ID : Factor w/ 45 levels "4300"..... : 44 8 24 ....
$ EGF : num ....
$ FGF_2 : num ....
$ Eotaxin : num ....
....
$ Cluster : Factor w/ 5 levels "1" , "2"...: 1 1 1.....3 1 5
#now plotting
#thought I pivot the datafram
new_df <- pivot_longer(df[,2:44],df$cluster, names_to = "Cytokine measured", values_to = "count")
#ggplot
ggplot(new_df,aes(x = new_df$cluster, y = new_df$count))+
geom_boxplot(width=0.2,alpha=0.1)+
geom_jitter(width=0.15)+
facet_grid(new_df$`Cytokine measured`~new_df$cluster, scales = 'free')
So the code did generate a small panel of the graphs that fit my imaginary output. But I can see only
5 rows instead of 42.
So going back to new_df, the last 3 columns draw my attention:
Cluster Cytokine measured count
1 EGF 2.66
1 FGF_2 390.1
1 Eotaxin 6.75
1 TGF 0
1 G_CSF 520
3 EGF 45
5 FGF_2 4
4 Eotaxin 0
1 TGF 0
1 G_CSF 43
....
So it seems the cluster number and count column is correct whereas the cytokine measured just kept repeating the 5 variable names, instead of the total 42 variables I want to see.
I think the table conversion step is wrong, but I dont quite know what went wrong and how to fix it.
Please enlighten me.
We can try this, I simulate something that looks like your data frame:
df = data.frame(
ID=1:74,matrix(rnorm(74*43),ncol=43)
)
colnames(df)[-1] = paste0("Measurement",1:43)
df$cluster = cutree(hclust(dist(scale(df[,-1]))),5)
df$cluster = factor(df$cluster)
Then melt:
library(ggplot2)
library(tidyr)
library(dplyr)
melted_df = df %>% pivot_longer(-c(cluster,ID),values_to = "count")
g = ggplot(melted_df,aes(x=cluster,y=count,col=cluster)) + geom_boxplot() + facet_wrap(~name,ncol=5,scale="free_y")
You can save it as a bigger plot to look at:
ggsave(g,file="plot.pdf",width=15,height=15)

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!
I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

Transformation of missing values by taking log(x+1)

I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.

Resources