Defining the function to select the data - r

Let's start with my data.
> dput(head(tbl_ready)) ## To make it clear I didn't put all of the row names
structure(list(Gene_name = structure(1:6, .Label = c("AT1G01050",
"AT1G01080", "AT1G01090", "AT1G01220", "AT1G01320", "AT1G01420",
"AT1G01470", "AT1G01800", "AT1G01910", "AT1G01920", "AT1G01960",
"AT5G66570", "AT5G66720", "AT5G66760", "AT5G67150", "AT5G67360",
"ATCG00120", "ATCG00160", "ATCG00170", "ATCG00190", "ATCG00380",
"ATCG00470", "ATCG00480", "ATCG00490", "ATCG00500", "ATCG00650",
"ATCG00660", "ATCG00670", "ATCG00750", "ATCG00770", "ATCG00780",
"ATCG00800", "ATCG00810", "ATCG00820", "ATCG01090", "ATCG01110",
"ATCG01120", "ATCG01240", "ATCG01300", "ATCG01310", "ATMG01190"
), class = "factor"), `10` = c(0, 0, 0, 0, 0, 0), `20` = c(0,
0, 0, 0, 0, 0), `52.5` = c(0, 1, 0, 0, 0, 0), `81` = c(0, 0.660693687777888,
0, 0, 0, 0), `110` = c(0, 0.521435654491704, 0, 0, 0, 1), `140.5` = c(0,
0.437291194705566, 0, 0, 0, 1), `189` = c(0, 0.52204783488213,
0, 0, 0, 0), `222.5` = c(0, 0.524298383907171, 0, 0, 0, 0), `278` = c(1,
0.376865096972469, 0, 1, 0, 0), `340` = c(0, 0, 0, 0, 0, 0),
`397` = c(0, 0, 0, 0, 0, 0), `453.5` = c(0, 0, 0, 0, 0, 0
), `529` = c(0, 0, 0, 0, 0, 0), `580` = c(0, 0, 0, 0, 0,
0), `630.5` = c(0, 0, 0, 0, 0, 0), `683.5` = c(0, 0, 0, 0,
0, 0), `735.5` = c(0, 0, 0, 0, 0, 0), `784` = c(0, 0, 0.476101907006443,
0, 0, 0), `832` = c(0, 0, 1, 0, 0, 0), `882.5` = c(0, 0,
0, 0, 0, 0), `926.5` = c(0, 0, 0, 0, 1, 0), `973` = c(0,
0, 0, 0, 0, 0), `1108` = c(0, 0, 0, 0, 0, 0), `1200` = c(0,
0, 0, 0, 0, 0)), .Names = c("Gene_name", "10", "20", "52.5",
"81", "110", "140.5", "189", "222.5", "278", "340", "397", "453.5",
"529", "580", "630.5", "683.5", "735.5", "784", "832", "882.5",
"926.5", "973", "1108", "1200"), row.names = c(NA, 6L), class = "data.frame")
Take a look on the names of the columns (just picked the 6 of them):
10
20
52.5
81
110
140.5
Those names tell me the size range. The size of the genes in the first column starts from 10 and ends on the begining of the second column = 20. That means that to the first column should belong genes with the size between 10-20.
I have another table which tells me what's the size of all genes (there are much more than can be finded in my first table):
>dput(head(tbl_size))
structure(list(Gene_name = structure(1:6, .Label = c("ATMG01290", "ATMG01300", "ATMG01310", "ATMG01320", "ATMG01330",
"ATMG01350", "ATMG01360", "ATMG01370", "ATMG01400", "ATMG01410"
), class = "factor"), tp = c(26L, 17L, 22L, 142L, 12L, 45L),
size = c(49.4255, 28.0913, 40.2872, 213.572, 24.4838, 70.4375
)), .Names = c("locus", "tp", "size"), row.names = c(NA,
6L), class = "data.frame")
and now the main part. What I want to achieve with my code ?
So, I'm trying to find only those genes which are found in the fractions (columns) with the size range two times higher than a real size of the gene. No idea if you understand what I am trying to do so let me use an example.
so let's say that we have a genes:
Names Size
AT1G01080 40
AT1G01090 30
AT1G01220 50
Let's multiply the size by 2:
Names Size
AT1G01080 80
AT1G01090 60
AT1G01220 100
In first table (tbl_ready) we can find the list of the genes and specific fractions (columns) defined by size which I explained in the begining of this thread. I would like to put the 0 instead of any values if any gene can be found in the fraction (column) which is not atleast two times higher than the gene size.
To find the size of the gene you have to look in the second table (tbl_size).
Just to sum it up. I'm trying to define which of those genes come atleast as a complex of 2. So only fractions with size two times higher than the size of the gene are important for me.
IF SOMEONE KNOWS WHAT I AM TRYING TO DO PLEASE EDIT MY QUESTION TO MAKE IT READABLE. I FEEL LIKE MY BRAIN IS DEAD.

Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
ind <- lapply(tbl_size$size, function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus, tbl_ready$Gene_name), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
You'll have to be careful if gene_names don't have a 1:1 mapping with locus, and cases where none of the columns exceed the gene size two fold, as there'll be NAs that need dealing with. I'm assuming you're stuck using these representations of your data, as it would probably be better to store tbl_ready in a longer narrower form than you have it here (containing only three columns name, size, and value - and omitted the zero values).

I'm going to change my original answer, this time using the data you've provided - the only real differences are that you've changed the column names (I'm assuming column tp in tbl_size is the thing we need to match to the column headings in tbl_ready), and that some of the rows in table_size don't correspond to tbl_ready.
Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
mapToReady <- tbl_size$locus %in% tbl_ready[[1]]
ind <- sapply(tbl_size$tp[mapToReady], function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus[mapToReady], tbl_ready[[1]]), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
So, for instance, AT1G01050 is the 5th row of tbl_size (none of the previous entries have an entry in your tbl_size), and the first row of tbl_ready. So the first 'iteration' of the sapply line hits 'tbl_size$tp[mapToReady][1]' which is the tp of AT1G01050 which is 12. 2*12 is 24, so is between 20.0 and 52.5, so we're going to need to set columns corresponding to '10', and '20' to zero, but not columns '52.5' onwards, for the AT1G01050. This corresponds to columns 2 and 3 of row 1 of tbl_ready, which is what the cbind portion of the last three lines is doing.

Related

How to calculate mean value of all columns of datarame [duplicate]

This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed last year.
I have a data frame and I want to calculate the mean of all columns and save it into a new dataframe. I found this solution calculate the mean for each column of a matrix in R however, this is only for matrix and not dataframe
structure(list(TotFlArea = c(1232, 596, 708, 1052, 716), logg_weighted_assess = c(13.7765298160156,
13.1822275291412, 13.328376420438, 13.3076293132057, 13.5164823091252
), TypeDwel1.2.Duplex = c(0, 0, 0, 0, 0), TypeDwelApartment.Condo = c(0,
1, 1, 1, 1), TypeDwelTownhouse = c(1, 0, 0, 0, 0), Age_new.70 = c(0,
0, 0, 0, 0), Age_new0.1 = c(0, 0, 0, 0, 0), Age_new16.40 = c(1,
1, 0, 1, 0), Age_new2.5 = c(0, 0, 0, 0, 0), Age_new41.70 = c(0,
0, 0, 0, 0), Age_new6.15 = c(0, 0, 1, 0, 1), LandFreehold = c(1,
1, 1, 0, 1), LandLeasehold.prepaid = c(0, 0, 0, 1, 0), LandOthers = c(0,
0, 0, 0, 0), cluster_K_mean.1 = c(0, 0, 0, 0, 0)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
Can you please advise how I can do this?
Note: my data frame can have NA values which should be excluded from mean calculation
As #akrun pointed out. Also another alternative
apply(df, 2, mean)
where 2 means by column and 1 is by row.
However, besides its flexibility (e.g. changing from mean to mode or applying to selected columns only apply(df[,c('a', 'b')], 2, mean)) below shows the disadvantage to using apply (in terms of speed)
library(data.table)
library(microbenchmark)
# dummy data
x <- 1e7
df <- data.table(a = 1:x )
y <- letters[2:10]
df[, (y) := lapply(2:10, \(i) a+i)]
# benchmark
z <-
microbenchmark(colMeans = {colMeans(df)}
, apply = {apply(df, 2, mean)}
, times = 30
)
plot(z)

Creating a function to find precision by group

I have the following dataframe for which I am trying to calculate the precision of observations by group.
df<- structure(list(BLG = c(77.634011090573, 119.341563786008, 12.0603015075377,
0, 155.275381552754, 117.391304347826, 81.1332904056665, 3.96563119629874,
91.566265060241), GSF = c(11.090573012939, 4.11522633744856,
0, 0, 0, 0, 0, 0, 0), LMB = c(73.9371534195933, 28.8065843621399,
24.1206030150754, 20.2360876897133, 59.721300597213, 13.0434782608696,
38.6349001931745, 31.7250495703899, 28.9156626506024), YLB = c(14.7874306839187,
4.11522633744856, 0, 0, 0, 0, 0, 0, 0), BLC = c(7.39371534195933,
0, 0, 20.2360876897133, 3.9814200398142, 0, 0, 7.93126239259749,
9.63855421686747), WHC = c(0, 0, 0, 0, 3.9814200398142, 0, 0,
0, 0), RSF = c(0, 0, 0, 0, 11.9442601194426, 0, 0, 0, 4.81927710843374
), CCF = c(0, 0, 0, 0, 0, 0, 0, 0, 0), BLB = c(0, 0, 0, 0, 0,
0, 0, 0, 0), group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000270a7061ef0>)
I am trying to find the precision with this formula:
Y_estimated= the value of in each cell of df
Y_true= y_true<- c(83, 10, 47, 8, 9, 6, 12, 5, 8) #the true value for each column in df
R= number of observations in each group (in this case=3)
After applying the formula, I should have 3 measures of precision for each column. But I am unsure of how to make this formula into a function that will do this. Specifically the applying the epsilon by group and defining R.
I've been working on the following:
estimate = function(df, y_true) {
R = 3
y_estimated = (df, .SD)
(sum((sqrt( (y_estimated - y_true)^2 / 3))) / y_true) * 100
}
But apart from this throwing errors (I think from the .SD in the y_estimated), I have to manually put in the value of R which I hope to not have to do given that this will be applied on data frames with multiple group sizes.
Any help would be greatly appreciated.

Create bar plot for every level of a factor in a wide format data frame

I'm trying to create a bar plot using ggplot2 and my data is in this format:
dput here:
structure(list(clade = structure(c(1L, 3L, 2L, 3L, 2L, 2L), .Label = c("19A",
"20A", "20B", "20E (EU1)", "20I (Alpha, V1)", "20J (Gamma, V3)",
"21J (Delta)"), class = "factor"), C.T = c(0, 4, 4, 4, 4, 4),
A.G = c(0, 1, 1, 1, 1, 1), G.A = c(0, 2, 0, 2, 0, 0), G.C = c(0,
1, 0, 1, 0, 0), T.C = c(0, 0, 0, 0, 0, 0), C.A = c(0, 0,
0, 0, 0, 0), G.T = c(0, 0, 0, 0, 0, 0), A.T = c(0, 0, 0,
0, 0, 0), T.A = c(0, 0, 0, 0, 0, 0), T.G = c(0, 0, 0, 0,
0, 0), A.C = c(0, 0, 0, 0, 0, 0), C.G = c(0, 0, 0, 0, 0,
0), A.del = c(0, 0, 0, 0, 0, 0), TAT.del = c(0, 0, 0, 0,
0, 0), TCTGGTTTT.del = c(0, 0, 0, 0, 0, 0), TACATG.del = c(0,
0, 0, 0, 0, 0), AGTTCA.del = c(0, 0, 0, 0, 0, 0), GATTTC.del = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000014b25a51ef0>)
I'd like to create 7 bar plots (one for each "clade") where the X axis would have the columns of the data frame (C.T would be 1 bar, A.G would be another bar, etc) and the Y axis would be the count. Essentially, for each clade, print a barplot with the counts of column.
For example, for the bar plot of the clade "20B" and the bar name "C.T" the count would be the sum of the values from the data frame. Can I do that in this wide format? Do I need to transform the data to a long format instead?
I was trying to apply this SO answer: Plotting error bar on bar chart for a data frame in wide format using ggplot but I keep getting choose another strategy with names_repair
Thank you in advance, any help is very welcome!

Find the second largest value rowwise with dplyr R

Problem:
I am working with wages data and I want to flag outliers as possible measurement errors. For doing so, I am combining two criteria:
To receive more than twice the value of the 99th percentile of wages within a given year, relative to the whole distribution of wages on my dataset (comparison criteria between persons, within year)
To receive more than twice the value of the second highest wage within a same person, across years. That is an intra-individual criteria (comparison criteria within person, between years).
I accomplished to code the first criteria, but I am having some trouble with coding the second one.
My data is in the wide format. Perhaps the solution to my problem can be easier achieved by reshaping the data to the long format, but as I am working in a multi-author project, if I use this solution I need to reshape it back to the wide format again.
Data example:
Below, I provide some rows of my data with cases that already met the first criteria:
df <- structure(list(
wage_2010 = c(120408.54, 11234.67, 19918.64, NA, 66006.32, 40581.36, 344587.84, 331970.28, NA, 161351.45, NA, 115310.68, 323336.27, 9681.69, NA, 682324.53, 43764.76, 134023.61, 78195.16, 141231.5, 48163.23, 71259.66, 73858.65, 57737.6, NA, 182837.23), wage_2011 = c(413419.86, 24343.04, 36349.02, NA, 99238.53, 18890.34, 129921.58, 108714.29, NA, 169289.89, 36158.73, 129543.51, 130791.99, 13872.76, 4479.58, 222327.52, 826239.14, 48892.78, 78506.06, 111569.8, 653239.41, 813158.54, 72960.17, 80193.15, NA, 209796.19), wage_2012 = c(136750.86, 77386.62, 177528.17, 86512.48, 375958.76, 20302.29, 145373.42, 91071.64, 95612.23, 176866.72, 85244.44, 225698.7, 181093.52, 162585.23, 147918.83, 254057.11, 72845.46, 86001.31, 80958.22, 105629.12, 77723.77, 115217.74, 68959.04, 111843.87, 85180.26, 261942.95 ),
wage_2013 = c(137993.48, 104584.84, 239822.37, 95688.8, 251573.14, 21361.93, 142771.58, 92244.51, 111058.93, 208013.94, 111326.07, 254276.36, 193663.33, 225404.84, 84135.55, 259772.16, 100031.38, 100231.81, 824271.38, 107336.19, 95292.2, 217071.19, 125665.58, 74513.66, 116227.01, 245161.73), wage_2014 = c(134914.8, 527180.87, 284218.4, 112332.41, 189337.74, 23246.46, 144070.09, 92805.77, 114123.3, 251389.07, 235863.98, 285511.12, 192950.23, 205364.45, 292988.3, 318408.56, 86255.91, 497960.18, 85467.13, 152987.99, 145663.31, 242682.93, 184123.01, 107423.03, 132046.43, 248928.89), wage_2015 = c(168812.65, 145961.09, 280556.86, 256268.69, 144549.45, 23997.1, 130253.75, NA, 115522.88, 241031.91, 243697.87, 424135.76, 15927.33, 213203.96, 225118.19, 298042.59, 77749.09, 151336.85, 88596.38, 121741.45, 34054.26, 206284.71, 335127.7, 201891.17, 189409.04, 246440.69),
wage_2016 = c(160742.14, 129892.09, 251333.29, 137192.73, 166127.1, 537611.12, 139350.84, NA, 115395.21, 243154.02, 234685.36, 903334.7, NA, 205664.08, 695079.91, 33771.37, 100938.19, 138864.28, 58658.4, 98576.95, NA, 144613.53, 430393.04, 217989.1, 229369.56, 600079.86), wage_2017 = c(175932.3, 138128.41, 584536.47, 143506.22, 61674.63, 1442.8, 126084.46, NA, 575771.83, 586909.69, 372954.89, 701815.37, NA, 402347.33, 93873.2, NA, 96792.96, 172908.08, 89006.92, 631645.41, NA, 72183.55, 579455.71, 294539.56, 353615.43, 151327.43), wage_2018 = c(146111.42, 149313.9, 627679.77, 850182.4, 72654.62, 9129.35, 41544.24, NA, 248020.12, 334280.68, 611781.99, 597465.2, NA, 535628.5, 63369.44, NA, 93710.71, 146769.63, 100736.71, 108022.87, NA, 79019.43, 772012.47, 549097.81, 504183.59, 99129.6),
outlier_2010 = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2011 = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0), outlier_2012 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2013 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0), outlier_2014 = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2015 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), outlier_2016 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1), outlier_2017 = c(0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), outlier_2018 = c(0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0)),
groups = structure(list(.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", "list"))), row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame")), row.names = c(NA, -26L), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"))
I have averages anual wages from 2010 to 2018, that is, 9 points in time. However, it seems to be hard to use a solution with the quantile function, because of possible missing values for some individuals in some years.
What I have tried:
So far I am using a median function within the dplyer approach. I flag as an outlier (possible error) if, in one given year, the individual receives more than twice the median of what he received across the years:
library(dplyr)
df1 <- df %>%
rowwise %>%
mutate(
median_wage = median(c(wage_2010, wage_2011, wage_2012, wage_2013, wage_2014, wage_2015, wage_2016, wage_2017, wage_2018), na.rm=T)) %>%
mutate(
individual_threshold = median_wage * 2,
) %>%
mutate(
outlier_2010 = case_when (wage_2010 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2011 = case_when (wage_2011 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2012 = case_when (wage_2012 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2013 = case_when (wage_2013 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2014 = case_when (wage_2014 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2015 = case_when (wage_2015 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2016 = case_when (wage_2016 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2017 = case_when (wage_2017 > individual_threshold ~ 1, TRUE ~ 0),
outlier_2018 = case_when (wage_2018 > individual_threshold ~ 1, TRUE ~ 0))
However, when I inspect the data, I see that I am coding as outlier possible legitimate wages. For example, in the third row/person of my data, I am flagging as outliers wages in 2017 and 2018. However, as we can see, there is a pattern of increase in this person's wage. Although he receives more than twice his median wage in these years, probably that is not a mistake, as the increase was recorded in two years in a row.
In the forth row, however, the 2018 wage is more likely to be wrongly reported, since there is not a similar wage to that one for the same person. In 2018 year, that person wage grew more than 4 times than it was ever before (and also became more than twice the 99th percentile of the whole distribution).
Summing up:
I want to write a code to analyse 9 variables for every individual (or rowwise): wage_2010-2018, and compare the highest value to the second highest value. If the highest value is more than twice the size of the second highest value, I flag it as a possible measurement error. Preferably within dplyr.
Here's a way to do this with a helper function.
library(dplyr)
compare_2nd_highest <- function(x) {
#Sort the wages in descending order
x1 <- sort(x, decreasing = TRUE)
#Is the highest value more than double of second highest value
x1[1] > (x1[2] * 2)
}
df %>%
rowwise() %>%
mutate(is_outlier = compare_2nd_highest(c_across(starts_with('wage')))) %>%
ungroup

How can I summarize several timesteps in R?

I want to draw a heat map showing the operating period of a ventilation system over the year. Since my dataset has 1minute-timesteps, I need to summarize the values to hourly-timesteps (1440 values on the y-axis result in a too small resolution). So I am looking for a command to average the first 60 values, the next 60 and so on...
dput(head(mydate,20))
structure(list(date = structure(c(1498373340, 1498373400, 1498373460,
1498373520, 1498373580, 1498373640, 1498373700, 1498373760, 1498373820,
1498373880, 1498373940, 1498374000, 1498374060, 1498374120, 1498374180,
1498374240, 1498374300, 1498374360, 1498374420, 1498374480), class = c("POSIXct",
"POSIXt"), tzone = ""), DS.ZV_SB = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 20L), class = "data.frame")

Resources