How can I summarize several timesteps in R?

How can I summarize several timesteps in R? - r

I want to draw a heat map showing the operating period of a ventilation system over the year. Since my dataset has 1minute-timesteps, I need to summarize the values to hourly-timesteps (1440 values on the y-axis result in a too small resolution). So I am looking for a command to average the first 60 values, the next 60 and so on...
dput(head(mydate,20))
structure(list(date = structure(c(1498373340, 1498373400, 1498373460,
1498373520, 1498373580, 1498373640, 1498373700, 1498373760, 1498373820,
1498373880, 1498373940, 1498374000, 1498374060, 1498374120, 1498374180,
1498374240, 1498374300, 1498374360, 1498374420, 1498374480), class = c("POSIXct",
"POSIXt"), tzone = ""), DS.ZV_SB = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 20L), class = "data.frame")

Related

Creating a function to find precision by group

I have the following dataframe for which I am trying to calculate the precision of observations by group.
df<- structure(list(BLG = c(77.634011090573, 119.341563786008, 12.0603015075377,
0, 155.275381552754, 117.391304347826, 81.1332904056665, 3.96563119629874,
91.566265060241), GSF = c(11.090573012939, 4.11522633744856,
0, 0, 0, 0, 0, 0, 0), LMB = c(73.9371534195933, 28.8065843621399,
24.1206030150754, 20.2360876897133, 59.721300597213, 13.0434782608696,
38.6349001931745, 31.7250495703899, 28.9156626506024), YLB = c(14.7874306839187,
4.11522633744856, 0, 0, 0, 0, 0, 0, 0), BLC = c(7.39371534195933,
0, 0, 20.2360876897133, 3.9814200398142, 0, 0, 7.93126239259749,
9.63855421686747), WHC = c(0, 0, 0, 0, 3.9814200398142, 0, 0,
0, 0), RSF = c(0, 0, 0, 0, 11.9442601194426, 0, 0, 0, 4.81927710843374
), CCF = c(0, 0, 0, 0, 0, 0, 0, 0, 0), BLB = c(0, 0, 0, 0, 0,
0, 0, 0, 0), group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L)), row.names = c(NA,
-9L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000270a7061ef0>)
I am trying to find the precision with this formula:
Y_estimated= the value of in each cell of df
Y_true= y_true<- c(83, 10, 47, 8, 9, 6, 12, 5, 8) #the true value for each column in df
R= number of observations in each group (in this case=3)
After applying the formula, I should have 3 measures of precision for each column. But I am unsure of how to make this formula into a function that will do this. Specifically the applying the epsilon by group and defining R.
I've been working on the following:
estimate = function(df, y_true) {
R = 3
y_estimated = (df, .SD)
(sum((sqrt( (y_estimated - y_true)^2 / 3))) / y_true) * 100
}
But apart from this throwing errors (I think from the .SD in the y_estimated), I have to manually put in the value of R which I hope to not have to do given that this will be applied on data frames with multiple group sizes.
Any help would be greatly appreciated.

Improving for loop inception speed when comparing Date-time values

I am looking to speed up some code of mine, and am looking for some advice. I have two dataframes, and want to take data from one and input it into the other, but their formats are quite different. I have written some code that works, but it runs fairly slow, especially when I have large datasets. Example dataframes are below:
dat = structure(list(Date = structure(c(1508641200, 1508643000, 1508644800,
1508646600, 1508648400, 1508650200, 1508652000, 1508653800, 1508655600,
1508657400, 1508659200, 1508661000, 1508662800, 1508664600, 1508666400,
1508668200, 1508670000, 1508671800, 1508673600, 1508675400, 1508677200,
1508679000, 1508680800, 1508682600, 1508684400, 1508686200, 1508688000,
1508689800, 1508691600, 1508693400), class = c("POSIXct", "POSIXt"
), tzone = "EST"), X = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), tru = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Date", "X", "tru"), row.names = c(NA,
30L), class = "data.frame")
and,
truth = structure(list(startdate = structure(c(1509937620, 1510705200,
1510722240, 1512245160, 1512250560, 1512251760, 1512271140, 1512274440,
1512984360, 1512986760, 1513002600, 1513004700, 1513752000, 1513753800,
1513775940, 1513776840, 1514268900, 1514289000, 1514307900, 1517842620,
1518056460, 1520031660, 1520050560, 1520075580, 1520218620), class =c("POSIXct",
"POSIXt"), tzone = ""), enddate = structure(c(1509952320, 1510718040,
1510737240, 1512250260, 1512269640, 1512269940, 1512290280, 1512289380,
1513001100, 1513001400, 1513023840, 1513021440, 1513772640, 1513771440,
1513797180, 1513796280, 1514288400, 1514307600, 1514311140, 1517857320,
1518067560, 1520049060, 1520073240, 1520078880, 1520222820), class =c("POSIXct",
"POSIXt"), tzone = ""), Y = c(42340.1667145732, 49765.2381579195,
48687.3848496384, 31170.9693454847, 50435.3541955455, 49757.5112973802,
44031.8550803252, 45912.1378875664, 47193.1529894274, 49826.4304479959,
45840.7120690243, 42483.44259103, 52188.4048476908, 52783.8164119854,
51769.0550080142, 49866.1301140174, 52250.5531316799, 49754.6933212176,
45256.185763228, 47742.0544890968, 53414.0366523465, 51881.9495162963,
48632.3656223053, 44476.3677890439, 61922.0098972212)), .Names =
c("startdate",
"enddate", "Y"), row.names = c(NA, 25L), class = "data.frame")
Now, what I want to do is take values from truth$Y and input them into dat$tru at their appropriate date-time values given as these actions happen between truth$startdate and truth$enddate.
What I have currently is the following code to do so:
for(i in 1:length(truth$startdate)){
for(j in 1:length(dat$Date)){
if(dat$Date[j] >= truth$startdate[i] && dat$Date[j] <= truth$enddate[i]){
dat$tru[j] = truth$Y[i]
}
}
}
Is there a more efficient way to do this avoiding a for loop inside of a for loop? Thanks in advance!

Using the between function from the dplyr package, you can eliminate the inner loop:
library(dplyr)
for(i in 1:length(truth$startdate)){
dat$tru[between(dat$Date, truth$startdate[i], truth$enddate[i])] <-truth$Y[i]
}
This should result in a significant performance improvement.
I sure this could be further refined but this is a quick and easy fix. I suggest looking at the purrr package, it could offer some additional performance improvements.

charts.PerformanceSummary() not working for intraday data

I am trying to plot intraday data using PerformanceAnalytics::charts.PerformanceSummary()
but I get the following error message:
charts.PerformanceSummary(e[,1:10])
Error in as.POSIXlt.POSIXct(.POSIXct(.index(x)), tz = indexTZ(x)) :
invalid 'tz' value
It seems that charts.PerformanceSummary only takes daily data but not intraday?
Can someone please come up with a solution for this?
e <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0.000369303493611195, 0,
0, 0.000590667454223315, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000150829562594268,
0.000150806816467952, -0.00015078407720126, -0.000150806816468174,
0.000301659125188536, 0, 0, -0.000617817867292869, 0, 0, 0, 0,
0.000107944732297138, 0.000323799244468459, -0.000215796288303927,
0, 0.000215842866393423, 0, 0, 0, 0, 0, 0), class = c("xts",
"zoo"), .indexCLASS = c("POSIXlt", "POSIXt"), tclass = c("POSIXlt",
"POSIXt"), .indexTZ = c("America/New_York", "EST", "EDT"), tzone = c("America/New_York",
"EST", "EDT"), index = structure(c(1496755860, 1496755920, 1496755980,
1496756040, 1496756100, 1496756160), tzone = c("America/New_York",
"EST", "EDT"), tclass = c("POSIXlt", "POSIXt")), .Dim = c(6L,
10L), .Dimnames = list(NULL, c("AADR", "AAXJ", "ACIM", "ACSI",
"ACTX", "ACWF", "ACWI", "ACWV", "ACWX", "ADRA")))

The problem is the POSIXlt index. You should convert it to POSIXct, which is most easily accomplished by creating a new xts object. Try the commands below.
e <- xts(coredata(e), as.POSIXct(index(e)))
PerformanceAnalytics::charts.PerformanceSummary(e[,1:10])
The chart throws an error with your example data, after drawing the first plot:
Error in segments(xlim[1], y_grid_lines(ylim), xlim[2], y_grid_lines(ylim), :
cannot mix zero-length and non-zero-length coordinates
But that may be because there aren't enough observations. Let me know if it doesn't work on your actual data and I'll investigate further.

R apply funciton on each cell in data frame

I have a data frame that look something like this
> dput(tes)
structure(list(path = structure(1:6, .Label = c("1893-chicago-fair",
"1960s-afghanistan", "1970s-iran", "1980s-new-york", "20-bizarre-vintage-ads",
"20-bizarre-vintage-ads?utm_campaign=6678&utm_medium=rpages&utm_source=Facebook&utm_term=1e8e704f7b587515c72e6cf7895d55fd110b652c480d98c1440f0a7acba5fb0e",
"20-photos-segregation-america-show-far-weve-come-much-farther-go",
"7-bizarre-cultural-practices", "7-creepy-abandoned-cities?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=4015a7368b588ff09694c96ba720c58f4e7f41a05b4181908b582bae682bef5e",
"a-brief-history-of-hippies", "abandoned-photographs", "albert-kahn",
"amazing-facts", "american-bison-extinction-1800s", "american-english-vs-british-english",
"andre-the-giant-photos", "andre-the-giant-photos??utm_source=facebook&sr_source=lift_facebook&utm_campaign=simplereach_andre-the-giant-photos&utm_medium=social",
"andre-the-giant-photos?grvVariant=d27feef0bfad84d60f335d3a8d241d9e",
"andre-the-giant-photos?grvVariant=d27feef0bfad84d60f335d3a8d241d9e&utm_campaign=gravityus2_142deb68f67fb1a99e7b80250fecc932&utm_medium=referral&utm_source=gravity",
"andre-the-giant-photos?grvVariant=d27feef0bfad84d60f335d3a8d241d9e&utm_campaign=gravityus2_16d63cf07ecf656f602b2d6b209344f7&utm_medium=referral&utm_source=gravity",
"andre-the-giant-photos?grvVariant=d27feef0bfad84d60f335d3a8d241d9e&utm_campaign=gravityus2_713050ecffc51540af02b2246ddf57dd&utm_medium=referral&utm_source=gravity",
"andre-the-giant-photos?grvVariant=d27feef0bfad84d60f335d3a8d241d9e&utm_campaign=gravityus2_c5bb3bc5e9408e0ad52ec9e787bd8654&utm_medium=referral&utm_source=gravity",
"andre-the-giant-photos?sr_source=lift_facebook&utm_campaign=simplereach_andre-the-giant-photos&utm_medium=social&utm_source=facebook",
"astounding-aerial-photography", "astounding-aerial-photography?utm_campaign=7002&utm_medium=rpages&utm_source=Facebook&utm_term=38e9e903d9ba59106d8b4d19be593f3de7ff8b91b12eafa03f2e382228f7b0d1",
"august-landmesser", "ben-franklin", "best-all-that-is-interesting-articles",
"bigfoot-facts", "celebrity-school-photos?grvVariant=82c0ce57a33dfd0209bdefc878665de0&utm_campaign=gravityus2_bc8646aefd6d0a16af03d7caf248f226&utm_medium=referral&utm_source=gravity",
"coolest-mushrooms?utm_campaign=taboolaINTL&utm_medium=referral&utm_source=taboola",
"craziest-ways-drugs-smuggled", "creepy-halloween-costumes",
"danakil-depression", "dark-john-lennon-quotes", "david-bowie-quotes",
"days-in-groundhog-day", "death-photos", "death-photos?utm_campaign=taboolaINTL&utm_medium=referral&utm_source=taboola",
"dr-seuss-quotes", "dream-chaser-spacecraft", "dust-bowl", "earth-two-planets",
"eixample-barcelona", "email-to-space", "evil-science-experiments",
"famous-incest", "famous-spies", "fun-facts-trivia", "golden-age-air-travel?utm_campaign=taboolaINTL&utm_medium=referral&utm_source=taboola",
"gross-foods", "gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=106965c54919c24bf37356500ec50f0709b1de621d6950bb4c5d48759ea3677e",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=184e0ee39e66af82f9b124b904f6e07964b211e902cb0dc00c28771ff46163a2",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=1a0ddea7bed770d5473c45e9f8d81dfd0c4fdd232f207c6b88b53c41ff220c59",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=538659f1fc53f28d2c87b93ac73973681c1a46a04954964ab6c52ed1ab09b33a",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=87caf0acb91ae2b202f1b00ad9eaad3fef20bbfb23405b9047fb2b5a5462ab9c",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=91eae42c8fc9568103d46e0b6b6ec08fc34fd68b2e1918ffe2333ec73035c95a",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=a72946874b2003a8e40635c6cf10c851d4e1c0ed45e645d69663214239550602",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=ab594f0a1be002c8c3db297e8d33b04678af40e6a6469ac815884ae0a014b3a3",
"gross-foods?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=fb1e333dd58cb7bb9251ec52290aae21771149f73e083440047068a69aaeae09",
"hilarious-insults", "hippie-communes", "hippie-communes?grvVariant=fda07538efb1c25617f7cc3d09c37c79",
"hippie-communes?grvVariant=fda07538efb1c25617f7cc3d09c37c79&utm_campaign=gravityus2_e3cd42d4745768460dab4694a972fd82&utm_medium=referral&utm_source=gravity",
"hippie-communes?pp=0", "history-of-the-vibrator", "history-of-the-vibrator?utm_campaign=whfbpd&utm_medium=social&utm_source=facebook",
"homosexuality-norm", "hunger-games-facts?utm_campaign=6905&utm_medium=rpages&utm_source=Facebook&utm_term=1a9e42ac8abb6ffa90bf0542206505e74d3df12114a2c4445527fb2b88ef8880",
"influential-photographs", "ingeniously-creative-ads", "insane-cults",
"insane-rulers", "inspirational-quotes", "inspirational-quotes?utm_medium=referral&utm_source=taboolainternal",
"interesting-facts-about-the-world", "interesting-quotes", "krokodil",
"making-a-murderer-theories", "maya-angelou-greatest-quotes",
"medieval-torture-devices", "milky-way-colorado", "montreal-metro",
"most-popular-female-names-in-america", "neil-degrasse-tyson-tweets",
"new-york-city-cinemagraphs", "new-york-subways-1980s", "north-korea-photographs",
"north-korea-photographs?utm_campaign=taboolaINTL&utm_medium=referral&utm_source=taboola",
"north-korea-photographs?utm_medium=referral&utm_source=taboolainternal",
"obama-aging", "pablo-escobar", "pablo-escobar??utm_source=facebook",
"pablo-escobar??utm_source=facebook&sr_source=lift_facebook&utm_campaign=simplereach_pablo-escobar&utm_medium=social",
"pablo-escobar?utm_campaign=whfbpd&utm_medium=social&utm_source=facebook",
"panda-facts", "photo-of-the-day-nasa-releases-crystal-clear-image-of-pluto",
"pollution-in-china-photographs", "pollution-in-china-photographs?utm_campaign=3434&utm_medium=rpages&utm_source=Facebook&utm_term=1a0ddea7bed770d5473c45e9f8d81dfd0c4fdd232f207c6b88b53c41ff220c59",
"pollution-in-china-photographs?utm_campaign=3434&utm_medium=rpages&utm_source=Facebook&utm_term=e28a76c1572c36c3a13965e52b4b2ea10518eb9f9c79c4bc84cfb85db16be81e",
"pollution-in-china-photographs?utm_campaign=6806&utm_medium=rpages&utm_source=Facebook&utm_term=1a0ddea7bed770d5473c45e9f8d81dfd0c4fdd232f207c6b88b53c41ff220c59",
"pollution-in-china-photographs?utm_campaign=7048&utm_medium=rpages&utm_source=Facebook&utm_term=2ef4bd7b6cd587601d6eeb35925282a1ed095ebbd4e9e4c0337ef868c7de7a0b",
"pollution-in-china-photographs?utm_campaign=7458&utm_medium=rpages&utm_source=Facebook&utm_term=b9e79a51cd4daf4c3ec02accce75b3e1fc9a22cb3133460c9c32a4f2f9cdb68c",
"powerful-photos-of-2014", "real-x-files", "romanovs-last-days",
"science-of-human-decay", "scientific-discoveries-2015", "scully-effect",
"serial-killer-quotes", "shah-iran", "six-of-the-craziest-gods-in-mythology",
"space-facts", "sun-facts", "sunken-cities", "sunken-ships",
"super-bowl-i-facts", "superhero-movies", "surreal-places", "syrian-civil-war-photographs",
"the-five-greatest-mysteries-of-human-history", "the-four-most-important-battles-of-ancient-greece",
"the-most-colorful-cities-in-the-world", "titanic-facts", "titanic-facts?utm_campaign=6385&utm_medium=rpages&utm_source=Facebook&utm_term=f5905e878216d14e20457ee3265caf6c10022d9545609edfb9a3cb0642c1a310",
"titanic-facts?utm_campaign=6899&utm_medium=rpages&utm_source=Facebook&utm_term=b9e79a51cd4daf4c3ec02accce75b3e1fc9a22cb3133460c9c32a4f2f9cdb68c",
"titanic-facts?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=106965c54919c24bf37356500ec50f0709b1de621d6950bb4c5d48759ea3677e",
"titanic-facts?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=538659f1fc53f28d2c87b93ac73973681c1a46a04954964ab6c52ed1ab09b33a",
"titanic-facts?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=91eae42c8fc9568103d46e0b6b6ec08fc34fd68b2e1918ffe2333ec73035c95a",
"titanic-facts?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=ab594f0a1be002c8c3db297e8d33b04678af40e6a6469ac815884ae0a014b3a3",
"titanic-facts?utm_campaign=6928&utm_medium=rpages&utm_source=Facebook&utm_term=d1864657a05e5b716bb5cb16a29f068a55652eb39fb669ea9c22a6486198f227",
"titanic-facts?utm_campaign=7292&utm_medium=rpages&utm_source=Facebook&utm_term=f5905e878216d14e20457ee3265caf6c10022d9545609edfb9a3cb0642c1a310",
"us-veterans-portraits", "vintage-disneyland", "wall-street-early-20th-century",
"what-we-love-this-week-the-incredible-last-words-of-famous-historical-figures",
"woodstock-photos", "zombie-proof-house"), class = "factor"),
`0089` = c(0, 0, 0, 0, 0, 1), `0096` = c(0, 0, 0, 0, 0, 0
), `02` = c(0, 0, 0, 0, 0, 0), `0215` = c(0, 0, 0, 0, 0,
0), `0225` = c(0, 0, 0, 0, 0, 0), `0252` = c(0, 0, 0, 0,
0, 0), `0271` = c(0, 0, 0, 0, 0, 0), `0272` = c(0, 0, 0,
0, 0, 0), `03` = c(0, 0, 0, 0, 1, 1)), .Names = c("path",
"0089", "0096", "02", "0215", "0225", "0252", "0271", "0272",
"03"), row.names = c(NA, 6L), class = "data.frame")
and I need to apply the min(x,1) function such that this function scan each value in the dataframe (except first column which is not numeric) and return the min(x,1). that way I have only zero's and one's.
I have tried:
f <- function(x) min(1,x)
res1<-do.call(f,tes[,2:ncol(tes)])
but that does not output the right result.
Any help aapreciated

We can use pmin
tes[,-1] <- pmin(1, as.matrix(tes[,-1]))
Or if we need only binary values
tes[,-1] <- +(!!tes[,-1])

Defining the function to select the data

Let's start with my data.
> dput(head(tbl_ready)) ## To make it clear I didn't put all of the row names
structure(list(Gene_name = structure(1:6, .Label = c("AT1G01050",
"AT1G01080", "AT1G01090", "AT1G01220", "AT1G01320", "AT1G01420",
"AT1G01470", "AT1G01800", "AT1G01910", "AT1G01920", "AT1G01960",
"AT5G66570", "AT5G66720", "AT5G66760", "AT5G67150", "AT5G67360",
"ATCG00120", "ATCG00160", "ATCG00170", "ATCG00190", "ATCG00380",
"ATCG00470", "ATCG00480", "ATCG00490", "ATCG00500", "ATCG00650",
"ATCG00660", "ATCG00670", "ATCG00750", "ATCG00770", "ATCG00780",
"ATCG00800", "ATCG00810", "ATCG00820", "ATCG01090", "ATCG01110",
"ATCG01120", "ATCG01240", "ATCG01300", "ATCG01310", "ATMG01190"
), class = "factor"), `10` = c(0, 0, 0, 0, 0, 0), `20` = c(0,
0, 0, 0, 0, 0), `52.5` = c(0, 1, 0, 0, 0, 0), `81` = c(0, 0.660693687777888,
0, 0, 0, 0), `110` = c(0, 0.521435654491704, 0, 0, 0, 1), `140.5` = c(0,
0.437291194705566, 0, 0, 0, 1), `189` = c(0, 0.52204783488213,
0, 0, 0, 0), `222.5` = c(0, 0.524298383907171, 0, 0, 0, 0), `278` = c(1,
0.376865096972469, 0, 1, 0, 0), `340` = c(0, 0, 0, 0, 0, 0),
`397` = c(0, 0, 0, 0, 0, 0), `453.5` = c(0, 0, 0, 0, 0, 0
), `529` = c(0, 0, 0, 0, 0, 0), `580` = c(0, 0, 0, 0, 0,
0), `630.5` = c(0, 0, 0, 0, 0, 0), `683.5` = c(0, 0, 0, 0,
0, 0), `735.5` = c(0, 0, 0, 0, 0, 0), `784` = c(0, 0, 0.476101907006443,
0, 0, 0), `832` = c(0, 0, 1, 0, 0, 0), `882.5` = c(0, 0,
0, 0, 0, 0), `926.5` = c(0, 0, 0, 0, 1, 0), `973` = c(0,
0, 0, 0, 0, 0), `1108` = c(0, 0, 0, 0, 0, 0), `1200` = c(0,
0, 0, 0, 0, 0)), .Names = c("Gene_name", "10", "20", "52.5",
"81", "110", "140.5", "189", "222.5", "278", "340", "397", "453.5",
"529", "580", "630.5", "683.5", "735.5", "784", "832", "882.5",
"926.5", "973", "1108", "1200"), row.names = c(NA, 6L), class = "data.frame")
Take a look on the names of the columns (just picked the 6 of them):
10
20
52.5
81
110
140.5
Those names tell me the size range. The size of the genes in the first column starts from 10 and ends on the begining of the second column = 20. That means that to the first column should belong genes with the size between 10-20.
I have another table which tells me what's the size of all genes (there are much more than can be finded in my first table):
>dput(head(tbl_size))
structure(list(Gene_name = structure(1:6, .Label = c("ATMG01290", "ATMG01300", "ATMG01310", "ATMG01320", "ATMG01330",
"ATMG01350", "ATMG01360", "ATMG01370", "ATMG01400", "ATMG01410"
), class = "factor"), tp = c(26L, 17L, 22L, 142L, 12L, 45L),
size = c(49.4255, 28.0913, 40.2872, 213.572, 24.4838, 70.4375
)), .Names = c("locus", "tp", "size"), row.names = c(NA,
6L), class = "data.frame")
and now the main part. What I want to achieve with my code ?
So, I'm trying to find only those genes which are found in the fractions (columns) with the size range two times higher than a real size of the gene. No idea if you understand what I am trying to do so let me use an example.
so let's say that we have a genes:
Names Size
AT1G01080 40
AT1G01090 30
AT1G01220 50
Let's multiply the size by 2:
Names Size
AT1G01080 80
AT1G01090 60
AT1G01220 100
In first table (tbl_ready) we can find the list of the genes and specific fractions (columns) defined by size which I explained in the begining of this thread. I would like to put the 0 instead of any values if any gene can be found in the fraction (column) which is not atleast two times higher than the gene size.
To find the size of the gene you have to look in the second table (tbl_size).
Just to sum it up. I'm trying to define which of those genes come atleast as a complex of 2. So only fractions with size two times higher than the size of the gene are important for me.
IF SOMEONE KNOWS WHAT I AM TRYING TO DO PLEASE EDIT MY QUESTION TO MAKE IT READABLE. I FEEL LIKE MY BRAIN IS DEAD.

Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
ind <- lapply(tbl_size$size, function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus, tbl_ready$Gene_name), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
You'll have to be careful if gene_names don't have a 1:1 mapping with locus, and cases where none of the columns exceed the gene size two fold, as there'll be NAs that need dealing with. I'm assuming you're stuck using these representations of your data, as it would probably be better to store tbl_ready in a longer narrower form than you have it here (containing only three columns name, size, and value - and omitted the zero values).

I'm going to change my original answer, this time using the data you've provided - the only real differences are that you've changed the column names (I'm assuming column tp in tbl_size is the thing we need to match to the column headings in tbl_ready), and that some of the rows in table_size don't correspond to tbl_ready.
Firstly, convert the columns to their numerical value:
frac <- as.numeric(colnames(tbl_ready))
and then get the index per gene of the column that doesn't exceed it's frac by two-fold:
mapToReady <- tbl_size$locus %in% tbl_ready[[1]]
ind <- sapply(tbl_size$tp[mapToReady], function(x) which(frac > x*2)[1]-1)
Then you can create an array index of the values that you need to set to zero:
rowI = rep(match(tbl_size$locus[mapToReady], tbl_ready[[1]]), times=ind-1)
colI = unlist(mapply(seq, from=2, length=ind-1))
tbl_ready[cbind(rowI, colI)] <- 0
So, for instance, AT1G01050 is the 5th row of tbl_size (none of the previous entries have an entry in your tbl_size), and the first row of tbl_ready. So the first 'iteration' of the sapply line hits 'tbl_size$tp[mapToReady][1]' which is the tp of AT1G01050 which is 12. 2*12 is 24, so is between 20.0 and 52.5, so we're going to need to set columns corresponding to '10', and '20' to zero, but not columns '52.5' onwards, for the AT1G01050. This corresponds to columns 2 and 3 of row 1 of tbl_ready, which is what the cbind portion of the last three lines is doing.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How can I summarize several timesteps in R? - r

Related

Creating a function to find precision by group

Improving for loop inception speed when comparing Date-time values

charts.PerformanceSummary() not working for intraday data

R apply funciton on each cell in data frame

Defining the function to select the data

Categories

Resources