I'm trying to build quite a complex loop in R.
I have a set of data set as an object called p_int (p_int is peak intensity).
For this example the structure of p_int i.e. str(p_int) is:
num [1:1599]
The size of p_int can vary i.e. [1:688], [1:1200] etc.
What I'm trying to do with p_int is to construct a complex loop to extract the monoisotopic peaks, these are peaks with certain characteristics which will be extracted into a second object: mono_iso:
search for the first eight sets of data results in p_int. Of these eight, find the set of data with the greatest score (this score also needs to be above 50).
Once this result has been found, record it into mono_iso.
The loop will then fix on to this position of where this result is located within the large dataset. From this position it will then skip the next result along the dataset before doing the same for the next set of 8 results.
So something similar to this:
16 Results: 100 120 90 66 220 90 70 30 70 100 54 85 310 200 33 41
** So, to begin with, the loop would take the first 8 results:
100 120 90 66 220 90 70 30
**It would then decide which peak is the greatest:
220
**It would determine whether 220 was greater than 50
IF YES: It would record 220 into "mono_iso"
IF NO: It would move on to the next set of 8 results
**220 is greater than 50... so records into mono_iso
The loop would then place it's position at 220 it would then skip the "90" and begin the same thing again for the next set of 8 results beginning at the next data result in line: in this case at the 70:
70 30 70 100 54 85 310 200
It would then record the "310" value (highest value) and do the same thing again etc etc until the end of the set of data.
Hope this makes perfect sense. If anyone could possibly help me out into making such a loop work with R-script, I'd very much appreciate it.
Use this:
mono_iso <- aggregate(p_int, by=list(group=((seq_along(p_int)-1)%/%8)+1), function(x)ifelse(max(x)>50,max(x),NA))$x
This will put NA for groups such that max(...)<=50. If you want to filter those out, use this:
mono_iso <- mono_iso[!is.na(mono_iso)]
Related
I am plotting my benchmark tests with plotly and the results look as expected:
This is just a preliminary view as the rest of the data is still calculated. Yet it's obvious already that the current plotting doesn't make too much sense as there will be by far more plotted data in different segments (10-100,100-1000,1000+). As here the smaller data is just not seen any more (if not zoomed in)
Is there a proper way to set the displayed bars (by definition of groups?)?
There is apparently a solution with Dash (https://community.plotly.com/t/how-can-i-select-multiple-options-to-pick-from-dropdown-as-a-group/60482) which seems to be what I am looking for but it's not an independent HTML-File which can be sent and/or exported.
Alternatively I thought about displaying it in log, but the result is irritating as it doesn't really show what I'd like to display.
This code here works and gives the result shown:
if __name__ == '__main__':
filep="Tests/10k-node-test/"
data=[]
quantities=[]
for p1 in next(os.walk(filep))[1]:
quantities.append(p1)
df = pd.read_csv(filep+p1+'/'+"timing.csv")
for index, row in df.iterrows():
if index>=2:
if index%2==0:
val=row[2]
else:
val=row[2]-val
data.append([p1,row[1],val])
df = pd.DataFrame(data, columns=["Records","Iteration","Insertion Time"])
fig = px.bar(df, x="Records", y="Insertion Time",
hover_data=["Records","Iteration","Insertion Time"], color="Insertion Time",
height=400,
log_y=True)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total ascending'})
fig.write_html("plotlye.html")
The data-frame looks like this:
Records Iteration Insertion Time
0 250 3 1.137531
1 250 4 1.137239
2 250 5 1.146533
3 250 6 1.131248
4 250 7 1.123308
.. ... ... ...
189 10 95 0.123577
190 10 96 0.131645
191 10 97 0.122587
192 10 98 0.124850
193 10 99 0.126864
I am not tied to plotly, but so far it returned what I desired - just the fine-tuning is not really what I'm lacking off. If there are alternatives I'd be open to that too, it should just convey my benchmarking-results in a proper way.
I know there are many questions asked about removing duplicates in SQL. However in my case it is slightly more complicated.
These are data with Barcode which repeats over a month. Therefore it is expected that there will be entries with the same Barcode. However it is found out that due to possibly a machine bug, same data will be recorded within 4-5 minutes timeframe 2 to 3 times. It does not happen for every entry, but it happens rather frequently.
Allow me to demonstrate with a sample table which contains the same Barcode "A00000"
Barcode No Date A B C D
A00000 1499456 10/10/2019 3:28 607 94 1743 72D
A00000 1803564 10/20/2019 22:09 589 75 1677 14D
A00000 1803666 10/20/2019 22:13 589 75 1677 14D
A00000 1803751 10/20/2019 22:17 589 75 1677 14D
A00000 2084561 10/30/2019 12:22 583 86 1677 14D
A00000 2383742 11/9/2019 23:18 594 81 1650 07D
As you can see the entries on 10/20 contains identical data which are duplicates which should be removed so only one of the entry remains (any of the entry is fine and the exact time is not the main concern). The "No" column is a pure arbitrary number which can be safely disregarded. The other entries should be remain as it is.
I know this should be done by using "Group by", but I am struggling on how to write the conditions. I have tried also using table INNER JOIN itself and then remove these selected results:
T2.A = T2.B AND
T2.[Date] > T1.[Date] AND
strftime('%s',T2.[Date]) - strftime('%s',T1.[Date]) < 600
The results still seem a bit off as some of the entries are selected twice and some are not selected. I am still not used to SQL style of thinking. Any help is appreciated.
The format of the Date column complicates things a bit, but otherwise the solution basically is to use GROUP BY in the normal way. In the following, I've assumed the name of the table is test:
WITH sane as
(SELECT *,
substr(date,1,instr(date, ' ') - 1) as time
FROM test)
SELECT Barcode, max(No), Date, A, B, C, D
FROM sane
GROUP BY barcode, time;
The use of max() is perhaps unneeded but it gives some determinacy, which might be helpful.
I am trying to synchronize data from two clocks. Each clock is drifting at different rates. I'd like to synchronize simultaneous events detected on both instruments to the clock of one instrument. Here are some simple data where the times are numeric representing minutes from some starting point:
MasterClock <- c(100, 150, 200, 250, 300)
clock2 <- c(101, 153, 206, 258, 310)
df <- data.frame(MasterClock,clock2)
The first step is simple. Identify a simultaneous event and calculate the difference in time between the two clocks and adjust one clock accordingly. I want to keep a record of the difference between the clocks so I will create a new variable of the difference between the clocks, then add the difference to the original time to get a corrected time, like this:
df$CF[1] <- df$MasterClock[1] - df$clock2[1] #calculate CorrectionFactor
df$clock2Corrected <- df$clock2 + df$CF #calculate corrected time
giving this:
> df
MasterClock clock2 CF clock2Corrected
1 100 101 -1 100
2 150 153 -1 152
3 200 206 -1 205
4 250 258 -1 257
5 300 310 -1 309
In this simplified example, it is easy to see that each row represents one simultaneous event. However, if you were to continue the trend in clock2, eventually clock2 will drift so far that it will begin to look like it occurred simultaneously with the previous event on the master clock. This is why I want to apply the correction factor from the first record to all of the data first, syncing the clocks every chance I get to keep the clocks as tight as possible (the real data set is obviously much larger and more complex, increasing the likelihood of an assignment error).
From here I need to repeat this process using df$clock2Corrected[2] as the new "original time" for clock 2. The CorrectionFactor (CF) for the second record would be df$MasterClock[2] - df$clock2Corrected[2] = -2. I would then want to apply this correction factor (-2) to records 2-n to get the new updated synced clock for records 2-n. Doing this stepwise will keep the clocks tight so that the CF should remain small over time.
Is there a way to repeat this process step-wise for each record without creating a new column for every row? I suspect it may need to be a forloop nested inside of another forloop but I can't wrap my head around the logic. Here's what I'd like the finish product to be in the end for this example:
> df
MasterClock clock2 CF clock2Corrected
1 100 101 -1 100
2 150 153 -2 150
3 200 206 -3 200
4 250 258 -2 250
5 300 310 -2 300
OK, it took a while but I think I've got it.
correct_clock <- function(DF){
DF$CF <- DF$MasterClock[1] - DF$clock2[1]
DF$clock2Corrected <- DF$clock2 + DF$CF
n <- nrow(DF)
for(i in seq_len(n)[-1]){
DF$CF[i] <- DF$MasterClock[i] - DF$clock2Corrected[i]
DF$clock2Corrected[i:n] <- DF$clock2[i:n] + sum(DF$CF[1:i])
}
DF
}
correct_clock(df)
MasterClock clock2 CF clock2Corrected
1 100 101 -1 100
2 150 153 -2 150
3 200 206 -3 200
4 250 258 -2 250
5 300 310 -2 300
You must then assign the return value of function correct_clock to some data frame, the same or other.
result <- correct_clock(df)
Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)
I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.
I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.
So this is what I have so far:
d<-read.table("domains.txt")
d
Gene.name Domain Start End
ABCF1 low_complexity_region 2 13
DKK1 low_complexity_region 25 39
ABCF1 AAA 328 532
F2 coiled_coil_region 499 558
m<-read.table("mutations.tx")
m
Gene.name Mutation
ABCF1 10
DKK1 21
ABCF1 335
xyz 15
F2 499
newfile<-m[, list(new=findInterval(d(c(d$Start,
d$End)),by'=Gene.Name']
My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.
I"d like my final data to look like this:
Gene.name Mutation Domain
DKK1 21 low_complexity_region
ABCF1 335 AAA
F2 499 coiled_coil_region
A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):
result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]
# Gene.name Domain Start End Mutation
#1 ABCF1 low_complexity_region 2 13 10
#4 ABCF1 AAA 328 532 335
#6 F2 coiled_coil_region 499 558 499