Parsing out all repeat and consecutive numbers in R - r

Suppose I have a dataframe like this:
1360 C 0 403
1361 A 0 403
1362 G 0 403
1402 0 A 444
2019 T 0 1060
2020 T 0 1060
2021 G 0 1060
2022 T 0 1060
2057 T 0 1085
2062 0 A 1093
2062 0 C 1094
2062 0 C 1095
Desired Output
1402 0 A 444
2057 0 0 1085
I was trying to parse out all the rows with repeats or consecutive numbers in the column 1. So, I want only the rows with the numbers which were not a repeat number or a consecutive number in the dataset. Any help will be much appreciated.

You can use diff to find the difference between adjacent elements in a vector. Assuming the vector is sorted, diff will return zero for repeat numbers and one for consecutive numbers.
keep1 <- diff(df[,1]) > 1
This will include values that are after a jump, but at the start of a new sequence, so we need to check the lag1 value, and pad the logical vector to make it as long as the original.
keep <- c(keep1, TRUE) & c(TRUE, keep1)
df[keep,]

Related

Looking for code in R to summarize by ____H or ____D?

I have a chart with ASV's per sample, the samples are sorted by number (sample) and a letter which corresponds to human or dog. I am trying to see which ASV's are in only humans, or only dogs. My thought for how to do this is sum all rows by dog or human, ignoring individual samples, and see values of 0 or greater than zero.
I am unsure of code, have tried a few things but none have worked. Mainly working with phyloseq and DESeq2.This is the table Im working with, 11,000 ASV samples.
I'm a little confused what the row names and column names represent but I gave it a go. Correct me if this is not exactly what you meant.
The data.table package has a neat function, melt( ) that allows you to transform data from wide to long format. This will make it easier for you to analyze and sum your values.
library(data.table)
data <- data.table(
`ASV_ID` = c(3,5,6,7,10,11,12,14,15,16,20),
`2104H` = c(0,353,483,305,289,200,0,0,0,284,406),
`2104D` = c(470,39,43,427,48,488,356,390,482,0,0),
`2105H` = c(0,784,816,0,704,100,0,0,0,158,141),
`2105D` = c(0,0,0,0,0,0,0,0,0,0,0))
data
ASV_ID 2104H 2104D 2105H 2105D
1: 3 0 470 0 0
2: 5 353 39 784 0
3: 6 483 43 816 0
4: 7 305 427 0 0
5: 10 289 48 704 0
6: 11 200 488 100 0
7: 12 0 356 0 0
8: 14 0 390 0 0
9: 15 0 482 0 0
10: 16 284 0 158 0
11: 20 406 0 141 0
data2 <- melt(
data = data,
id.vars = c("ASV_ID"),
measure.vars = c("2104H","2104D","2105H","2105D"),
variable.name = "sample",
value.name = "value")
data2[,.(Sum = sum(value)),by=.(sample)]
sample Sum
1: 2104H 2320
2: 2104D 2743
3: 2105H 2703
4: 2105D 0

Execute a condition after leaving a certain number of values in a column

I have a data frame as shown below which has around 130k data values.
Eng_RPM Veh_Spd
340 56
450 65
670 0
800 0
890 0
870 0
... ..
800 0
790 0
940 0
... ...
1490 67
1540 78
1880 81
I need to have another variable called Idling Count which increments the value when ever it finds value in Eng_RMP > = 400 and Veh_Spd ==0 , the condition is the counter has to start after 960 Data points from the data point which has satisfied the condition, also the above mentioned condition should not be applicable for the first 960 data points as shown below
Expected Output
Eng_RPM Veh_Spd Idling_Count
340 56 0
450 65 0
670 0 0
... ... 0 (Upto first 960 values)
600 0 0(The Idling time starts but counter should wait for another 960 values to increment the counter value)
... ... 0
800 0 1(This is the 961st Values after start of Idling time i.e Eng_RPM>400 and Veh_Spd==0)
890 0 2
870 0 3
... .. ..
800 1 0
790 2 0
940 3 0
450 0 0(Data point which satisfies the condition but counter should not increment for another 960 values)
1490 0 4(961st Value from the above data point)
1540 0 5
1880 81 0
.... ... ... (This cycle should continue for rest of the data points)
Here is how to do with data.table (not using for which is known to be slow in R).
library(data.table)
setDT(df)
# create a serial number for observation
df[, serial := seq_len(nrow(df))]
# find series of consective observations matching the condition
# then create internal serial id within each series
df[Eng_RPM > 400 & Veh_Spd == 0, group_serial:= seq_len(.N),
by = cumsum((serial - shift(serial, type = "lag", fill = 1)) != 1) ]
df[is.na(group_serial), group_serial := 0]
# identify observations with group_serial larger than 960, add id
df[group_serial > 960, Idling_Count := seq_len(.N)]
df[is.na(Idling_Count), Idling_Count := 0]
you can do this by for cycle like this
Creating sample data and empty column Indling_Cnt
End_RMP <- round(runif(1800,340,1880),0)
Veh_Spd <- round(runif(1800,0,2),0)
dta <- data.frame(End_RMP,Veh_Spd)
dta$Indling_Cnt <- rep(0,1800)
For counting in Indling_Cnt you can use forcycle with few if conditions, this is probably not most efficient way to do it, but it should work. There are better and yet more complex solutions. For example using packages as data.table as mentioned in other answers.
for(i in 2:dim(dta)[1]){
n <- which(dta$End_RMP[-(1:960)]>=400&dta$Veh_Spd[-(1:960)]==0)[1]+960+960
if(i>=n){
if(dta$End_RMP[i]>=400&dta$Veh_Spd[i]==0){
dta$Indling_Cnt[i] <- dta$Indling_Cnt[i-1]+1
}else{
dta$Indling_Cnt[i] <- dta$Indling_Cnt[i-1]
}
}
}

adding and subtracting values in multiple data frames of different lengths - flow analysis

Thank you jakub and Hack-R!
Yes, these are my actual data. The data I am starting from are the following:
[A] #first, longer dataset
CODE_t2 VALUE_t2
111 3641
112 1691
121 1271
122 185
123 522
124 0
131 0
132 0
133 0
141 626
142 170
211 0
212 0
213 0
221 0
222 0
223 0
231 95
241 0
242 0
243 0
244 0
311 129
312 1214
313 0
321 0
322 0
323 565
324 0
331 0
332 0
333 0
334 0
335 0
411 0
412 0
421 0
422 0
423 0
511 6
512 0
521 0
522 0
523 87
In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city.
Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type.
To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following:
[B] #second, shorter dataset
CODE_t2 CODE_t1 VALUE_CHANGE1
121 112 2
121 133 12
121 323 0
121 511 3
121 523 2
123 523 4
133 123 3
133 523 4
141 231 12
141 511 37
So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0.
What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0.
Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real):
CODE VALUE_t0 VALUE_t2 % VAR t2-t0
code1 50 100 ((100-50)/50)*100
code2 70 80 ((80-70)/70)*100
code3 45 34 ((34-45)/45)*100
What I could do so far is:
land_code <- names(A)[-1]
land_code
A$VALUE_t1 <- for(code in land_code{
cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
}
If I use the loop I get an error, while if I take it away:
A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula.
Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here).
Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me.
thanks a lot

Product between two data.frames columns

I have two data.frames:
The first one is the coefficients of my regressions for each day:
> parametrosBase
beta0 beta1 beta4
2015-12-15 0.1622824 -0.012956819 -0.04637442
2015-12-16 0.1641884 -0.007914548 -0.06170213
2015-12-17 0.1623660 -0.005618474 -0.05914809
2015-12-18 0.1643263 0.005380472 -0.08533237
2015-12-21 0.1667710 0.003824588 -0.09040071
The second one is: the independent (x) variables:
> head(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
1 2015-12-15 21 1 0.5642792 0.2859359 0 0 0
2 2015-12-15 42 1 0.3606713 0.2831963 0 0 0
3 2015-12-15 63 1 0.2550200 0.2334554 0 0 0
4 2015-12-15 84 1 0.1943071 0.1883048 0 0 0
5 2015-12-15 105 1 0.1561231 0.1544524 0 0 0
6 2015-12-15 126 1 0.1302597 0.1297947 0 0 0
> tail(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
835 2015-12-21 2415 1 0.006799321 0.006799321 0 0 0
836 2015-12-21 2436 1 0.006740707 0.006740707 0 0 0
837 2015-12-21 2457 1 0.006683094 0.006683094 0 0 0
838 2015-12-21 2478 1 0.006626457 0.006626457 0 0 0
839 2015-12-21 2499 1 0.006570773 0.006570773 0 0 0
840 2015-12-21 2520 1 0.006516016 0.006516016 0 0 0
What i want is to multiply the beta0 column of "parametrosBase" by h0x column of "ir_dfSTORED" and store the result in the beta0_h0x column. And the same for the others: beta1 and beta4
The problem im facing is with the dates in "ind" column. This multiplication has to respect the dates.
So, once i change the day in "ir_dfSTORED" i have to change to the same day in "parametrosBase".
For example:
The first rowof "parametrosBase" df is
2015-12-15 0.1622824 -0.012956819 -0.04637442
is fixed for the 2015-12-15 day. And then i do the product. Once i enter on the 2015-12-16 day i will have to consider the second row of "parametrosBase" df.
How can i do this?
Thanks a lot. :)
Maybe you should merge the two datasets first:
parametrosBase$ind <- rownames(parametrosBase)
df <- merge(ir_dfSTORED,parametrosBase)
df <- within(df,{
beta0_h0x <- beta0*h0x
beta1_h0x <- beta1*h0x
beta4_h0x <- beta4*h0x
})
Since I don't know the structure of the data, you may have to convert the dates from rownames to a date format in order for the merge to work. Using ind as the name of the date in parametrosBase is key to making merge work, otherwise you'll have to specify the variables to merge by.

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Resources