Get the time between consecutive dates stored in a single column

Get the time between consecutive dates stored in a single column - r

I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!

Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132

You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132

Related

Create new column based on partial match with another column

I want to create a new column for a dataframe, using partial matches in another column. The problem is that my values are only partial matches, the suffix _3p, or _5p at the end of the names only exist in the original dataframe but not in the other column I am using to test to.
The code I am using should work, but due to the partial match thing is not and I am stuck.
> head(df)
# A tibble: 6 x 2
microRNAs `number of targets`
<chr> <int>
1 bantam|LQNS02278082.1_33125_3p 128
2 bantam|LQNS02278082.1_33125_5p 8
3 Dpu-Mir-10-P2_LQNS02277998.1_30984_3p 44
4 Dpu-Mir-10-P2_LQNS02277998.1_30984_5p 78
5 Dpu-Mir-10-P3_LQNS02277998.1_30988_3p 1076
6 Dpu-Mir-10-P3_LQNS02277998.1_30988_5p 309
> dput(head(df))
structure(list(microRNAs = c("bantam|LQNS02278082.1_33125_3p",
"bantam|LQNS02278082.1_33125_5p", "Dpu-Mir-10-P2_LQNS02277998.1_30984_3p",
"Dpu-Mir-10-P2_LQNS02277998.1_30984_5p", "Dpu-Mir-10-P3_LQNS02277998.1_30988_3p",
"Dpu-Mir-10-P3_LQNS02277998.1_30988_5p"), `number of targets` = c(128L,
8L, 44L, 78L, 1076L, 309L)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
#matches to look for
unique
1 miR-9|LQNS02278094.1_36129
2 LQNS02278139.1_39527
3 LQNS02278139.1_39523
4 LQNS02278075.1_32386
5 Dpu-Mir-10-P3_LQNS02277998.1_30988
> dput(head(unique))
structure(list(unique = c("miR-9|LQNS02278094.1_36129",
"LQNS02278139.1_39527", "LQNS02278139.1_39523", "LQNS02278075.1_32386",
"Dpu-Mir-10-P3_LQNS02277998.1_30988")), row.names = c(NA,
6L), class = "data.frame")
#Create new column with Yes, No
df$new <- ifelse(df$microRNAs %in% unique$unique, 'Yes', 'No')
##But it all appears like No due to the partial match.

A fast solution using data.table.
library(data.table)
# convert data.frame to data.table
setDT(df)
# create temporary column dropping the last 3 characters
df[, microRNAs_short := substr(microRNAs ,1, nchar(microRNAs)-3) ]
# check values in common
df[, new := fifelse( microRNAs_short %in% df2$unique, 'Yes', 'No')]

We could use regex_left_join from fuzzyjoin
library(fuzzyjoin)
regex_left_join(df, unique, by = c("microRNAs" = "unique"))

How to get sum of column from sqldf output in R?

I would like to sum a single column of data that was output from an sqldf function in R.
I have a csv. file that contains groupings of sites with a uniqueID and their associated areas. For example:
occurrenceID sarea
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.30626786
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.49235953
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.03490536
{0255531B-904F-4E2D-B81D-797A21165A2F} 0.00001389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.0302389
{175A4B1C-CA8C-49F6-9CD6-CED9187579DC} 0.01360811
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.08412911
{1EC60400-0AD0-4DB5-B815-221C4123AE7F} 0.01852466
I used the code below in R to pull out the largest area from each grouping of unique ID's.
> MyData <- read.csv(file="sacandaga2.csv", header=TRUE, sep=",")
> sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
This produced the following output:
max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}
Now I would like to sum all the values in the max(sarea) column. What is the best way to accomplish this?

Either do it in sqldf or R, or assign your existing result and do it in R:
# assign your original
grouped_sum = sqldf("select max(sarea),occurrenceID from MyData group by occurrenceID")
# and sum in R
sum(grouped_sum$`max(sarea)`)
# you might prefer to use a standard column name so you don't need backticks
grouped_sum = sqldf(
"select max(sarea) as max_sarea, occurrenceID
from MyData
group by occurrenceID"
)
sum(grouped_sum$max_sarea)

If the intention is to do this in a single 'sqldf' call, use with
library(sqldf)
sqldf("with tmpdat AS (
select max(sarea) as mxarea, occurrenceID
from MyData group by occurrenceID
) select sum(mxarea)
as smxarea from tmpdat")
# smxarea
#1 0.6067275
data
MyData <-
structure(list(occurrenceID = c("{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{0255531B-904F-4E2D-B81D-797A21165A2F}",
"{0255531B-904F-4E2D-B81D-797A21165A2F}", "{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}",
"{175A4B1C-CA8C-49F6-9CD6-CED9187579DC}", "{1EC60400-0AD0-4DB5-B815-221C4123AE7F}",
"{1EC60400-0AD0-4DB5-B815-221C4123AE7F}"), sarea = c(0.30626786,
0.49235953, 0.03490536, 1.389e-05, 0.0302389, 0.01360811, 0.08412911,
0.01852466)), class = "data.frame", row.names = c(NA, -8L))

You can do it by getting the sum of maximum values:
sqldf("select sum(max_sarea) as sum_of_max_sarea
from (select max(sarea) as max_sarea,
occurrenceID from Mydata group by occurrenceID)")
# sum_of_max_sarea
# 1 0.6067275
Data:
Mydata <- structure(list(occurrenceID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L),
.Label = c("0255531B-904F-4E2D-B81D-797A21165A2F", "175A4B1C-CA8C-49F6-9CD6-CED9187579DC",
"1EC60400-0AD0-4DB5-B815-221C4123AE7F"), class = "factor"),
sarea = c(0.30626786, 0.49235953, 0.03490536, 1.389e-05, 0.0302389,
0.01360811, 0.08412911, 0.01852466)), class = "data.frame",
row.names = c(NA, -8L))

If DF is the last data frame shown in the question this sums the numeric column:
sqldf("select sum([max(sarea)]) as sum from DF")
## sum
## 1 11.07853
Note
We assume this data frame shown in reproducible form:
Lines <- "max(sarea) occurrenceID
1 0.49235953 {0255531B-904F-4E2D-B81D-797A21165A2F}
2 0.03023890 {175A4B1C-CA8C-49F6-9CD6-CED9187579DC}
3 0.08412911 {1EC60400-0AD0-4DB5-B815-221C4123AE7F}
4 0.00548259 {2412E244-2E9A-4477-ACC6-1EB02503BE75}
5 0.00295924 {40450574-ABEB-48E3-9BE5-09B5AB65B465}
6 0.01403846 {473FB631-D398-46B7-8E85-E63540BDFF92}
7 0.00257519 {4BABDE22-E8E0-435E-B60D-0BB9A84E1489}
8 0.02158115 {5F616A33-B028-46B1-AD92-89EAC1660C41}
9 0.00191211 {70067496-25B6-4337-8C70-782143909EF9}
10 0.03049355 {7F858EBB-132E-483F-BA36-80CE889373F5}
11 0.03947298 {9A579565-57EC-4E46-95ED-79724FA6F2AB}
12 0.02464722 {A9010BA3-0FE1-40B1-96A7-21122261A003}
13 0.00136672 {AAD710BF-1539-4235-87F1-34B66CF90781}
14 0.01139146 {AB1286C3-DBE3-467B-99E1-AEEF88A1B5B2}
15 0.07954269 {BED0433A-7167-4184-A25F-B9DBD358AFFB}
16 0.08401067 {C4EF0F45-5BF7-4F7C-BED8-D6B2DB718CB2}
17 0.04289261 {C58AC2C6-BDBE-4FE5-BD51-D70BBDFB4DB5}
18 0.03151558 {D4230F9C-80E4-454A-9D5D-0E373C6DCD9A}
19 0.00403585 {DD76A03A-CFBF-41E9-A571-03DA707BEBDA}
20 0.00007336 {E20DE254-8A0F-40BE-90D2-D6B71880E2A8}
21 9.81847859 {F382D5A6-F385-426B-A543-F5DE13F94564}
22 0.00815881 {F9032905-074A-468F-B60E-26371CF480BB}
23 0.24717113 {F9E5DC3C-4602-4C80-B00B-2AF1D605A265}"
DF <- read.table(text = Lines, check.names = FALSE)

Data manipulations in R

As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")

You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))

You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)

Calculating subtotals (sum, stdev, average etc)

I have been searching for this for a while, but haven't been able to find a clear answer so far. Probably have been looking for the wrong terms, but maybe somebody here can quickly help me. The question is kind of basic.
Sample data set:
set <- structure(list(VarName = structure(c(1L, 5L, 4L, 2L, 3L),
.Label = c("Apple/Blue/Nice",
"Apple/Blue/Ugly", "Apple/Pink/Ugly", "Kiwi/Blue/Ugly", "Pear/Blue/Ugly"
), class = "factor"), Color = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Blue",
"Pink"), class = "factor"), Qty = c(45L, 34L, 46L, 21L, 38L)), .Names = c("VarName",
"Color", "Qty"), class = "data.frame", row.names = c(NA, -5L))
This gives a data set like:
set
VarName Color Qty
1 Apple/Blue/Nice Blue 45
2 Pear/Blue/Ugly Blue 34
3 Kiwi/Blue/Ugly Blue 46
4 Apple/Blue/Ugly Blue 21
5 Apple/Pink/Ugly Pink 38
What I would like to do is fairly straight forward. I would like to sum (or averages or stdev) the Qty column. But, also I would like to do the same operation under the following conditions:
VarName includes "Apple"
VarName includes "Ugly"
Color equals "Blue"
Anybody that can give me a quick introduction on how to perform this kind of calculations?
I am aware that some of it can be done by the aggregate() function, e.g.:
aggregate(set[3], FUN=sum, by=set[2])[1,2]
However, I believe that there is a more straight forward way of doing this then this. Are there some filters that can be added to functions like sum()?

The easiest way to to split up your VarName column, then subsetting becomes very easy. So, lets create an object were varName has been separated:
##There must(?) be a better way than this. Anyone?
new_set = t(as.data.frame(sapply(as.character(set$VarName), strsplit, "/")))
Brief explanation:
We use as.character because set$VarName is a factor
sapply takes each value in turn and applies strplit
The strsplit function splits up the elements
We convert to a data frame
Transpose to get the correct rotation
Next,
##Convert to a data frame
new_set = as.data.frame(new_set)
##Make nice rownames - not actually needed
rownames(new_set) = 1:nrow(new_set)
##Add in the Qty column
new_set$Qty = set$Qty
This gives
R> new_set
V1 V2 V3 Qty
1 Apple Blue Nice 45
2 Pear Blue Ugly 34
3 Kiwi Blue Ugly 46
4 Apple Blue Ugly 21
5 Apple Pink Ugly 38
Now all the operations are as standard. For example,
##Add up all blue Qtys
sum(new_set[new_set$V2 == "Blue",]$Qty)
[1] 146
##Average of Blue and Ugly Qtys
mean(new_set[new_set$V2 == "Blue" & new_set$V3 == "Ugly",]$Qty)
[1] 33.67
Once it's in the correct form, you can use ddply which does every you want (and more)
library(plyr)
##Split the data frame up by V1 and take the mean of Qty
ddply(new_set, .(V1), summarise, m = mean(Qty))
##Split the data frame up by V1 & V2 and take the mean of Qty
ddply(new_set, .(V1, V2), summarise, m = mean(Qty))

Is this what you're looking for?
# sum for those including 'Apple'
apple <- set[grep('Apple', set[, 'VarName']), ]
aggregate(apple[3], FUN=sum, by=apple[2])
Color Qty
1 Blue 66
2 Pink 38
# sum for those including 'Ugly'
ugly <- set[grep('Ugly', set[, 'VarName']), ]
aggregate(ugly[3], FUN=sum, by=ugly[2])
Color Qty
1 Blue 101
2 Pink 38
# sum for Color==Blue
sum(set[set[, 'Color']=='Blue', 3])
[1] 146
The last sum could be done by using subset
sum(subset(set, Color=='Blue')[,3])

Diffing sublists of a data frame

I've got a data frame that contains several interleaved values that occurred in a timeline. I'd like to create a new data frame that contains line numbers (row IDs, basically), a file descriptor, operation and a "size" value.
Example:
line fd syscall size
1 1 1 lseek 1289020416
2 2 1 lseek 1289021440
3 3 2 lseek 1289024512
4 4 1 lseek 1289025536
5 5 2 lseek 1289026560
6 6 1 lseek 1289027584
I'd like to compute a diff of the size values per fd and show the starting point of the diff. The diff function itself throws away a lot of data. Is there something similar that will help me have context (e.g. where the beginning of each line was)?
I'd like results that look like the following where I know how far each fd has moved since the previous line, and what the previous line was.
line fd diff
1 1 1 1024
2 2 1 4096
3 3 2 2048
4 4 1 2048
Is there something I can do that's easier than tearing it all apart and looping? I have to believe someone has a slightly better diff out there.
Example input:
structure(list(line = 1:6, fd = c(1, 1, 2, 1, 2, 1), syscall = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "lseek"), size = c(1289020416,
1289021440, 1289024512, 1289025536, 1289026560, 1289027584)), .Names = c("line",
"fd", "syscall", "size"), row.names = c(NA, -6L), class = "data.frame")

Use plyr to cut the data.frame in pieces and transform to attach the new vector.
library(plyr)
ddply(dtf, .(fd), function(x) transform(x, diff = c(x$size[1], diff(x$size))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get the time between consecutive dates stored in a single column - r

Related

Create new column based on partial match with another column

How to get sum of column from sqldf output in R?

Data manipulations in R

Calculating subtotals (sum, stdev, average etc)

Diffing sublists of a data frame

Categories

Resources