I've got a data frame that contains several interleaved values that occurred in a timeline. I'd like to create a new data frame that contains line numbers (row IDs, basically), a file descriptor, operation and a "size" value.
Example:
line fd syscall size
1 1 1 lseek 1289020416
2 2 1 lseek 1289021440
3 3 2 lseek 1289024512
4 4 1 lseek 1289025536
5 5 2 lseek 1289026560
6 6 1 lseek 1289027584
I'd like to compute a diff of the size values per fd and show the starting point of the diff. The diff function itself throws away a lot of data. Is there something similar that will help me have context (e.g. where the beginning of each line was)?
I'd like results that look like the following where I know how far each fd has moved since the previous line, and what the previous line was.
line fd diff
1 1 1 1024
2 2 1 4096
3 3 2 2048
4 4 1 2048
Is there something I can do that's easier than tearing it all apart and looping? I have to believe someone has a slightly better diff out there.
Example input:
structure(list(line = 1:6, fd = c(1, 1, 2, 1, 2, 1), syscall = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "lseek"), size = c(1289020416,
1289021440, 1289024512, 1289025536, 1289026560, 1289027584)), .Names = c("line",
"fd", "syscall", "size"), row.names = c(NA, -6L), class = "data.frame")
Use plyr to cut the data.frame in pieces and transform to attach the new vector.
library(plyr)
ddply(dtf, .(fd), function(x) transform(x, diff = c(x$size[1], diff(x$size))))
Related
Am at beginner stage of R programming, please help me in below issue.
I have different desc values assigned to the same sol attribute in different rows. I want to make all desc values of sol attribute in single row as mentioned below
My data is as follows:
sol desc
1 fry, toast
1 frt,grt,gty
1 ytr,uyt,ytr
6 hyt, ytr,oiu
4 hyg,hyu,loi
4 opu,yut,yut
I want the output as follows :
sol desc
1 fry,toast,frt,grt,gty,ytr,uyt,yir
6 hyt, ytr,oiu
4 hyg,hyu,loi,opu,yut,yut
Note: you can input any values in desc as per your convenience.
aggregate() is what you are looking for. Try this:
aggregate(desc ~ sol, data = df, paste, collapse = ",")
sol desc
1 1 fry, toast,frt,grt,gty,ytr,uyt,ytr
2 4 hyg,hyu,loi,opu,yut,yut
3 6 hyt, ytr,oiu
Data
df <- structure(list(sol = c(1L, 1L, 1L, 6L, 4L, 4L), desc = c("fry, toast",
"frt,grt,gty", "ytr,uyt,ytr", "hyt, ytr,oiu", "hyg,hyu,loi",
"opu,yut,yut")), .Names = c("sol", "desc"), class = "data.frame", row.names = c(NA,
-6L))
I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.
There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.
What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.
As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)
The goal is for each UniqueID, to select the earliest Date that occurs in the data frame for each Step. This can be done by creating a data frame for each Step, sorting by UniqueID and dates oldest to newest, and then removing duplicated UniqueID.
The difficult part is that each Step represents steps in a process which must occur in order. Step 2 can never occur before Step 1 and so on, and if records show that Step 2 has occurred before step 1, then those records indicate errors in the data that should be ignored.
So in the example data frame below, for UniqueIDs of "A" we would ignore the earliest two instances of Step 2 because they occur before the earliest incidence of Step 1 which is not permitted. We can then proceed to obtain the earliest instance of each step that does occur in the permitted order to get the desired result of:
Step 1 = 9/07/2015
Step 2 = 20/07/2015
Step 3 = 24/07/2015
For UniqueIDs of "B", the earliest instance of Step 3 occurs before the earliest instance of Step 2 so it must be ignored. Once we ignore this we can go on to obtain the following desired values:
Step 1 = 1/06/2015
Step 2 = 22/06/2015
Step 3 = 8/07/2015
Example dataframe:
UniqueID Date Step
A 3/07/2015 2
A 7/07/2015 2
A 9/07/2015 1
A 14/07/2015 1
A 17/07/2015 1
A 20/07/2015 2
A 23/07/2015 2
A 24/07/2015 3
A 29/07/2015 3
B 1/06/2015 1
B 15/06/2015 1
B 22/06/2015 1
B 29/06/2015 1
B 13/07/2015 3
B 22/06/2015 2
B 8/07/2015 3
B 27/07/2015 3
The real data set is very large. What techniques might we be able to use to efficiently attain the desired output. We would like a data frame with a row for each UniqueID and a column for each Step showing the date that each ID reached each Step.
Here is the dput for my example data frame:
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Date = structure(c(16619, 16623, 16625,
16630, 16633, 16636, 16639, 16640, 16645, 16587, 16601, 16608,
16615, 16629, 16608, 16624, 16643), class = "Date"), Step = c(2,
2, 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 3, 3, 3, 2)), .Names = c("UniqueID",
"Date", "Step"), row.names = c(NA, -17L), class = "data.frame")
What you need is a quick function that could check the correctness of the order of steps.
Here is one suggestion:
df <- group_by(df,UniqueID) %>%
mutate(position1=sapply(Step,function(y) if(y==1){0} else{
min(which(Step<y))
}
),
position2=1:length(Step)) %>%
print
df <- filter(df,position1<=position2) %>%
select(-position1,-position2)
First it creates 2 extra columns: position1 defines the minimal position of an element that is less then current step. Position2 is just an alias for a position of this step in current group. Obviously, if position1 is greater than position2, the step is "incorrect": for example, if step2 goes before step1.
After filtering, you will have data frame with all correct steps, and than you can just use Khashaa's method.
I am trying to figure out how to get the time between consecutive events when events are stored as a column of dates in a dataframe.
sampledf=structure(list(cust = c(1L, 1L, 1L, 1L), date = structure(c(9862,
9879, 10075, 10207), class = "Date")), .Names = c("cust", "date"
), row.names = c(NA, -4L), class = "data.frame")
I can get an answer with
as.numeric(rev(rev(difftime(c(sampledf$date[-1],0),sampledf$date))[-1]))
# [1] 17 196 132
but it is really ugly. Among other things, I only know how to exclude the first item in a vector, but not the last so I have to rev() twice to drop the last value.
Is there a better way?
By the way, I will use ddply to do this to a larger set of data for each cust id, so the solution would need to work with ddply.
library(plyr)
ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(rev(rev(difftime(c(date[-1],0),date))[-1]))
)
Thank you!
Are you looking for this?
as.numeric(diff(sampledf$date))
# [1] 17 196 132
To remove the last element, use head:
head(as.numeric(diff(sampledf$date)), -1)
# [1] 17 196
require(plyr)
ddply(sampledf, .(cust), summarise, daysBetween = as.numeric(diff(date)))
# cust daysBetween
# 1 1 17
# 2 1 196
# 3 1 132
You can just use diff.
as.numeric(diff(sampledf$date))
To leave off the last, element, you can do:
[-length(vec)] #where `vec` is your vector
In this case I don't think you need to leave anything off though, because diff is already one element shorter:
test <- ddply(sampledf,
c("cust"),
summarize,
daysBetween = as.numeric(diff(sampledf$date)
))
test
# cust daysBetween
#1 1 17
#2 1 196
#3 1 132