I have a data frame with multiple time series identified by uniquer id's. I would like to remove any time series that have only 0 values.
The data frame looks as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
B 2010/01/01 0
B 2010/01/02 0
B 2010/01/03 0
B 2010/01/04 0
B 2010/01/05 0
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
I want any time series with only 0 values to be removed so that the data frame look as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
This is a follow up to a previous question that was answered with a really great solution using the data.tables package.
R efficiently removing missing values from the start and end of multiple time series in 1 data frame
If dat is a data.table, then this is easy to write and read :
dat[,.SD[any(value!=0)],by=id]
.SD stands for Subset of Data. This answer explains .SD very well.
Picking up on Gabor's nice use of ave, but without repeating the same variable name (DF) three times, which can be a source of typo bugs if you have a lot of long or similar variable names, try :
dat[ ave(value!=0,id,FUN=any) ]
The difference in speed between those two may be dependent on several factors including: i) number of groups ii) size of each group and iii) the number of columns in the real dat.
Try this. No packages are used.
DF[ ave(DF$value != 0, DF$id, FUN = any), ]
An easy plyr solution would be
ddply(mydat,"id",function(x) if (all(x$value==0)) NULL else x)
(seems to work OK) but there may be a faster solution with data.table ...
Related
Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.
I have below data table and want to replace other than Web,Mobile usage into another category say others. (OR)
Is there anyway group the usage as Web, Mobile and rest all as others without replace the value like Web is used 1 , Mobile 1 and Others - 4 (OR)
Do we need to write a function to do so.
Id Name Usage
1 AAA Web
2 BBB Mobile
3 CCC Manual
4 DDD M1
5 EEE M2
6 FFF M3
Assuming that the 'Usage' is character class, we can use %chin% to create a logical index, negate it (!) and assign (:=) values in 'Usage' to 'Others'. This would be more efficient as we are assigning in place without any copying.
library(data.table)
setDT(df1)[!Usage %chin% c("Web", "Mobile"), Usage := "Others"]
df1
# Id. Name Usage
#1: 1 AAA Web
#2: 2 BBB Mobile
#3: 3 CCC Others
#4: 4 DDD Others
#5: 5 EEE Others
#6: 6 FFF Others
This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between
I am trying to do a if statement with a subset function in it.
I have a dataframe dat1, for example like this:
Unit Cost Date
1 40 Sep
1 50 Dec
2 55 Sep
2 30 Oct
And based on its row nrow(dat1) I want to subset an other dataframe (dat2)
unit model sales
1 AAA 100
1 BBB 110
1 CCC 130
4 ZZZ 120
5 YYY 128
I wrote a ifstatement like this:
Sales <- ifelse(nrow(dat1)>=30,
dat2[which(dat2$unit==1 & dat2$model=="AAA"),],
dat2[which(dat2$unit==1),])
So if the nrow>30 I want to apply subset on 2 dimensions of dat2, else just on one of them.
However this gives me a list with only the first column, not a dataframe with all the 3 columns of dat2.
What is the right command to do this?
Thanks in advance for your help.
This works:
Sales <- dat2[which(dat2$unit==1),] # default
if (nrow(dat1)>=30) {
Sales <-dat2[which(dat2$unit==1 & dat2$model=="AAA"),]
}
Use if followed by else statements for data subsetting
if(nrow(dat1)>=30){
Sales <- dat2[dat2$unit==1 & dat2$model=="AAA",]
}else{
Sales <- dat2[dat2$unit==1,]}
I'm trying to apply the kmodes clustering method (from klaR package) in R to a text matrix made of around 1000 strings and 6 columns.
Unfortunately, I get an error that I cannot comprehend:
kmodes(mat, 5, iter.max=10)
Error in cluster[j] <- which.min(dist) : replacement has length zero
Do you have any ideas about why this is happening?
EDIT: This is the head(mat):
1 aaa ccc iii <NA> 0
2 aaa ddd kkk <NA> 0
3 aaa eee -273 <NA> 0
4 aaa fff lll <NA> 0
5 bbb ggg 67 <NA> 0
6 bbb hhh mmm <NA> 0
You can very well use kmodes(na.omit(mat), 5, iter.max = 10), but that will lead to loss of data which is not appreciable.
You could instead check the summary(mat) to find for missing data or NA values and then replace them using random imputation.
dataframeName$variable <- with(dataframeName, impute(variable, 'random'))
and in case you have too many NA values in a column/variable then you can also neglect that column.
Check out the link:-
https://discuss.analyticsvidhya.com/t/how-to-handle-missing-values-of-categorical-variables/310
I had the same error. It seems to be due to missing data. Remove missing data from the data first, e.g. by using kmodes(na.omit(mat), 5, iter.max=10).