I have a data frame df. It has several columns, two of them are dates and serial_day, corresponding to the date an observation was taken and MATLAB's serial day. I would like to restrict my time series such that the increment (in days) between two consecutive observations is 3 or 4 and separate such blocks by a NA row.
It is known that consecutive daily observations never occur and the case of 2 day separation followed by 2 day separation is rare, so it can be ignored.
In the example, increment is shown for convenience, but it is easily generated using the diff function. So, if the data frame is
serial_day increment
1 4 NA
2 7 3
3 10 3
4 12 2
5 17 5
6 19 2
7 22 3
8 25 3
9 29 4
10 34 5
I would hope to get a new data frame as:
serial_day increment
1 4 NA
2 7 3
3 10 3
4 NA ## Entire row of NAs NA
5 19 NA
6 22 3
7 25 3
8 29 4
9 NA ## Entire row of NAs NA
I can't figure out a way to do this without looping, which is bad idea in R.
First you check in which rows the increment is not equal to 3 or 4. Then you'd replace these rows with a row of NAs:
inds <- which( df$increment > 4 | df$increment < 3 )
df[inds, ] <- rep(NA, ncol(df))
# serial_day increment
# 1 4 NA
# 2 7 3
# 3 10 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 22 3
# 8 25 3
# 9 29 4
# 10 NA NA
This may result in multiple consecutive rows of NAs. In order to reduce these consecutive NA-rows to a single NA-row, you'd check where the NA-rows are located with which() and then see whether these locations are consecutive with diff() and remove these rows from df:
NArows <- which(rowSums(is.na(df)) == ncol(df)) # c(4, 5, 6, 10)
inds2 <- NArows[c(FALSE, diff(NArows) == 1)] # c(5, 6)
df <- df[-inds2, ]
# serial_day increment
# 1 4 NA
# 2 7 3
# 3 10 3
# 4 NA NA
# 7 22 3
# 8 25 3
# 9 29 4
# 10 NA NA
Related
I have a dataframe A like below.
Notice that the first column is the row name with random order.
ID
5 10
3 10
1 10
Them. I have another 5 * 1 data frame B with NAs. I am trying to copy A to B matching the column names in A. I want to get a data frame like below.
ID
1 10
2 NA
3 10
4 NA
5 10
What you are trying to do is potentially dangerous. If you are 100% sure that the rows contain identifiers that would match between the 2 data frames, here's the code.
library(tidyverse)
# Generate a data frame that looks like yours (you don't need this)
df <- data.frame(ID=c(10, NA, 10, NA, 10))
# Assign row names to a new column on the df
df$names <- row.names(df)
# Here's how your data will look like
df<-df[complete.cases(df),]
# Make a second df
df2 <- data.frame(names=as.character(1:20))
# Join by names (what are other possible columns to join by ?)
left_join(df2, df, by="names")
This will produce
names ID
1 1 10
2 2 NA
3 3 10
4 4 NA
5 5 10
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
12 12 NA
13 13 NA
14 14 NA
15 15 NA
16 16 NA
17 17 NA
18 18 NA
19 19 NA
20 20 NA
I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.
I have a large dataframe with random columns which contain NA values. It looks like this:
2002-06-26 2002-06-27 2002-06-28 2002-07-01 2002-07-02 2002-07-03 2002-07-05
1 US1718711062 NA BMG4388N1065 US0116591092 NA AN8068571086 GB00BYMT0J19
2 US9837721045 NA US0025671050 US03662Q1058 NA BMG3223R1088 US0097281069
3 NA US00847J1051 US06652V2088 NA BMG4388N1065 US0305061097
4 NA US04351G1013 US1046741062 NA BMG7496G1033 US03836W1036
5 NA US2925621052 US1431301027 NA CA88157K1012 US06652V2088
6 NA US34988V1061 US1897541041 NA CH0044328745 US1547604090
7 NA US3596941068 US2053631048 NA GB00B5BT0K07 US1778351056
8 NA US4180561072 US2567461080 NA IE00B5LRLL25 US1999081045
9 NA US4198791018 US2925621052 NA IE00B8KQN827 US3498531017
10 NA US45071R1095 US3989051095 NA IE00BGH1M568 US42222N1037
I need a code which identifies and fills out the NA columns with the contents of the previous column. So for example column "2002-06-27" should contain "US1718711062" and "US9837721045". The NA columns are at irregular intervals.
Columns are also of random length some only containing one element so I think the best way to identify columns with no values is to look at the first row like so:
row.has.na <- which(is.na(data[1,]))
[1] 2 5
To complete my comment: as you have already computed row.has.na, the vector of indices for the NA column, here is a way to use it and get what you need:
data[, row.has.na] <- data[, row.has.na - 1]
This should work. Note that this also works if two (or more) NA columns are next to each other. Maybe there is a way around the while-loop, but...
# Create some data
data <- data.frame(col1 = 1:10, col2 = NA, col3 = 10:1, col4 = NA, col5 = NA, col6 = NA)
# Find which columns contain NA in the first row
col_NA <- which(is.na(data[1,]))
# Select the previous columns
col_replace <- col_NA - 1
# Check if any NA columns are next to each other and fix it:
while(any(diff(col_replace) == 1)){
ind <- which(diff(col_replace) == 1) + 1
col_replace[ind] <- col_replace[ind] - 1
}
# Replace the NA columns with the previous columns
data[,col_NA] <- data[,col_replace]
col1 col2 col3 col4 col5 col6
1 1 1 10 10 10 10
2 2 2 9 9 9 9
3 3 3 8 8 8 8
4 4 4 7 7 7 7
5 5 5 6 6 6 6
6 6 6 5 5 5 5
7 7 7 4 4 4 4
8 8 8 3 3 3 3
9 9 9 2 2 2 2
10 10 10 1 1 1 1
Raw Data na.approx desired result
1 1 1
NA 3 4
5 5 5
6 6 6
7 7 7
NA 8 4
NA 9 7
10 10 10
13 11 13
14 12 14
By default, i believe na.approx in R will interpolate NA between two known values; one before and another after NA (the result will be seen as column "na.approx" above). Is there a way I can change this function to interpolate based on next two known values? for eg, first NA to be interpolated using 5 and 6.... but not 1 and 5.
I am not sure if there is an exact equivalent to what you want to do, but you can achieve similar results the following way:
> data <- c(1, NA, 5,6,7,NA,NA,10,13,14)
> ind <- which(is.na(data))
> sapply(rev(ind), function(i) data[i] <<- data[i + 1] - 1)
> data
[1] 1 4 5 6 7 8 9 10 13 14
This is my first post at StackOverflow. I am relatively a newbie in programming and trying to work with the data.table in R, for its reputation in speed.
I have a very large data.table, named "Actions", with 5 columns and potentially several million rows. The column names are k1, k2, i, l1 and l2. I have another data.table, with the unique values of Actions in columns k1 and k2, named "States".
For every row in Actions, I would like to find the unique index for columns 4 and 5, matching with States. A reproducible code is as follows:
S.disc <- c(2000,2000)
S.max <- c(6200,2300)
S.min <- c(700,100)
Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])
So if i was using data.frame, the following line would be like:
index <- apply(Actions,1,function(x) {which((States[,1]==x[4]) & (States[,2]==x[5]))})
How can I do the same with data.table efficiently ?
This is relatively simple once you get the hang of keys and the special symbols which may be used in the j expression of a data.table. Try this...
# First make an ID for each row for use in the `dcast`
# because you are going to have multiple rows with the
# same key values and you need to know where they came from
Actions[ , ID := 1:.N ]
# Set the keys to join on
setkeyv( Actions , c("l1" , "l2" ) )
setkeyv( States , c("k1" , "k2" ) )
# Join States to Actions, using '.I', which
# is the row locations in States in which the
# key of Actions are found and within each
# group the row number ( 1:.N - a repeating 1,2,3)
New <- States[ J(Actions) , list( ID , Ind = .I , Row = 1:.N ) ]
# k1 k2 ID Ind Row
#1: 700 100 1 1 1
#2: 700 100 1 2 2
#3: 700 100 1 3 3
#4: 700 100 2 1 1
#5: 700 100 2 2 2
#6: 700 100 2 3 3
# reshape using 'dcast.data.table'
dcast.data.table( Row ~ ID , data = New , value.var = "Ind" )
# Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27...
#1: 1 1 1 1 4 4 4 7 7 7 10 10 10 13 13 13 16 16 16 1 1 1 4 4 4 7 7 7...
#2: 2 2 2 2 5 5 5 8 8 8 11 11 11 14 14 14 17 17 17 2 2 2 5 5 5 8 8 8...
#3: 3 3 3 3 6 6 6 9 9 9 12 12 12 15 15 15 18 18 18 3 3 3 6 6 6 9 9 9...