I have a time series:
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
I want to remove values that are identical to the previous value to obtain:
x = c(5,7,8,9,5,4,3)
I tried:
df[unique(df$x),]
But this gives the incorrect answer.
You can do:
df[c(1, diff(df$x)) != 0, ]
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
With dplyr, you can do:
df %>%
filter(x != lag(x, default = first(x)-1))
t x
1 1 5
2 2 7
3 3 8
4 4 9
5 5 5
6 9 4
7 10 3
In base R, we can use head and tail
subset(df, c(TRUE, head(x, -1) != tail(x, -1)))
# t x
#1 1 5
#2 2 7
#3 3 8
#4 4 9
#5 5 5
#9 9 4
#10 10 3
Another base solution would be using rle.
If you want to subset the dataframe based on the criteria, you can use lengths. Otherwise, if you only need the subset of x column, we should extract the values from rle. See below;
df[cumsum(rle(df$x)$lengths), ] # dataframe subset
# t x
# 1 1 5
# 2 2 7
# 3 3 8
# 4 4 9
# 8 8 5
# 9 9 4
# 10 10 3
rle(df$x)$values # vector of values
# [1] 5 7 8 9 5 4 3
Or using data.table:
library(data.table)
setDT(df_large)[, rn :=1:.N, by = rleid(x)][rn == 1, .(t, x)]
# t x
# 1: 1 5
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 5
# 6: 9 4
# 7: 10 3
library(dplyr)
df <- data.frame(t=1:10, x= c(5,7,8,9,5,5,5,5,4,3))
subsetVec <- df$x - lag(df$x) != 0
subsetVec <- replace_na(subsetVec, TRUE)
df[subsetVec,]
Related
Hello I have a data frame of 245 columns but to add some sets and generate new columns try to do it recursively as follows
cl1<-sample(1:4,10,replace=TRUE)
cl2<-sample(1:4,10,replace=TRUE)
cl3<-sample(1:4,10,replace=TRUE)
cl4<-sample(1:4,10,replace=TRUE)
cl5<-sample(1:4,10,replace=TRUE)
cl6<-sample(1:4,10,replace=TRUE)
dat<-data.frame(cl1,cl2,cl3,cl4,cl5,cl6)
my intention is to add column 1 with column 3 and 5, likewise column 2 with 4 and 6 and in the end obtain a dataframe with two columns
and you should pay me something like that
I have programmed the following code
revisar<- function(a){
todos = list()
i=1
j=3
l=5
k=1
while(i<=2 ){
cl<-a[,i]
cl2<-a[,j]
cl3<-a[,l]
cl[is.na(cl)] <- 0
cl2[is.na(cl2)] <- 0
cl3[is.na(cl3)] <- 0
colu<-cl+cl2+cl3
col<-cbind(colu,colu)
i<-i+1
j<-j+1
l<-l+1
k<-k+1
}
return(col)
}
it turns out that it only returns column 2 repeated twice and I must replicate the same thing to join those 245 columns.7
I would like to know what is failing the example
base R
Literal programming:
with(dat, data.frame(s1 = cl1+cl3+cl5, s2 = cl2+cl4+cl6))
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
Programmatically,
L <- list(s1 = c(1,3,5), s2 = c(2,4,6))
out <- data.frame(lapply(L, function(z) do.call(rowSums, list(as.matrix(dat[,z])))))
out
# s1 s2
# 1 7 11
# 2 7 7
# 3 4 11
# 4 4 10
# 5 9 8
# 6 12 5
# 7 7 6
# 8 7 10
# 9 4 9
# 10 6 5
dplyr
library(dplyr)
dat %>%
transmute(
s1 = rowSums(cbind(cl1, cl3, cl5)),
s2 = rowSums(cbind(cl2, cl4, cl6))
)
or programmatically using purrr:
purrr::map_dfc(L, ~ rowSums(dat[, .]))
Data
set.seed(42)
# your `dat` above
Here is an alternative general approach:
Here we sum all uneven columns -> s1 and
all even columns -> s2:
library(dplyr)
dat %>%
rowwise() %>%
mutate(s1 = sum(c_across(seq(1,ncol(dat),2)), na.rm = TRUE),
s2 = sum(c_across(seq(2,ncol(dat),2)), na.rm = TRUE))
cl1 cl2 cl3 cl4 cl5 cl6 s1 s2
<int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 3 2 3 2 7 5
2 2 4 1 4 2 3 5 11
3 2 2 2 2 1 3 5 7
4 2 4 4 3 1 4 7 11
5 2 4 4 3 2 2 8 9
6 3 3 3 2 2 2 8 7
7 2 1 1 2 1 4 4 7
8 2 4 1 3 2 3 5 10
9 3 1 1 2 3 4 7 7
10 2 4 1 3 4 4 7 11
I want to repeat the rows of a data.frame for N times. Here N calculates based on the difference between the values of a first and second column in each row of a data.frame. Here I am facing a problem with N. In particular, N may change per each row. And I need to create a new column by creating a sequence from a first value to second value in row 1 by increasing K. Here K remains constant for all the rows.
Ex: d1<-data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
In the above dataset, there are 5 rows. THe difference between first and second values in first row is 7. Now I need to replicate the first row for 7 times and need to create a new column with the sequence of 2,3,4,5,6,7 and 8.
I can create a dataset by using the following code.
dist<-1
rec_len<-c()
seqe<-c()
for(i in 1:nrow(d1))
{
a<-seq(d1[i,"A"],d1[i,"B"],by=dist)
rec_len<-c(rec_len,length(a))
seqe<-c(seqe,a)
}
d1$C<-rec_len
d1<-d1[rep(1:nrow(d1),d1$C),]
d1$D<-seqe
row.names(d1)<-NULL
But it is taking very long time. Is there any possibity to speed up the process?
A data.table approach for this can be to use 1:nrow(df) as grouping variable to make rowwise operation for creating a list with the sequences of A and B, and then unlist, i.e.
library(data.table)
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
Which gives,
A B C D
1: 2 8 7 2
2: 2 8 7 3
3: 2 8 7 4
4: 2 8 7 5
5: 2 8 7 6
6: 2 8 7 7
7: 2 8 7 8
8: 4 6 3 4
9: 4 6 3 5
10: 4 6 3 6
11: 6 7 2 6
12: 6 7 2 7
13: 8 8 1 8
14: 1 10 10 1
15: 1 10 10 2
16: 1 10 10 3
17: 1 10 10 4
18: 1 10 10 5
19: 1 10 10 6
20: 1 10 10 7
21: 1 10 10 8
22: 1 10 10 9
23: 1 10 10 10
A B C D
Note You can easily change K within seq, i.e.
setDT(d1)[, C := B - A + 1][,
D := list(list(seq(A, B, by = 0.2))), by = 1:nrow(d1)][,
lapply(.SD, unlist), by = 1:nrow(d1)][,
nrow := NULL][]
You could use lists and purr package to process each row of your data frame:
data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10)) %>% # take original data frame
setNames(c("from", "to")) %>% pmap(seq) %>% # sequence from A to B
map(as_data_frame) %>% # convert each element to data frame
map(~mutate(.,A=min(value), B=max(value))) %>% # add A and B columns
bind_rows() %>% select(A,B,value) # combine and reorder columns
Here is a base R option where we get the times of replication of each row by subtracting the 'B' with 'A' column ('i1'), create that as column 'C', then replicate the sequence of rows of original dataset using 'i1'. Finally, the 'D' column is created by getting the sequence of corresponding elements of 'A' and 'B' using Map. The output will be a list, so we unlist it to make a vector
i1 <- with(d1, B - A + 1)
d1$C <- i1
d2 <- d1[rep(seq_len(nrow(d1)), i1),]
d2$D <- unlist(Map(`:`, d1$A, d1$B))
row.names(d2) <- NULL
d2
# A B C D
#1 2 8 7 2
#2 2 8 7 3
#3 2 8 7 4
#4 2 8 7 5
#5 2 8 7 6
#6 2 8 7 7
#7 2 8 7 8
#8 4 6 3 4
#9 4 6 3 5
#10 4 6 3 6
#11 6 7 2 6
#12 6 7 2 7
#13 8 8 1 8
#14 1 10 10 1
#15 1 10 10 2
#16 1 10 10 3
#17 1 10 10 4
#18 1 10 10 5
#19 1 10 10 6
#20 1 10 10 7
#21 1 10 10 8
#22 1 10 10 9
#23 1 10 10 10
Simple example using N (case where k = 1)
library(dplyr)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d) {
A = rep(d$A, d$diff)
B = rep(d$B, d$diff)
C = seq(d$A, d$B)
data.frame(A, B, C) }
d1 %>%
mutate(diff = B - A + 1) %>% # calculate difference
rowwise() %>% # for every row
do(f(.)) %>% # apply the function
ungroup() # forget the grouping
# # A tibble: 23 x 3
# A B C
# * <dbl> <dbl> <int>
# 1 2 8 2
# 2 2 8 3
# 3 2 8 4
# 4 2 8 5
# 5 2 8 6
# 6 2 8 7
# 7 2 8 8
# 8 4 6 4
# 9 4 6 5
# 10 4 6 6
# # ... with 13 more rows
Example where you have one k for all rows (I'm using 0.25 to demonstrate)
# example data frame
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
# function to use (must have same column names)
f = function(d, k) {
A = d$A
B = d$B
C = seq(d$A, d$B, k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(., 0.25)) %>% # apply the function using your own k
ungroup()
# # A tibble: 77 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.00
# 2 2 8 2.25
# 3 2 8 2.50
# 4 2 8 2.75
# 5 2 8 3.00
# 6 2 8 3.25
# 7 2 8 3.50
# 8 2 8 3.75
# 9 2 8 4.00
# 10 2 8 4.25
# # ... with 67 more rows
Example where you have different k for each row
# example data frame
# give manually different k for each row
d1 <- data.frame(A=c(2,4,6,8,1),B=c(8,6,7,8,10))
d1$k = c(0.5, 1, 2, 0.25, 1.5)
d1
# A B k
# 1 2 8 0.50
# 2 4 6 1.00
# 3 6 7 2.00
# 4 8 8 0.25
# 5 1 10 1.50
# function to use (must have same column names)
f = function(d) {
A = d$A
B = d$B
C = seq(d$A, d$B, d$k)
data.frame(A, B, C) }
d1 %>%
rowwise() %>% # for every row
do(f(.)) %>% # apply the function using different k for each row
ungroup()
# # A tibble: 25 x 3
# A B C
# * <dbl> <dbl> <dbl>
# 1 2 8 2.0
# 2 2 8 2.5
# 3 2 8 3.0
# 4 2 8 3.5
# 5 2 8 4.0
# 6 2 8 4.5
# 7 2 8 5.0
# 8 2 8 5.5
# 9 2 8 6.0
# 10 2 8 6.5
# # ... with 15 more rows
I have a data frame that looks like the following:
> df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
value = c(3,4,3,4,5,6,6,4,9))
> df
group date value
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
I want to create a new column that contains the date value per group that is associated with the value "4" from the value column.
The following data frame shows what I hope to accomplish.
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
As we can see, group 1 has the newValue "2" because that is the date associated with the value "4". Similarly, group two has newValue 4 and group three has newValue 8.
I assume there is an easy way to do this using ave() or a range of dplyr/data.table functions, but I have been unsuccessful with my many attempts.
Here's a quick data.table one
library(data.table)
setDT(df)[, newValue := date[value == 4L], by = group]
df
# group date value newValue
# 1: 1 1 3 2
# 2: 1 2 4 2
# 3: 1 3 3 2
# 4: 2 4 4 4
# 5: 2 5 5 4
# 6: 2 6 6 4
# 7: 3 7 6 8
# 8: 3 8 4 8
# 9: 3 9 9 8
Here's a similar dplyr version
library(dplyr)
df %>%
group_by(group) %>%
mutate(newValue = date[value == 4L])
Or a possible base R solution using merge after filtering the data (will need some renaming afterwards)
merge(df, df[df$value == 4, c("group", "date")], by = "group")
Here is a base R option
df$newValue = rep(df$date[which(df$value == 4)], table(df$group))
Another alternative using lapply
do.call(rbind, lapply(split(df, df$group),
function(x){x$newValue = rep(x$date[which(x$value == 4)],
each = length(x$group)); x}))
# group date value newValue
#1.1 1 1 3 2
#1.2 1 2 4 2
#1.3 1 3 3 2
#2.4 2 4 4 4
#2.5 2 5 5 4
#2.6 2 6 6 4
#3.7 3 7 6 8
#3.8 3 8 4 8
#3.9 3 9 9 8
One more base R path:
df$newValue <- ave(`names<-`(df$value==4,df$date), df$group, FUN=function(x) as.numeric(names(x)[x]))
df
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
10 3 11 7 8
I used a test on variable length groups. I assigned the date column as the names for the logical index of value equal to 4. Then identify the value by group.
Data
df = data.frame(group = c(1,1,1,2,2,2,3,3,3,3),
date = c(1,2,3,4,5,6,7,8,9,11),
value = c(3,4,3,4,5,6,6,4,9,7))
Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3
The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.