Suppose I have an R dataframe that looks like this, where end.group signifies the end of a unique group of observations:
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
I want to return the following, where group.count is a running count of the number of observations in a group, and group is a unique identifier for each group, in number order. Can anyone help me with a piece of R code to do this?
end.group group.count group
0 1 1
0 2 1
1 3 1
0 1 2
0 2 2
1 3 2
1 1 3
0 1 4
0 2 4
0 3 4
1 4 4
1 1 5
1 1 6
0 1 7
1 2 7
You can create group by using cumsum and rev. You need rev because you have the end points of the groups.
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
# create groups
x$group <- rev(cumsum(rev(x$end.group)))
# re-number groups from smallest to largest
x$group <- abs(x$group-max(x$group)-1)
Now you can use ave to create group.count.
x$group.count <- ave(x$end.group, x$group, FUN=seq_along)
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
ends <- which(as.logical(x$end.group))
ends2 <- c(ends[1],diff(ends))
transform(x, group.count=unlist(sapply(ends2,seq)), group=rep(seq(length(ends)),times=ends2))
end.group group.count group
1 0 1 1
2 0 2 1
3 1 3 1
4 0 1 2
5 0 2 2
6 1 3 2
7 1 1 3
8 0 1 4
9 0 2 4
10 0 3 4
11 1 4 4
12 1 1 5
13 1 1 6
14 0 1 7
15 1 2 7
Related
I am new(ish) to R and I am still unsure about loops.
If I had a large matrix object in R with columns having values of 0 - 4, and I would like to invert these values for specified columns.
I would use the code:
b[, "AX1"] <- 4 - b[, "AX1"]
Where b is a Matrix extracted from a larger list object and AX1 would be a column in the matrix.
I would then replace the changed Matrix back into its list using the code:
DF1$geno[[1]]$data <- b
How would I loop this code through a list of column names(AX1, AX10, AX42, ...)for about 30 columns of the 5000 columns in the matrix if I used a list with the 30 Column names to be inverted?
The simplest way you can do it (assuming that you always transform it the way x = 4 - x) is to expand your approach to the list of columns:
# Create an example dataset
set.seed(68859457)
(
dat <- matrix(
data = sample(x = 0:4, size = 100, replace = TRUE),
nrow = 10,
dimnames = list(1:10, paste('AX', 1:10, sep = ''))
)
)
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 3 2 2 3 1 0 4
# 2 4 3 4 4 0 1 3 1 3 4
# 3 3 0 3 4 2 2 4 1 2 1
# 4 2 2 0 2 4 2 2 1 1 0
# 5 4 4 4 3 3 1 0 3 2 2
# 6 2 1 1 0 3 3 4 4 1 0
# 7 2 3 1 3 3 1 0 1 0 4
# 8 2 2 1 1 0 3 1 3 2 1
# 9 3 1 4 1 2 1 0 0 4 1
# 10 4 3 2 4 1 0 2 0 3 2
# Create a list of columns you want to modify
set.seed(68859457)
(
cols_to_invert <- sort(sample(x = colnames(dat), size = 5))
)
# [1] "AX3" "AX4" "AX5" "AX6" "AX9"
# Use the list of columns created above to modify matrix in place
dat[, cols_to_invert] <- 4 - dat[, cols_to_invert]
# See the result
dat
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 1 2 2 3 1 4 4
# 2 4 3 0 0 4 3 3 1 1 4
# 3 3 0 1 0 2 2 4 1 2 1
# 4 2 2 4 2 0 2 2 1 3 0
# 5 4 4 0 1 1 3 0 3 2 2
# 6 2 1 3 4 1 1 4 4 3 0
# 7 2 3 3 1 1 3 0 1 4 4
# 8 2 2 3 3 4 1 1 3 2 1
# 9 3 1 0 3 2 3 0 0 0 1
# 10 4 3 2 0 3 4 2 0 1 2
Difficult to tell without knowing exact structure of the data but based on your explanation and attempt maybe this will help.
cols <- c('AX1', 'AX10', 'AX42')
DF1$geno <- lapply(DF1$geno, function(x) {
x$data <- 4 - x$data[, cols]
x
})
I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!
One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0
I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))
I have the following dataframe (df):
A B T Required col (window = 3)
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1 4
5 6 0 0 2
6 4 1 1 0
7 7 1 1 1
8 8 1 1 1
9 1 0 0 1
I would like to add the required column, as followed:
Insert in the current row the previous row value of A or B.
If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen).
What is the most efficient way to do it for big data table.
I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE
library(tidyverse)
library(data.table)
df[, newcol := {
equal <- A == TC
map(1:.N, ~ if(.x <= 3) NA
else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1]
else B[.x - 1])
}]
df
# N A B TC newcol
# 1: 1 1 0 1 NA
# 2: 2 3 0 3 NA
# 3: 3 4 0 4 NA
# 4: 4 2 1 1 4
# 5: 5 6 0 0 2
# 6: 6 4 1 1 0
# 7: 7 7 1 1 1
# 8: 8 8 1 1 1
# 9: 9 1 0 0 1
This works too, but it's less clear, and likely less efficient
df[, newcol := shift(A == TC, 1:3) %>%
pmap_lgl(~sum(...) > 3/2) %>%
ifelse(shift(A), shift(B))]
data:
df <- fread("
N A B TC
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1
5 6 0 0
6 4 1 1
7 7 1 1
8 8 1 1
9 1 0 0
")
Probably much less efficient than the answer by Ryan, but without additional packages.
A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
if(equal > window/2){
df$req[i+1]<-df$A[i]
}else{
df$req[i+1]<-df$B[i]
}
}
I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))