I have a data frame as below,
dt <- data.frame(id = c("a","b","c","d","e","f","g","h","i","j"),
value = c(1,2,1,2,1,1,1,2,1,2))
> dt
id value
1 a 1
2 b 2
3 c 1
4 d 2
5 e 1
6 f 1
7 g 1
8 h 2
9 i 1
10 j 2
I hope to create a column based on column value so that whenever it runs into a 2 in column value it will assign a new group number. The output will look like,
dtgroup <- data.frame(id = c("a","b","c","d","e","f","g","h","i","j"),
value = c(1,2,1,2,1,1,1,2,1,2),
group = c(1,1,2,2,3,3,3,3,4,4))
> dtgroup
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
Any ideas? Thanks!
We can use findInterval like below
> transform(dt, group = 1 + findInterval(seq_along(value), which(value == 2), left.open = TRUE))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
or cut
> transform(dt, group = as.integer(cut(seq_along(value), c(-Inf, which(value == 2)))))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
Another possibility. Increment by one when value is 1 and the previous value (dplyr::lag) is not 1.
dt$group <- with(dt, cumsum(value == 1 & dplyr::lag(value != 1, default = 1)))
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
With cumsum, if value doesn't have NAs:
dt$group <- head(c(0,cumsum(dt$value==2))+1,-1)
dt
id value group
1 a 1 1
2 b 2 1
3 c 1 2
4 d 2 2
5 e 1 3
6 f 1 3
7 g 1 3
8 h 2 3
9 i 1 4
10 j 2 4
Related
I have four lists, each of which contains 12 data frames. Something like this: `
for (i in 1:10) {
assign(paste0("df", i), data.frame(c=c(1,2,3), d=c(1,2,3),
e=c(1,2,3), e=c(1,2,3),
g=c(1,2,3), h=c(1,2,3),
i=c(1,2,3), j=c(1,2,3),
k=c(1,2,3), l=c(1,2,3)))
}
for (i in 1:4) {
assign(paste0("list_", i), lapply(ls(pattern="df"), get))
}
rm(list=ls(pattern="df"))
`
Each list corresponds to a year and each of its elements (the data frames) correspond to a month. Conveniently, the position of each element of the list (of each data frame) is equivalent to its month. So, in the first list, the first data frame corresponds to January 2020, the second to February 2020, and so on. In the second list, the first data frame corresponds to January 2021, the second to February 2021, and so on.
What I need to do is to create a new variable that indicates the month of each data frame.
I have been trying different things, including this:
for(j in 20:21) {
for(i in 1:12) {
assign(get(paste0("df_20", j))[[i]],
get(paste0("df_20", j))[[i]] %>%
mutate(month=i)) ## se le suma 1 al mes porque comienza desde febrero
}
}
But nothing works. The problem seems to be the left hand of the assignment. When I use the get() function, the software returns an error ("Error in assign(get(paste0("mies_20", j))[[1]], get(paste0("mies_20", :
invalid first argument"). If I don't include this function, paste0("df_20", j))[[i]] does not recognize the "[[i]]".
Any ideas?
get() each list, iterate over its dataframes using lapply(), then assign back to the environment. It’s also best to use something other than i for your iteration variable, since there’s also a column i in your dataframes.
library(dplyr)
for (j in 1:4) {
list_j <- get(paste0("list_", j))
list_j <- lapply(
seq_along(list_j),
\(mnth) mutate(list_j[[mnth]], month = mnth)
)
assign(paste0("list_", j), list_j)
}
list_1
[[1]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 1
3 3 3 3 3 3 3 3 3 3 3 1
[[2]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 2
2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 2
[[3]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 3
2 2 2 2 2 2 2 2 2 2 2 3
3 3 3 3 3 3 3 3 3 3 3 3
[[4]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 4
2 2 2 2 2 2 2 2 2 2 2 4
3 3 3 3 3 3 3 3 3 3 3 4
[[5]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 5
2 2 2 2 2 2 2 2 2 2 2 5
3 3 3 3 3 3 3 3 3 3 3 5
[[6]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 6
2 2 2 2 2 2 2 2 2 2 2 6
3 3 3 3 3 3 3 3 3 3 3 6
[[7]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 7
2 2 2 2 2 2 2 2 2 2 2 7
3 3 3 3 3 3 3 3 3 3 3 7
[[8]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 8
2 2 2 2 2 2 2 2 2 2 2 8
3 3 3 3 3 3 3 3 3 3 3 8
[[9]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 9
2 2 2 2 2 2 2 2 2 2 2 9
3 3 3 3 3 3 3 3 3 3 3 9
[[10]]
c d e f g h i j k l month
1 1 1 1 1 1 1 1 1 1 1 10
2 2 2 2 2 2 2 2 2 2 2 10
3 3 3 3 3 3 3 3 3 3 3 10
I am trying to merge two data set into one using id and column name as indices.
I have the following data
df <-
a b c d e f g id
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4
panel_empty <-
id df_id df_data df1_data df2_data df3_data
1 a
1 b
1 c
1 d
1 e
1 f
1 g
2 a
2 b
2 c
2 d
2 e
2 f
2 g
3 a
3 b
3 c
3 d
3 e
3 f
3 g
4 a
4 b
4 c
4 d
4 e
4 f
4 g
I would like to merge these somehow to look like this
panel_full <-
id df_id df_data df2_data df3_data
1 a 1
1 b 1
1 c 1
1 d 1
1 e 1
1 f 1
1 g 1
2 a 2
2 b 2
2 c 2
2 d 2
2 e 2
2 f 2
2 g 2
3 a 3
3 b 3
3 c 3
3 d 3
3 e 3
3 f 3
3 g 3
4 a 4
4 b 4
4 c 4
4 d 4
4 e 4
4 f 4
4 g 4
I only know how to merge by id but have no idea how to merge by id and column name. For panel data data this is quite important to do and I was surprised not find any similar problem on this site.
EDIT:
So far, I was able to convert from wide to long
long <- melt(df, id.vars = c("id"))
However, I do not know to move on.
I tried
m1 <- merge(panel_emtpy, long, by.x = "id", by.y = "df_id")
Here's a way with dplyr and tidyr::gather() -
panel_full %>%
left_join(gather(df, df_id, df_data, -id), by = c("id", "df_id"))
I want to create a unique sequential numeric ID for each distinct group based on 3 columns, but for each group the IDs must start from 1 to n.
Using the solution at Creating a unique ID, I can create unique IDs, but they are sequential for the entire data frame.
k1 <- c(1,1,1,1,1,1,1,1,1,1)
k2 <- c(1,1,1,1,1,2,2,2,2,2)
k3 <- rep(letters[1:2],5)
df <- as.data.frame(cbind(k1,k2, k3))
d <- transform(df, id = as.numeric(interaction(k1,k2,k3, drop=TRUE)))
d <- d[with(d, order(k1,k2,k3)),]
the result is
> d
k1 k2 k3 id
1 1 1 a 1
3 1 1 a 1
5 1 1 a 1
2 1 1 b 3
4 1 1 b 3
7 1 2 a 2
9 1 2 a 2
6 1 2 b 4
8 1 2 b 4
10 1 2 b 4
and I'd like to have
> d
k1 k2 k3 id
1 1 1 a 1
3 1 1 a 1
5 1 1 a 1
2 1 1 b 2
4 1 1 b 2
7 1 2 a 1
9 1 2 a 1
6 1 2 b 2
8 1 2 b 2
10 1 2 b 2
Try using data.table as mentioned in the link:
library(data.table)
setDT(df)[,id:=.GRP,by=list(k1,k3)][]
# k1 k2 k3 id
# 1: 1 1 a 1
# 2: 1 1 b 2
# 3: 1 1 a 1
# 4: 1 1 b 2
# 5: 1 1 a 1
# 6: 1 2 b 2
# 7: 1 2 a 1
# 8: 1 2 b 2
# 9: 1 2 a 1
#10: 1 2 b 2
Try
d$id <- with(d, ave(id, k2, FUN=function(x) as.numeric(factor(x))))
d$id
#[1] 1 1 1 2 2 1 1 2 2 2
Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.