I would like to reclassify the names of some individuals in a dataframe with consequtive letters, and the reclassification criterion has to change each X intervals since the first occurrence of an individual. I explain it better with an example.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
df <- data.frame (ID, Year)
df
I have a dataset with repeated measures of some individuals along 6 years. As you can see some IDs like the "1" or "8" are repeated in Year == 1,2,3,4,5 for the ID == 1 and Year == 2,4 for the ID == 8. However different individuals may have the same ID if some time has happened since the first occurrence of an individual. It is because we consider that the individual dies each 2 years, and the ID may be reused.
In this hypothetical case, we assume that the life of an individual is 2 years, and that we can recognise during the sampling different individuals perfectly. The ID == 1 in the Year == 1 and Year == 2 represent the same individual, however the ID == 1 in the Year == 1,2, Year == 3,4 and Year == 5 represent different individuals. It is because the individual with ID == 1 from the Year == 1 couldn't live that long. The problem is that the first occurrence of the individuals may happen in different years and repeatedly as in this case. So the code has to forget an ID each 2 years since its first occurrence, and classify a new occurrence as a new individual.
I would like to name each individual with an unique ID. The new name does not have to be arranged chronologically as you can see with the ID == 1 in the Year == 5. I only want that they will be named with an unique name.
Below I have put the expected result.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 1, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
As you can see the ID == 1 have different new_ID in the Year == 1 Year == 4 and Year == 5, because we assume that if one individual occurs for the first time in the Year == 1, an individual with the same ID in the Year == 3 is different, and the same with the individual that occurs in the Year == 5.
Thanks in advance.
You can use dplyr and cut:
library(dplyr)
df %>% group_by(ID) %>%
mutate(x = as.numeric(cut(Year, seq(min(Year)-1, max(Year)+1, 2))),
idout = paste0(ID, ".", x))
ID Year x idout
1 1 1 1 1.1
2 2 1 1 2.1
3 3 1 1 3.1
4 4 1 1 4.1
5 5 1 1 5.1
6 6 1 1 6.1
7 7 1 1 7.1
8 1 2 1 1.1
9 2 2 1 2.1
10 3 2 1 3.1
11 8 2 1 8.1
12 9 2 1 9.1
13 10 2 1 10.1
14 11 2 1 11.1
15 12 2 1 12.1
16 1 3 2 1.2
17 2 3 2 2.2
18 3 3 2 3.2
19 4 3 2 4.2
20 5 3 2 5.2
21 6 3 2 6.2
22 1 4 2 1.2
23 2 4 2 2.2
24 6 4 2 6.2
25 8 4 2 8.2
26 12 4 2 12.2
27 7 5 3 7.3
28 15 5 1 15.1
29 16 5 1 16.1
30 17 5 1 17.1
31 18 5 1 18.1
32 19 5 1 19.1
33 20 5 1 20.1
34 1 5 3 1.3
35 21 6 1 21.1
36 22 6 1 22.1
37 19 6 1 19.1
NB there are two mismatches with your desired output: row 34, and 15,26 where you have an L at years 2 and 4 with the same ID. I think these are mistakes?
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
# if all ID renews same use:
newID<-sapply(unique(ID), function(x) c(0,cumsum(diff(Year[ID==x]))%%2))
# if some ID renews different year use:
newID<-sapply(unique(ID), function(x) {
mod<-2
if(x==1) mod <- 3
c(0,cumsum(diff(Year[ID==x]))%%mod)
})
names(newID)<-(unique(ID))
new_df<-data.frame(ID,Year,IDcond=NA,new_ID=NA)
for(i in unique(ID)){
new_df[new_df[,1]==i,3]<-newID[[which(unique(ID)==i)]]
}
ltrs<-c(LETTERS,apply(combn(LETTERS,2,simplify = T),2,function(x) paste(x,sep = "",collapse = "")))
ltrn<-0
for(i in 1:nrow(new_df)){
if(new_df[i,3]==0) {ltrn<-ltrn+1;new_df[i,4]<-ltrs[ltrn]}
else {ind<-which(new_df[,1]==new_df[i,1])
ind<-ind[ind<i]
new_df[i,4]<-tail(new_df[ind,4],1)}
}
new_df
> new_df
ID Year IDcond new_ID
1 1 1 0 A
2 2 1 0 B
3 3 1 0 C
4 4 1 0 D
5 5 1 0 E
6 6 1 0 F
7 7 1 0 G
8 1 2 1 A
9 2 2 1 B
10 3 2 1 C
11 8 2 0 H
12 9 2 0 I
13 10 2 0 J
14 11 2 0 K
15 12 2 0 L
16 1 3 0 M
17 2 3 0 N
18 3 3 0 O
19 4 3 0 P
20 5 3 0 Q
21 6 3 0 R
22 1 4 1 M
23 2 4 1 N
24 6 4 1 R
25 8 4 0 S
26 12 4 0 T
27 7 5 0 U
28 15 5 0 V
29 16 5 0 W
30 17 5 0 X
31 18 5 0 Y
32 19 5 0 Z
33 20 5 0 AB
34 1 5 0 AC
35 21 6 0 AD
36 22 6 0 AE
37 19 6 1 Z
Related
I want to calculate sum of columns and rows 'by groups'.
For example,
a <- matrix(c(NA, NA, "a", "a", "a", "b", "b", NA, NA, 1, 2, 3, 4, 5, "a", 1, 0, 1, 2, 3, 1, "a", 2, 1, 2, 1, 1, 1, "a", 3, 3, 1, 2, 0, 1, "b", 4, 0, 0, 0, 0, 3, "b", 5, 1, 1, 2, 1, 0), ncol = 7, nrow = 7)
I have a square matrix like data 'a'.
a
a
a
b
b
1
2
3
4
5
a
1
0
1
2
3
1
a
2
1
2
1
1
1
a
3
3
1
2
0
1
b
4
0
0
0
0
3
b
5
1
1
2
1
0
The thing I need to do is to change the matrix like this
a
b
a
13
7
b
4
4
The point is that I do not need the rows&columns of numbers. I just need the alphabet rows&columns. For example, the original matrix has values.
1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3
All of these 9 values need to be added up. So they become 13.
To solve this problem, I tried aggregate function. But I didn't success.
You can use rowsum twice and use once t to transpose the matrix to sum it up per group in both directions.
x <- matrix(as.numeric(a[3:7, 3:7]), 5, dimnames=list(a[3:7,1], a[1,3:7]))
y <- t(rowsum(x, rownames(x)))
rowsum(y, rownames(y))
# a b
#a 13 7
#b 4 4
This is similar to a question I asked earlier, but I left out a couple of important pieces: an ID column and the XYZ variables.
I have a dataset with the following layout (strange column titles, I know):
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
XYZ1_a <- c(1, 2, 1, 2, 1, 2, 4, 2, 5, 1)
XYZ1_b <- c(2, 1, 1, 1, 2, 2, 4, 2, 1, 5)
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ID, XYZ1_a, XYZ1_b, ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like the below, but I also need to do the same for the columns with the XYZ naming convention:
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
XYZ1 <- c(1, 2, 1, 2, 1, 2, 4, 2, 5, 1, 2, 1, 1, 1, 2, 2, 4, 2, 1, 5)
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ID, XYZ1, ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
We can use rearrange the substrings in those column names that starts_with 'ABC' by capturing the letter as a group followed by the underscore and one or more digits (\\d+) as second group, in the replacement specify the backreference in reverse while adding a new _. In pivot_longer, specify the sep to match the _ that precedes a letter
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_replace(., "([a-z])_(\\d+)$", "_\\2_\\1"),
starts_with('AB')) %>%
pivot_longer(cols = -ID, names_to = c(".value", "grp"),
names_sep = "_(?=[a-z])", values_drop_na = TRUE) %>%
select(-grp)
-output
# A tibble: 20 x 6
ID XYZ1 ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.1 1 1 4 2 2
2 1.1 2 4 2 1 2
3 1.2 2 5 5 5 5
4 1.2 1 2 3 2 3
5 1.3 1 3 5 3 5
6 1.3 1 1 3 2 3
7 1.4 2 4 4 5 1
8 1.4 1 1 2 4 2
9 1.5 1 3 2 3 2
10 1.5 2 5 2 5 1
11 1.6 2 4 5 4 1
12 1.6 2 3 3 3 3
13 1.7 4 5 5 5 5
14 1.7 4 2 2 2 1
15 1.8 2 2 1 3 1
16 1.8 2 1 1 4 1
17 1.9 5 2 2 2 3
18 1.9 1 1 4 1 2
19 2 1 1 4 3 4
20 2 5 5 2 4 2
In the older version use rename_at
df %>%
rename_at(vars(starts_with('AB')),
~ str_replace(., "([a-z])_(\\d+)$", "_\\2_\\1")) %>%
pivot_longer(cols = -ID, names_to = c(".value", "grp"),
names_sep = "_(?=[a-z])", values_drop_na = TRUE) %>%
select(-grp)
-output
# A tibble: 20 x 6
ID XYZ1 ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.1 1 1 4 2 2
2 1.1 2 4 2 1 2
3 1.2 2 5 5 5 5
4 1.2 1 2 3 2 3
5 1.3 1 3 5 3 5
6 1.3 1 1 3 2 3
7 1.4 2 4 4 5 1
8 1.4 1 1 2 4 2
9 1.5 1 3 2 3 2
10 1.5 2 5 2 5 1
11 1.6 2 4 5 4 1
12 1.6 2 3 3 3 3
13 1.7 4 5 5 5 5
14 1.7 4 2 2 2 1
15 1.8 2 2 1 3 1
16 1.8 2 1 1 4 1
17 1.9 5 2 2 2 3
18 1.9 1 1 4 1 2
19 2 1 1 4 3 4
20 2 5 5 2 4 2
I want to expand a vector of integers into consecutive integers in each group in r. Can anyone have some hints on this problem?
Below is my original dataset:
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
data = data.frame(x, group)
and my desired dataset is as below:
desired_data = data.frame(
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3))
Thanks for your help!
This can be easily done via expand from tidyr,
library(tidyverse)
df %>%
group_by(group) %>%
expand(x = full_seq(x, 1))
Which gives,
# A tibble: 19 x 2
# Groups: group [3]
group x
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
11 2 6
12 3 1
13 3 2
14 3 3
15 3 4
16 3 5
17 3 6
18 3 7
19 3 8
I'm sure someone will have a cleaner solution soon. In the meantime:
minVals=aggregate(data$x, by = list(data$group), min)[,2]
maxVals=aggregate(data$x, by = list(data$group), max)[,2]
ls=apply(cbind(minVals,maxVals),1,function(x) x[1]:x[2])
desired_data = data.frame(
x = unlist(ls),
group = rep(unique(data$group),lapply(ls,length)))
x group
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 6 2
12 1 3
13 2 3
14 3 3
15 4 3
16 5 3
17 6 3
18 7 3
19 8 3
Here's a base R solution.
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
sl = split(x,group)
expanded = lapply(names(sl),function(x){
r = range(sl[[x]])
return(data.frame(x = seq(r[1],r[2],1),group = x))
})
do.call(rbind,expanded)
split x by group which results in a named list per group
using lapply on the names we can expand the integer range for each group
finally use do.call to rbind the results together.
I have two data frames in R
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
> df1
Cust ItemId
1 1 1
2 1 2
3 1 3
4 1 4
5 2 2
6 2 3
7 2 2
8 2 5
9 3 1
10 3 2
11 3 5
12 4 6
13 4 2
> df2
ItemId1 ItemId2
1 1 3
2 3 1
3 2 2
4 3 3
5 2 4
6 1 1
7 2 6
8 3 4
9 4 2
10 6 4
11 5 3
12 3 1
13 2 3
14 4 5
All I need is the following output which is less costly than joins/merge (because in real time I am dealing with billions of records)
> output
ItemId1 ItemId2 Cust
1 1 3 1
2 3 1 1
3 2 2 1, 2, 3, 4
4 3 3 1, 2
5 2 4 1
6 1 1 1, 3
7 2 6 4
8 3 4 1
9 4 2 1
10 6 4 NA
11 5 3 2
12 3 1 1
13 2 3 1, 2
14 4 5 NA
What happens is If ItemId1, ItemId2 of df2 combination is present in ItemId of df1 we need to return the Cust values (even if they are multiple). If they are present we need to return NA.
i.e. Take the first row as example: ItemId1 = 1, ItemId2 = 3. Only Customer = 1 has ItemId = c(1,3) in df1. Similarly the next rows.
We can do this using Joins/Merge which are costly operations. But, they are resulting in Memory Error.
This may take more time but wont take much of your memory.
Please convert for loops using apply if possible:
library(plyr)
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
temp2 = ddply(df1[,c("Cust","ItemId")], .(Cust), summarize, ItemId = toString(unique(ItemId)))
temp3 = ddply(df1[,c("ItemId","Cust")], .(ItemId), summarize, Cust = toString(unique(Cust)))
dfout = cbind(df2[0,],data.frame(Cust = df1[0,1]))
for(i in 1:nrow(df2)){
a = df2[i,1]
b = df2[i,2]
if(a == b){
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = a,Cust = temp3$Cust[temp3$ItemId == a]))
}else{
cusli = c()
for(j in 1:nrow(temp2)){
if(length(grep(a,temp2$ItemId[j]))>0 & length(grep(b,temp2$ItemId[j]))>0){
cusli = c(cusli,temp2$Cust[j])
}
}
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = b,Cust = paste(cusli,collapse = ", ")))
}
}
dfout$Cust[dfout$Cust == "",] = NA
Here is my problem:
myvec <- c(1, 2, 2, 2, 3, 3,3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
I want to develop a function that can caterize this vector depending upon number of categories I define.
if categories 1 all newvec elements will be 1
if categories are 2 then
unique (myvec), i.e.
1 = 1, 2 =2, 3 = 1, 4 = 2, 5 =1, 6 = 2, 7 = 1, 8 = 2, 9 = 1, 10 = 2
(which is situation of odd or even numbers)
If categories are 3 then first three number will be 1:3 and then pattern will be repeated.
1 = 1, 2 = 2, 3=3, 4=1, 5 = 2, 6 = 3, 7 =1, 8 = 2, 9 = 3, 10 =1
If caterogies are 4 then first number will be 1:4 and then pattern will be repeated
1 = 1, 2 = 2, 3= 3, 4 = 4, 5 = 1, 6 = 2, 7=3, 8=4, 9 =1, 10 = 2
Similarly for n categories the first 1:n, then the pattern repeated.
This should do what you need, if I correctly understood the question. You can vary variable n to choose the number of groups.
myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
out <- vector(mode="integer", length=length(myvec))
uid <- sort(unique(myvec))
n <- 3
for (i in 1:n) {
s <- seq(i, length(uid), n)
out[myvec %in% s] <- i
}
Using the recycling features of R (this gives a warning if the vector length is not divisible by n):
R> myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
R> n <- 3
R> y <- cbind(x=sort(unique(myvec)), y=1:n)[, 2]
R> y
[1] 1 2 3 1 2 3 1 2 3 1
or using rep:
R> x <- sort(unique(myvec))
R> y <- rep(1:n, length.out=length(x))
R> y
[1] 1 2 3 1 2 3 1 2 3 1
Update: you could just use the modulo operator
R> myvec
[1] 1 2 2 2 3 3 3 4 4 5 6 6 6 6 7 8 8 9 10 10 10
R> n <- 4
R> ((myvec - 1) %% n) + 1
[1] 1 2 2 2 3 3 3 4 4 1 2 2 2 2 3 4 4 1 2 2 2