I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2
Related
I'm trying to create a dataset for each combination of rows from separate groups. Ideally, one row from each group would be selected and there would be a dataset for every combination. I have a dataset of that looks similar in structure to the sample below:
Name Group Stat1 Stat2
1 1 a 63 38
2 2 a 33 62
3 3 b 3 66
4 4 b 57 67
5 5 c 42 69
6 6 c 47 14
7 7 c 16 10
8 8 d 21 46
9 9 d 72 1
Trying to get the end result of the first dataset to look like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 8 d 21 46
With the second data dataset looking like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 9 d 72 1
Until every combination has been exhausted. I've tried strategies using apply functions and combn but cannot seem to get the result I want. This does not seem too challenging to me conceptually, so I'm not sure what I'm missing.
Any help would be greatly appreciated! Thanks in advance!
Lots of ways to approach this. A simple solution is to just generate all 4 row combos, then subset to those with all distinct Group values. I named your data df and assumed Name would be unique row id. If that's not true, you could replace df$Name with 1:nrow(df)
# All 4 row combos of row ids
combs <- combn(df$Name, 4)
# Match group labels to row ids
g <- matrix(df$Group[combs], nrow = 4)
# 4 row combs filtered to all distinct group vals
combs <- combs[,apply(g, 2, function(i) all(!duplicated(i)))]
# For each 4 row combo, extract rows from the dataframe
final_list <- apply(combs, 2, function(i) df[i,])
final_list[1:3]
[[1]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
8 8 d 21 46
[[2]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
9 9 d 72 1
[[3]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
6 6 c 47 14
8 8 d 21 46
I have the following dataframe containing a variable "group" and a variable "number of elements per group"
group elements
1 3
2 1
3 14
4 10
.. ..
.. ..
30 5
then I have a bunch of numbers going from 1 to (let's say) 30
when summing "elements" I would get 900. what I want to obtain is to randomly select a number (from 0 to 30) from 1-30 and assign it to each group until I fill the number of elements for that group. Each of those should appear 30 times in total.
thus, for group 1, I want to randomly select 3 number from 0 to 30
for group 2, 1 number from 0 to 30 etc. until I filled all of the groups.
the final table should look like this:
group number(randomly selected)
1 7
1 20
1 7
2 4
3 21
3 20
...
any suggestions on how I can achieve this?
In base R, if you have df like this...
df
group elements
1 3
2 1
3 14
Then you can do this...
data.frame(group = rep(df$group, #repeat group no...
df$elements), #elements times
number = unlist(sapply(df$elements, #for each elements...
sample.int, #...sample <elements> numbers
n=30, #from 1 to 30
replace = FALSE))) #without duplicates
group number
1 1 19
2 1 15
3 1 28
4 2 15
5 3 20
6 3 18
7 3 27
8 3 10
9 3 23
10 3 12
11 3 25
12 3 11
13 3 14
14 3 13
15 3 16
16 3 26
17 3 22
18 3 7
Give this a try:
df <- read.table(text = "group elements
1 3
2 1
3 14
4 10
30 5", header = TRUE)
# reproducibility
set.seed(1)
df_split2 <- do.call("rbind",
(lapply(split(df, df$group),
function(m) cbind(m,
`number(randomly selected)` =
sample(1:30, replace = TRUE,
size = m$elements),
row.names = NULL
))))
# remove element column name
df_split2$elements <- NULL
head(df_split2)
#> group number(randomly selected)
#> 1.1 1 25
#> 1.2 1 4
#> 1.3 1 7
#> 2 2 1
#> 3.1 3 2
#> 3.2 3 29
The split function splits the df into chunks based on the group column. We then take those smaller data frames and add a column to them by sampling 1:30 a total of elements time. We then do.call on this list to rbind back together.
Yo have to generate a new dataframe repeating $group $element times, and then using sample you can generate the exact number of random numbers:
data<-data.frame(group=c(1,2,3,4,5),
elements=c(2,5,2,1,3))
data.elements<-data.frame(group=rep(data$group,data$elements),
number=sample(1:30,sum(data$elements)))
The result:
group number
1 1 9
2 1 4
3 2 29
4 2 28
5 2 18
6 2 7
7 2 25
8 3 17
9 3 22
10 4 5
11 5 3
12 5 8
13 5 26
I solved as follow:
random_sample <- rep(1:30, each=30)
random_sample <- sample(random_sample)
then I create a df with this variable and a variable containing one group per row repeated by the number of elements in the group itself
I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B
I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45