Generating combinations based on 2 columns in R - r

I know that this question has been repeated multiple times but I am not able to look exactly for what I am looking for in the previous topics. Please feel free to close the topic in case that this is duplicated.
I have a dataframe as follows:
> data %>% arrange(customer_id)
region market unit_key
1 2 98 320
2 2 98 321
3 4 184 287
4 4 4 7
5 4 4 287
6 66 521 899
7 66 521 900
8 66 3012 899
9 66 521 916
10 66 3011 900
I would like to make a 4th column which is a unique identifier call combination id that is formed as follows:
So basically for each unique pair of region and market I should get a unique identifier that will allow me to retrieve the unit_keys that they are linked with the combination of markets for an specific region.
I tried to do it with a cross-join and with tidyr::crossing() but I didnt get the expected results.
Any hints on this topic?
BR
/Edgar

Unfortunately the proposed solution by:
df %>% group_by(region, market) %>% mutate(id = cur_group_id())
Does not work as I get the following result:
combination_id %>% arrange(region)
# A tibble: 373 x 4
# Groups: region, market [182]
region market unit_key id
<dbl> <dbl> <dbl> <int>
1 2 98 320 1
2 2 98 321 1
3 4 184 287 3
4 4 4 7 2
5 4 4 287 2
6 66 521 899 4
In this case, for region 4 we should have the following combinations:
id=2 where market is 184
id=3 where market is 4
id=4 where market is 4 and 184

Related

How to get rid of varying zeros in a column in R?

I have a df1:
Story Score
1 00678
2 0980
3 1120
4 00067
5 0091
6 123
7 234
8 0234
9 00412
and I would like to get rid of all beginning 0s to have a df2:
Story Score
1 678
2 980
3 1120
4 67
5 91
6 123
7 234
8 234
9 412
Assuming the Score column be text, you could use sub here:
df$Score <- sub("^0+", "", df$Score)
If you intend for Score to be treated and used as numbers, you also might be able to just cast it to numeric:
df$Score <- as.numeric(df$Score)

How to get the IDs of a cluster of nodes in a network using igraph in R?

I have an edge list that is in the following format:
# A tibble: 162,157 x 4
id source target weight
<int> <int> <int> <int>
1 1 2 166 3777
2 2 2 204 17527
3 3 2 279 999
4 4 2 373 6826
5 5 2 552 1313
6 6 2 664 680
7 7 2 670 7624
8 8 2 791 167
9 9 2 1015 99
10 10 2 1182 18716
# … with 162,147 more rows
I have created a graph from this data using igraph::graph_from_data_frame(df, directed=TRUE) and have plotted the results, which can be seen in the following image.
The plot was generated with the following code snippet.
ggraph(g) +
geom_node_point(size=0.6) +
theme_graph()
What I would like to do is figure out which nodes are in the 6 tiny clusters surrounded by whitespace. I realize that I could assign labels, but in this instance that would be impossible to read. Is there a more mathematical or programmatic approach to identifying what those nodes are using igraph?

How to create a data frame with all ordinal variables as columns and with frequencies of specific event

I have an ordinal data frame which has answers in the survey format. I want to convert each factor into a possible column so as to get them by frequencies of a specific event.
I have tried lapply, dplyr to get frequencies but failed
as.data.frame(apply(mtfinal, 2, table))
and
mtfinalf<-mtfinal %>%
group_by(q28) %>%
summarise(freq=n())
Expected Results in the form of data.frame
Frequency table with respect to q28's factors
Expected Results in the form of data.frame
q28 sex1 sex2 race1 race2 race3 race4 race5 race6 race7 age1 age2
2 0
3 0
4 23
5 21
Actual Results
$age
1 2 3 4 5 6 7
6 2 184 520 507 393 170
$sex
1 2
1239 543
$grade
1 2 3 4
561 519 425 277
$race7
1 2 3 4 5 6
179 21 27 140 17 1307
7
91
$q8
1 2 3 4 5
127 259 356 501 539
$q9
1 2 3 4 5
993 224 279 86 200
$q28
2 3 4 5
1034 533 94 121
This will give you a count of number of unique combinations. What you are asking is impossible since there would be overlaps between levels of sex, race and age.
mtfinalf<-mtfinal %>%
group_by(q28,age,race,sex) %>%
tally()

Applying dplyr filter operation to specific rows without losing other data

Apologies if this has been asked before. I couldn't find any satisfactory answers, although it sounds like it should be a rather straightforward operation.
I have my data
transition_frame name state_number lifetime
<int> <chr> <dbl> <dbl>
1 38 //Traces_exp1_tif_pair10 1 NA
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
It was easy enough to calculate the rowwise differences between transition frames, but since there's no "transition" between state 0 and 1, it breaks the flow.
How can I make only the first row be transition_frame - 1 (hint, it's 37), without touching any other data?
Imagine,
group_by(name) %>%
filter(state_number == 1) %>%
mutate(lifetime = transition_frame - 1) %>%
unfilter() # To retrieve dropped data
Which would result in a whole set, with the first row computed, and NOT only the first row.
transition_frame name state_number lifetime
<int> <chr> <dbl> <dbl>
1 38 //Traces_exp1_tif_pair10 1 37
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
Does the following work for you?
df <- data.frame(transition_frame = c(38, 44, 352, 362, 379, 388),
name = rep("//Traces_exp1_tif_pair10", 6),
state_number = seq(1, 6))
df %>% mutate(lifetime = diff(c(1, transition_frame)))
transition_frame name state_number lifetime
1 38 //Traces_exp1_tif_pair10 1 37
2 44 //Traces_exp1_tif_pair10 2 6
3 352 //Traces_exp1_tif_pair10 3 308
4 362 //Traces_exp1_tif_pair10 4 10
5 379 //Traces_exp1_tif_pair10 5 17
6 388 //Traces_exp1_tif_pair10 6 9
Replace 1 in diff() with other values if you want the transition frame in state 0 to take on different values.
Hope an approach similar to below code might help you!
df <- data.frame(transition_frame=c(38,44,352),
name=c('//Traces_exp1_tif_pair10','//Traces_exp1_tif_pair10','//Traces_exp1_tif_pair10'),
state_number=c(1,2,3),
lifetime=c(NA,6,308))
df[df$state_number==1 & is.na(df$lifetime),"lifetime"] <-
df[df$state_number==1 & is.na(df$lifetime),"transition_frame"] - 1
df

R: Modifying Subsets of Dataframe using Calculations on that Subset

I am going to ask my question through example, because I don't know what the best way to phrase it in general is. Using the ChickWeight dataset built into R:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> tail(ChickWeight)
weight Time Chick Diet
573 155 12 50 4
574 175 14 50 4
575 205 16 50 4
576 234 18 50 4
577 264 20 50 4
578 264 21 50 4
I can use ddply to calculate mean for each unique Diet, for example
> ddply(d, .(Diet), summarise, mean_weight=mean(weight, na.rm=TRUE))
Diet mean_weight
1 1 102.6455
2 2 122.6167
3 3 142.9500
4 4 135.2627
What do I do if I wanted to easily create a data frame that modifies the 'weight' column in ChickWeight by dividing it by the mean_weight of it's corresponding diet?
A solution with data.table that's short, fast and readable:
library(data.table)
cw <- data.table(ChickWeight)
cw[, pct_mw_diet:=weight/mean(weight, na.rm=T), by=Diet]
Now you have a column with percent of mean weight by diet

Resources