R - DataFrames and operation with rows - r

suppose I have the next data frame.
table<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 35 1 12
9 40 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
17 35 2 20
18 40 2 20
19 0 3 5
20 5 3 6
21 10 3 8
22 15 3 12
23 20 3 15
24 25 3 20
25 30 3 22
26 35 3 28
27 40 3 28
So, I want to group the columns so that for each "plan" with "group" greater than 20, group me 2-in-2 records (average of the next record) and when the largest number is repeated , Leave the latter without duplicates.
The example below shows how to result would be.
data.frame(group=c(0,5,10,15,20,30,0,5,10,15,20,30,0,5,10,15,20,30,40),plan=c(1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3),price=c(1,4,5,6,8.5,12,3,5,6,7,11,20,5,6,8,12,17.5,25,28))
group plan price
1 0 1 1.0
2 5 1 4.0
3 10 1 5.0
4 15 1 6.0
5 20 1 8.5
6 30 1 12.0
7 0 1 3.0
8 5 2 5.0
9 10 2 6.0
10 15 2 7.0
11 20 2 11.0
12 30 2 20.0
13 0 3 5.0
14 5 3 6.0
15 10 3 8.0
16 15 3 12.0
17 20 3 17.5
18 30 3 25.0
19 40 3 28.0
Thanks!

You could try this using the dplyr package:
library(dplyr)
table %>%
group_by(plan) %>%
mutate(group=ifelse(group<20,group,10*floor(group/10))) %>%
group_by(plan,group) %>%
summarise(price=mean(price)) %>%
## Keep the last row per group only if the price is different from the previous average price
group_by(plan) %>%
filter(!(row_number()==n() & price==lag(price)))
This returns:
plan group price
<dbl> <dbl> <dbl>
1 1 0 1.0
2 1 5 4.0
3 1 10 5.0
4 1 15 6.0
5 1 20 8.5
6 1 30 12.0
7 2 0 3.0
8 2 5 5.0
9 2 10 6.0
10 2 15 7.0
11 2 20 11.0
12 2 30 20.0
13 3 0 5.0
14 3 5 6.0
15 3 10 8.0
16 3 15 12.0
17 3 20 17.5
18 3 30 25.0
19 3 40 28.0

How about:
dat<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
s <- split(dat, ifelse(dat$group>20, ">20", "<=20"))
s20 <- s[[">20"]] # easier to read
tens <- which(s20$group %% 10 == 0)
tens
# [1] 2 4 6 8 10 12
subgroup <- rep(1:length(tens), each = nrow(s20)/length(tens)) # can handle different freqs
subgroup
# [1] 1 1 2 2 3 3 4 4 5 5 6 6
ToAddBack <- s20[tens,]
ToAddBack[,"price"] <- aggregate(s20$price, by = list(subgroup), mean)[2]
newdat <- rbind(s[["<=20"]], ToAddBack)
finaldat <- newdat[order(newdat$plan, newdat$group),]
Where your finaldat is a little different from your example as I think you left out some rows by accident:
finaldat
group plan price
1 0 1 1.0
2 5 1 4.0
3 10 1 5.0
4 15 1 6.0
5 20 1 8.0
7 30 1 10.5
9 40 1 12.0
10 0 2 3.0
11 5 2 5.0
12 10 2 6.0
13 15 2 7.0
14 20 2 10.0
16 30 2 16.0
18 40 2 20.0
19 0 3 5.0
20 5 3 6.0
21 10 3 8.0
22 15 3 12.0
23 20 3 15.0
25 30 3 21.0
27 40 3 28.0

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

double for loops in R for histograms

I have the following table and I am trying to run a double for loop in R to get a histogram of the distribution of responses for every month of the survey (I will then fit a distribution to it). I am currently running the following code, but cannot seem to get anywhere. Any suggestions?
for (i in 2008:2021) { for (j in 1:12) { dfn <- df(df$Year=i, df$Month=j) hist(dfn) }}
Month
Year
-3
0
2
4
5.5
8
12.5
15
1
2008
3
2
28
41
17
3
5
1
2
2008
5
3
26
40
15
4
6
1
3
2008
6
4
27
39
13
4
6
1
4
2008
9
4
18
28
28
5
7
1
5
2008
6
5
15
29
29
6
9
1
6
2008
8
3
17
28
26
6
10
2
7
2008
9
5
16
28
28
4
9
1
8
2008
5
5
19
29
26
5
9
2
9
2008
7
5
22
39
15
4
7
1
10
2008
8
6
20
40
15
4
7
0

Replacing a restarting sequence in a dataframe with the group number of the sequence

I have a sequence in df$V1 that starts at some number and increases. At some point, it drops, indicating that observations for a new group have started. I want to replace V1 (or create a new column) with the group number. What are some ways to do this? I've tried various dplyr tricks to no avail, and searched here and elsewhere and have not found a similar problem. Wondering if there's a slick dplyr way to do this. Thank you for any insights.
The data frame has about 350 rows. Here is a subset:
> df
V1 V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 2 5 10 1 3 9
3 3 5 11 1 4 4
4 4 5 15 1 5 7
5 5 5 18 1 6 14
6 6 5 22 1 7 6
27 1 5 9 1 2 14
28 21 9 10 2 3 4
29 22 9 11 2 4 6
30 23 9 15 2 5 1
31 24 9 18 2 6 7
32 25 9 22 2 7 14
33 26 9 24 2 8 6
34 27 9 25 2 9 7
35 28 9 26 2 10 6
And I want it to look like this (or with group as an added column in the new.df):
> new.df
group V2 V3 V4 V5 V6
1 1 5 9 1 2 14
2 1 5 10 1 3 9
3 1 5 11 1 4 4
4 1 5 15 1 5 7
5 1 5 18 1 6 14
6 1 5 22 1 7 6
27 2 5 9 1 2 14
28 2 9 10 2 3 4
29 2 9 11 2 4 6
30 2 9 15 2 5 1
31 2 9 18 2 6 7
32 2 9 22 2 7 14
33 2 9 24 2 8 6
34 2 9 25 2 9 7
35 2 9 26 2 10 6
Here's the initial data frame to load into your R session:
df <- read.table(header=TRUE, text="
V1 V2 V3 V4 V5 V6
1 5 9 1 2 14
2 5 10 1 3 9
3 5 11 1 4 4
4 5 15 1 5 7
5 5 18 1 6 14
6 5 22 1 7 6
1 5 9 1 2 14
21 9 10 2 3 4
22 9 11 2 4 6
23 9 15 2 5 1
24 9 18 2 6 7
25 9 22 2 7 14
26 9 24 2 8 6
27 9 25 2 9 7
28 9 26 2 10 6
")

Data Frame Filter Values

Suppose I have the next data frame.
table<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 35 1 12
9 40 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
17 35 2 20
18 40 2 20
How can I get the values from the table up to the maximum price, without duplicates.
So the result would be:
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
You can use slice in dplyr:
library(dplyr)
table %>%
group_by(plan) %>%
slice(1:which.max(price == max(price)))
which.max gives the index of the first occurrence of price == max(price). Using that, I can slice the data.frame to only keep rows for each plan up to the maximum price.
Result:
# A tibble: 22 x 3
# Groups: plan [3]
group plan price
<dbl> <dbl> <dbl>
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 0 2 3
9 5 2 5
10 10 2 6
# ... with 12 more rows

Get the X value of multivariable box plot whiskers in R

I am trying to get the values of the whiskers in a boxplot.
Sample of my data is:
Company.ID ACTIVE Websource Company.Name Country Sector Ownership Activity.Status Update.Date MIN_MAX_REVENUE 16 Construction Private Number.of.Employees NOE splittedN splittedco splitted RR Range SECTORNUM
I want to find the whiskers when I box-plotted Number.of.Employees and Sector
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)
Got the otliers
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)$out
[1] 18 16 12 35 15 65 45 25 50 40 30 32 30 50 45 65 80 35 35 40 90 25 60 30 40 25
[27] 50 25 40 65 25 35 60 27 130 30 100 25 30 40 30 35 25 23 150 60 29 23 30 56 30 25
[53] 22 23 40 80 30 32 22 30 28 7 25 8 10 7 8 11 30 10 10 32 10 10 40 20 8 2
[79] 3 4 2 15 10 3 4 2 2 6 2 4 2 3 3 2 2 2 2 2 13 2 3 5 3 5
[105] 3 2 4 7 2 6 2 2 2 5 3 3 2 2 2 3 4 9 4 15 2 2 2 10 2 2
[131] 4 19 2 9 2 6 2 2 2 4 4 2 15 2 2 4 2 2 2 27 4 2 3 2 2 2
[157] 3 12 7 2 11 2 3 2 2 3 2 2 8 14 5 3 4 170 3 2 4 3 5 3 2 2
[183] 5 2 2 3 2 6 2 2 2 2 2 3 3 2 17 4 2 2 2 3 4 3 4 2 7 2
[209] 4 2 5 2 2 10 3 30 12 23 15 14 30 200 12 45 16 20 16 12 12 19 12 60 18 18
[235] 30 15 12 20 12 30 21 25 40 22 30 70 32 50 40 32 47 50 30 21 16 20 25 18 12 14
[261] 30 10 14 15 30 11 8 10 15 8 18 7 20 13 15 17 25 10 17 8 20 17 45 7 15 7
[287] 17 9 8 8 8 20 10 20 10 19 10 20 10 9 16 7 16 20 15 8 15 10 12 10 9 10
[313] 7 10 10 12 9 22 10 8 10 9 14 8 7 10 10 15 20 8 15 15 14 8 50 20 50 10
[339] 10 10 50 3 18 4 15 5 2 4 11 7 16 15 2 2 2 2 2 2 3 2 2 2 6 7
[365] 2 8 2 3 2 2 2 2 2 7 2 2 2 4 5 2 5 3 2 3 4 2 2 44 2 2
[391] 8 3 2 10 10 7 10 10 11 20 18 11 3 20 5 2 5 2 2 6 30 6 2 2 43 13
[417] 30 10 10 35 16 16 11 10 15 10 9 8 16 7 21 5 50 30 4 4 14 15 2 2 5 8
[443] 5 40 2 2 2 2 2 2 25 2 4 3 2 6 2 10 5 4 5 2 2 3 3 4 2 2
[469] 14 8 5 2 7 2 2 3 42 20 10 10 15 13 11 40 10 15 30 20 2 8 3 8 3 4
[495] 2 4 2 3 2 4 4 2 3 35 5 2 3 8 2 8 2 3 40 35 2 2 2 2 7 2
[521] 3 3 2 30 15 4 60 2 28 4 2 2 5 10 2 2 3 4 18 2 6 2 4 4 2 2
[547] 30 9 2 3 12 5 2 2 5 3 4 2 11 2 2 2 8 2 2 3 6 3 7 2 2 2
[573] 2 40 14 2 2 3 2 3 3 18 14 9 10 25 12 19 35 10 10 15 25 15 17 20 35 10
I need the full info about these outliers (company.Name....)
You need first the interquartile range
IQR = 75%quartile - 25% quartile,
then you find the
upper whisker at min(max(x), 75%quartile+1.5*IQR)
lower whisker at max(min(x), 25%quartiel+1.5*IQR)

Resources