merge data tables in R - r

My apologies for this simple question. Basically, I want to make three separate cumsum() tables and merge them together by the first table. For example:
a <- cumsum(table(df$variable))
b <- cumsum(table(df$variable[c(TRUE, FALSE)]))
c <- cumsum(table(df$variable[c(FALSE, TRUE)]))
Where a is the cumsum of the entire vector of df$variable, b is the cumsum of the odd-numbered values of df$variable, c is the cumsum of the even-numbered values of df$variable. Another way of interpreting this is that combining b and c produces a.
This is the entire vector of numbers.
[1] 18 17 15 10 5 0 10 10 0 10 15 5 5 5 25 15 13 0 0 0 25 18 15 15 1 4 5
[28] 5 5 15 5 12 15 0 3 12 20 0 5 5 13 10 10 10 3 15 13 20 12 60 10 10 2 0
[55] 5 10 8 4 0 15 5 5 15 5 0 5 2 8 5 5 5 5 9 9 3 7 20 25 5 4 10
[82] 10 2 4 5 5 18 8 0 10 5 5 7 12 5 13 26 20 13 21 5 15 10 10 5 15 5 15
[109] 0 1 13 21 25 25 5 14 5 15 10 0 5 15 3 4 5 15 15 5 25 25 5 15 0 2 13
[136] 22 2 10 3 3 15 11 0 2 40 35 24 24 5 5 10 5 16 0 17 19 20 5 5 5 0 15
[163] 3 13 20 4 5 5 3 19 25 25 0 15 5 3 22 22 25 5 15 15 5 15 17 9 5 5 15
[190] 10
For a, I used cbind(cumsum(table(df$variable)))
0 18
1 20
2 26
3 35
4 41
5 88
7 90
8 93
9 96
10 115
11 116
12 120
13 128
14 129
15 154
16 155
17 158
18 161
19 163
20 169
21 171
22 174
24 176
25 186
26 187
35 188
40 189
60 190
For b, I used cbind(cumsum(table(df$variable[c(TRUE, FALSE)])))
0 10
1 11
2 15
3 22
5 50
7 51
8 52
9 53
10 60
12 61
13 67
15 76
16 77
17 79
18 81
20 85
22 86
24 87
25 93
26 94
40 95
For c, I used cbind(cumsum(table(df$variable[c(FALSE, TRUE)])))
0 8
1 9
2 11
3 13
4 19
5 38
7 39
8 41
9 43
10 55
11 56
12 59
13 61
14 62
15 78
17 79
18 80
19 82
20 84
21 86
22 88
24 89
25 93
35 94
60 95
In frequency form, the distributions should look something like this.
a b c
0 18 10 8
1 2 1 1
2 6 4 2
3 9 7 2
4 6 0 6
5 47 28 19
7 2 1 1
8 3 1 2
9 3 1 2
10 19 7 12
11 1 0 1
12 4 1 3
13 8 6 2
14 1 0 1
15 25 9 16
16 1 1 0
17 3 2 1
18 3 2 1
19 2 0 2
20 6 4 2
21 2 0 2
22 3 1 2
24 2 1 1
25 10 6 4
26 1 1 0
35 1 0 1
40 1 1 0
60 1 0 1
190 95 95
But I want it in cumsum() form, such that it should look something like this. I wrote out the first 6 rows as illustration.
a b c
0 18 10 8
1 20 11 9
2 26 15 11
3 35 22 13
4 41 22 19
5 88 50 38
7 90 51 39
The problem I've been having is that the subsets a and b doesn't have all the values (i.e. some values have 0 frequency), such that it shortens the length of the vector; as a result, I'm unable to properly merge or cbind() these values.
Any suggestion is greatly appreciated.

You could probably get there using match quite easily. Assuming your data is:
set.seed(1)
df <- data.frame(variable=rbinom(10,prob=0.5,size=3))
Something like this seems to work
out <- data.frame(a,b=b[match(names(a),names(b))],c=c[match(names(a),names(c))])
replace(out,is.na(out),0)
# a b c
#0 1 0 1
#1 4 2 2
#2 7 4 3
#3 10 5 5

Related

Recode column every nth element in R

I'm looking to recode a column, say the following:
df <- data.frame(col1 = rep(3, 100),
col2 = rep(NA, 100))
I want to recode col2 as 1 for rows 1:5, 2 for rows 6:10, 3 for 11:15, etc. So, every five rows I would add +1 to the assigned value. Any way to automate this process to avoid manually recoding 100 rows?
There are lot of ways to do that. Here are couple of them -
Using rep :
df$col2 <- rep(1:nrow(df), each = 5, length.out = nrow(df))
Using ceiling
df$col2 <- ceiling(seq(nrow(df))/5)
dplyr way
df %>% mutate(col2 = ((row_number()-1) %/% 5)+1)
OR
A simple for loop
for(i in 0:((nrow(df)/5)-1)){
df[0:nrow(df) %/% 5 == i,2] <- i+1
}
> df
col1 col2
1 3 1
2 3 1
3 3 1
4 3 1
5 3 1
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
11 3 3
12 3 3
13 3 3
14 3 3
15 3 3
16 3 4
17 3 4
18 3 4
19 3 4
20 3 4
21 3 5
22 3 5
23 3 5
24 3 5
25 3 5
26 3 6
27 3 6
28 3 6
29 3 6
30 3 6
31 3 7
32 3 7
33 3 7
34 3 7
35 3 7
36 3 8
37 3 8
38 3 8
39 3 8
40 3 8
41 3 9
42 3 9
43 3 9
44 3 9
45 3 9
46 3 10
47 3 10
48 3 10
49 3 10
50 3 10
51 3 11
52 3 11
53 3 11
54 3 11
55 3 11
56 3 12
57 3 12
58 3 12
59 3 12
60 3 12
61 3 13
62 3 13
63 3 13
64 3 13
65 3 13
66 3 14
67 3 14
68 3 14
69 3 14
70 3 14
71 3 15
72 3 15
73 3 15
74 3 15
75 3 15
76 3 16
77 3 16
78 3 16
79 3 16
80 3 16
81 3 17
82 3 17
83 3 17
84 3 17
85 3 17
86 3 18
87 3 18
88 3 18
89 3 18
90 3 18
91 3 19
92 3 19
93 3 19
94 3 19
95 3 19
96 3 20
97 3 20
98 3 20
99 3 20
100 3 20
As there is a pattern (each 5th row) you can use rep(row_number()) length.out = n() takes into account the length of column.
Learned here dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe from Ronak!!!
Thanks to Ronak!
df %>% mutate(col2 = rep(row_number(), each=5, length.out = n()))

I have a weights variable and I need to create cross tabulations for a chord diagram

I have a dataset with over 15,000 observations. I've dropped all variables but three (3).
One is the individual's origin or, the other is the individual's destination dest, and the third is weight of that individual wgt.
Origin and destination are categorical variables.
The weights I have are used as analytic weights in Stata. However, Stata can't handle the number of columns I generate when making tables. R generates them with ease. However, I can't figure out how to apply weights into the generated table.
I tried using wtd.tables(), but the following error appears.
wtd.table(NonHSGrad$b206reg, NonHSGrad$c305reg, weights=NonHSGrad$ind_wgts)
Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions
When I use only the table(), this comes out:
table(NonHSGrad$b206reg, NonHSGrad$c305reg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 285 38 20 8 6 3 1 2 0 1 0 10 38 46 0 2 14
2 32 312 26 3 1 0 2 1 1 0 1 1 22 51 0 0 8
3 17 35 325 12 12 2 3 7 0 2 3 5 52 13 1 1 25
4 3 5 27 224 19 5 2 10 1 1 1 2 51 4 0 3 35
5 4 9 44 81 778 6 7 22 1 4 5 5 155 5 0 5 47
6 4 5 22 21 10 547 24 12 32 21 32 81 86 5 3 15 58
7 5 4 12 17 20 21 558 20 31 99 93 33 59 1 3 67 15
8 8 9 41 49 17 11 24 919 5 8 37 10 151 2 0 52 19
9 0 1 7 9 1 4 26 5 466 66 19 17 17 2 24 24 7
10 1 2 3 4 2 3 27 8 41 528 21 17 13 2 11 36 2
11 3 0 3 10 1 5 11 5 6 17 519 59 7 1 2 49 1
12 0 1 1 2 0 1 5 2 2 10 39 318 10 0 14 17 1
13 15 9 26 34 25 21 12 42 2 5 3 5 187 2 1 6 15
14 14 47 7 5 0 0 0 1 1 0 0 0 9 475 0 0 0
15 0 0 3 1 2 2 4 2 22 9 3 60 9 2 342 2 3
16 0 2 6 10 3 2 11 21 3 33 29 4 34 0 3 404 5
17 1 1 7 15 2 6 1 2 0 1 1 0 34 0 0 2 463
99 0 0 1 1 0 0 0 1 0 1 0 0 0 1 2 0 1
I am also going to use the table for a chord diagram to show flows.

How can I upload this data on R?

I need to read the data in this link: http://www.stats.ox.ac.uk/~snijders/siena/ for my final assignment.
The problem I haven't seen these data extension before as the file name is like (cov1.dat) for example and I don't know how can I upload it on R.
So please need your help
Okay when you find a file than you don't know, the first thing you must do is open it on text editor such as notepad and see what it looks like. In this case, the file was plain text so it would be very easy to open with normal reading functions.
1 1 1 2 1 21 29 34 13 8 25 42 42 42 42 42 42 34 4 2 3
2 1 1 0 2 24 28 36 7 2 22 2 29 36 36 23 32 28 4 2 2
3 1 1 0 1 24 21 13 42 34 21 21 29 25 21 29 21 29 4 3 3
4 1 2 0 2 30 6 21 0 32 21 21 38 36 17 21 21 35 4 3 3
5 1 1 0 1 38 17 29 40 29 34 1 21 4 21 38 0 25 5 4 3
6 1 2 0 1 18 4 21 10 26 5 0 38 14 10 5 2 32 3 2 2
7 2 2 4 1 37 6 15 23 19 23 26 18 18 32 27 3 7 5 4 2
8 2 1 5 4 36 21 21 2 2 21 0 25 21 42 34 27 32 4 3 2
9 2 1 0 2 27 21 6 0 2 36 4 10 34 21 21 29 13 4 3 1
10 2 1 0 2 27 30 18 0 0 13 6 19 21 38 22 34 6 5 4 5
11 2 1 2 2 24 21 27 40 32 2 0 15 4 32 30 11 9 4 3 2
12 2 2 5 1 38 34 29 17 4 13 25 25 21 38 34 25 34 4 2 2
13 2 2 0 2 37 21 32 38 0 34 0 33 25 38 34 0 29 2 1 2
14 1 1 0 1 32 21 27 36 36 21 3 36 21 24 34 31 27 5 3 2
15 2 1 0 1 17 34 25 42 25 17 0 21 4 29 29 0 29 4 3 2
16 2 2 2 1 32 23 32 11 23 23 27 23 21 23 23 7 28 5 4 2
17 2 2 0 1 26 30 31 30 9 13 4 34 31 40 34 9 39 4 2 2
18 2 2 2 1 29 32 21 0 9 0 41 41 29 40 0 10 34 5 3 1
19 2 1 4 4 25 21 19 6 0 7 0 42 21 26 36 0 22 4 4 4
20 2 2 2 2 29 25 6 26 21 22 12 25 17 21 8 29 13 4 3 2
21 1 2 4 2 25 25 29 24 17 17 17 24 20 38 29 13 21 4 3 4
22 1 1 0 1 29 29 16 42 0 21 8 38 8 42 21 0 21 5 3 5
23 2 1 0 1 21 25 25 13 0 4 0 34 21 17 25 17 34 4 2 2
24 1 1 9 4 38 38 29 8 34 8 2 42 34 34 38 0 34 4 3 1
25 2 1 5 5 36 31 31 2 2 11 27 40 28 19 10 3 23 5 3 5
26 2 2 0 1 28 21 21 36 16 0 0 21 3 13 7 0 16 5 3 3
27 2 1 0 1 3 13 21 4 2 4 42 21 18 13 21 0 42 1 1 1
28 1 1 0 1 27 11 27 40 34 32 7 32 25 21 32 0 21 4 3 3
29 1 1 5 5 19 23 31 6 11 14 7 28 25 36 34 0 24 5 3 4
30 2 1 0 1 13 29 29 21 17 26 22 42 21 34 36 0 21 3 2 2
31 2 2 4 1 18 32 28 34 6 11 7 33 10 28 32 2 37 5 3 3
32 1 1 3 3 30 25 25 21 13 26 0 38 25 34 30 25 30 5 3 4
33 2 2 5 1 29 26 34 21 13 21 0 34 17 26 26 42 29 5 3 5
34 2 2 0 2 9 28 28 7 3 28 0 26 28 32 27 9 21 5 4 3
35 2 2 5 1 34 25 30 14 9 30 34 34 8 29 31 26 30 5 3 2
36 2 1 0 1 30 26 21 17 14 9 14 30 21 40 12 4 30 3 2 5
37 2 1 5 3 41 41 37 2 6 28 2 41 21 38 36 11 24 4 2 1
38 1 1 0 1 17 31 21 29 10 29 0 34 8 21 21 0 21 3 3 2
I have used the common read.table() operation and it worked.
dat <- read.table("DataMarijtje//cov1.dat")

Get the X value of multivariable box plot whiskers in R

I am trying to get the values of the whiskers in a boxplot.
Sample of my data is:
Company.ID ACTIVE Websource Company.Name Country Sector Ownership Activity.Status Update.Date MIN_MAX_REVENUE 16 Construction Private Number.of.Employees NOE splittedN splittedco splitted RR Range SECTORNUM
I want to find the whiskers when I box-plotted Number.of.Employees and Sector
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)
Got the otliers
boxplot(Data$Range ~ Data$Sector, ylab= "range", Xlab= "Sector", las=2)$out
[1] 18 16 12 35 15 65 45 25 50 40 30 32 30 50 45 65 80 35 35 40 90 25 60 30 40 25
[27] 50 25 40 65 25 35 60 27 130 30 100 25 30 40 30 35 25 23 150 60 29 23 30 56 30 25
[53] 22 23 40 80 30 32 22 30 28 7 25 8 10 7 8 11 30 10 10 32 10 10 40 20 8 2
[79] 3 4 2 15 10 3 4 2 2 6 2 4 2 3 3 2 2 2 2 2 13 2 3 5 3 5
[105] 3 2 4 7 2 6 2 2 2 5 3 3 2 2 2 3 4 9 4 15 2 2 2 10 2 2
[131] 4 19 2 9 2 6 2 2 2 4 4 2 15 2 2 4 2 2 2 27 4 2 3 2 2 2
[157] 3 12 7 2 11 2 3 2 2 3 2 2 8 14 5 3 4 170 3 2 4 3 5 3 2 2
[183] 5 2 2 3 2 6 2 2 2 2 2 3 3 2 17 4 2 2 2 3 4 3 4 2 7 2
[209] 4 2 5 2 2 10 3 30 12 23 15 14 30 200 12 45 16 20 16 12 12 19 12 60 18 18
[235] 30 15 12 20 12 30 21 25 40 22 30 70 32 50 40 32 47 50 30 21 16 20 25 18 12 14
[261] 30 10 14 15 30 11 8 10 15 8 18 7 20 13 15 17 25 10 17 8 20 17 45 7 15 7
[287] 17 9 8 8 8 20 10 20 10 19 10 20 10 9 16 7 16 20 15 8 15 10 12 10 9 10
[313] 7 10 10 12 9 22 10 8 10 9 14 8 7 10 10 15 20 8 15 15 14 8 50 20 50 10
[339] 10 10 50 3 18 4 15 5 2 4 11 7 16 15 2 2 2 2 2 2 3 2 2 2 6 7
[365] 2 8 2 3 2 2 2 2 2 7 2 2 2 4 5 2 5 3 2 3 4 2 2 44 2 2
[391] 8 3 2 10 10 7 10 10 11 20 18 11 3 20 5 2 5 2 2 6 30 6 2 2 43 13
[417] 30 10 10 35 16 16 11 10 15 10 9 8 16 7 21 5 50 30 4 4 14 15 2 2 5 8
[443] 5 40 2 2 2 2 2 2 25 2 4 3 2 6 2 10 5 4 5 2 2 3 3 4 2 2
[469] 14 8 5 2 7 2 2 3 42 20 10 10 15 13 11 40 10 15 30 20 2 8 3 8 3 4
[495] 2 4 2 3 2 4 4 2 3 35 5 2 3 8 2 8 2 3 40 35 2 2 2 2 7 2
[521] 3 3 2 30 15 4 60 2 28 4 2 2 5 10 2 2 3 4 18 2 6 2 4 4 2 2
[547] 30 9 2 3 12 5 2 2 5 3 4 2 11 2 2 2 8 2 2 3 6 3 7 2 2 2
[573] 2 40 14 2 2 3 2 3 3 18 14 9 10 25 12 19 35 10 10 15 25 15 17 20 35 10
I need the full info about these outliers (company.Name....)
You need first the interquartile range
IQR = 75%quartile - 25% quartile,
then you find the
upper whisker at min(max(x), 75%quartile+1.5*IQR)
lower whisker at max(min(x), 25%quartiel+1.5*IQR)

Quartiles by group saved as new variable in data frame

I have data that look something like this:
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
yr <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
gr <- c(3,4,5,3,4,5,3,4,5,4,5,6,4,5,6,4,5,6,5,6,7,5,6,7,5,6,7)
x <- c(33,48,31,41,31,36,25,38,28,17,39,53,60,60,19,39,34,47,20,28,38,15,17,49,48,45,39)
df <- data.frame(id,yr,gr,x)
id yr gr x
1 1 1 3 33
2 1 2 4 48
3 1 3 5 31
4 2 1 3 41
5 2 2 4 31
6 2 3 5 36
7 3 1 3 25
8 3 2 4 38
9 3 3 5 28
10 4 1 4 17
11 4 2 5 39
12 4 3 6 53
13 5 1 4 60
14 5 2 5 60
15 5 3 6 19
16 6 1 4 39
17 6 2 5 34
18 6 3 6 47
19 7 1 5 20
20 7 2 6 28
21 7 3 7 38
22 8 1 5 15
23 8 2 6 17
24 8 3 7 49
25 9 1 5 48
26 9 2 6 45
27 9 3 7 39
I would like to create a new variable in the data frame that contains the quantiles of "x" computed within each unique combination of "yr" and "gr". That is, rather than finding the quantiles of "x" based on all 27 rows of data in the example, I would like to compute the quantiles by two grouping variables: yr and gr. For instance, the quantiles of "x" when yr = 1 and gr = 3, yr = 1 and gr = 4, etc.
Once these values are computed, I would like them to be appended to the data frame as a single column, say "x_quant".
I am able to split the data into the separate groups I need, and I am know how to calculate quantiles, but I am having trouble combining the two steps in a way that is amenable to creating a new column in the existing data frame.
Any help y'all can provide would be greatly appretiated! Thank you much!
~kj
# turn "yr" and "gr" into sortable column
df$y <- paste(df$yr,"",df$gr)
df.ordered <- df[order(df$y),] #sort df based on group
grp <- split(df.ordered,df.ordered$y);grp
# get quantiles and turn results into string
q <- vector('list')
for (i in 1:length(grp)) {
a <- quantile(grp[[i]]$x)
q[i] <- paste(a[1],"",a[2],"",a[3],"",a[4],"",a[5])
}
x_quant <- unlist(sapply(q, `[`, 1))
x_quant <- rep(x_quant,each=3)
# append quantile back to data frame. Gave new column a more descriptive name
df.ordered$xq_0_25_50_75_100 <- x_quant
df.ordered$y <- NULL
df <- df.ordered;df </pre>
Output:
> # turn "yr" and "gr" into sortable column
> df$y <- paste(df$yr,"",df$gr)
> df.ordered <- df[order(df$y),] #sort df based on group
> grp <- split(df.ordered,df.ordered$y);grp
$`1 3`
id yr gr x y
1 1 1 3 33 1 3
4 2 1 3 41 1 3
7 3 1 3 25 1 3
$`1 4`
id yr gr x y
10 4 1 4 17 1 4
13 5 1 4 60 1 4
16 6 1 4 39 1 4
$`1 5`
id yr gr x y
19 7 1 5 20 1 5
22 8 1 5 15 1 5
25 9 1 5 48 1 5
$`2 4`
id yr gr x y
2 1 2 4 48 2 4
5 2 2 4 31 2 4
8 3 2 4 38 2 4
$`2 5`
id yr gr x y
11 4 2 5 39 2 5
14 5 2 5 60 2 5
17 6 2 5 34 2 5
$`2 6`
id yr gr x y
20 7 2 6 28 2 6
23 8 2 6 17 2 6
26 9 2 6 45 2 6
$`3 5`
id yr gr x y
3 1 3 5 31 3 5
6 2 3 5 36 3 5
9 3 3 5 28 3 5
$`3 6`
id yr gr x y
12 4 3 6 53 3 6
15 5 3 6 19 3 6
18 6 3 6 47 3 6
$`3 7`
id yr gr x y
21 7 3 7 38 3 7
24 8 3 7 49 3 7
27 9 3 7 39 3 7
> # get quantiles and turn results into string
> q <- vector('list')
> for (i in 1:length(grp)) {
+ a <- quantile(grp[[i]]$x)
+ q[i] <- paste(a[1],"",a[2],"",a[3],"",a[4],"",a[5])
+ }
> x_quant <- unlist(sapply(q, `[`, 1))
> x_quant <- rep(x_quant,each=3)
> # append quantile back to data frame
> df.ordered$xq_0_25_50_75_100 <- x_quant
> df.ordered$y <- NULL
> df <- df.ordered
> df
id yr gr x xq_0_25_50_75_100
1 1 1 3 33 25 29 33 37 41
4 2 1 3 41 25 29 33 37 41
7 3 1 3 25 25 29 33 37 41
10 4 1 4 17 17 28 39 49.5 60
13 5 1 4 60 17 28 39 49.5 60
16 6 1 4 39 17 28 39 49.5 60
19 7 1 5 20 15 17.5 20 34 48
22 8 1 5 15 15 17.5 20 34 48
25 9 1 5 48 15 17.5 20 34 48
2 1 2 4 48 31 34.5 38 43 48
5 2 2 4 31 31 34.5 38 43 48
8 3 2 4 38 31 34.5 38 43 48
11 4 2 5 39 34 36.5 39 49.5 60
14 5 2 5 60 34 36.5 39 49.5 60
17 6 2 5 34 34 36.5 39 49.5 60
20 7 2 6 28 17 22.5 28 36.5 45
23 8 2 6 17 17 22.5 28 36.5 45
26 9 2 6 45 17 22.5 28 36.5 45
3 1 3 5 31 28 29.5 31 33.5 36
6 2 3 5 36 28 29.5 31 33.5 36
9 3 3 5 28 28 29.5 31 33.5 36
12 4 3 6 53 19 33 47 50 53
15 5 3 6 19 19 33 47 50 53
18 6 3 6 47 19 33 47 50 53
21 7 3 7 38 38 38.5 39 44 49
24 8 3 7 49 38 38.5 39 44 49
27 9 3 7 39 38 38.5 39 44 49
>

Resources