I have a character vector (consisting of randomly arranged numbers or letters) that I want to use to order a dataframe:
vals = as.numeric(dict$keys)
## ONE
vals = order(vals)
## TWO
dict = dict[vals,]
At ONE:
> vals
[1] 1 1 1 1 1 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5
[26] 6 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 10 10
[51] 10 10 10 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 15 15 15
[76] 15 16 16 16 16 16 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18 18 18 18 18
[101] 18 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21 21 21 21 21
[126] 22 22 22 22 22 22 22 22 22 22 22 22 23
At TWO:
> vals
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138
When I execute this snippet in RStudio in Windows, it orders the dataframe dict fine. Numbers are ordered first, then letters are at the end (this is what I want).
However, in a linux remote desktop where I execute with > Rscript , this snippet doesn't work and the dataframe remains how it was before these lines are executed.
I fixed this by defining stringsAsFactors = F for all uses of data.frame in the script as Henrik suggested. The issue lied in the different versions of R I was using on the two systems.
Related
I have a data frame with a column Session. There are 215 unique values for Session, and I am trying to treat it as a categorical variable.
However, when I run table(df$Session), the sessions are not appearing in order and some appear to be missing:
table(df$Session)
1 10 100 101 102 103 104 105 106 107 108 109 11 110 111 113 114 115 116 117 118
6 11 20 14 17 8 14 11 8 14 15 17 12 16 15 17 19 26 24 31 28
12 120 121 122 123 124 125 126 127 128 13 130 131 132 133 134 135 136 137 138 139
13 36 27 20 23 18 12 12 40 52 19 91 78 88 78 8 7 74 5 8 6
14 140 141 142 143 144 145 146 147 148 149 15 150 151 152 153 154 155 156 157 158
14 7 6 7 5 3 75 3 70 75 68 16 68 67 67 68 58 69 70 68 26
159 16 160 161 162 163 164 165 166 167 168 169 17 170 171 172 173 174 175 176 177
75 17 65 70 63 76 57 43 45 32 31 18 18 20 17 22 13 15 12 7 7
178 179 18 180 181 182 183 184 185 186 187 188 189 19 190 191 192 193 194 195 196
6 7 17 9 9 13 12 18 19 22 15 3 10 3 21 32 43 54 66 77 84
197 198 199 2 20 200 201 202 203 204 205 206 207 208 209 21 210 211 212 213 215
77 85 79 6 17 89 87 93 85 85 98 80 78 68 54 17 34 24 50 50 65
22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 4 40
11 12 12 10 11 7 7 10 4 7 8 7 6 9 11 10 23 27 14 3 21
41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 56 57 58 59 6
27 16 16 18 10 12 19 7 6 4 5 13 21 17 25 31 32 30 15 10 3
60 61 62 63 64 65 66 67 68 69 7 70 71 73 74 75 76 77 78 79 8
18 17 11 14 14 15 18 11 13 9 7 13 12 7 8 8 9 12 8 9 6
80 81 82 83 84 85 86 87 88 89 9 90 91 92 93 94 95 97 98 99
1 11 8 17 20 13 14 18 19 19 9 14 16 12 15 17 19 13 7 16
If we only look at a couple of columns:
table(df$Session)
# 1 10 100 101 ... 197 198 199 2 20 200 201 202 ...
# 6 11 20 14 ... 77 85 79 6 17 89 87 93 ...
Why are they not ordered by number (1, 2, 3 instead of 1, 10, 100)? And how can I correct this?
Answer
The variable will be sorted correctly if you make it numeric first:
table(as.numeric(df$Session))
table(as.factor(as.numeric(df$Session)))
Explanation
Your variable is or was of the class character. The order of your variable is alphabetically, i.e. what would happen if you sort a character vector. Try: sort(c("1", "11", "2")). When you apply factor or as.factor to a character vector, the levels will be ordered as such (see ?factor):
levels: an optional vector of the unique values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x. Note that this set can be specified as smaller than sort(unique(x)).
Keep in mind that R reads in numbers as numeric by default. If you expected the column to be numeric from the start but R made it character, then you likely have values in there that are not strictly numbers. It is important to find out why the vector was character.
Reproducible example
vec <- c(22, 11, 3, 2, 1)
table(vec) # correct: numeric
# 1 2 3 11 22
# 1 1 1 1 1
table(as.character(vec)) # incorrect: character
# 1 11 2 22 3
# 1 1 1 1 1
table(as.factor(as.character(vec))) # incorrect: character -> factor
# 1 11 2 22 3
# 1 1 1 1 1
table(as.factor(vec)) # correct: numeric -> factor
# 1 2 3 11 22
# 1 1 1 1 1
I want to split this data frame df by column using a list of vectors ind as the column indices.
> df
1 2 3 4 5 6 7 8 9 10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
The combined length of the vectors are equal to the number of columns in the data frame.
> ind
[[1]]
[1] 1 4 9
[[2]]
[1] 2 5 10 7 3
[[3]]
[1] 8 6
The desired result should look like this:
$`1`
1 4 9
1 1 31 81
2 2 32 82
3 3 33 83
4 4 34 84
5 5 35 85
6 6 36 86
7 7 37 87
8 8 38 88
9 9 39 89
10 10 40 90
$`2`
2 5 10 7 3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
4 14 44 94 64 24
5 15 45 95 65 25
6 16 46 96 66 26
7 17 47 97 67 27
8 18 48 98 68 28
9 19 49 99 69 29
10 20 50 100 70 30
$`3`
8 6
1 71 51
2 72 52
3 73 53
4 74 54
5 75 55
6 76 56
7 77 57
8 78 58
9 79 59
10 80 60
Effectively the code generates sub matrices as data frames from the data frame df based on the vectors in the list ind
I have tried using split.defult without achieving the desired result.
split.default(V, rep(seq_along(ind), lengths(ind)))
One purrr option could be:
map(.x = ind, ~ df[, .x])
[[1]]
X1 X4 X9
1 1 31 81
2 2 32 82
3 3 33 83
[[2]]
X2 X5 X10 X7 X3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
[[3]]
X8 X6
1 71 51
2 72 52
3 73 53
With ind defined as:
ind <- list(c(1, 4, 9),
c(2, 5, 10, 7, 3),
c(8, 6))
An option for a list of dfs:
map(ind, ~ map(df_list, `[`, .))
You can just do,
lapply(your_list, function(i) your_df[i])
You can try the following base R solution using subset + Map
r <- Map(function(k) subset(df,select = k),ind)
such that
> r
[[1]]
X1 X4 X9
1 1 31 81
2 2 32 82
3 3 33 83
4 4 34 84
5 5 35 85
6 6 36 86
7 7 37 87
8 8 38 88
9 9 39 89
10 10 40 90
[[2]]
X2 X5 X10 X7 X3
1 11 41 91 61 21
2 12 42 92 62 22
3 13 43 93 63 23
4 14 44 94 64 24
5 15 45 95 65 25
6 16 46 96 66 26
7 17 47 97 67 27
8 18 48 98 68 28
9 19 49 99 69 29
10 20 50 100 70 30
[[3]]
X8 X6
1 71 51
2 72 52
3 73 53
4 74 54
5 75 55
6 76 56
7 77 57
8 78 58
9 79 59
10 80 60
Good evening,
I need to solve a location problem in R and I'm stuck in one of the first steps.
From a .txt file I need to create a distance matrix using the euclidean method.
datos <- file.choose()
servidores <- read.table(datos)
servidores
From which I obtain the following information:
X50 shows the total number of servers.
x5 the number of hubs required.
x120 the total capacity.
The first column shows the distance of x.
The second column shows the distance of y.
The third column shows the requirements of the node.
X50 X5 X120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I tried to use the dist() function:
distance_matrix <-dist(servidores,method = "euclidean",diag = TRUE,upper = TRUE)
but since x and y are on different columns I am not sure what to do to get a 50x50 matrix with all the distances.
Anybody knows how could I create such matrix?.
Many thanks in advance.
Sample data
df <- data.frame(loc.id = rep(1:5, each = 6), day = sample(1:365,30),
ref.day1 = rep(c(20,30,50,80,90), each = 6),
ref.day2 = rep(c(10,28,33,49,67), each = 6),
ref.day3 = rep(c(31,49,65,55,42), each = 6))
For each loc.id, if I want to keep days that are >= then ref.day1, I do this:
df %>% group_by(loc.id) %>% dplyr::filter(day >= ref.day1)
I want to make 3 data frames, each whose rows are filtered by ref.day1, ref.day2,ref.day3 respectively
I tried this:
col.names <- c("ref.day1","ref.day2","ref.day3")
temp.list <- list()
for(cl in seq_along(col.names)){
col.sub <- col.names[cl]
columns <- c("loc.id","day",col.sub)
df.sub <- df[,columns]
temp.dat <- df.sub %>% group_by(loc.id) %>% dplyr::filter(day >= paste0(col.sub)) # this line does not work
temp.list[[cl]] <- temp.dat
}
final.dat <- rbindlist(temp.list)
I was wondering how to refer to columns by names and paste function in dplyr in order to filter it out.
The reason why your original code doesn't work is that your col.names are strings, but dplyr function uses non-standard evaluation which doesn't accept strings. So you need to convert the string into variables.rlang::sym() can do that.
Also, you can use map function in purrr package, which is much more compact:
library(dplyr)
library(purrr)
col_names <- c("ref.day1","ref.day2","ref.day3")
map(col_names,~ df %>% dplyr::filter(day >= UQ(rlang::sym(.x))))
#it will return you a list of dataframes
By the way I removed group_by() because they don't seem to be useful.
Returned result:
[[1]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 5 199 90 67 42
22 5 141 90 67 42
23 5 241 90 67 42
24 5 187 90 67 42
[[2]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 16 20 10 31
2 1 362 20 10 31
3 1 69 20 10 31
4 1 65 20 10 31
5 1 88 20 10 31
6 1 142 20 10 31
7 2 355 30 28 49
8 2 255 30 28 49
9 2 136 30 28 49
10 2 156 30 28 49
11 2 194 30 28 49
12 2 204 30 28 49
13 3 129 50 33 65
14 3 254 50 33 65
15 3 279 50 33 65
16 3 201 50 33 65
17 3 282 50 33 65
18 4 351 80 49 55
19 4 114 80 49 55
20 4 338 80 49 55
21 4 283 80 49 55
22 4 79 80 49 55
23 5 199 90 67 42
24 5 67 90 67 42
25 5 141 90 67 42
26 5 241 90 67 42
27 5 187 90 67 42
[[3]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 4 79 80 49 55
22 5 199 90 67 42
23 5 67 90 67 42
24 5 141 90 67 42
25 5 241 90 67 42
26 5 187 90 67 42
You may also want to check these:
https://dplyr.tidyverse.org/articles/programming.html
Use variable names in functions of dplyr
I want to use barplot (or any other better options) to plot the following data:
action_number times
1 1 13408
2 2 5550
3 3 2757
4 4 1782
5 5 1114
6 6 847
7 7 582
8 8 410
9 9 306
10 10 278
11 11 212
12 12 165
13 13 139
14 14 112
15 15 106
16 16 82
17 17 64
18 18 61
19 19 69
20 20 47
21 21 31
22 22 40
23 23 34
24 24 31
25 25 28
26 26 26
27 27 21
28 28 16
29 29 14
30 30 16
31 31 11
32 32 10
33 33 11
34 34 10
35 35 4
36 36 6
37 37 5
38 38 8
39 39 6
40 40 3
41 41 6
42 42 8
43 43 3
44 44 3
45 45 7
46 46 8
47 47 4
48 48 4
49 49 1
50 50 4
51 51 2
52 52 4
53 53 3
54 54 1
55 55 2
56 56 1
57 58 2
58 59 4
59 60 1
60 62 2
61 63 1
62 66 1
63 67 4
64 68 2
65 69 1
66 70 1
67 71 1
68 73 1
69 74 1
70 77 1
71 79 1
72 80 1
73 82 1
74 92 2
75 97 1
76 98 1
77 103 1
78 106 1
79 114 1
80 118 1
81 128 1
82 142 1
83 148 1
84 153 1
85 155 1
86 166 1
87 183 1
88 218 1
89 224 1
90 298 1
91 536 1
I am using the following, but it does not match the data correctly:
mp <- barplot(data$times,axes=FALSE,ylim=c(0,13408))
axis(1,at=data$action_number,labels=data$action_number)
#??? Should I use at=data$action_number to at=data$times
axis(2,seq(0,91,3),c(0:30))
![enter image description here][1]
Problems:
- the x-axis does not have 536, it only goes to 224
- the Y axis only shows one number
Can you please give me advice and if I should use any package?
still, unclear but may be something like this
barplot(data$times, xlab=data$action_number)
mp <- barplot(data$times,axes=FALSE,ylim=c(0,13408))
axis(1,at=seq(1,91,10),labels=data$action_number[seq(1,91,10)])
axis(2,seq(0,13408,500),seq(0,13408,500))