Assign name in column based on values from another column in R - r

I am looking for some help to what seems like a very simple question. Any advice is greatly appreciated! I have created a data frame and I am looking to assign names under one column (Region) based on the values in the other column (Unit).
rdf<-as.data.frame(matrix(NA, nrow= 59, ncol=2))
colnames(rdf)<-c("Unit", "Region")
rdf$Unit<-c(1:35, 37:60)
rdf$Region<- ## (See below)
Here I want for Units 1:13 <- the region to be East,
for Units 14:25 and 27 the region to be labeled Central,
units 26, 28:38, 40:43, 45:46, to be labeled West,
and then Units 44, 39, 47:60, to be labeled BC.
I've been trying case_when or nested if else statements, but I am getting errors relating to a longer object length is not a multiple of shorter object length.

We can use dplyr. case_when is the way to go here.
library(dplyr)
rdf %>%
mutate(Region = case_when(Unit %in% 1:13 ~ "East",
Unit %in% c(14:25, 27) ~ "Central",
Unit %in% c(26, 28:38, 40:43, 45:46) ~ "West",
Unit %in% c(44, 39, 47:60) ~ "BC"))
Unit Region
1 1 East
2 2 East
3 3 East
4 4 East
5 5 East
6 6 East
7 7 East
8 8 East
9 9 East
10 10 East
11 11 East
12 12 East
13 13 East
14 14 Central
15 15 Central
16 16 Central
17 17 Central
18 18 Central
19 19 Central
20 20 Central
21 21 Central
22 22 Central
23 23 Central
24 24 Central
25 25 Central
26 26 West
27 27 Central
28 28 West
29 29 West
30 30 West
31 31 West
32 32 West
33 33 West
34 34 West
35 35 West
36 37 West
37 38 West
38 39 BC
39 40 West
40 41 West
41 42 West
42 43 West
43 44 BC
44 45 West
45 46 West
46 47 BC
47 48 BC
48 49 BC
49 50 BC
50 51 BC
51 52 BC
52 53 BC
53 54 BC
54 55 BC
55 56 BC
56 57 BC
57 58 BC
58 59 BC
59 60 BC

Because this is a small data frame, using a translation table might be the most convenient way:
xlat <- list(East=c(1:13),
Central=c(14:25,27),
West=c(26,28:38,40:43,45:46),
BC=c(44,39,47:60))
rdf$Region <- NA
for (r in names(xlat)) rdf$Region[rdf$Unit %in% xlat[[r]]] <- r
This solution (a) clearly documents the recoding; (b) will indicate if you have overlooked any unit number (by setting Region to NA); and (c) is easy to alter and maintain.
For larger tables, learn about the join operation among relations. It has many implementations in R.

Related

Am I able to get a specific P-value to see where the significance lies?

So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations

Sankey Diagram in R with networkD3 - row number issues

I'd like to focus on the flow highlighted above connecting the blue 'Thermal generation' block to the pink 'Electricity grid' block. You'll notice that the flow is 526 TWh, which is row #62 from Energy$links.
Energy$links
source target value
...
62 26 15 525.531
...
Now let's focus on the source and target values which refer to nodes in Energy$nodes.
Energy$nodes
name
...
15 Heating and cooling - homes
16 Electricity grid
...
26 Gas reserves
27 Thermal generation
...
The source value is '26' when it actually refers to row '27' of the nodes data. The target value is '15' when it actually refers to row '16' of the nodes data. Why do the source and target values in the links data actually refer to row x - 1 instead of x in the nodes data? Is there any way around this other than performing the x - 1 calculation in my head when building these Sankey Diagrams?
Here's the full Energy data:
> Energy
$`nodes`
name
1 Agricultural 'waste'
2 Bio-conversion
3 Liquid
4 Losses
5 Solid
6 Gas
7 Biofuel imports
8 Biomass imports
9 Coal imports
10 Coal
11 Coal reserves
12 District heating
13 Industry
14 Heating and cooling - commercial
15 Heating and cooling - homes
16 Electricity grid
17 Over generation / exports
18 H2 conversion
19 Road transport
20 Agriculture
21 Rail transport
22 Lighting & appliances - commercial
23 Lighting & appliances - homes
24 Gas imports
25 Ngas
26 Gas reserves
27 Thermal generation
28 Geothermal
29 H2
30 Hydro
31 International shipping
32 Domestic aviation
33 International aviation
34 National navigation
35 Marine algae
36 Nuclear
37 Oil imports
38 Oil
39 Oil reserves
40 Other waste
41 Pumped heat
42 Solar PV
43 Solar Thermal
44 Solar
45 Tidal
46 UK land based bioenergy
47 Wave
48 Wind
$links
source target value
1 0 1 124.729
2 1 2 0.597
3 1 3 26.862
4 1 4 280.322
5 1 5 81.144
6 6 2 35.000
7 7 4 35.000
8 8 9 11.606
9 10 9 63.965
10 9 4 75.571
11 11 12 10.639
12 11 13 22.505
13 11 14 46.184
14 15 16 104.453
15 15 14 113.726
16 15 17 27.140
17 15 12 342.165
18 15 18 37.797
19 15 19 4.412
20 15 13 40.858
21 15 3 56.691
22 15 20 7.863
23 15 21 90.008
24 15 22 93.494
25 23 24 40.719
26 25 24 82.233
27 5 13 0.129
28 5 3 1.401
29 5 26 151.891
30 5 19 2.096
31 5 12 48.580
32 27 15 7.013
33 17 28 20.897
34 17 3 6.242
35 28 18 20.897
36 29 15 6.995
37 2 12 121.066
38 2 30 128.690
39 2 18 135.835
40 2 31 14.458
41 2 32 206.267
42 2 19 3.640
43 2 33 33.218
44 2 20 4.413
45 34 1 4.375
46 24 5 122.952
47 35 26 839.978
48 36 37 504.287
49 38 37 107.703
50 37 2 611.990
51 39 4 56.587
52 39 1 77.810
53 40 14 193.026
54 40 13 70.672
55 41 15 59.901
56 42 14 19.263
57 43 42 19.263
58 43 41 59.901
59 4 19 0.882
60 4 26 400.120
61 4 12 46.477
62 26 15 525.531 # the highlighted 'flow'
63 26 3 787.129
64 26 11 79.329
65 44 15 9.452
66 45 1 182.010
67 46 15 19.013
68 47 15 289.366
The reason is that ultimately the data gets sent to JavaScript/D3, which uses 0-based indexing... which means the index of the first element of a vector/array/etc. is 0... unlike in R where the index of the first element of a vector is 1.
as an example of easily converting R-style data...
source <- c("A", "A", "B", "C", "D", "D", "E", "E")
target <- c("D", "E", "E", "D", "H", "I", "I", "H")
nodes <- data.frame(name = unique(c(source, target)))
links <- data.frame(source = match(source, nodes$name) - 1,
target = match(target, nodes$name) - 1,
value = 1)
library(networkD3)
sankeyNetwork(links, nodes, "source", "target", "value", "name")

Error while plotting a tree with some squirrels using trees package

I am using the package trees found here, by #jbaums and explained in this post.
My data are the following:
the tree is composed by
the trunk
Trunk
[1] 13.60415
and the branches
Tree
TreeBranchLength TreeBranchID
1 10.004269 1
2 7.994269 2
3 9.028834 11
4 10.817401 12
5 8.551311 111
6 10.599798 112
7 11.073243 121
8 13.367392 122
9 9.625431 1111
10 10.793569 1112
11 9.896499 11121
12 8.687741 11122
13 7.791180 1211
14 12.506105 1212
15 6.768478 1221
16 10.441796 1222
17 10.751892 1121
18 9.458651 1122
19 10.768509 11221
20 10.150673 11222
21 12.377448 111211
22 12.235136 111212
23 9.074079 11211
24 9.996334 11212
25 9.807019 112221
26 10.895809 112222
27 6.741274 1122211
28 15.841272 1122212
29 5.753920 11222111
30 8.846389 11222112
31 11.925961 112111
32 9.780776 112112
33 8.207965 12221
34 10.079375 12222
the 50 squirrel populations -
Populations
PopulationPositionOnBranch PopulationBranchID ID
1 10.6321655 112111 1
2 1.0644897 1 2
3 3.9315473 1 3
4 1.0310244 0 4
5 9.1768846 0 5
6 13.4267181 0 6
7 7.9461528 0 7
8 6.0533401 121 8
9 2.1227425 121 9
10 1.8256787 121 10
11 4.7332588 11222112 11
12 4.4837432 11222112 12
13 4.6200834 11222112 13
14 2.5622276 1221 14
15 1.2446683 1221 15
16 7.0674052 111 16
17 1.3854674 111 17
18 4.8735635 111 18
19 9.5007998 1222 19
20 6.6373468 1222 20
21 12.6757728 122 21
22 4.2685465 122 22
23 3.9806540 2 23
24 3.1025403 2 24
25 3.9119065 11122 25
26 1.5527653 11122 26
27 1.6687957 11122 27
28 8.0697456 1122 28
29 6.7871391 1122 29
30 9.8050713 111212 30
31 8.5226920 111212 31
32 3.6113379 111212 32
33 7.3184965 111211 33
34 8.6142984 111211 34
35 1.3550870 1211 35
36 8.3650639 12 36
37 4.6411446 112112 37
38 3.2985541 112112 38
39 12.2344148 1212 39
40 9.0290776 1212 40
41 1.3900249 1121 41
42 0.9261425 1122212 42
43 15.2522199 1122212 43
44 4.0253771 12222 44
45 8.7507678 11222 45
46 4.6289841 1122211 46
47 9.1799522 112 47
48 5.1293838 12221 48
49 1.1543080 12221 49
50 10.1014837 112222 50
the code to produce the plot
g <- germinate(list(trunk.height=Trunk,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30))
xy <- squirrels(g, Populations$PopulationBranchID, pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)
, which produces
As you can see on the plot bellow population 43 (blue arrow) is out of the tree.. It seems that the length of the branches on the plot do not correspond to the data. For example the branch (left green arrow) on which are populations 38 and 37 is longer than the one where population 43 is (right green arrow), that is not the case in the data. What am I doing wrong? Have I understood correctly how to use trees?
On studying the germinate function it seems to me that the Tree values that you are passing to it needs to be sorted on TreeBranchId field in the ascending order.
The BranchID: 1122212 where you have placed 43 is not the actual 1122212 branch.
Due to the order in which you have fed the values in the Tree, the function is somehow messing the location of branch.
I was curious to see if I increase the length of Branch ID: 1122212, will it change the branch where 43 is placed, and guess what? it didn't. The branch which actually showed an increase in length was the branch where you have placed 37 and 38.
So this hint pointed out that something was wrong with germinate function. On further debugging I was able to make it work using the below code.
Tree<-read.csv("treeBranch.csv")
Tree<-Tree[order(Tree$TreeBranchID),]
g <- germinate(list(trunk.height=15,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30)
xy <- squirrels(g, Populations$PopulationBranchID,pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)

Subbing random numbers for text

This should be fairly easy but I can't find a quick way to do it. All I want to do is replace certain levels of a factor with random numbers (I'm building a dataframe from scratch and want certain levels of the factors to have different ranges of values).
data <- data.frame(
animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
region = sample(c("north","south","east","west"),50,replace=T),
reports = sample(50:100,50,replace=T))
Something like this doesn't work, because you have to specify the number of elements to be generated
data$animal <- sub("lion",rnorm(15,10,2),data$animal)
Which gives the warning:
Warning: In sub("lion", rnorm(15, 10, 2), data$animal) :
argument 'replacement' has length > 1 and only the first element will be used
Does anybody have an easy way to do this, or is it not possible to use the "sub" expressions with numbers?
I don't understand why you'd want this, but here we go.
set.seed(42)
data <- data.frame(
animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
region = sample(c("north","south","east","west"),50,replace=T),
reports = sample(50:100,50,replace=T))
data$animal <- as.character(data$animal)
to.change <- data$animal=="lion"
data$animal[to.change] <- rnorm(sum(to.change),10,2)
# animal region reports
# 1 bear south 81
# 2 bear south 61
# 3 11.1619929953634 south 61
# 4 bear west 69
# 5 tiger north 98
# 6 tiger east 99
# 7 bear east 87
# 8 11.5363574756692 north 87
# 9 tiger south 77
# 10 bear east 50
# 11 tiger east 81
# 12 bear west 92
# 13 bear west 88
# 14 10.9275351770803 east 73
# 15 tiger west 77
# 16 bear north 77
# 17 bear south 50
# 18 8.22844740518064 west 68
# 19 tiger east 81
# 20 tiger north 92
# 21 bear north 68
# 22 7.80043820270429 north 70
# 23 bear north 79
# 24 bear south 80
# 25 13.0254140196099 north 86
# 26 tiger east 70
# 27 tiger north 96
# 28 bear south 99
# 29 tiger east 61
# 30 bear north 86
# 31 bear east 96
# 32 bear north 80
# 33 tiger south 82
# 34 bear east 97
# 35 10.5158428750641 west 93
# 36 bear east 79
# 37 10.1768804583192 north 91
# 38 9.75820692492182 north 55
# 39 bear north 88
# 40 tiger south 81
# 41 tiger east 57
# 42 tiger north 54
# 43 7.61134220967894 north 73
# 44 bear west 89
# 45 tiger west 87
# 46 bear east 91
# 47 bear south 58
# 48 tiger east 98
# 49 bear east 64
# 50 tiger east 57
Edit:
From your comment it seems you actually want something like this:
offense <- data.frame(animal=c("lion","tiger","bear"),
mean=c(35,25,10),
sd=c(3,2,1))
library(plyr)
data <- ddply(merge(data, offense),
.(animal),
transform,
attacks=rnorm(length(mean), mean=mean, sd=sd),
mean=NULL,
sd=NULL)
# animal region reports attacks
# 1 bear south 81 10.580996
# 2 bear south 61 10.768179
# 3 bear north 77 10.463768
# 4 bear west 69 9.114224
# 5 bear east 96 8.900219
# 6 bear north 80 11.512707
# 7 bear east 87 10.257921
# 8 bear north 68 10.088440
# 9 bear west 88 9.879103
# 10 bear east 50 8.805671
# 11 bear south 80 10.611997
# 12 bear west 92 9.782860
# 13 bear south 50 9.817243
# 14 bear west 89 10.933346
# 15 bear south 99 10.821773
# 16 bear east 91 11.392116
# 17 bear east 97 9.523826
# 18 bear north 88 10.650349
# 19 bear north 79 11.391110
# 20 bear east 79 8.889211
# 21 bear east 64 9.139207
# 22 bear north 86 8.868261
# 23 bear south 58 8.540786
# 24 lion west 68 35.239948
# 25 lion south 61 36.959613
# 26 lion north 70 38.602896
# 27 lion north 73 38.134253
# 28 lion north 91 31.990374
# 29 lion north 86 40.545446
# 30 lion east 73 32.999680
# 31 lion north 87 35.316541
# 32 lion west 93 33.733232
# 33 lion north 55 34.632949
# 34 tiger west 77 25.376386
# 35 tiger east 61 25.238322
# 36 tiger east 99 24.949815
# 37 tiger east 81 25.216145
# 38 tiger north 92 24.029130
# 39 tiger north 96 23.991566
# 40 tiger south 81 21.677802
# 41 tiger east 81 24.235333
# 42 tiger north 54 23.974699
# 43 tiger south 77 30.403782
# 44 tiger north 98 22.275768
# 45 tiger east 57 25.274512
# 46 tiger south 82 22.012750
# 47 tiger east 70 22.059129
# 48 tiger east 98 25.249405
# 49 tiger west 87 23.006722
# 50 tiger east 57 24.996355

Exclude Row with State having less than some value on Aggregation

I'm currently learning R with help of video on coursera. When trying to exclude all hospital of state which have less than 20 hospital form table, I couldn't able to find correct solution with lack of programming knowledge of R (as I had program lots with C, Logic I tried to implemented in R is also like C)
Code I had used is like
>test <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
>test[, 11] <- as.numeric(outcome[, 11])
>test2 <- table(outcome$State)
Here from table test2, I can get the value of particular row as test2[[2]] but couldn't able to find out how to use conditional logic to get state with less then 20 hospital (If i get the state name then I can use subset() to address actual problem). Also I had look on dimnames() function but could find out any idea to solve my problem. So my question is, in R how could I check the threshold value with table value.
Value store in test2 is
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37
MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX
134 133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370
UT VA VI VT WA WI WV WY ##State Name
42 87 2 15 88 125 54 29 ##Count of Hospital
as Arun also specified on his comment... you can do it as names(test2[test2 >= 20]) in order to get state with higher than 20 Hospital... Here is nice explanation why you have to avoid subset.
Or yo can transform your table to a data.frame and use subset
dat <- as.data.frame(test2)
subset(dat, Freq < 20)
nn Freq
1 AK 17
8 DC 8
9 DE 6
12 GU 1
13 HI 19
42 RI 12
49 VI 2
50 VT 15

Resources