Subbing random numbers for text - r

This should be fairly easy but I can't find a quick way to do it. All I want to do is replace certain levels of a factor with random numbers (I'm building a dataframe from scratch and want certain levels of the factors to have different ranges of values).
data <- data.frame(
animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
region = sample(c("north","south","east","west"),50,replace=T),
reports = sample(50:100,50,replace=T))
Something like this doesn't work, because you have to specify the number of elements to be generated
data$animal <- sub("lion",rnorm(15,10,2),data$animal)
Which gives the warning:
Warning: In sub("lion", rnorm(15, 10, 2), data$animal) :
argument 'replacement' has length > 1 and only the first element will be used
Does anybody have an easy way to do this, or is it not possible to use the "sub" expressions with numbers?

I don't understand why you'd want this, but here we go.
set.seed(42)
data <- data.frame(
animal = sample(c("lion","tiger","bear"),50,replace=TRUE),
region = sample(c("north","south","east","west"),50,replace=T),
reports = sample(50:100,50,replace=T))
data$animal <- as.character(data$animal)
to.change <- data$animal=="lion"
data$animal[to.change] <- rnorm(sum(to.change),10,2)
# animal region reports
# 1 bear south 81
# 2 bear south 61
# 3 11.1619929953634 south 61
# 4 bear west 69
# 5 tiger north 98
# 6 tiger east 99
# 7 bear east 87
# 8 11.5363574756692 north 87
# 9 tiger south 77
# 10 bear east 50
# 11 tiger east 81
# 12 bear west 92
# 13 bear west 88
# 14 10.9275351770803 east 73
# 15 tiger west 77
# 16 bear north 77
# 17 bear south 50
# 18 8.22844740518064 west 68
# 19 tiger east 81
# 20 tiger north 92
# 21 bear north 68
# 22 7.80043820270429 north 70
# 23 bear north 79
# 24 bear south 80
# 25 13.0254140196099 north 86
# 26 tiger east 70
# 27 tiger north 96
# 28 bear south 99
# 29 tiger east 61
# 30 bear north 86
# 31 bear east 96
# 32 bear north 80
# 33 tiger south 82
# 34 bear east 97
# 35 10.5158428750641 west 93
# 36 bear east 79
# 37 10.1768804583192 north 91
# 38 9.75820692492182 north 55
# 39 bear north 88
# 40 tiger south 81
# 41 tiger east 57
# 42 tiger north 54
# 43 7.61134220967894 north 73
# 44 bear west 89
# 45 tiger west 87
# 46 bear east 91
# 47 bear south 58
# 48 tiger east 98
# 49 bear east 64
# 50 tiger east 57
Edit:
From your comment it seems you actually want something like this:
offense <- data.frame(animal=c("lion","tiger","bear"),
mean=c(35,25,10),
sd=c(3,2,1))
library(plyr)
data <- ddply(merge(data, offense),
.(animal),
transform,
attacks=rnorm(length(mean), mean=mean, sd=sd),
mean=NULL,
sd=NULL)
# animal region reports attacks
# 1 bear south 81 10.580996
# 2 bear south 61 10.768179
# 3 bear north 77 10.463768
# 4 bear west 69 9.114224
# 5 bear east 96 8.900219
# 6 bear north 80 11.512707
# 7 bear east 87 10.257921
# 8 bear north 68 10.088440
# 9 bear west 88 9.879103
# 10 bear east 50 8.805671
# 11 bear south 80 10.611997
# 12 bear west 92 9.782860
# 13 bear south 50 9.817243
# 14 bear west 89 10.933346
# 15 bear south 99 10.821773
# 16 bear east 91 11.392116
# 17 bear east 97 9.523826
# 18 bear north 88 10.650349
# 19 bear north 79 11.391110
# 20 bear east 79 8.889211
# 21 bear east 64 9.139207
# 22 bear north 86 8.868261
# 23 bear south 58 8.540786
# 24 lion west 68 35.239948
# 25 lion south 61 36.959613
# 26 lion north 70 38.602896
# 27 lion north 73 38.134253
# 28 lion north 91 31.990374
# 29 lion north 86 40.545446
# 30 lion east 73 32.999680
# 31 lion north 87 35.316541
# 32 lion west 93 33.733232
# 33 lion north 55 34.632949
# 34 tiger west 77 25.376386
# 35 tiger east 61 25.238322
# 36 tiger east 99 24.949815
# 37 tiger east 81 25.216145
# 38 tiger north 92 24.029130
# 39 tiger north 96 23.991566
# 40 tiger south 81 21.677802
# 41 tiger east 81 24.235333
# 42 tiger north 54 23.974699
# 43 tiger south 77 30.403782
# 44 tiger north 98 22.275768
# 45 tiger east 57 25.274512
# 46 tiger south 82 22.012750
# 47 tiger east 70 22.059129
# 48 tiger east 98 25.249405
# 49 tiger west 87 23.006722
# 50 tiger east 57 24.996355

Related

Am I able to get a specific P-value to see where the significance lies?

So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations

Error when merging: Error in `vectbl_as_row_location()`: ! Must subset rows with a valid subscript vector. x Subscript `x` has the wrong type

I am trying to merge two dataframes in r, and this error message keeps coming up even though the variable types all should be correct.
Here is my code:
team_info <- baseballr::mlb_teams(season = 2022)
team_info_mlb <- subset(team_info, sport_name == 'Major League Baseball')
tim2 <- team_info_mlb %>%
rename('home_team' = club_name)
tim3 <- subset(tim2, select = c('team_full_name', 'home_team'))
new_pf <- baseballr::fg_park(yr = 2022)
new_pf <- subset(new_pf, select = c('home_team', '1yr'))
info_pf <- merge(tim3, new_pf, by = 'home_team')
The final line is where the problems happen. Let me know if anyone has advice.
The problem is that the data have some fancy class attributes.
> class(tim3)
[1] "baseballr_data" "tbl_df" "tbl" "data.table" "data.frame"
> class(new_pf)
[1] "baseballr_data" "tbl_df" "tbl" "data.table" "data.frame"
Just wrap them in as.data.frame(). Since both data sets have the same by variable you may omit explicit specification.
info_pf <- merge(as.data.frame(tim3), as.data.frame(new_pf))
info_pf
# home_team team_full_name 1yr
# 1 Angels Los Angeles Angels 102
# 2 Astros Houston Astros 99
# 3 Athletics Oakland Athletics 94
# 4 Blue Jays Toronto Blue Jays 106
# 5 Braves Atlanta Braves 105
# 6 Brewers Milwaukee Brewers 102
# 7 Cardinals St. Louis Cardinals 92
# 8 Cubs Chicago Cubs 103
# 9 Diamondbacks Arizona Diamondbacks 103
# 10 Dodgers Los Angeles Dodgers 98
# 11 Giants San Francisco Giants 99
# 12 Guardians Cleveland Guardians 97
# 13 Mariners Seattle Mariners 94
# 14 Marlins Miami Marlins 97
# 15 Mets New York Mets 91
# 16 Nationals Washington Nationals 97
# 17 Orioles Baltimore Orioles 108
# 18 Padres San Diego Padres 96
# 19 Phillies Philadelphia Phillies 98
# 20 Pirates Pittsburgh Pirates 101
# 21 Rangers Texas Rangers 98
# 22 Rays Tampa Bay Rays 89
# 23 Red Sox Boston Red Sox 111
# 24 Reds Cincinnati Reds 112
# 25 Rockies Colorado Rockies 112
# 26 Royals Kansas City Royals 108
# 27 Tigers Detroit Tigers 94
# 28 Twins Minnesota Twins 99
# 29 White Sox Chicago White Sox 100
# 30 Yankees New York Yankees 99

Assign name in column based on values from another column in R

I am looking for some help to what seems like a very simple question. Any advice is greatly appreciated! I have created a data frame and I am looking to assign names under one column (Region) based on the values in the other column (Unit).
rdf<-as.data.frame(matrix(NA, nrow= 59, ncol=2))
colnames(rdf)<-c("Unit", "Region")
rdf$Unit<-c(1:35, 37:60)
rdf$Region<- ## (See below)
Here I want for Units 1:13 <- the region to be East,
for Units 14:25 and 27 the region to be labeled Central,
units 26, 28:38, 40:43, 45:46, to be labeled West,
and then Units 44, 39, 47:60, to be labeled BC.
I've been trying case_when or nested if else statements, but I am getting errors relating to a longer object length is not a multiple of shorter object length.
We can use dplyr. case_when is the way to go here.
library(dplyr)
rdf %>%
mutate(Region = case_when(Unit %in% 1:13 ~ "East",
Unit %in% c(14:25, 27) ~ "Central",
Unit %in% c(26, 28:38, 40:43, 45:46) ~ "West",
Unit %in% c(44, 39, 47:60) ~ "BC"))
Unit Region
1 1 East
2 2 East
3 3 East
4 4 East
5 5 East
6 6 East
7 7 East
8 8 East
9 9 East
10 10 East
11 11 East
12 12 East
13 13 East
14 14 Central
15 15 Central
16 16 Central
17 17 Central
18 18 Central
19 19 Central
20 20 Central
21 21 Central
22 22 Central
23 23 Central
24 24 Central
25 25 Central
26 26 West
27 27 Central
28 28 West
29 29 West
30 30 West
31 31 West
32 32 West
33 33 West
34 34 West
35 35 West
36 37 West
37 38 West
38 39 BC
39 40 West
40 41 West
41 42 West
42 43 West
43 44 BC
44 45 West
45 46 West
46 47 BC
47 48 BC
48 49 BC
49 50 BC
50 51 BC
51 52 BC
52 53 BC
53 54 BC
54 55 BC
55 56 BC
56 57 BC
57 58 BC
58 59 BC
59 60 BC
Because this is a small data frame, using a translation table might be the most convenient way:
xlat <- list(East=c(1:13),
Central=c(14:25,27),
West=c(26,28:38,40:43,45:46),
BC=c(44,39,47:60))
rdf$Region <- NA
for (r in names(xlat)) rdf$Region[rdf$Unit %in% xlat[[r]]] <- r
This solution (a) clearly documents the recoding; (b) will indicate if you have overlooked any unit number (by setting Region to NA); and (c) is easy to alter and maintain.
For larger tables, learn about the join operation among relations. It has many implementations in R.

Using mutate() to efficiently create data frame

I have this local data frame:
Source: local data frame [792 x 3]
team player_name g
1 Anaheim PERRY_COREY 31
2 Anaheim GETZLAF_RYAN 22
3 Dallas BENN_JAMIE 25
4 Pittsburgh CROSBY_SIDNEY 20
5 Toronto KESSEL_PHIL 27
6 Edmonton HALL_TAYLOR 16
7 Dallas SEGUIN_TYLER 24
8 Montreal VANEK_THOMAS 19
9 Colorado LANDESKOG_GABRIEL 18
10 Chicago SHARP_PATRICK 22
.. ... ... ..
I want to be able to rank the teams based on their average number of goals (g) per player. Here is what I did (really feels suboptimal):
library(dplyr)
d1 <- select(df, team, g, player_name)
c1 <- count(d1, team, wt = g)
c2 <- count(d1, team, wt = n_distinct(player_name))
c3 <- cbind(c1, c2[,2])
c4 <- c3[,2] / c3[,3]
c5 <- cbind(c3, c4)
colnames(c5) <- c("team", "ttgpt", "ttnp", "agpp")
c6 <- mutate(c5, rank = row_number(desc(c4)))
c7 <- filter(c6, rank <=10)
c8 <- arrange(c7, rank)
And here is the result of c8:
team ttgpt ttnp agpp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY_Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San_Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St._Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
I would like to recreate this table with consistent use of %>%
See CSV for reproductible example: playerstats.csv
Ok from what you said:
df<-read.csv("../Downloads/playerstats.csv",header=T,sep=",")
df %>% group_by(Team)
%>% summarise(ttgp=sum(G),ttnp=n_distinct(Player.Name),agp=sum(G)/n_distinct(Player.Name))
%>% mutate(rank=rank(desc(agp)))
%>% filter(rank<=10)
%>% arrange(rank)
Source: local data frame [10 x 5]
Team ttgp ttnp agp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St. Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
Note that I am not sure what you mean with ttgpt and ttnp. Therefore, I tried to guess it.

Select "europe" from df

my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6

Resources