Suppose this table:
Browse[2]> tra_all_data
ID CITY COUNTRY PRODUCT CATEGORY YEAR INDICATOR COUNT
1 1 VAL ES Tomato Vegetables 1999 10 10
2 2 MAD ES Beer Alcohol 1999 20 20
3 3 LON UK Whisky Alcohol 1999 30 30
4 4 VAL ES Tomato Vegetables 2000 100 100
5 5 VAL ES Beer Alcohol 2000 121 121
6 6 LON UK Whisky Alcohol 2000 334 334
7 7 MAD ES Tomato Vegetables 2000 134 134
8 8 LON UK Tomato Vegetables 2000 451 451
17 17 BIL ES Pincho Meat 1999 180 180
18 18 VAL ES Orange Vegetables 1999 110 110
19 19 MAD ES Wine Alcohol 1999 120 120
20 20 LON UK Wine Alcohol 1999 230 230
21 21 VAL ES Orange Vegetables 2000 100 100
22 22 VAL ES Wine Alcohol 2000 122 122
23 23 LON UK JB Alcohol 2000 133 133
24 24 MAD ES Orange Vegetables 2000 113 113
25 25 MAD ES Orange Vegetables 2000 113 113
26 26 LON UK Orange Vegetables 2000 145 145
And this piece of code:
CURRENT_COLS<-c("PRODUCT", "YEAR", "CITY")
tra_dAGG <- tra_all_data
regroup(as.list(CURRENT_COLS)) %>%
#group_by(PRODUCT, YEAR, CITY) %>%
summarise(Percent = sum(COUNT)) %>%
mutate(Percent = Percent / sum(Percent))
If I use this code as it is, I get the following warning:
Warning message:
'regroup' is deprecated.
Use 'group_by_' instead.
See help("Deprecated")
If I comment the regroup line and use the group_by line, it works but the point is that CURRENT_COLS changes in each iteration and I need to use this variable (I have explicitly defined CURRENT_COLS in this code to better explain my question)
Can anyone help me on this issue? How can I use a variable in the group_by?
Thank you so much in advance.
My R version: 3.1.2 (2014-10-31)
You need to use the newer standard evaluation versions of dplyr's functions. They are denoted by an additional _ at the end of the function name, for example select_().
In your case, you can change your code to:
CURRENT_COLS<-c("PRODUCT", "YEAR", "CITY")
tra_dAGG <- tra_all_data
group_by_(.dots = CURRENT_COLS) %>%
summarise(Percent = sum(COUNT)) %>%
mutate(Percent = Percent / sum(Percent))
Make sure you have the latest versions of dplyr installed and loaded.
To learn more about standard/non-standard evaluation in dplyr, see the vignette NSE.
Related
I have a data.frame where most, but not all, data are recorded over a 12-month period. This is specified in the months column.
I need to transform the revenue and cost variables only (since they are flow data, compared to total_assets which is stock data) so I get the 12-month values.
In this example, for Michael and Ravi I need to replace the values in revenue and cost by (12/months)*revenue and (12/months)*cost, respectively.
What would be a possible way to do this?
df1 = data.frame(name = c('George','Andrea', 'Micheal','Maggie','Ravi'),
months=c(12,12,4,12,9),
revenue=c(45,78,13,89,48),
cost=c(56,52,15,88,24),
total_asset=c(100,121,145,103,119))
df1
name months revenue cost total_asset
1 George 12 45 56 100
2 Andrea 12 78 52 121
3 Micheal 4 13 15 145
4 Maggie 12 89 88 103
5 Ravi 9 48 24 119
Using dplyr:
library(dplyr)
df1 %>%
mutate(cost = (12/months)*cost,
revenue = (12/months)*revenue)
An alternative if for any reason you have to use base R is:
df1$revenue <- 12/df1$months * df1$revenue
df1$cost <- 12/df1$months * df1$cost
df1
#> name months revenue cost total_asset
#> 1 George 12 45 56 100
#> 2 Andrea 12 78 52 121
#> 3 Micheal 4 39 45 145
#> 4 Maggie 12 89 88 103
#> 5 Ravi 9 64 32 119
Created on 2022-06-01 by the reprex package (v2.0.1)
Slightly different base R approach with with():
df1 = data.frame(name = c('George','Andrea', 'Micheal','Maggie','Ravi'),
months=c(12,12,4,12,9),
revenue=c(45,78,13,89,48),
cost=c(56,52,15,88,24),
total_asset=c(100,121,145,103,119))
df1$revenue <- with(df1, 12/months * revenue)
df1$cost <- with(df1, 12/months * cost)
head(df1)
#> name months revenue cost total_asset
#> 1 George 12 45 56 100
#> 2 Andrea 12 78 52 121
#> 3 Micheal 4 39 45 145
#> 4 Maggie 12 89 88 103
#> 5 Ravi 9 64 32 119
Created on 2022-06-01 by the reprex package (v2.0.1)
I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))
I'm trying to 're-count' a column in R and having issues by cleaning up the data. I'm working on cleaning data by location and once I change CA to California.
all_location <- read.csv("all_location.csv", stringsAsFactors = FALSE)
all_location <- count(all_location, location)
all_location <- all_location[with(all_location, order(-n)), ]
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 CA 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
From the above, there's CA and California. Below I'm able to clean grep and replace CA with California. However, my issue is that it's grouping by California but shows two separate instances of California.
ca1 <- grep("CA",all_location$location)
all_location$location <- replace(all_location$location,ca1,"California")
all_location
A tibble: 100 x 2
location n
<chr> <int>
1 California 3216
2 Alaska 2985
3 Nevada 949
4 Washington 253
5 Hawaii 239
6 Montana 218
7 Puerto Rico 149
8 California 126
9 Utah 83
10 NA 72
My goal would be to combine both to a total under n.
all_location$location[substr(all_location$location, 1, 5) %in% "Calif" ] <- "California"
to make sure everything that starts with "Calif" gets made into "California"
I am assuming that maybe you have a space in the California (e.g. "California ") that is already present which is why this is happening..
I have this local data frame:
Source: local data frame [792 x 3]
team player_name g
1 Anaheim PERRY_COREY 31
2 Anaheim GETZLAF_RYAN 22
3 Dallas BENN_JAMIE 25
4 Pittsburgh CROSBY_SIDNEY 20
5 Toronto KESSEL_PHIL 27
6 Edmonton HALL_TAYLOR 16
7 Dallas SEGUIN_TYLER 24
8 Montreal VANEK_THOMAS 19
9 Colorado LANDESKOG_GABRIEL 18
10 Chicago SHARP_PATRICK 22
.. ... ... ..
I want to be able to rank the teams based on their average number of goals (g) per player. Here is what I did (really feels suboptimal):
library(dplyr)
d1 <- select(df, team, g, player_name)
c1 <- count(d1, team, wt = g)
c2 <- count(d1, team, wt = n_distinct(player_name))
c3 <- cbind(c1, c2[,2])
c4 <- c3[,2] / c3[,3]
c5 <- cbind(c3, c4)
colnames(c5) <- c("team", "ttgpt", "ttnp", "agpp")
c6 <- mutate(c5, rank = row_number(desc(c4)))
c7 <- filter(c6, rank <=10)
c8 <- arrange(c7, rank)
And here is the result of c8:
team ttgpt ttnp agpp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY_Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San_Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St._Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
I would like to recreate this table with consistent use of %>%
See CSV for reproductible example: playerstats.csv
Ok from what you said:
df<-read.csv("../Downloads/playerstats.csv",header=T,sep=",")
df %>% group_by(Team)
%>% summarise(ttgp=sum(G),ttnp=n_distinct(Player.Name),agp=sum(G)/n_distinct(Player.Name))
%>% mutate(rank=rank(desc(agp)))
%>% filter(rank<=10)
%>% arrange(rank)
Source: local data frame [10 x 5]
Team ttgp ttnp agp rank
1 Chicago 177 23 7.695652 1
2 Colorado 164 23 7.130435 2
3 Anaheim 180 26 6.923077 3
4 NY Rangers 153 23 6.652174 4
5 Boston 179 27 6.629630 5
6 San Jose 157 25 6.280000 6
7 Dallas 155 25 6.200000 7
8 St. Louis 148 24 6.166667 8
9 Ottawa 160 26 6.153846 9
10 Philadelphia 140 23 6.086957 10
Note that I am not sure what you mean with ttgpt and ttnp. Therefore, I tried to guess it.