I have the two columns and I am trying to merge the two columns into one.
library(tibble)
a <- tribble(
~Life_Expectancy_At_Birth_1960, ~Life_Expectancy_At_Birth_2013,
65.5693658536586, 75.3286585365854,
32.328512195122, 60.0282682926829,
32.9848292682927, 51.8661707317073,
62.2543658536585, 77.537243902439,
52.2432195121951, 77.1956341463415,
)
The result I want is:
Life_Expectancy
65.5693658536586
75.3286585365854
32.328512195122
60.0282682926829
32.9848292682927
51.8661707317073
62.2543658536585
77.537243902439
52.2432195121951
77.1956341463415
and so on
Any help would be great. Thank you!
Here's one way with re-shaping via pivot_longer():
dat <- tibble::tribble(
~Life_Expectancy_At_Birth_1960, ~Life_Expectancy_At_Birth_2013,
65.5693658536586, 75.3286585365854,
32.328512195122, 60.0282682926829,
32.9848292682927, 51.8661707317073,
62.2543658536585, 77.537243902439,
52.2432195121951, 77.1956341463415)
library(tidyr)
library(dplyr)
dat %>% mutate(obs= 1:n()) %>%
pivot_longer(-obs, names_to="variable", values_to="var") %>%
arrange(obs, variable) %>%
select(-c(obs, variable))
# # A tibble: 10 x 1
# var
# <dbl>
# 1 65.6
# 2 75.3
# 3 32.3
# 4 60.0
# 5 33.0
# 6 51.9
# 7 62.3
# 8 77.5
# 9 52.2
# 10 77.2
You probably want melt from the data.table package. Without seeing more details of what your whole data looks like it's difficult to be more specific than that.
Related
I want to take the average of each column (except the date) after every seven rows. I tried the approach below, but I was getting incorrect values. This method also seems really long. Is there a way to shorten it?
bankamerica = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/bankamerica.csv')
library(tidyverse)
GroupLabels <- 0:(nrow(bankamerica) - 1)%/% 7
bankamerica$Group <- GroupLabels
Avgs <- bankamerica %>%
group_by(bankamerica$Group) %>%
summarize(Avg = mean(bankamerica$tr))
EDITED: Just realized this code provides the incorrect values
I think you're on the right path.
bankamerica %>%
mutate(group = cumsum(row_number() %% 7 == 1)) %>%
group_by(group) %>%
summarise(caldate = first(caldate), across(-caldate, mean)) %>%
select(-group)
## A tibble: 144 × 3
# caldate tr var
# <chr> <dbl> <dbl>
# 1 1/2/01 28.9 -50.6
# 2 1/11/01 23.6 -45.4
# 3 1/23/01 20.9 -45
# 4 2/1/01 17.4 -48
# 5 2/12/01 14.4 -48
# 6 2/21/01 17 -48.9
# 7 3/2/01 19.1 -56
# 8 3/13/01 19.4 -56.9
# 9 3/22/01 23.3 -55.7
#10 4/2/01 7.71 -58.3
This averages every 7 rows not every 7 days, because there are missing days in the data.
I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:
https://www.pgatour.com/stats/stat.02568.y2019.eon.t041.html
In the example link - 02568 is one of many stat_id's and t041 is one of many tournament_id's. I want the scrape to get every combo of stat_id and tournament_id in the following manner:
Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?
library(rvest)
library(dplyr)
library(stringr)
tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))
test_table_pga4 <- lapply(url_g, function(i){
page2 <- read_html(i)
test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>%
mutate(tournament = i)
})
test_golf7 <- as_tibble(rbind.fill(test_table_pga4))
Use expand.grid() to create unique combinations of stat_id and tournament_id and then mutate a new column with those links.
library(tidyverse)
library(janitor)
library(rvest)
df <- expand.grid(
tournament_id = c("t041", "t054", "t464"),
stat_id = c("02568", "02567", "02564")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character,
tournament = tournament) %>%
relocate(tournament)
}
# Retrieve the tables and bind them
df %$%
map2_dfr(links, tournament_id, get_info)
# A tibble: 648 × 9
tournament rank_this_week player_name rounds average total_sg_app
<fct> <chr> <chr> <int> <dbl> <dbl>
1 t041 1 Corey Conners 4 2.89 11.6
2 t041 2 Matt Kuchar 4 2.16 8.62
3 t041 3 Byeong Hun An 4 1.90 7.60
4 t041 4 Charley Hoffman 4 1.72 6.88
5 t041 5 Ryan Moore 4 1.43 5.73
6 t041 6 Brian Stuard 4 1.42 5.69
7 t041 7 Danny Lee 4 1.30 5.18
8 t041 8 Cameron Tringale 4 1.22 4.88
9 t041 9 Si Woo Kim 4 1.22 4.87
10 t041 10 Scottie Scheffler 4 1.16 4.62
# … with 638 more rows, and 3 more variables: measured_rounds <int>,
# total_sg_ott <dbl>, total_sg_putting <dbl>
I have a dataframe with variables from COMPUSTAT containing data on various accounting items, including SG&A expenses from different companies.
I want to create a new variable in the dataframe which accumulates the SG&A expenses for each company in chronological order. I use PERMNO codes as the unique ID for each company.
I have tried this code, however it does not seem to work:
crsp.comp2$cxsgaq <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate_at(vars(xsgaq), cumsum(xsgaq))
(xsgag is the COMPUSTAT variable for SG&A expenses)
Thank you very much for your help
Your example code is attempting write the entire dataframe crsp.comp2, into a variable crsp.comp2$cxsgaq.
Usually the vars() function variables needs to be "quoted"; though in your situation, use the standard mutate() function and assign the cxsgaq variable there.
crsp.comp2 <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate(cxsgaq = cumsum(xsgaq))
Reproducible example with iris dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
mutate(C.Sepal.Width = cumsum(Sepal.Width))
Building on the answer from #m-viking, if using the WRDS PostgreSQL server, you would simply use window_order (from dplyr) in place of arrange. (I use the Compustat firm identifier gvkey in place of permno so that this code works, but the idea is the same.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer", sslmode='allow')
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
comp2 <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
comp2 <-
comp2 %>%
group_by(gvkey) %>%
dbplyr::window_order(datadate) %>%
mutate(cxsgaq = cumsum(xsgaq))
comp2 %>%
filter(!is.na(xsgaq)) %>%
select(gvkey, datadate, xsgaq, cxsgaq)
#> # Source: lazy query [?? x 4]
#> # Database: postgres [iangow#wrds-pgdata.wharton.upenn.edu:9737/wrds]
#> # Groups: gvkey
#> # Ordered by: datadate
#> gvkey datadate xsgaq cxsgaq
#> <chr> <date> <dbl> <dbl>
#> 1 001000 1966-12-31 0.679 0.679
#> 2 001000 1967-12-31 1.02 1.70
#> 3 001000 1968-12-31 5.86 7.55
#> 4 001000 1969-12-31 7.18 14.7
#> 5 001000 1970-12-31 8.25 23.0
#> 6 001000 1971-12-31 7.96 30.9
#> 7 001000 1972-12-31 7.55 38.5
#> 8 001000 1973-12-31 8.53 47.0
#> 9 001000 1974-12-31 8.86 55.9
#> 10 001000 1975-12-31 9.59 65.5
#> # … with more rows
Created on 2021-04-05 by the reprex package (v1.0.0)
I have a data set with a column of countries and a column of time it took them to run a marathon. I want to find out which 5 countries completed it in the shortest time on average. I am new to R so only have basic knowledge. The column of time is in hours. eexample of the data: marathon$Countries is a column of the nationality of each runner, marathon$OverallHrs is the overall time it took to complete the marathon for each runner.
I have tried
tapply(marathon$OverallHrs, marathon$Country, mean)
It hasnt worked in the way I want it to
I am assuming that you are not referring to the trivial case where you don't have repeated countries for your "country" column. For a beginner in R, i would strongly encourage to start learning with the package "tidyverse".
Below is the solution, where you can have repeated countries for the column "Country"
library(tidyverse)
set.seed(123)
# Generate 10 Countries, each one 5 times
A = sample(rep(1:10,5))
# Generate 50 random timing from (5-20)
B = round(runif(50)*15 + 5)
#Create a dataframe with columns (Country, Timing), rows = 50
df = data.frame("Country" = paste0("Country",A),
"Timing" = B)
#Dataframe will look like this
# Country Timing
# 1 Country5 15
# 2 Country4 17
# 3 Country4 5
# 4 Country3 12
# 5 Country5 16
# Calculate average marathon timing
df_mean <- df %>%
group_by(Country) %>% #Group
summarise(Mean_Timing = mean(Timing), .groups = 'drop') %>% #Calculate Mean_Timing
arrange(Mean_Timing) # Arrange by fastest timing first
#Dataframe = df_mean
# A tibble: 10 x 2
# Country Mean_Timing
# <chr> <dbl>
# 1 Country9 10.6
# 2 Country1 11.4
# 3 Country3 11.4
# 4 Country4 11.4
# 5 Country2 12.2
# 6 Country10 12.6
# 7 Country8 13.2
# 8 Country7 13.6
# 9 Country5 15
# 10 Country6 15.2
#To get the first 5 country, would just be
df_mean$Country[1:5]
# "Country9" "Country1" "Country3" "Country4" "Country2"
There is always the aggregate function in R for calculating mean per group. Lesser code, but I still prefer the tidyverse method as it is intuitive to use after a while and can be tweaked slightly to solve any dataframe question.
Anyway, here is the solution using aggregate.
df_mean2 <- aggregate(df[, 2], list(df$Country), mean) # Calculate Mean
df_mean2[order(df_mean2$x), ] # Sort by ascending
Group.1 x
10 Country9 10.6
1 Country1 11.4
4 Country3 11.4
5 Country4 11.4
3 Country2 12.2
2 Country10 12.6
9 Country8 13.2
8 Country7 13.6
6 Country5 15.0
7 Country6 15.2
I have written down the following script to get the data in longer format. How i can get the data.frame arrange by variables and not by Date?. That means first i should get the data for Variable A for all the dates followed by Variable X.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
A = runif(365,1,10), X = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Maybe I not understood wrigth, but you can arrange your data according to the variables column, through the arrange() function.
library(tidyverse)
DF <- DF %>%
arrange(Variables)
Resulting this
# A tibble: 730 x 3
Date Variables Values
<date> <chr> <dbl>
1 1979-01-01 A 3.59
2 1979-01-02 A 8.09
3 1979-01-03 A 4.68
4 1979-01-04 A 8.95
5 1979-01-05 A 9.46
6 1979-01-06 A 1.41
7 1979-01-07 A 5.75
8 1979-01-08 A 9.03
9 1979-01-09 A 5.96
10 1979-01-10 A 5.11
# ... with 720 more rows
In base R, we can use
DF1 <- DF[order(DF$Variables),]
Am I missing something? This is it.
arrange (DF,Variables,Date) %>% select(Variables,everything())