I have written down the following script to get the data in longer format. How i can get the data.frame arrange by variables and not by Date?. That means first i should get the data for Variable A for all the dates followed by Variable X.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
A = runif(365,1,10), X = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Maybe I not understood wrigth, but you can arrange your data according to the variables column, through the arrange() function.
library(tidyverse)
DF <- DF %>%
arrange(Variables)
Resulting this
# A tibble: 730 x 3
Date Variables Values
<date> <chr> <dbl>
1 1979-01-01 A 3.59
2 1979-01-02 A 8.09
3 1979-01-03 A 4.68
4 1979-01-04 A 8.95
5 1979-01-05 A 9.46
6 1979-01-06 A 1.41
7 1979-01-07 A 5.75
8 1979-01-08 A 9.03
9 1979-01-09 A 5.96
10 1979-01-10 A 5.11
# ... with 720 more rows
In base R, we can use
DF1 <- DF[order(DF$Variables),]
Am I missing something? This is it.
arrange (DF,Variables,Date) %>% select(Variables,everything())
Related
I want to ask for ideas on creating a syntax to pivot_longer given on this.
I've already tried researching in the internet but I can't seem to find any examples that is similar to my data given where it has a Metric column which is also seperated in 3 different columns of months.
My desire final output is to have seven columns consisting of (regions,months, and the five Metrics)
How to formulate the pivot_longer and pivot_wider syntax to clean my data in order for me to visualize it?
The tricky part isn't pivot_longer. You first have to clean your Excel spreadsheet, i.e. get rid of empty rows and merge the two header rows containing the names of the variables and the dates.
One approach to achieve your desired result may look like so:
library(readxl)
library(tidyr)
library(janitor)
library(dplyr)
x <- read_excel("data/Employment.xlsx", skip = 3, col_names = FALSE) %>%
# Get rid of empty rows and cols
janitor::remove_empty()
# Make column names
col_names <- data.frame(t(x[1:2,])) %>%
fill(1) %>%
unite(name, 1:2, na.rm = TRUE) %>%
pull(name)
x <- x[-c(1:2),]
names(x) <- col_names
# Convert to long and values to numerics
x %>%
pivot_longer(-Region, names_to = c(".value", "months"), names_sep = "_") %>%
separate(months, into = c("month", "year")) %>%
mutate(across(!c(Region, month, year), as.numeric))
#> # A tibble: 6 × 8
#> Region month year `Total Population … `Labor Force Part… `Employment Rat…
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Philippin… April 2020f 73722. 55.7 82.4
#> 2 Philippin… Janu… 2021p 74733. 60.5 91.3
#> 3 Philippin… April 2021p 74971. 63.2 91.3
#> 4 National … April 2020f 9944. 54.2 87.7
#> 5 National … Janu… 2021p 10051. 57.2 91.2
#> 6 National … April 2021p 10084. 60.1 85.6
#> # … with 2 more variables: Unemployment Rate <dbl>, Underemployment Rate <dbl>
I have a dataframe with variables from COMPUSTAT containing data on various accounting items, including SG&A expenses from different companies.
I want to create a new variable in the dataframe which accumulates the SG&A expenses for each company in chronological order. I use PERMNO codes as the unique ID for each company.
I have tried this code, however it does not seem to work:
crsp.comp2$cxsgaq <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate_at(vars(xsgaq), cumsum(xsgaq))
(xsgag is the COMPUSTAT variable for SG&A expenses)
Thank you very much for your help
Your example code is attempting write the entire dataframe crsp.comp2, into a variable crsp.comp2$cxsgaq.
Usually the vars() function variables needs to be "quoted"; though in your situation, use the standard mutate() function and assign the cxsgaq variable there.
crsp.comp2 <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate(cxsgaq = cumsum(xsgaq))
Reproducible example with iris dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
mutate(C.Sepal.Width = cumsum(Sepal.Width))
Building on the answer from #m-viking, if using the WRDS PostgreSQL server, you would simply use window_order (from dplyr) in place of arrange. (I use the Compustat firm identifier gvkey in place of permno so that this code works, but the idea is the same.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer", sslmode='allow')
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
comp2 <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
comp2 <-
comp2 %>%
group_by(gvkey) %>%
dbplyr::window_order(datadate) %>%
mutate(cxsgaq = cumsum(xsgaq))
comp2 %>%
filter(!is.na(xsgaq)) %>%
select(gvkey, datadate, xsgaq, cxsgaq)
#> # Source: lazy query [?? x 4]
#> # Database: postgres [iangow#wrds-pgdata.wharton.upenn.edu:9737/wrds]
#> # Groups: gvkey
#> # Ordered by: datadate
#> gvkey datadate xsgaq cxsgaq
#> <chr> <date> <dbl> <dbl>
#> 1 001000 1966-12-31 0.679 0.679
#> 2 001000 1967-12-31 1.02 1.70
#> 3 001000 1968-12-31 5.86 7.55
#> 4 001000 1969-12-31 7.18 14.7
#> 5 001000 1970-12-31 8.25 23.0
#> 6 001000 1971-12-31 7.96 30.9
#> 7 001000 1972-12-31 7.55 38.5
#> 8 001000 1973-12-31 8.53 47.0
#> 9 001000 1974-12-31 8.86 55.9
#> 10 001000 1975-12-31 9.59 65.5
#> # … with more rows
Created on 2021-04-05 by the reprex package (v1.0.0)
I have the two columns and I am trying to merge the two columns into one.
library(tibble)
a <- tribble(
~Life_Expectancy_At_Birth_1960, ~Life_Expectancy_At_Birth_2013,
65.5693658536586, 75.3286585365854,
32.328512195122, 60.0282682926829,
32.9848292682927, 51.8661707317073,
62.2543658536585, 77.537243902439,
52.2432195121951, 77.1956341463415,
)
The result I want is:
Life_Expectancy
65.5693658536586
75.3286585365854
32.328512195122
60.0282682926829
32.9848292682927
51.8661707317073
62.2543658536585
77.537243902439
52.2432195121951
77.1956341463415
and so on
Any help would be great. Thank you!
Here's one way with re-shaping via pivot_longer():
dat <- tibble::tribble(
~Life_Expectancy_At_Birth_1960, ~Life_Expectancy_At_Birth_2013,
65.5693658536586, 75.3286585365854,
32.328512195122, 60.0282682926829,
32.9848292682927, 51.8661707317073,
62.2543658536585, 77.537243902439,
52.2432195121951, 77.1956341463415)
library(tidyr)
library(dplyr)
dat %>% mutate(obs= 1:n()) %>%
pivot_longer(-obs, names_to="variable", values_to="var") %>%
arrange(obs, variable) %>%
select(-c(obs, variable))
# # A tibble: 10 x 1
# var
# <dbl>
# 1 65.6
# 2 75.3
# 3 32.3
# 4 60.0
# 5 33.0
# 6 51.9
# 7 62.3
# 8 77.5
# 9 52.2
# 10 77.2
You probably want melt from the data.table package. Without seeing more details of what your whole data looks like it's difficult to be more specific than that.
I have the DF data.frame. I would like to add another column (i.e., call it station_no) where it will extract the number after underscore from the Variables column.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
Grid_2 = runif(365,1,10), Grid_20 = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Desired Output:
DF_out <- data.frame(Date = c("1979-01-01","1979-01-01"),Variables = c("Grid_2","Grid_20"),
Values = c(0.95,1.3), Station_no = c(2,20))
Easy option is parse_number which returns numeric converted value
library(dplyr)
DF %>%
mutate(Station_no = readr::parse_number(Variables))
Or using str_extract (in case we want to go by the pattern)
library(stringr)
DF %>%
mutate(Station_no = str_extract(Variables, "(?<=_)\\d+"))
Or using base R
DF$Station_no <- trimws(DF$Variables, whitespace = '\\D+')
A base R solution would be:
#Code
DF$Station_no <- sub("^[^_]*_", "", DF$Variables)
Output (some rows):
# A tibble: 730 x 4
Date Variables Values Station_no
<date> <chr> <dbl> <chr>
1 1979-01-01 Grid_2 3.59 2
2 1979-01-01 Grid_20 12.8 20
3 1979-01-02 Grid_2 8.09 2
4 1979-01-02 Grid_20 6.93 20
5 1979-01-03 Grid_2 4.68 2
6 1979-01-03 Grid_20 5.18 20
7 1979-01-04 Grid_2 8.95 2
8 1979-01-04 Grid_20 9.07 20
9 1979-01-05 Grid_2 9.46 2
10 1979-01-05 Grid_20 9.83 20
# ... with 720 more rows
I have daily rainfall data which I have converted to yearwise cumulative value using following code
library(seas)
library(data.table)
library(ggplot2)
#Loading data
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
dat$julian.date <- as.numeric(format(dat$date, "%j"))
DT <- data.table(dat)
DT[, Cum.Sum := cumsum(rain), by=list(year)]
df <- cbind.data.frame(day=dat$julian.date,cumulative=DT$Cum.Sum)
Then I want to apply segmented regression year-wise to have year-wise breakpoints. I could able to do it for single year like
library("segmented")
x <- subset(dat,year=="1984")$julian.date
y <- subset(DT,year=="1984")$Cum.Sum
fit.lm<-lm(y~x)
segmented(fit.lm, seg.Z = ~ x, npsi=3)
I have used npsi = 3 to have 3 breakpoints. Now how to dinimically apply it year-wise segmented regression and have the estimated breakpoints?
Here's a short script to come out with a customised function so that you can run the different yearwise regressions.
## using tidyverse processes instead of mixing and matching with other data manipulation packages
library(tidyverse); library(segmented); library(seas)
## get mscdata from "seas" packages
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
## generate cumulative sum of rain by year
d2 <- dat %>% group_by(year) %>% mutate(rain_cs = cumsum(rain)) %>% ungroup
## write a custom function
segmentedlm <- function(data, year){
subset.df <- data %>% filter(year == year)
fit.lm <- lm(rain_cs ~ julian.date, subset.df)
segmented(fit.lm, seg.Z = ~ julian.date, npsi=3)
}
# run the customised function for 1975 data
segmentedlm(d2, "1975") %>% plot(., main="1975")
segmentedlm(d2, "1984") %>% plot(., main = "1984")
To output the summary of segmented linear models of multiple years into a text file:
sink("output.txt")
lapply(c("1975", "1984"), function(x) segmentedlm(d2, x))
sink()
You can change the argument for lapply to input all the years.
You can store the lm object in a list and apply segmented for each year.
library(tidyverse)
data <- DT %>%
group_by(year) %>%
summarise(fit.lm = list(lm(Cum.Sum~julian.date)),
julian.date1 = list(julian.date)) %>%
mutate(out = map2(fit.lm, julian.date1, function(x, julian.date)
data.frame(segmented::segmented(x,
seg.Z = ~julian.date, npsi=3)$psi))) %>%
unnest_wider(out) %>%
unnest(cols = c(Initial, Est., St.Err)) %>%
dplyr::select(-fit.lm, -julian.date1)
# A tibble: 90 x 4
# year Initial Est. St.Err
# <int> <dbl> <dbl> <dbl>
# 1 1975 84.8 68.3 1.44
# 2 1975 168. 167. 9.31
# 3 1975 282. 281. 0.917
# 4 1976 84.8 68.3 1.44
# 5 1976 168. 167. 9.33
# 6 1976 282. 281. 0.913
# 7 1977 84.8 68.3 1.44
# 8 1977 168. 167. 9.32
# 9 1977 282. 281. 0.913
#10 1978 84.8 68.3 1.44
# … with 80 more rows