Classification dummy R - r

In a large dataset of US stocks I have a integer variable containing SIC codes. https://www.sec.gov/info/edgar/siccodes.htm
I would like to create a dummy variable indicating the major group of 50, i.e. a variable that takes on 1 for durable goods and 0 otherwise.
I tried the code:
data$durable <- as.integer(grepl(pattern = "50", x = data$sic))
But this, of course, does not take the hierarchical structure of SIC into account. I want to get the "50" only for the first two digits.
(New to R)
/Alex

Use either the division, or pad zero to left and check the first two letters.
code <- c(100, 102, 501, 5010)
# approach 1
as.integer(as.integer(code/100) == 50)
# approach 2
as.integer(substring(sprintf("%04d", code), 1, 2) == "50")

library(readxl)
library(dplyr)
library(stringi)
data_sic <- read_excel("./sic_example.xlsx")
data_sic$temp1 <- stri_sub(data_sic$SIC,1,2)
data_sic <- mutate(data_sic, durable_indicator =
ifelse(temp1 == "50", 1, 0))
str(data_sic)
Output:
str(data_sic)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 4 variables:
$ SIC : num 4955 4961 4991 5000 5010 ...
$ Industry Title : chr "HAZARDOUS WASTE MANAGEMENT" "STEAM & AIR-CONDITIONING SUPPLY" "COGENERATION SERVICES & SMALL POWER PRODUCERS" "WHOLESALE-DURABLE GOODS" ...
$ temp1 : chr "49" "49" "49" "50" ...
$ durable_indicator: num 0 0 0 1 1 1
Addendum:
There are multiple ways to approach this problem.
I would suggest reviewing the stringi package Link to documentation for string editing.
As well as, the caret package - documentation for dummification of variables and other statistical transformations.

Related

Error in Surv("Time to LTFU", censored) : Time variable is not numeric

I am attempting to do a survival analysis, with "time to loss to follow up".
I have tried to fix the error by ensuring that the column is numeric (see strings as factors and colClasses in the .csv read function below), but it has not solved the error.
I have trawled stack overflow and other sites for answers, but I am stuck.
Can anyone help, please?
library(tidyverse)
library(gtsummary)
library(data.table)
library(tidyr)
library(dplyr)
library(survival)
survdat <- fread("221121_HBV_Followup_survivalanalysis.csv", stringsAsFactors=FALSE,
colClasses = c("Time to LTFU"="numeric"))
#Create censoring variable (right censoring)
survdat$censored[survdat$`LTFU confirmed` == 'Yes']<- 1
survdat$censored[survdat$`LTFU confirmed` == 'No'] <-0
#specify KM analysis model
km1 <- survfit(Surv('Time to LTFU', censored) ~ 1,
data=survdat,
type="kaplan-meier")
#I get the following error
> km1 <- survfit(Surv('Time to LTFU', censored) ~ 1,
+ data=survdat,
+ type="kaplan-meier")
Error in Surv("Time to LTFU", censored) : Time variable is not numeric
str(survdat)
````
NB Have removed some of the variables for confidentiality
Classes ‘data.table’ and 'data.frame': 43 obs. of 10 variables:
$ Date screened : chr "19/10/2021" "07/07/2021" "18/01/2022" "07/05/2021" ...
$ Last date seen : chr "21/11/2022" "21/11/2022" "21/11/2022" "21/11/2022" ...
$ Time to LTFU : num 398 502 307 563 564 605 516 29 118 118 ...
$ LTFU confirmed : chr "No" "No" "No" "No" ...
$ censored : num 0 0 0 0 0 0 0 1 1 0 ...
As you can see, the "Time to LTFU" variable IS numeric!
Please help!
Thanks
Time to LTFU needs to be between backticks, not single quotes, otherwise you are supplying a string (character variable) to the function.
km1 <- survfit(Surv(`Time to LTFU`, censored) ~ 1,
data=survdat,
type="kaplan-meier")

Datasets built into R [duplicate]

Can someone please help how to get the list of built-in data sets and their dependency packages?
There are several ways to find the included datasets in R:
1: Using data() will give you a list of the datasets of all loaded packages (and not only the ones from the datasets package); the datasets are ordered by package
2: Using data(package = .packages(all.available = TRUE)) will give you a list of all datasets in the available packages on your computer (i.e. also the not-loaded ones)
3: Using data(package = "packagename") will give you the datasets of that specific package, so data(package = "plyr") will give the datasets in the plyr package
If you want to know in which package a dataset is located (e.g. the acme dataset), you can do:
dat <- as.data.frame(data(package = .packages(all.available = TRUE))$results)
dat[dat$Item=="acme", c(1,3,4)]
which gives:
Package Item Title
107 boot acme Monthly Excess Returns
I often need to also know which structure of datasets are available, so I created dataStr in my misc package.
dataStr <- function(package="datasets", ...)
{
d <- data(package=package, envir=new.env(), ...)$results[,"Item"]
d <- sapply(strsplit(d, split=" ", fixed=TRUE), "[", 1)
d <- d[order(tolower(d))]
for(x in d){ message(x, ": ", class(get(x))); message(str(get(x)))}
}
dataStr()
Please mind that the output in the console is quite long.
This is the type of output:
[...]
warpbreaks: data.frame
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
WorldPhones: matrix
num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:7] "1951" "1956" "1957" "1958" ...
..$ : chr [1:7] "N.Amer" "Europe" "Asia" "S.Amer" ...
WWWusage: ts
Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
Edit: To get more informative output and use it for unloaded packages or all the packages on the search path, please use the revised online version with
source("https://raw.githubusercontent.com/brry/berryFunctions/master/R/dataStr.R")
Here is a comprehensive R packages datasets list maintained by Prof. Vincent Arel-Bundock.
https://vincentarelbundock.github.io/Rdatasets/
Rdatasets is a collection of 1892 datasets that were originally
distributed alongside the statistical software environment R and some
of its add-on packages. The goal is to make these data more broadly
accessible for teaching and statistical software development.
Run
help(package = "datasets")
in the R Studio console and you'll get all available datasets in the tidy Help tab on the right.

How to speed up code with loop in R

Problem:
I have two data frames.
DF with payment log:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53682 obs. of 7 variables:
str(moneyDB)
$ user_id : num 59017170 57859746 58507536 59017667 59017795 ...
$ reg_date: Date, format: "2016-08-06" "2016-07-01" "2016-07-19" ...
$ date : Date, format: "2016-08-06" "2016-07-01" "2016-07-19" ...
$ money : num 0.293 0.05 0.03 0.03 7 ...
$ type : chr "1" "2" "2" "1" ...
$ quality : chr "VG" "no_quality" "no_quality" "VG" ...
$ geo : chr "Canada" "NO GEO" "NO GEO" "Canada" ...
Here is its structure. Its just a log of all transactions.
Also i have second data frame:
str(grPaysDB)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 335591 obs. of 9 variables:
$ reg_date : Date, format: "2016-05-01" "2016-05-01" "2016-05-01" ...
$ date : Date, format: "2016-05-01" "2016-05-01" "2016-05-01" ...
$ type : chr "1" "1" "1" "1" ...
$ quality : chr "VG" "VG" "VG" "VG" ...
$ geo : chr "Australia" "Canada" "Finland" "Canada" ...
$ uniqPayers : num 0 1 0 1 1 0 0 1 0 3 ...
Its Grouped data from first data frame + zero transactions. For example, there is a lot of rows in second data frame with zero payers. Thats why second data frame is greater then first.
I need to add column weeklyPayers to the second data frames. Weekly payers is sum unique payers for the last 7 days. I tried do it via loop, but it wooks too long. Is there any another vectorized ideas, how to realise this?
weeklyPayers <- vector()
for (i in 1:nrow(grPaysDB)) {
temp <- moneyDB %>%
filter(
geo == grPaysDB$geo[i],
reg_date == grPaysDB$reg_date[i],
quality == grPaysDB$quality[i],
type == grPaysDB$type[i],
between(date, grPaysDB$date[i] - 6, grPaysDB$date[i])
)
weeklyPayers <- c(weeklyPayers, length(unique(temp$user_id)))
}
grPaysDB <- cbind(grPaysDB, weeklyPayers)
In this loop for each row in second data frame i find rows in first data frame with right geo,type, quality and reg_date and range of dates. And then I can calculate number of unique payers.
I may be misunderstanding, but I think this should be fairly simple, using filter and summarise in dplyr. However, as #Hack-R mentioned, it would be helpful to have your dataset. But it would look something like:
library(dplyr)
weeklyPayers <- grPaysDB %>%
filter(date > ADD DATE IN QUESTION) %>%
summarise(sumWeeklyPayers = sum(uniqPayers))
Then again, I may well have misunderstood. If your question involves summing for each week, then you may want to investigate daily2weekly in the timeSeries package and then using group_by for the weekly variable that transpires.
I would try making a join on your datasets using merge on multiple columns (c('geo', 'reg_date', 'quality', 'type') and filter the result based on the dates. After that, aggregate using summarise.
But I am not completely sure why you want to add the weeklypayers to every transaction. Isn't it more informative or easier to aggregate your data on week number (with dplyr). Like so:
moneyDB %>% mutate(week = date- as.POSIXlt(date)$wday) %>%
group_by(geo, reg_date, quality, type, week) %>%
summarise(weeklyPayers = n())

Reading durations

I have a CSV file containing times per competitor of each section of a triathlon. I am having trouble reading the data so that R can use it. Here is an example of how the data looks (I've removed some columns for clarity):
"Place","Division","Gender","Swim","T1","Bike","T2","Run","Finish"
1, "40-49","M","7:45","0:55","27:07","0:29","18:53","55:07"
2, "UNDER 18","M","5:41","0:28","30:41","0:28","18:38","55:55"
3, "40-49","M","6:27","0:26","29:24","0:40","20:16","57:11"
4, "40-49","M","7:57","0:35","29:19","0:23","19:20","57:32"
5, "40-49","M","6:28","0:32","31:00","0:34","19:19","57:51"
6, "40-49","M","7:42","0:30","30:02","0:37","19:11","58:02"
....
250 ,"18-29","F","13:20","3:23","1:06:40","1:19","38:00","2:02:40"
251 ,"30-39","F","13:01","2:42","1:02:12","1:20","43:45","2:02:58"
252 ,50 ,"F","20:45","1:33","58:09","3:17","40:14","2:03:56"
253 ,"30-39","M","13:14","1:14","DNF","1:11","25:10","DNF bike"
254 ,"40-49","M","10:04","1:41","56:36","2:32",,"D.N.F"
My first naive attempt to plot the data went like this.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> pairs(~ Bike + Run + Swim, data=tri)
The times are not being imported in a sensible way so the charts don't make sense.
I have found the difftime type and have tried to use it to parse the times in the data file.
There are some rows with DNF or similar in place of times, I'm happy for rows with times that can't be parsed to be discarded. There are two formats for the times "%M:%S" and "%H:%M:%S"
I think I need to create a new data frame from the data but I am having trouble parsing the times. This is what I have so far.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> str(tri)
'data.frame': 254 obs. of 12 variables:
$ Place : num 1 2 3 4 5 6 7 8 9 10 ...
$ Race.. : num 237 274 268 226 267 247 264 257 273 272 ...
$ First.Name: chr ** removed names ** ...
$ Last.Name : chr ** removed names ** ...
$ Division : chr "40-49" "UNDER 18" "40-49" "40-49" ...
$ Gender : chr "M" "M" "M" "M" ...
$ Swim : chr "7:45" "5:41" "6:27" "7:57" ...
$ T1 : chr "0:55" "0:28" "0:26" "0:35" ...
$ Bike : chr "27:07" "30:41" "29:24" "29:19" ...
$ T2 : chr "0:29" "0:28" "0:40" "0:23" ...
$ Run : chr "18:53" "18:38" "20:16" "19:20" ...
$ Finish : chr "55:07" "55:55" "57:11" "57:32" ...
> as.numeric(as.difftime(tri$Bike, format="%M:%S"), units="secs")
This converts all the times that are under one hour, but the hours are interpreted as minutes for any times over an hour. Substituting "%H:%M:%S" for "%M:%S" parses times over an hour but produces NA otherwise. What is the best way to convert both types of times?
EDIT: Adding a simple example as requested.
> times <- c("27:07", "1:02:12", "DNF")
> as.numeric(as.difftime(times, format="%M:%S"), units="secs")
[1] 1627 62 NA
> as.numeric(as.difftime(times, format="%H:%M:%S"), units="secs")
[1] NA 3732 NA
The output I would like would be 1627 3732 NA
Here's a quick hack at a solution, although there may be a better one:
cdifftime <- function(x) {
x2 <- gsub("^([0-9]+:[0-9]+)$","00:\\1",x) ## prepend 00: to %M:%S elements
res <- as.difftime(x2,format="%H:%M:%S")
units(res) <- "secs"
as.numeric(res)
}
times <- c("27:07", "1:02:12", "DNF")
cdifftime(times)
## [1] 1627 3732 NA
You can apply this to the relevant columns:
tri[4:9] <- lapply(tri[4:9],cdifftime)
A couple of notes from trying to replicate your example:
you may want to use na.strings="DNF" to set "did not finish" values to NA automatically
you need to make sure strings are not read in as factors, e.g. (1) set options(stringsAsFactors="FALSE"); (2) use stringsAsFactors=FALSE when calling read.csv; (3) use as.is=TRUE, ditto.

R: Is it possible to turn combinations of vectors into data sets?

I am a beginner in R.
After watching a number of tutorials on regression analysis (on youtube), I decided to make up my own data set and apply what I learnt to it. This is what I did!
I wanted to randomly create a list of salaries, ages and marital status.
Salaries
salary = sample(2000:3000, 250, replace = T)
Ages
ages = sample(20:50, 250, replace = T)
MaritalStatus
marSt = sample(c("MARRIED", "SINGLE"), 250, repeat = T)
Then, I combined the three sets of data with:
dataset = cbind(salary, ages, marSt)
Finally, I tried to run a regression on what I thought was my new data set with this command:
data.reg = lm(salary~ages+marSt, data = dataset)
... only for me to be told that there was an error and that the object "dataset" was actually NOT a dataset.
My question is two fold:
(i) Is it possible to create data sets from combinations of vectors?
(ii) If no, is there any way in R to create data sets without importing them from other sources?
Thank you very much and please I am a beginner and do not be too sophisticated in your response.
You probably want a data.frame not a matrix (as returned by cbind),
dataset <- data.frame(salary, ages, marSt)
also, repeat is not an argument of sample(), you probably mean replace=TRUE. You would do well to read an introduction to R.
This may help:
salary = sample(2000:3000, 250, replace = T)
ages = sample(20:50, 250, replace = T)
marSt = sample(c("MARRIED", "SINGLE"), 250, replace = T)
# dataset = cbind(salary, ages, marSt) #WHAT YOU DID
dataset = data.frame(salary, ages, marSt) #WHAT YOU SHOULD HAVE DONE
data.reg = lm(salary~ages+marSt, data = dataset)
Also str() allows you to look at the structure of objects so you can see the difference between what you did and I did:
str(cbind(salary, ages, marSt))
str(data.frame(salary, ages, marSt))
Output:
> str(cbind(salary, ages, marSt))
chr [1:250, 1:3] "2388" "2530" "2518" "2450" "2008" "2502" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "salary" "ages" "marSt"
> str(data.frame(salary, ages, marSt))
'data.frame': 250 obs. of 3 variables:
$ salary: int 2388 2530 2518 2450 2008 2502 2264 2185 2207 2048 ...
$ ages : int 24 21 35 31 50 39 22 21 36 29 ...
$ marSt : Factor w/ 2 levels "MARRIED","SINGLE": 1 2 2 2 2 2 2 1 1 2 ...
EDIT:
baptiste beat me to this one but I'm leaving my answer up as it adds to the explanation given by baptiste

Resources