This is a more general problem I get in R. Say I want to create a subset for a dataset data that contains the first 10 days the days 1,...,10. For a single day I can easy make the subset like this
data_new <- subset(data, data$time == as.Date(as.character(2016-01-01)) )
But say I want the first 10 days in January 2016. I try to make a loop like this
data_new <- matrix(ncol=2,nrow=1)
for(j in 1:10) {
data_new[,j]= subset(data, data$time==as.Date(as.character(2016-01-j)))
}
but this code can not run in R because of the term as.character(2016-01-j).
How can I create such a subset?
You could do
data_new = subset(data, data$time %in% as.Date(paste0("2016-01-", 1:10)))
Related
I have the following code and I wanna make a for loop out of it. I only have to change the year numbers on all lines (years 1996-2019). The following is my code:
# loading health data
health_data_1996 <- read.csv("1996-Annual.csv")
#delete data which is not needed
health_data_1996 <- health_data_1996[!(health_data_1996$Measure.Name != "Unemployment Rate, Annual" &
health_data_1996$Measure.Name != "High School Graduation"),]
health_data_1996 <- health_data_1996[,-c(1,2,5,7:11)]
#rename value column
colnames(health_data_1996)[3] <- "1996"
Can somebody tell me how I could make a for loop out of this?
Thank you very much for your help.
Since you just want to read the datasets and not combine them I suggest the following. I'm assuming here that all your CSV files have the same name structure.
# create a vector with all the years
years <- 1996:2019
# apply the desired function on every value in years consecutively
all_data <- lapply(years, function(y) {
df <- read.csv(paste0(y, "-Annual.csv"))
df <- df[df$Measure.Name == "Unemployment Rate, Annual" |
df$Measure.Name == "High School Graduation", ]
df <- df[, -c(1, 2, 5, 7:11)]
colnames(df)[3] <- y
df
})
This will give you a named list where every element is the dataset for a given year. So for example if you want the data from 2019 you should be able to retrieve it with all_data[["2019"]].
I have a problem relating to subsetting.
Basically I have a dataset. This toy dataset is a good small example:
df<- data.frame(year = c(1980:2019), randnorm = rnorm(40, 0, 1), count1 = rpois(40, 18),
lograndnorm=(rlnorm(40, 3, 2)))
For each value of year between 2000 and 2019, I want to remove each years observation, and output a subset of the total df data excluding a year. I then want to take the year removed and enter it into a model, and use the remainder of the data to train the model.
For example, subset_ex2010 might be excluding 2010. Therefore, all data except for where year= 2010 goes into subset_ex2010 , and I can then use that data to predict 2010.
Once those parameters are entered into the model, the output is saved (after the model has run) and the loop does the next year, that is, removes 2009 from the full df dataframe and subsets the remainder.
I've tried:
for(i in 2000:2019){
subset_excl_[i] <- subset(df, year<i | year>i] )
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
and,
for(i in 2000:2019){
subset_excl_[i] <- [df$year-i]
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
but both fall over. Any assistance would be gratefully received.
I don't know linear programming. But. In both your blocks of code
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmMod[i], subset_of_[i])
You're referring to both lmmod and lmMod. R is case-sensitive.
If that alone doesn't fix it - put a broswer() call in the head of the loop and single step til you find where it's blowing up.
I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}
I have a data.table where I have few customers,some day value and pay_day value .
pay_day is a vector of length 5 for each customer and it consists of day values
I want to check each day value with the pay_day vector whether the day is part of the pay_day
Here is a dummy data for this (pardon for the messy way to create the data ) could not think of a better way atm
customers <- c("179288" ,"146506" ,"202287","16207","152979","14421","41395","199103","183467","151902")
mdays <- 1:31
set.seed(1)
data <- sort(rep(customers,100))
days <- sample(mdays,1000,replace=T)
xyz <- cbind(data,days)
x <- vector(length=1000L)
j <- 1
for( i in 1:10){
set.seed(i) ## I wanted diff dates to be picked
m <- sample(mdays,5)
while(j <=100*i){
x[j] <- paste(m,collapse = ",")
j <- j+1
}
}
xyz <- cbind(xyz,x)
require(data.table)
my_data <- setDT(as.data.frame(xyz))
setnames(my_data, c("cust","days","pay_days"))
my_data[,pay:=runif(1000,min = 0,max=10000)]
Now I want for each cust the vector of pays which happens in pay_days.
i have tried various ways but cant seem to figure it out , my initial thought is to create a flag based if days is a subset of pay_days and then take the pays according to the flag
my_data[,ifelse(grepl(days,pay_days),1,0),cust]
this does not work as I expect it to . I dont want to use a native loop as the
actual data is huge .
Using tidyr to split the pay_days column into and then checking if days is in pay_days:
library(tidyr)
library(dplyr)
# creating long-form data
tidier <- my_data %>%
mutate(pay_days = strsplit(as.character(pay_days), ",")) %>%
unnest(pay_days)
# casting as numeric to make factor & character columns comparable
tidier[, days := as.numeric(days)]
tidier[, pay_days := as.numeric(pay_days)]
tidier[days == pay_days, pay, by=cust]
Not sure how this performs for large data, as you multiply your table length by the number of days in pay_days...
Side note: I can't comment yet, but to replicate your data one needs to add library(data.table) and initialize x x<-vector() which is otherwise not found, as Dee also points out.
Another one-liner approach using the data table:
my_data[,result:=sum(unlist(lapply(strsplit(as.character(pay_days),","),match,days)),na.rm=T)>0,by=1:nrow(my_data)]
I am trying to create two vectors of the 20th and 80th percentiles of monthly return data for companies from 1927 to 2013. The issue I have encountered is that in my nested four loop I don't know how to reference both the month and the year (i.e. the returns across all companies in April 1945). Right now the code looks like this:
qunatile<-function(r){
vec20<-c(rep(0,1038))
vec80<-c(rep(0,1038))
for(i in 1927:2013){
for(j in 1:12){
vec20[j+12(i-1927)]<-quantile(r$(i, j),20)
vec80[j+12(i-1927)]<-quantile(r$(i, j),80)
}
}
data1decilest<-rbind(ps1NYSE,vec20,vec80)
}
But I know that that r$(i, j) notation is not correct. I was wondering if anyone knew how to do what I am attempting with that clearly incorrect code (i.e. reference all returns from a given month in a given year.
Thank you!
One option that would eliminate nesting loops is to create a column in your dataframe that contains a month/year combo (e.g. "Jan1955", "Apr1999", etc.) and then split your dataframe on that variable, and apply quantile functions. It's hard to say if this is solving your problem since there is not a reproducible example. I assume here your data is called df and contains a date and a value column.
library(lubridate)
library(plyr)
df$newtime <- paste0(month(df$date, label = T, abbr = T), year(df$date))
q20 <- function(df){ quantile(df$value, 20) }
q80 <- function(df){ quantile(df$value, 80) }
vec20 <- ddply(df, .(newtime), FUN=q20)
vec80 <- ddply(df, .(newtime), FUN=q80