R replace first n column value with NA - r

I have a large set of stock data over a two year period. The data frame is sorted by stock id and date, i.e. first I have all data for one stock and then all data for the second stock and so on. Now I want to replace the first 29 values (rows) in a column with NA for each stock. Is there a simple way to do that?
I have tried to use:
aggregate(column~stock_id, data = df, FUN = function(x){x[1:29] <- NA})
but it does not work.

aggregate is for summarizing - you end up with 1 row per group. You want the same number of rows, so aggregate won't work for you.
I'd use dplyr:
library(dplyr)
df %>% group_by(stock_id) %>%
mutate(column = case_when(row_number() < 30 ~ NA_real_, TRUE ~ column))

In base R, we can use ave
i1 <- with(df, ave(seq_along(stock_id), stock_id, FUN = seq_along) < 30)
df[i1, setdiff(names(df), 'stock_id)] <- NA

Related

creating new column while using group_by, quantile and other functions takes long time and doesn't gives desired outcome

I have a dataframe of 100 columns and 2 million rows. Among the columns three column are year, compound_id, lt_rto. Hare
length(unique(year))
30
length(unique(compound_id))
642
What I want to do is create a new column named avg_rto that is for each year and each compound_id the mean for lowest 12% of lt_rto values. For example - suppose for year 2001, and coumpund_id xyz, it will find the all the values of lt_rto that are at lower 12% and calculate the mean. This mean will be at the rows where year == 2001 & comound_id == "xyz" .
The code I came up is -
dt <- dt %>% group_by(year, compound_id) %>%
mutate( avg_rto = mean( dt[['lt_rto']] < quantile(fun.zero.omit(dt[['lt_rto']]),
probs = .88, na.rm = TRUE ) ))
Note: I also intend to omit the zero values while calculating the lower 12 % value.
The above code gives me same value for all the observations. And this also takes a lot time.
My problem is I can not figure out what's wrong on the code and how can I reduce the run time.
Thank you for your help.
You can write a function which ignores 0 values and calculates mean of lowest 12%.
mean_of_lower_12_perc <- function(x) {
val <- x[x != 0]
mean(sort(val)[1:(0.12 * length(val))], na.rm = TRUE)
}
Now apply this function by group.
library(dplyr)
dt %>%
group_by(year, compound_id) %>%
mutate( avg_rto = mean_of_lower_12_perc(lt_rto))
If your data is huge you can try data.table.
library(data.table)
setDT(dt)[, avg_rto := mean_of_lower_12_perc(lt_rto)]

Rearrange dataframe to fit longitudinal model in R

I have a dataframe where each entry relates to a job posting in the NHS specifying the week the job was posted, and what NHS Trust (and region) the job is in.
At the moment my dataframe looks something like this:
set.seed(1)
df1 <- data.frame(
NHS_Trust = sample(1:30,20,T),
Week = sample(1:10,20,T),
Region = sample(1:15,20,T))
And I would like to count the number of jobs for each week across each NHS Trust and assign that value to a new column 'jobs' so my dataframe looks like this:
set.seed(1)
df2 <- data.frame(
NHS_Trust = rep(1:30, each=10),
Week = rep(seq(1,10),30),
Region = rep(as.integer(runif(30,1,15)),1,each = 10),
Jobs = rpois(10*30, lambda = 2))
The dataframe may then be used to create a Poisson longitudinal multilevel model where I may model the number of jobs.
Using the data.table package you can group by, count and assign to a new column in a single expression. The syntax for data.tables is dt[i, j, by]. Here i is "with" - ie the subset of data specified by i or data in the order of i which is empty in this case so all data is used in its original order. The j tells what is to be done, here counting the the number of occurrences using .N, which is then assigned to the new variable count using the assign operator :=. The by takes a list of variables where the j operation is performed on each group.
library(data.table)
setDT(df1)
df1[, count := .N, by = .(NHS_Trust, Week, Region)]
A tidyverse approach would be
library(tidyverse)
df1 <- df1 %>%
group_by(NHS_Trust, Week, Region) %>%
count()
You can use count to count number of jobs across each Region, NHS_Trust and Week and use complete to fill in missing combinations.
library(dplyr)
df1 %>%
count(Region, NHS_Trust, Week, name = 'Jobs') %>%
tidyr::complete(Region, Week = 1:10, fill = list(Jobs = 0))
I guess I'm moving my comment to an answer:
df2 <- df1 %>% group_by(Region, NHS_Trust, Week) %>% count(); colnames(df2)[4] <- "Jobs"
df2$combo <- paste0(df2$Region, "_", df2$NHS_Trust, "_", df2$Week)
for (i in 1:length(unique(df2$Region))){
for (j in 1:length(unique(df2$NHS_Trust))){
for (k in 1:length(unique(df2$Week))){
curr_combo <- paste0(unique(df2$Region)[i], "_",
unique(df2$NHS_Trust)[j], "_",
unique(df2$Week)[k])
if(!curr_combo %in% df2$combo){
curdat <- data.frame(unique(df2$Region)[i],
unique(df2$NHS_Trust)[j],
unique(df2$Week)[k],
0,
curr_combo,
stringsAsFactors = FALSE)
#cat(curdat)
names(curdat) <- names(df2)
df2 <- rbind(as.data.frame(df2), curdat)
}
}
}
}
tail(df2)
# Region NHS_Trust Week Jobs combo
# 4495 15 1 4 0 15_1_4
# 4496 15 1 5 0 15_1_5
# 4497 15 1 8 0 15_1_8
# 4498 15 1 3 0 15_1_3
# 4499 15 1 6 0 15_1_6
# 4500 15 1 9 0 15_1_9
The for loop here check which Region-NHS_Trust-Week combinations are missing from df2 and appends those to df2 with a corresponding Jobs value of 0. The checking is done with the help of the new variable combo which is just a concatenation of the values in the fields mentioned earlier separated by underscores.
Edit: I am plenty sure the people here can come up with something more elegant than this.

Extracting counts of a variable grouped at 2 levels

I have weather data tagged by year, month and day. Here is some of the data:
Date MinT Year Month
1976-01-01 1.1 1976 1
1976-01-02 0.3 1976 1
1976-01-03 1.3 1976 1
The data run is 1976:2016 for all months. Call this TestData.
I can group and subset as follows (it is very clunky but that is because I have been trying to test each step)
temp1 <- TestData %>%
group_by(Year)
temp2 <- temp1 %>%
subset(between(Month, 1, 3))
temp3 <- temp2
v1 <- replace(temp3$minT, temp3$minT >-2.0,0) ### replaces data above the threshold
temp3["v1"] <- v1
index1 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = sum)) ## sums the month 1-3-2 degree values
index2 <- with(temp3, tapply(X = v1, INDEX = Year, FUN = length)) ## counts the number of items in each year for the selected period.
index2 gives me a count of the days in each month. I can use index1 and 2 to create index of 'weather for the month'.
What I would like is to be able to get a count of all of the days below -2 (or whatever) and so get an index of comparable severity for each month.
The v1 assignment is necessary because if I use rle to count instances, some months will have zero instances and they drop from the final tally meaning the compiled table of indices against minT, year and month has index vectors of different lengths which R doesn't like. I have tried rle as the FUN in the index2 assignment but that would not let me reach the day counts. The same was true for using a range value with length in that assignment (index3) as well.
Short of generating a mini table for each year, I am stuck. Does anyone have any suggestions?
I guess summarise is the function you are looking for. Something like this (different data, same principle):
library(latticeExtra)
threshold <- 40
SeatacWeather %>%
group_by(year, month) %>%
filter(min.temp < threshold) %>%
summarise(days_below_threshold = n())

R: Find max value for column among a subset of a data frame

I have a dataframe df with columns ID, Year, Value1, Value2, Value3 and 21788928 rows. I need to subset the data by Year and IDand find the max Value1 in that subset saving the rest of information of that row, I need to do that for all the combinations of Year and ID(Year goes from 1982 to 2013,
ID is from 1 to 28371)
I was trying to do that in a double for loop:
year<-seq(1982, 2013)
cnt=1
for (i in 1:32) {
for (j in 1:28371)
A<-df[df$Year==year[i]&df$ID==j,]
maxVal[cnt,]<-A[A$Value1==max(A$Value1),]
cnt=cnt+1
}
}
but it takes way to long. Is there a more efficient way to do that? Maybe using ddply or with.
A base R solution with aggregate:
prov <- aggregate(. ~ Year + ID, data = dat, FUN = max)
You can use dplyr
library(dplyr)
dat %>% group_by(ID, Year) %>%
summarise(mval=max(Value1)) -> result
or plyr, keeping all the other columns (and repeating max Value1 as mval)
ddply(dat, .(ID, Year), function(x) {
transform(x[which.max(x$Value1),], mval=Value1)
}, .drop=F)
Data
dat <- data.frame(ID=sample(1:10, 100, rep=T),
Year=sample(1995:2000, 100, rep=T),
Value1=runif(100))

Iteratively create columns based on grouped variables

I've got some data (below) where I want to iteratively add columns based on sums of current columns by some grouping variable, and I want to name the columns a pasted value of the current name + "_tot". I'm thinking a combination of dplyr and lapply is the way to go about it but I can't get the structure correct.
set.seed(1234)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas"), 50, replace = TRUE),
june = sample(1:50, 50, replace=TRUE),
july = sample(100:150, 50, replace=TRUE)
)
So, what I want to do is 1) group this data by "region", then add a new column for each of the following months that is the sum of that month's value (in the real dataframe, there are many periods that follow).
Basically, I want to apply this function
library(dplyr)
data %>% group_by(region) %>% mutate(june_tot = sum(june))
across every month, without having to specify "june" or "july". My initial take:
testfun <- function(df, col) {
name <- paste(col, "_tot", sep="")
data2 <- df %>% group_by(region) %>% summarise(name=sum(col))
return(data2)
}
but lapplying this doesn't work, because I have to specify the columns to call into the initial function. Just removing the "col" argument from the initial function doesn't work either, of course.
Any ideas how to lapply this sort of argument?
Here are possible solutions to your problems using dplyr (first, since that is what you tried), and followed by data.table as well as base R solutions:
dplyr:
cols <- lapply(names(data)[-(1:2)], as.name)
names(cols) <- paste0(names(data)[-(1:2)], "_tot")
data %>% group_by(region) %>% mutate_each_q(funs(sum), cols)
Assumes every column but the first two are monthly data. An explanation by line:
we use as.name and lapply to generate a list of the columns names we want to mutate as symbols
we give the new names we want (i.e. month_tot) to the list of symbols from 1.
we use the mutate_each_q (known as mutate_each_ in dplyr 0.3.0.2) to apply sum to the list of expressions we created in 1. and 2.
This is the (sample) result:
Source: local data frame [50 x 6]
Groups: region
biz region june july june_tot july_tot
1 shipping mideast 17 124 780 3339
2 telco americas 11 101 465 2901
3 telco mideast 27 131 780 3339
4 tech americas 24 135 465 2901
... rows omitted
data.table:
new.names <- paste0(tail(names(data), 2L), "_tot") # Make new names
data.table(data)[,
(new.names):=lapply(.SD, sum), # `lapply` `sum` to the selected columns (those in .SD), and assign to `new.names` columns
by=region, .SDcols=-1 # group by `region`, and exclude first column from `.SD` (note `region` is excluded as well by reason of being in `by`
][] # extra `[]` just to force printing
Here, similar logic, except we use the special .SD object that represents every column in the data.table that we are not grouping by.
base:
do.call(
cbind,
list(
data,
setNames(
lapply(data[-(1:2)], function(x) ave(x, data$region, FUN=sum)),
paste0(names(data[-(1:2)]), "_tot")
) ) )
Here we use ave to compute the per region sums, use lapply to apply ave to each column, and use do.call(cbind, ...) to reconstruct the final data frame.
Try:
> for(i in 3:4) print(tapply(data[[i]], data$region, sum))
americas mideast
563 768
americas mideast
2538 3802
You can get all outputs in a list if you want.
Restructuring the data works well for this.
require(tidyr)
# wide to long
d2 <- gather(data = data,key = month,value = monthval,-c(biz,region))
# get totals and rename month
month_tots <- aggregate(x = list(total = d2$monthval),by = list(region = d2$region,month = d2$month),sum)
month_tots$month <- paste0(month_tots$month,'_tot')
# long to wide
month_tots <- spread(data = month_tots,key = month,value = total)
# recombine
merge(data,month_tots,by = 'region',all.x = T)

Resources