I want to use R to sample my dataframe. My data is timestamped epidemiological data, and I want to randomly sample at least 1 and as many as 10 records for each year, preferably in a manner that is scaled to the number of records for each year. I would like to export the results as a csv.
here are a few lines of my dataset, where I've left off the long genetic sequence field for each record.
year matrix USD clade
1958 W mG018U UP
1958 W mG018U UP
1958 W mG018U UP
1966 UN mG140L LL
1969 UN mG207L LL
1969 UN mG013L LL
1971 UN mG208L LL
1972 HA mG129M MN
1973 C1 mG018U UP
1973 NA mG001U UC
1973 NA mG001U UC
all I've learned to do is
sample(mydata, size = 600, replace = FALSE)
which doesn't of course take the year into account.
There are many possibilities to run sample per group (for example sample_n in the dplyr package), here's an illustration using the data.table package.
You can set a fraction of, let's say 0.1, of the amount of the records you want to sample out of each year so the size will be relative, wrap it up in ceiling in case this fraction is smaller than 1, and restrict to maximum 10 per group using the min function, for example
library(data.table)
setDT(df)[, .SD[sample(.N, min(10, ceiling(.N*.1)))], year]
# year matrix USD clade
#1: 1958 W mG018U UP
#2: 1966 UN mG140L LL
#3: 1969 UN mG013L LL
#4: 1971 UN mG208L LL
#5: 1972 HA mG129M MN
#6: 1973 NA mG001U UC
Related
I have to select the countries that have a number of points in the top 25% of the distribution of number of datapoints using function subset & quantiles with the %in% operator.
My dataset has this form
head(drugs1)
LOCATION TIME PC_HEALTHXP PC_GDP USD_CAP TOTAL_SPEND
1 AUS 1971 15.992 0.727 35.720 462.11
2 AUS 1972 15.091 0.686 36.056 475.11
3 AUS 1973 15.117 0.681 39.871 533.47
4 AUS 1974 14.771 0.755 47.559 652.65
5 AUS 1975 11.849 0.682 47.561 660.76
6 AUS 1976 10.920 0.630 46.908 658.26
where the first column represents the countries & the second the data points that each country appear in each year.
I tried to apply the command
a<-subset(drugs1, quantile(drugs1$TIME, 0.25),1)
but the results are NULL.
Can you help me with this?
Start by figuring out the number of datapoints for each country using table().
n <- table(drugs1$location)
Find the 25th percentile of the number of datapoints.
q <- quantile(n, .75)
Find the countries that have more than q datapoints.
countries <- names(n)[n > q]
Subset the original data to only include countries in countries.
drugs2 <- subset(drugs1, LOCATION %in% countries)
I have a long dataset in the following format:
Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
: : :
: : :
2019-06-30 Australia 57
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10
: : :
: : :
2019-06-30 Austria 21
I want to calculate a 90-day period rolling standard deviation of the Score for each country. I have tried using the rollapply function (Package:zoo) and roll_sd (Package:RcppRoll) but they are not working for groupwise standard deviation. Can anyone please suggest a possible way to calculate the rolling standard deviation.
Thanks!
In general grouping is done separately from the base operation in R so it is not that those functions can't be used for grouped data. It is just that you need to embed them within a grouping operation. Here we use ave to do the grouping and rollapplyr to perform the rolling sd.
Now, at each point can we assume that the last 90 days are the last 90 rows? Assuming yes and taking rolling standard deviations of 2 so that we can use the selected rows of the posted data shown reproducibly in the Note at the end:
library(zoo)
roll <- function(x) rollapplyr(x, 2, sd, fill = NA)
transform(DF, roll = ave(Score, Country, FUN = roll))
giving:
Date Country Score roll
1 1995-01-01 Australia 100 NA
2 1995-01-02 Australia 99 0.7071068
3 1995-01-03 Australia 85 9.8994949
4 1995-01-01 Austria 67 NA
5 1995-01-02 Austria 12 38.8908730
6 1995-01-03 Austria 10 1.4142136
Wide form approach
Another approach is to convert the data to wide form and then perform the rolling operation:
library(zoo)
z <- read.zoo(DF, split = "Country")
zr <- rollapplyr(z, 2, sd, fill = NA)
zr
giving this zoo series:
Australia Austria
1995-01-01 NA NA
1995-01-02 0.7071068 38.890873
1995-01-03 9.8994949 1.414214
You can then just leave it as a zoo series in order to take advantage of the other time series functions in that package or can convert it back to a data frame using fortify.zoo(zr) or fortify.zoo(zr, melt = TRUE, names = names(DF)) depending on what you need.
Note
The input used in reproducible form.
Lines <- "Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)
I have this dataset which includes all the sales for a company in a given year (company code = gvkey, year = fyearq, sales = saley). I want to examine the volatility of these sales, which is defined as a time series of standard deviations of ten-year rolling windows of the sales growth rates(x). I also need to calculate the growth rates in order to do this.
Mathematically, the time series for the volatility would look like this:
where the "average x" is the average of x between t-4 and t+5.
How can I input this in R? And how can I calculate the growth rates I need?
An example of the data i am working with looks like this:
gvkey fyearq saley
1 1004 1978 26.669
2 1004 1979 32.563
3 1004 1980 30.454
4 1004 1981 41,766
5 1004 1982 40.465
6 1004 1983 40.475
7 1004 1984 52.723
8 1004 1985 53.386
9 1004 1986 66.376
10 1004 1987 74.543
11 1004 1988 90.007
12 1004 1989 108.635
13 1004 1990 116.092
Here is a function that ia doing what you want
# create data
n <- 100
saley <- rnorm(n, 100, 5)
gvkey <- factor(c(rep("A", n/2), rep("B", n/2)))
fyearq <- c(1980:(1980+n/2-1), 1992:(1992+n/2-1))
df <- data.frame(saley, gvkey, fyearq)
# function for growth ratea
growth_rate <- function(x){
out <- c(NA, x[2:length(x)]/ x[1:(length(x)-1)])
return(out)
}
# function for volatility
volatility <- function(x){
out <- rep(NA, length(x))
for(i in (1+4):(length(x)-5)){
out[i] <- sqrt((x[i] - mean(x[(i-4):(i+5)]))**2 /10)
}
return(out)
}
# apply function for growth rates
df$growth_rate <- do.call("c", by(df$saley, df$gvkey, growth_rate))
# applying function for volatility
df$volatility <- do.call("c", by(df$growth_rate, df$gvkey, volatility))
df
I tried to answer the question although to me this is not a good question since it asks for a complete method rather than asking something specific about programming with R. The benefit for others reading this is little compared to programming questions which can be applied to many situations.
I have a data frame with GDP values for 12 South American countries over ~40 years. A snippet of the frame is as follows:
168 Chile 1244.1799 1972
169 Chile 4076.3207 1994
170 Chile 3474.7172 1992
171 Chile 2928.1562 1991
172 Chile 6143.7276 2004
173 Colombia 882.5687 1976
174 Colombia 1094.8795 1977
175 Colombia 5403.4557 2008
176 Colombia 2376.8022 2002
177 Colombia 2047.9784 1993
1) I want to order the data frame by country. The first ~40 values should pertain to Argentina, then next ~40 to Bolivia, etc.
2) Within each country grouping, I want to order by year. The first 3 rows should pertain to Argentina 2012, Argentina 2011, Argentina 2010, etc.
I can grab the data for each country individually using subset(), and then order it with order(). Surely I don't have to do this for every country and then use rbind()? How do I do it in one foul swoop?
3) Once I have the final product, I'd like to create 12 small, individual line graphs stacked vertically, each pertaining to a different country, which shows the trend of that country's GDP over the ~40 years. How I do create such a plot?
I'm sure I could find info on the 3rd question myself, but, well, I don't even know what such a graph is called in the first place..
Here is a solution with ggplot2. Assuming your data is in df:
library(ggplot2)
df$year.as.date <- as.Date(paste0(df$year, "-01-01")) # convert year to date
ggplot(df, aes(x=year.as.date, y=gdp)) +
geom_line() + facet_grid(country ~ .)
You don't actually need to sort by year and country, ggplot will handle that for you. Here is the data (clearly, only using 5 countries and 12 years, but this will work for your data). Also, I show you how to sort by two columns on the third line:
countries <- c("ARG", "BRA", "CHI", "PER", "URU")
df <- data.frame(country=rep(countries, 12), year=rep(2001:2012, each=5), gdp=runif(60))
df <- df[order(df$country, df$year),] # <- we sort here
df$gdp <- df$gdp + 1:12 / 2
I have a a big dataframe in R that all looks about like this:
name amount date1 date2 days_out year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971
If someone's days_out is less than 60, they get a 90% discount. 60-90, a 70% discount. I need to find out the discounted sum of all the amounts for each year. My utterly embarrassing workaround is to write a python script that writes an R script that reads like this for each relevant year:
tmp <- members[members$year==1971, ]
tmp90 <- tmp[tmp$days_out <= 60 & tmp$days_out > 0 & !is.na(tmp$days_out), ]
tmp70 <- tmp[tmp$days_out <= 90 & tmp$days_out > 60 & !is.na(tmp$days_out), ]
tmp50 <- tmp[tmp$days_out <= 120 & tmp$days_out > 90 & !is.na(tmp$days_out), ]
tmp30 <- tmp[tmp$days_out <= 180 & tmp$days_out >120 & !is.na(tmp$days_out), ]
tmp00 <- tmp[tmp$days_out > 180 | is.na(tmp$days_out), ]
details.1971 <- c(1971, nrow(tmp),
nrow(tmp90), sum(tmp90$amount), sum(tmp90$amount) * .9,
nrow(tmp70), sum(tmp70$amount), sum(tmp70$amount) * .7,
nrow(tmp50), sum(tmp50$amount), sum(tmp50$amount) * .5,
nrow(tmp30), sum(tmp30$amount), sum(tmp90$amount) * .9,
nrow(tmp00), sum(tmp00$amount))
membership.for.chart <- rbind(membership.for.chart,details.1971)
It works just fine. The tmp frames and vectors get overwritten which is fine. But I know that I've utterly defeated everything that is elegant and efficient about R here. I launched R for the first time a month ago and I think I've come a long way. But I would really like to know how I should have gone about this?
Wow, you wrote a Python script that generates an R script? Consider my eyebrows raised...
Hopefully this will get you started:
#Import your data; add dummy column to separate 'days' suffix into its own column
dat <- read.table(text = " name amount date1 date2 days_out dummy year
JEAN 318.5 1971-02-16 1972-11-27 650 days 1971
GREGORY 1518.5 <NA> <NA> NA days 1971
JOHN 318.5 <NA> <NA> NA days 1971
EDWARD 318.5 <NA> <NA> NA days 1971
WALTER 518.5 1971-07-06 1975-03-14 1347 days 1971
BARRY 1518.5 1971-11-09 1972-02-09 92 days 1971
LARRY 518.5 1971-09-08 1972-02-09 154 days 1971
HARRY 318.5 1971-09-16 1972-02-09 146 days 1971
GARRY 1018.5 1971-10-26 1972-02-09 106 days 1971",header = TRUE,sep = "")
#Repeat 3 times
df <- rbind(dat,dat,dat)
#Create new year variable
df$year <- rep(1971:1973,each = nrow(dat))
#Breaks for discount levels
ct <- c(0,60,90,120,180,Inf)
#Cut into a factor
df$fac <- cut(df$days_out,ct)
#Create discount amounts for each row
df$discount <- c(0.9,0.7,0.5,0.9,1)[df$fac]
df$discount[is.na(df$discount)] <- 1
#Calc adj amount
df$amount_adj <- with(df,amount * discount)
#I use plyr a lot, but there are many, many
# alternatives
library(plyr)
ddply(df,.(year),summarise,
amt = sum(amount_adj),
total = length(year),
d60 = length(which(fac == "(0,60]")))
I only calculated a few of your summary values in the last ddply command. I'm assuming you can extend it yourself.
You can use either the cut function or the findInterval function. The exact code will depend on the internals of the object which are not unambiguously communicated with console output. If that days_out is a difftime-object. then something like this might work:
disc_amt <- with(tmp, amount*c(.9, .7, .5, .9, 1)[
findInterval(days_out, c(0, 60, 90, 120, 180, Inf] )
You should post the output of dput() on that tmp object or perhaps dput(head(tmp, 20)) if its really big, and testing can proceed. (The actual discounts did not seem to be ordered in a manner I would have expected.)