I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.
Allow to me explain in more detail.
I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.
So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.
The data:
Here I will replicate a small slice of the data, and will include a missing value for realism.
set.seed(2000)
df <- data.frame(year = rep(c(1946:1970),length.out=50),
ccode1 = rep(c("2"), length.out = 50),
ccode2 = rep(c("20","31"), each=25),
kappavv = rnorm(50,mean = 0, sd=0.25),
output = NA)
df$kappavv[12] <- NA
What I've done:
NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- na_interpolation(dat_ts)
trialseries <- holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.
What isn't working:
I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.
I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.
Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.
You can put the for loop code in a function.
library(dplyr)
library(purrr)
apply_func <- function(df) {
for(i in 3:nrow(df)){
dat_ts <- ts(df[, 4], start = c(min(df$year), 1),
end = c(df$year[i], 1), frequency = 1)
dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
trialseries <- forecast::holt(dat_ts_corr, h=1)
df$output[i] <- trialseries$mean
}
return(df)
}
Split the data by ccode2 and apply apply_func.
df %>%group_split(ccode2) %>% map_df(apply_func)
# year ccode1 ccode2 kappavv output
# <int> <chr> <chr> <dbl> <dbl>
# 1 1946 2 20 -0.213 NA
# 2 1947 2 20 -0.0882 NA
# 3 1948 2 20 0.223 0.286
# 4 1949 2 20 0.435 0.413
# 5 1950 2 20 0.229 0.538
# 6 1951 2 20 -0.294 0.477
# 7 1952 2 20 -0.485 -0.675
# 8 1953 2 20 0.524 0.405
# 9 1954 2 20 0.0564 0.0418
#10 1955 2 20 0.294 0.161
# … with 40 more rows
Related
I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972
I have two variables that i want to bootstrap and graph the resulting linear regression line on the same xy plane from each new data set.
I was thinking that i can hold each resulting intercept and slope from the lm() but i dont know how i could graph that information for each resulting pair of information in the same graph. I know that abline() can do one pair but not all of them. Feel free to throw anything at me.
intercept_stuff<-rep(NA,T)
opp_stuff<-rep(NA,T)
N<-1000
for(t in 1:T){
idx <- sample(1:N, size =N, replace=TRUE)
intercept_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[1]
opp_stuff[t]<- lm(oppose_any~local_topic ,data = facebook[idx,
])$coefficient[2]
}
Here is an example of how to do multiple pairs of lines on ggplot with some simulated data. Hopefully this will give you some useful clues:
library(reshape2)
library(tibble)
# simulate some data
obs <- c(1:90)
values1 <- rnorm(90,mean=0,sd=1)
values2 <- rnorm(90,mean=5,sd=2)
values3 <- rnorm(90,mean=10,sd=3)
df <- as.tibble(cbind(obs,values1,values2,values3))
It looks like this:
> df
# A tibble: 90 x 4
obs values1 values2 values3
<dbl> <dbl> <dbl> <dbl>
1 1. -0.162 7.47 10.7
2 2. 0.518 5.17 7.61
3 3. 1.52 7.66 4.42
# ... with 80 more rows
Then melt it into long form:
m.df <- melt(df,id="obs",measures=c("values1","values2","values3"))
to look like this:
> m.df
obs variable value
1 1 values1 -0.16228714
2 2 values1 0.51755370
3 3 values1 1.52433685
4 4 values1 -1.82067006
5 5 values1 -1.42180601
...
Then plot many lines (as long as there are groups like the color group here, they will be unique lines):
ggplot(m.df,aes(x=obs,y=value,color=variable)) + geom_line()
And here is the plot:
Good luck!
I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!
I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.
I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}
I would like to randomly sample months according to a set of weights given by an index in a separate data frame, but the index changes according to the some categorical variables.
Below is an example problem:
require(dplyr)
sim.size <- 1000
# Generating the weights for each month, and category combination
class_probs <- data_frame(categoryA=rep(letters[1:3],24)
categoryB=rep(LETTERS[1:2],each=36),
Month=rep(month.name,6),
MonthIndex=runif(72))
# Generating some randomly simulated cateogories
sim.data <- data_frame(categoryA=sample(letters[1:3],size=sim.size,replace=TRUE),
categoryB=sample(LETTERS[1:2],size=sim.size,replace=TRUE))
# This is where i need help
# I would like to add an extra column called Month on the end of sim.data
# That will be sampled using the class_probs data, taking into account the
# Both categoryA and categoryB to generate the weights in MonthIndex
sim.data %>%
group_by(categoryA,categoryB) %>%
do(sample_n(class_probs[class_probs$categoryA==categoryA &
class_probs$categoryB==categoryB, ],
size=nrow(sim.data[sim.data$categoryA==categoryA &
sim.data$categoryB==categoryB]),
replace=TRUE,
weight=MonthIndex)$Month)
So for each group i would like to be able to sample the same number of occurrences of a particular combination of categoryA and categoryB, and for each occurrence i would like to sample a Month according to the MonthIndex given from the subset of the class_prob data.frame...
The chosen Month is then binded onto the original dataset sim.data as an extra column
Hopefully my code is already quite close...i just need a bit of help working out what bits need to change...
Here's an approach with a helper function to do the sampling, then a simple mutate call in dplyr to create the new column.
Helper function:
sampler <- function(x, y, df) {
tab <- sample_n(df %>% filter(categoryA==x,
categoryB==y),
size=1,
replace=TRUE,
weight=MonthIndex)
return(tab$Month)
}
Calling it to create a new variable:
sim.data %>%
rowwise() %>%
mutate(month = sampler(categoryA, categoryB, class_probs))
Result:
Source: local data frame [1,000 x 3]
Groups: <by row>
categoryA categoryB month
1 b B February
2 b A February
3 b B May
4 c B December
5 c B June
6 b A August
7 c A March
8 c A September
9 b A August
10 c A December
.. ... ... ...