subset dataframe variables through part of names - r

Suppose I have a data frame that contains these series and something else.
Where Ru and Uk are country codes.
Date CPI.Ru CPI.g.Ru CPI.s.Ru CPI.Uk CPI.g.Uk CPI.s.Uk
Q4-1990 61.4 66.4 67.5 72.2 68.2 32.4
Q1-1991 61.3 67.0 68.0 72.6 68.8 33.2
Q2-1991 61.4 67.5 68.1 73.2 69.5 35.1
Q3-1991 61.7 68.7 68.9 73.7 70.6 35.9
Q4-1991 62.3 68.4 69.3 74.3 71.9 38.2
Q1-1992 62.3 69.7 69.6 74.7 72.9 39.2
Q2-1992 62.1 70.3 70.0 75.3 73.7 40.6
Q3-1992 62.2 71.4 70.5 75.3 74.1 41.2
Q4-1992 62.5 71.1 70.9 75.7 74.3 44.0
I want to subset dataframe by country and then do something with this series.
For example I want to divide CPI index for each country by its first element.
How can I do it in cycle or maybe with apply function?
countries <- c("Ru","Uk")
for (i in countries)
{dataFrameName$CPI.{i} <- dfName$CPI.{i}/dfName$CPI.{i}[1]}
What should I write instead of {i}?

$ only accept fixed column names. To select columns based on an expression you can instead use double brackets:
countries <- c("Ru", "Uk")
for (i in countries){
x <- paste0("CPI.", i)
dfName[[x]] <- dfName[[x]]/dfName[[x]][1]
}

This is not a loop, but if your data is always of the same form for each country, so that each country has 3 columns, and you always want to operate on the first column per country, you could try this:
sub <- df[,seq(2,ncol(df), 3)] #create a subsetted data.frame containing the CPI index per country
apply(sub, 2, function(x) x/x[1]) #then use apply to operate on each column
# CPI.Ru CPI.Uk
# [1,] 1.0000000 1.000000
# [2,] 0.9983713 1.005540
# [3,] 1.0000000 1.013850
# [4,] 1.0048860 1.020776
# [5,] 1.0146580 1.029086
# [6,] 1.0146580 1.034626
# [7,] 1.0114007 1.042936
# [8,] 1.0130293 1.042936
# [9,] 1.0179153 1.048476

Related

Read in CSV in mixed English and French number format

I would like to read the a CSV into R that is quoted, comma-separated (i.e. sep = "," not sep = ";" as read.csv2 defaults to) but that
uses the comma inside fields as the decimal separator
contains periods to separate each group of three digits from the right
An example of a problematic entry is "3.051,00" in the final line of the excerpt from the CSV shown.
I tried
dat <- read.csv2("path_to_csv.csv", sep = ",", stringsAsFactors = FALSE)
and a variant using read.csv (both are identical except for their defaults as noted in Difference between read.csv() and read.csv2() in R. Both return improperly-formatted data.frames (e.g. containing 3.051,00).
Can I read this comma-separated file in directly with read.table without having to perform text-preprocessing?
Excerpt of CSV
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
Note: I am aware of the question European and American decimal format for thousands, which is not sufficient. This user preprocesses the file they want to read in whereas I would like a direct means of reading a CSV of the kind shown into R.
Most of it is resolved with dec=",",
# saved your data to 'file.csv'
out <- read.csv("file.csv", dec=",")
head(out)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
# 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414,00 17.00 44.9 40.00
# 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352,00 3.37 16.5 61.50
# 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403,00 11.40 31.9 19.50
# 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692,00 6.49 25.0 91.40
# 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505,00 18.60 31.1 7.64
# 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400,00 11.70 22.7 6.85
Only one column is string:
sapply(out, class)
# praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "character" "numeric" "numeric"
# pjnk
# "numeric"
This can be resolved post-read with:
ischr <- sapply(out, is.character)
out[ischr] <- lapply(out[ischr], function(z) as.numeric(gsub(" ", "", chartr(",.", ". ", z))))
out$PKA
# [1] 414 352 403 692 505 400 956 1407 207 3051
If you'd rather read it in without post-processing, you can pipe(.) it, assuming you have sed available[^1]:
out <- read.csv(pipe("sed -E 's/([0-9])[.]([0-9])/\\1\\2/g;s/([0-9]),([0-9])/\\1.\\2/g' < file.csv"))
Notes:
sed is generally available on all linux/macos systems, and on windows computers it is included within Rtools.
Like r2evans's comment says, dec = "," takes care of the cases without thousands separators. Then use lapply/gsub to process the other cases, which are still of class "character".
txt <- '
praf,pmek,plcg,PIP2,PIP3,p44/42,pakts473,PKA,PKC,P38,pjnk
"26,40","13,20","8,82","18,30","58,80","6,61","17,00","414,00","17,00","44,90","40,00"
"35,90","16,50","12,30","16,80","8,13","18,60","32,50","352,00","3,37","16,50","61,50"
"59,40","44,10","14,60","10,20","13,00","14,90","32,50","403,00","11,40","31,90","19,50"
"62,10","51,90","13,60","30,20","10,60","14,30","37,90","692,00","6,49","25,00","91,40"
"75,00","33,40","1,00","31,60","1,00","19,80","27,60","505,00","18,60","31,10","7,64"
"20,40","15,10","7,99","101,00","35,90","9,14","22,90","400,00","11,70","22,70","6,85"
"47,80","19,60","17,50","33,10","82,00","17,90","35,20","956,00","22,50","43,30","20,00"
"59,90","53,30","11,80","77,70","12,90","11,10","37,90","1.407,00","18,80","29,40","16,80"
"46,60","27,10","12,40","109,00","21,90","21,50","38,20","207,00","11,00","31,30","12,00"
"51,90","21,30","49,10","58,80","10,80","58,80","200,00","3.051,00","15,30","39,20","15,70"
'
df1 <- read.csv(textConnection(txt), dec = ",")
i <- sapply(df1, is.character)
df1[i] <- lapply(df1[i], \(x) gsub("\\.", "", x))
df1[i] <- lapply(df1[i], \(x) as.numeric(sub(",", ".", x)))
df1
#> praf pmek plcg PIP2 PIP3 p44.42 pakts473 PKA PKC P38 pjnk
#> 1 26.4 13.2 8.82 18.3 58.80 6.61 17.0 414 17.00 44.9 40.00
#> 2 35.9 16.5 12.30 16.8 8.13 18.60 32.5 352 3.37 16.5 61.50
#> 3 59.4 44.1 14.60 10.2 13.00 14.90 32.5 403 11.40 31.9 19.50
#> 4 62.1 51.9 13.60 30.2 10.60 14.30 37.9 692 6.49 25.0 91.40
#> 5 75.0 33.4 1.00 31.6 1.00 19.80 27.6 505 18.60 31.1 7.64
#> 6 20.4 15.1 7.99 101.0 35.90 9.14 22.9 400 11.70 22.7 6.85
#> 7 47.8 19.6 17.50 33.1 82.00 17.90 35.2 956 22.50 43.3 20.00
#> 8 59.9 53.3 11.80 77.7 12.90 11.10 37.9 1407 18.80 29.4 16.80
#> 9 46.6 27.1 12.40 109.0 21.90 21.50 38.2 207 11.00 31.3 12.00
#> 10 51.9 21.3 49.10 58.8 10.80 58.80 200.0 3051 15.30 39.2 15.70
Created on 2022-02-07 by the reprex package (v2.0.1)

Create an array between two numbers that can only be incremented a set amount of times?

Hopefully the title is clear enough. I'm trying to create an array of numbers between
40.56 and 100.00. But it can only be incremented 25 times. Meaning there needs to perfectly be
25 increments between 40.56 and 100.00.
How can I go about automating this in R?
You could use seq with length.out argument specifying desired length of the sequence.
seq(40.56, 100, length.out = 25)
#[1] 40.6 43.0 45.5 48.0 50.5 52.9 55.4 57.9 60.4
#[10] 62.9 65.3 67.8 70.3 72.8 75.2 77.7 80.2 82.7
#[19] 85.1 87.6 90.1 92.6 95.0 97.5 100.0

Time series forecasting by lm() using lapply

I was trying to forecast a time series problem using lm() and my data looks like below
Customer_key date sales
A35 2018-05-13 31
A35 2018-05-20 20
A35 2018-05-27 43
A35 2018-06-03 31
BH22 2018-05-13 60
BH22 2018-05-20 67
BH22 2018-05-27 78
BH22 2018-06-03 55
Converted my df to a list format by
df <- dcast(df, date ~ customer_key,value.var = c("sales"))
df <- subset(df, select = -c(dt))
demandWithKey <- as.list(df)
Trying to write a function such that applying this function across all customers
my_fun <- function(x) {
fit <- lm(ds_load ~ date, data=df) ## After changing to list ds_load and date column names
## are no longer available for formula
fit_b <- forecast(fit$fitted.values, h=20) ## forecast using lm()
return(data.frame(c(fit$fitted.values, fit_b[["mean"]])))
}
fcast <- lapply(df, my_fun)
I know the above function doesn't work, but basically I'm looking for getting both the fitted values and forecasted values for a grouped data.
But I've tried all other methods using tslm() (converting into time series data) and so on but no luck I can get the lm() work somehow on just one customer though. Also many questions/posts were on just fitting the model but I would like to forecast too at same time.
lm() is for a regression model
but here you have a time serie so for forecasting the serie you have to use one of the time serie model (ARMA ARCH GARCH...)
so you can use the function in r : auto.arima() in "forecast" package
I don't know what you're up to exactly, but you could make this less complicated.
Using by avoids the need to reshape your data, it splits your data e.g. by customer ID as in your case and applies a function on the subsets (i.e. it's a combination of split and lapply; see ?by).
Since you want to compare fitted and forecasted values somehow in your result, you probably need predict rather than $fitted.values, otherwise the values won't be of same length. Because your independent variable is a date in weekly intervals, you may use seq.Date and take the first date as a starting value; the sequence has length actual values (nrow each customer) plus h= argument of the forecast.
For demonstration purposes I add the fitted values as first column in the following.
res <- by(dat, dat$cus_key, function(x) {
H <- 20 ## globally define 'h'
fit <- lm(sales ~ date, x)
fitted <- fit$fitted.values
pred <- predict(fit, newdata=data.frame(
date=seq(x$date[1], length.out= nrow(x) + H, by="week")))
fcst <- c(fitted, forecast(fitted, h=H)$mean)
fit.na <- `length<-`(unname(fitted), length(pred)) ## for demonstration
return(cbind(fit.na, pred, fcst))
})
Result
res
# dat$cus_key: A28
# fit.na pred fcst
# 1 41.4 41.4 41.4
# 2 47.4 47.4 47.4
# 3 53.4 53.4 53.4
# 4 59.4 59.4 59.4
# 5 65.4 65.4 65.4
# 6 NA 71.4 71.4
# 7 NA 77.4 77.4
# 8 NA 83.4 83.4
# 9 NA 89.4 89.4
# 10 NA 95.4 95.4
# 11 NA 101.4 101.4
# 12 NA 107.4 107.4
# 13 NA 113.4 113.4
# 14 NA 119.4 119.4
# 15 NA 125.4 125.4
# 16 NA 131.4 131.4
# 17 NA 137.4 137.4
# 18 NA 143.4 143.4
# 19 NA 149.4 149.4
# 20 NA 155.4 155.4
# 21 NA 161.4 161.4
# 22 NA 167.4 167.4
# 23 NA 173.4 173.4
# 24 NA 179.4 179.4
# 25 NA 185.4 185.4
# ----------------------------------------------------------------
# dat$cus_key: B16
# fit.na pred fcst
# 1 49.0 49.0 49.0
# 2 47.7 47.7 47.7
# 3 46.4 46.4 46.4
# 4 45.1 45.1 45.1
# 5 43.8 43.8 43.8
# 6 NA 42.5 42.5
# 7 NA 41.2 41.2
# 8 NA 39.9 39.9
# 9 NA 38.6 38.6
# 10 NA 37.3 37.3
# 11 NA 36.0 36.0
# 12 NA 34.7 34.7
# 13 NA 33.4 33.4
# 14 NA 32.1 32.1
# 15 NA 30.8 30.8
# 16 NA 29.5 29.5
# 17 NA 28.2 28.2
# 18 NA 26.9 26.9
# 19 NA 25.6 25.6
# 20 NA 24.3 24.3
# 21 NA 23.0 23.0
# 22 NA 21.7 21.7
# 23 NA 20.4 20.4
# 24 NA 19.1 19.1
# 25 NA 17.8 17.8
# ----------------------------------------------------------------
# dat$cus_key: C12
# fit.na pred fcst
# 1 56.4 56.4 56.4
# 2 53.2 53.2 53.2
# 3 50.0 50.0 50.0
# 4 46.8 46.8 46.8
# 5 43.6 43.6 43.6
# 6 NA 40.4 40.4
# 7 NA 37.2 37.2
# 8 NA 34.0 34.0
# 9 NA 30.8 30.8
# 10 NA 27.6 27.6
# 11 NA 24.4 24.4
# 12 NA 21.2 21.2
# 13 NA 18.0 18.0
# 14 NA 14.8 14.8
# 15 NA 11.6 11.6
# 16 NA 8.4 8.4
# 17 NA 5.2 5.2
# 18 NA 2.0 2.0
# 19 NA -1.2 -1.2
# 20 NA -4.4 -4.4
# 21 NA -7.6 -7.6
# 22 NA -10.8 -10.8
# 23 NA -14.0 -14.0
# 24 NA -17.2 -17.2
# 25 NA -20.4 -20.4
As you can see, prediction and forecast yield the same values, since both methods are based on the same single explanatory variable date in this case.
Toy data:
set.seed(42)
dat <- transform(expand.grid(cus_key=paste0(LETTERS[1:3], sample(12:43, 3)),
date=seq.Date(as.Date("2018-05-13"), length.out=5, by="week")),
sales=sample(20:80, 15, replace=TRUE))

Gathering multiple data columns currently in factor form

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!
If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6
Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

run function on consecutive vals with specific range in the vector with R

spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)

Resources