import txt file with desired data structure in R - r

The txt is like
#---*----1----*----2----*---
Name Time.Period Value
A Jan 2013 10
B Jan 2013 11
C Jan 2013 12
A Feb 2013 9
B Feb 2013 11
C Feb 2013 15
A Mar 2013 10
B Mar 2013 8
C Mar 2013 13
I tried to use read.table with readLines and count.field as shown belows:
> path <- list.files()
> data <- read.table(text=readLines(path)[count.fields(path, blank.lines.skip=FALSE) == 4])
Warning message:
In readLines(path) : incomplete final line found on 'data1.txt'
> data
V1 V2 V3 V4
1 A Jan 2013 10
2 B Jan 2013 11
3 C Jan 2013 12
4 A Feb 2013 9
5 B Feb 2013 11
6 C Feb 2013 15
7 A Mar 2013 10
8 B Mar 2013 8
9 C Mar 2013 13
The problem is that it give four attributes instead of three. Therefore i manipulate my data as below which seeking a alternative.
> library(zoo)
> data$Name <- as.character(data$V1)
> data$Time.Period <- as.yearmon(paste(data$V2, data$V3, sep=" "))
> data$Value <- as.numeric(data$V4)
> DATA <- data[, 5:7]
> DATA
Name Time.Period Value
1 A Jan 2013 10
2 B Jan 2013 11
3 C Jan 2013 12
4 A Feb 2013 9
5 B Feb 2013 11
6 C Feb 2013 15
7 A Mar 2013 10
8 B Mar 2013 8
9 C Mar 2013 13

You can use read.fwf to read fixed width files. You need to correctly specify the width of each column, in spaces.
data <- read.fwf(path, widths=c(-12, 8, -4, 2), header=T)
The key there is how you specify the width. Negative means skip that many places, positive means read that many. I am assuming entries in the last column have only 2 digits. Change widths accordingly if this is not the case. You will probably also have to fix the column names.
You will have to change the indices if the file format changes, or come up with some clever regexp to read it from the first few rows. A better solution would be to enclose your strings in " or, even better, avoid the format altogether.

?count.fields
As the R Documentation states count.fields counts the number of fields, as separated by sep, in each of the lines of file read, when you set count.fields(path, blank.lines.skip=FALSE) == 4 it will skip the header row which actually has three fields.

Related

How to insert a value in a table

I have aggregated a table from my datafile using this synthax:
sumtab <- as.data.frame(table(S$MONTH))
colnames(sumtab) <- c("Month", "Frq")
rownames(sumtab) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug",
"Sep","Oct","Dec")
Resulting in this table sumtab:
Month Frq
Jan 1 3
Feb 2 5
Mar 3 16
Apr 4 45
May 5 11
Jun 6 16
Jul 7 99
Aug 8 101
Sep 9 45
Oct 10 456
Dec 12 112
And this script produces a ggplot:
ggplot(sumtab, aes(x=Month,y=Frq),width=1.5) +
scale_y_continuous(limit=c(0,17),expand=c(0, 0)) +
geom_bar(stat='identity',fill="lightgreen",colour="black") +
xlab("Month") + ylab("No of bears killed") +
theme_bw(base_size = 11) +
theme(axis.text.x=element_text(angle=0,size=9))
The problem is that there are no values for November in my data, and I need to somehow enter a zero for November in the table. Probably a simple thing for most of you, and I have tried to search in other questions , and I have googled and read the books, but been unable to find the correct synthax.Need a little help.
Adding rbind into the script:
sumtab <- as.data.frame(table(S$MONTH))
sumtab <- rbind(sumtab, c(11, 0))
produced this error message:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 11) :
invalid factor level, NA generated
ant this table:
Var1 Freq
1 1 3
2 2 5
3 3 6
4 4 14
5 5 7
6 6 2
7 7 13
8 8 12
9 9 3
10 10 1
11 12 4
12 <NA> 0
So thanks #PaulH for your help, but I've probably used your help in a wrong way.
You could use the rbind command to add the November row:
sumtab <- rbind(sumtab, Nov = c(11, 0))
Good luck!

R: Insert and fill missing periods in panel data

I'm trying to learn R coming from Stata, but have run into the following two problems which I cannot seem to find elegant solutions for in R:
1) I have a panel dataset with gaps in my time variable. I would like to expand my time variable to include the gaps despite having no observed data for these rows.
In Stata I would usually go about this by setting my ID and time variables with xtset and then expanding the dataset based on this with tsfill. Is there an equivalently elegant way in R?
2) I would like to fill some of the new, blank cells with data for constant variables.
In Stata I would do this by copying data from previous (relative to my time variable) observations using the l.-prefix; for example using replace Con = l.Con.
In other words I'm asking how to go from something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Jun 15 B
To something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 Mar A
1 Apr A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Apr B
2 May B
2 Jun 15 B
Hopefully that makes sense. Thanks in advance.
You can try merge from base R or the data.table join
library(data.table)
DT2 <- setDT(df1)[, {tmp <- match(Time, month.abb)
list(Time=month.abb[min(tmp):max(tmp)])}, .(ID,Con)]
setkey(df1[, c(1,4,2,3), with=FALSE], ID, Con, Time)[DT2]
# ID Con Time Num
# 1: 1 A Jan 10
# 2: 1 A Feb 15
# 3: 1 A Mar NA
# 4: 1 A Apr NA
# 5: 1 A May 20
# 6: 2 B Feb 12
# 7: 2 B Mar 14
# 8: 2 B Apr NA
# 9: 2 B May NA
#10: 2 B Jun 15
NOTE: It may be better to keep missing value as NA

How to calculate the exponential in some columns of a dataframe in R?

I have a dataframe:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 2009 12.42669703 12.41831191
2 2010 12.39309563 12.40043599
3 2011 12.36596964 12.38256006
4 2012 12.32067284 12.36468414
5 2013 12.303095 12.34680822
6 2014 NA 12.32893229
7 2015 NA 12.31105637
8 2016 NA 12.29318044
9 2017 NA 12.27530452
10 2018 NA 12.25742859
I want to calulate the exponential of the third and fourth columns. How can I do that?
In case your dataframe is called dfs, you can do the following:
dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')] <- exp(dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')])
which gives you:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 1 2009 249371 247288.7
2 2 2010 241131 242907.5
3 3 2011 234678 238603.9
4 4 2012 224285 234376.5
5 5 2013 220377 230224.0
6 6 2014 NA 226145.1
7 7 2015 NA 222138.5
8 8 2016 NA 218202.9
9 9 2017 NA 214336.9
10 10 2018 NA 210539.5
In case you know the column numbers, this could then also simply be done by using:
dfs[,3:4] <- exp(dfs[,3:4])
which gives you the same result as above. I usually prefer to use the actual column names as the indices might change when the data frame is further processed (e.g. I delete columns, then the indices change).
Or you could do:
dfs$Dependent.variable.1 <- exp(dfs$Dependent.variable.1)
dfs$Forecast.Dependent.variable.1 <- exp(dfs$Forecast.Dependent.variable.1)
In case you want to store these columns in new variables (below they are called exp1 and exp2, respectively), you can do:
exp1 <- exp(dfs$Forecast.Dependent.variable.1)
exp2 <- exp(dfs$Dependent.variable.1)
In case you want to apply it to more than two columns and/or use more complicated functions, I highly recommend to look at apply/lappy.
Does that answer your question?

forecast throws error K must be not be greater than period/2

I issue the following commands:
ops <- read.csv("ops.csv")
ops.ts <- ts(ops, frequency=12, start=c(2014,1))
ops.fc <- forecast(ops.ts)
forecast() then throws the following error:
Error in ...fourier(x, K, 1:length(x)) :
K must be not be greater than period/2
The data from the csv looks like this according to summary(ops):
1 10
2 3
3 7
4 4
5 2
6 20
7 13
8 9
9 8
10 7
11 6
12 11
13 7
R is up to date, Forecast is installed via CRAN.
I appreciate any advice especially because I am quiet new to R.
The error message is self-explanatory.
You have 13 elements in your dataset so when you do:
ops.ts <- ts(ops, frequency = 12, start=c(2014, 1))
You get (notice the 2015 value here):
#> ops.ts
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#2014 10 3 7 4 2 20 13 9 8 7 6 11
#2015 7
I'm guessing you only want to use the first 12 months and then use forecast() ? If that is the case you can do either:
ops.ts <- ts(ops, frequency = 12, start = 2014, end = c(2015, 0))
ops.fc <- forecast(ops.ts)
or
ops <- ops[1:12, ]
ops.ts <- ts(ops, frequency = 12, start = 2014)
ops.fc <- forecast(ops.ts)

insert NA into a time series object in r

i want to sum the months for all the years in a time series that looks like
Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec
2006 4 4 3 4 4 5 5 3 3
2007 3 3 2 2 4 3 3 2 2 5 5
2008 3 3 3 2 2 4 4 3
by using
window(the time series object,start=c(2006,3),end=c(2008,3),frequency=1)
this line gives you a new ts object with just march of 2006-2007. However this does not work when the month does not have any values in it, is there any way to replace the gaps with NA? I have seen questions like this before but the dont answer i think for a ts object.
Assuming that
the_time_series_object <- ts(1:31, frequency = 12, start = c(2006, 3))
then:
window(the time series object, start = c(2006,3), end = c(2008,3), frequency = 12)
Your frequency should be 12 instead of 1. There's no NA problem it's just that one variable that you have wrong

Resources