r - Fill in missing years in Data frame [duplicate] - r

This question already has answers here:
Extend an irregular sequence and add zeros to missing values
(9 answers)
Closed 1 year ago.
I have some data in R that looks like this.
year freq
<int> <int>
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1
The data was read in using the following code.
data = read.csv("earthquakes.csv")
my_var <- c('year')
new_data <- data[my_var]
counts <- count(data, 'year')
This is 1 page of a 7 page table. I need to fill in the missing years with a count of 0 from 1900-1999. How would I go about this? I haven't been able to find an example online where year is the primary column.

We may use complete on the 'counts' data
library(tidyr)
complete(counts, year = 1990:1999, fill = list(freq = 0))

1) Convert the input, shown in the Note, to zoo class and then to ts class. The latter will fill iln the missing years with NA. Replace the NA's with 0, convert back to data frame and set the names to the original names.
If a ts series is ok as output then omit the last two lines. If in addition it is ok to use NA rather than 0 then omit the last three lines.
library(zoo)
DF |>
read.zoo() |>
as.ts() |>
na.fill(0) |>
fortify.zoo() |>
setNames(names(DF))
giving:
year freq
1 1902 2
2 1903 2
3 1904 0
4 1905 1
5 1906 4
6 1907 1
7 1908 1
8 1909 1
9 1910 0
10 1911 0
11 1912 1
12 1913 0
13 1914 1
14 1915 1
2) for a base solution use merge. Omit the last line if NA is ok instead of 0.
m <- merge(DF, data.frame(year = min(DF$year):max(DF$year)), all = TRUE)
transform(m, freq = replace(freq, is.na(freq), 0))
Note
Lines <- "year freq
1902 2
1903 2
1905 1
1906 4
1907 1
1908 1
1909 1
1912 1
1914 1
1915 1"
DF <- read.table(text = Lines, header = TRUE)

Related

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

How to add a year to the existing date list without erasing any of the existing ones

I have a data frame with Date and Velocity as they are seen below. My issue is that some years are missing like 1945 and 1951.
I would like to add 1945 to Date only once and on the position that it should be on between 1944 and 1946. I know some years are repeated. The day and month are not very important as they are more of a position holder. I plan to make the velocity equal to 0 for all the added years (e.g. mm-dd-1945)
What I have
Date Velocity
2/23/1944 1
12/26/1944 2
1/7/1946 5
3/25/1947 8
4/14/1948 10
6/18/1949 12
1/31/1950 13
12/7/1950 14
1/27/1952 15
I tried doing the following
NewYear <- complete(Data,Date = seq.Date(min(Data$Date),
max(Data$Date), by="year"))
but all of the existing dates get overwritten and I end up with this
Date Velocity
2/23/1944 NA
2/23/1945 NA
2/23/1946 NA
2/23/1947 NA
2/23/1948 NA
2/23/1949 NA
2/23/1950 NA
2/23/1951 NA
2/23/1952 NA
Desired Output
Date Velocity
2/23/1944 1
12/26/1944 2
1/01/1945 0
1/7/1946 5
3/25/1947 8
4/14/1948 10
6/18/1949 12
1/31/1950 13
12/7/1950 14
1/1/1951 0
1/27/1952 15
We first need to extract the year from the date then use complete to get missing years and replace the missing Date with first day of the Year.
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y"),
Year = as.integer(format(Date, "%Y"))) %>%
tidyr::complete(Year = seq(min(Year), max(Year)), fill = list(Velocity = 0)) %>%
mutate(Date = if_else(is.na(Date), as.Date(paste0(Year, "-01-01")), Date))
# Year Date Velocity
# <int> <date> <dbl>
# 1 1944 1944-02-23 1
# 2 1944 1944-12-26 2
# 3 1945 1945-01-01 0
# 4 1946 1946-01-07 5
# 5 1947 1947-03-25 8
# 6 1948 1948-04-14 10
# 7 1949 1949-06-18 12
# 8 1950 1950-01-31 13
# 9 1950 1950-12-07 14
#10 1951 1951-01-01 0
#11 1952 1952-01-27 15
Add select(-Year) if you don't want Year column in your final output.

Sum daily values into monthly values

I am trying to sum daily rainfall values into monthly totals for a record over 100 years in length. My data takes the form:
Year Month Day Rain
1890 1 1 0
1890 1 2 3.1
1890 1 3 2.5
1890 1 4 15.2
In the example above I want R to sum all the days of rainfall in January 1890, then February 1890, March 1890.... through to December 2010. I guess what I'm trying to do is create a loop to sum values. My output file should look like:
Year Month Rain
1890 1 80.5
1890 2 72.4
1890 3 66.8
1890 4 77.2
Any easy way to do this?
Many thanks.
You can use dplyr for some pleasing syntax
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Rain = sum(Rain))
In some cases it can be beneficial to convert it to a time-series class like xts, then you can use functions like apply.monthly().
Data:
df <- data.frame(
Year = rep(1890,5),
Month = c(1,1,1,2,2),
Day = 1:5,
rain = rexp(5)
)
> head(df)
Year Month Day rain
1 1890 1 1 0.1528641
2 1890 1 2 0.1603080
3 1890 1 3 0.5363315
4 1890 2 4 0.6368029
5 1890 2 5 0.5632891
Convert it to xts and use apply.monthly():
library(xts)
dates <- with(df, as.Date(paste(Year, Month, Day), format("%Y %m %d")))
myXts <- xts(df$rain, dates)
> head(apply.monthly(myXts, sum))
[,1]
1890-01-03 0.8495036
1890-02-05 1.2000919

How to compute the daily average from hourly values?

I have a text file consisting of 6 columns as shown below. the measurements are taken each 30 mint for several years (2001-2013). I want to compute the daily average so for example: for 2001 take all values correspond to the first day (1) and compute the average and do this for all days in that year and also for all years available in the text file.
to read the file:
LR=read.table("C:\\Users\\dat.txt", sep ='', header =TRUE)
header:
head(LR)
Year day hour mint valu1 valu2
1 2001 1 5 30 0 0
2 2001 1 6 0 1 0
3 2001 1 6 30 2 0
4 2001 1 7 0 0 7
5 2001 1 7 30 5 8
6 2001 1 8 0 0 0
Try:
library(plyr)
ddply(LR, .(Year, day), summarize, val = mean(valu1))
And another less elegant option:
LR$n <- paste(LR$Year, LR$day, sep="-")
tapply(LR$valu1, LR$n, FUN=mean)
If you want to select a certain range of years use subset:
dat < ddply(LR, .(Year, day), summarize, val = mean(valu1))
subset(dat, Year > 2003 & Year < 2005)
You can try aggregate:
res <- aggregate(LR, by = list(paste0(dat$Year, dat$day)), FUN = mean)
## You can remove the extra columns if you want
res[, -c(1,4,5)]
Or as Michael Lawrence suggests, using the formula interface:
aggregate(cbind(valu1, valu2) ~ Year + day, LR, mean)

Combine lists of different lengths

I am new to R and started learning two weeks ago. I want to take a list of tropical cyclone counts for various years (where some years are absent, because there were no tropical cyclones) and create a list with a column of every year from 1907-2013 and a column of the number of tropical cyclones.
In the example I include the list of occurrences to 1973 (before 1912 there were none).
Year Count
1 1912 1
2 1913 1
3 1921 1
4 1940 1
5 1953 1
6 1958 1
7 1959 1
8 1960 1
9 1966 1
10 1969 1
11 1971 1
12 1973 2
I tried using a for loop and if/else statement, but it does not work. I get the message "longer object length is not a multiple of shorter object length" and "the condition has length > 1 and only the first element will be used."
tc.SP=matrix(0,len.tc.yr,2)
tc.SP[,1]=tc.year.list
for (i in 1:len.tc.yr) #107 yrs (1907-2013)
{
if (tc.SP5.count[,1] == tc.SP[,1]) #tc.SP5.count is various years of TC occ.
{tc.SP[,2]= tc.SP5.count[,2]}
else
{tc.SP[,2]= 0}
}
Thank you for any help in advance.
When you say list, i'm going to assume you want to create a data.frame. Let's say the data above is in a data.frame called cyclone. The easiest way to create a data.frame for every year is just to merge it with a complete list. For example
cyclone.full <- merge(cyclone, data.frame(Year=1907:2013), all=T)
Here the data.frames will automatically merge on the Year column because both sets have that column. This will put NA values in all the missing years. If you want the default to be 0, you can do
cyclone.full$Count[is.na(cyclone.full$Count)] <- 0
Then yo uget
head(cyclone.full)
# Year Count
# 1 1907 0
# 2 1908 0
# 3 1909 0
# 4 1910 0
# 5 1911 0
# 6 1912 1

Resources