stacking dataset using "for loop" function in R - r

I'm having a trouble answering a homework question.
Create a larger dataset by stacking gm2
n=100 times over. That is, if nrg is the number
of rows of gm2 and ncg is the number of columns,
the larger dataset should have 100*nrg rows and
ncg columns.
Call your stacked dataset biggm2.
To create the stacked dataset,
initialize with biggm2 <- NULL and use
a for loop to build up biggm2 one layer
at a time. Time this code using the system.time() function.
An example use of system.time() to time an R
command, e.g., x <- rnorm(100000) is:
Code given:
system.time(
{
x <- rnorm(100000) # Could put multiple lines of R code here
})
I am new to R and this is taking too much time to find the answer. Hence, I'm asking this question here.
I'm using library(gapminder)
To add, 'gm2' is a dataset:
The columns year, lifeExp, pop and
gdpPercap and save this dataset as gm2.
Also coerce gm2 to a matrix and save as gm3.
This is what I did to read the dataset. Is this correct?
gm2 <- c(CanUS2$year, CanUS2$lifeExp, CanUS2$pop, CanUS2$gbpPercap)
gm2
gm3 <- matrix(gm2, nrow=24, ncol=4)
gm3
Or is this correct?
gm2 <- subset(CanUS2, select = c("year","lifeExp","pop","gdpPercap"))
But here, there is no point of coercing to a matrix since gm2 is already a matrix.
gm3 <- matrix(gm2v, nrow=24, ncol=4)
gm3
I am really confused.
Any advice will be really appreciated.

library(gapminder)
Tidyverse:
library(dplyr)
filter(gapminder, country %in% c("United States", "Canada")) %>%
select(year, lifeExp, pop, gdpPercap) -> tidyverse_can_us
head(tidyverse_can_us)
## # A tibble: 6 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 68.75 14785584 11367.16
## 2 1957 69.96 17010154 12489.95
## 3 1962 71.30 18985849 13462.49
## 4 1967 72.13 20819767 16076.59
## 5 1972 72.88 22284500 18970.57
## 6 1977 74.21 23796400 22090.88
Base R:
gapminder[with(gapminder, country %in% c("United States", "Canada")),
c("year", "lifeExp", "pop", "gdpPercap")] -> base_r_can_us
head(base_r_can_us)
## # A tibble: 6 × 4
## year lifeExp pop gdpPercap
## <int> <dbl> <int> <dbl>
## 1 1952 68.75 14785584 11367.16
## 2 1957 69.96 17010154 12489.95
## 3 1962 71.30 18985849 13462.49
## 4 1967 72.13 20819767 16076.59
## 5 1972 72.88 22284500 18970.57
## 6 1977 74.21 23796400 22090.88
For either of those you've lost the factor variable that ties the values to the country. We also have no idea why you want/need a matrix. I made the assumption you were extracting Canada & U.S. data from your variable name but your entire question was mostly incoherent so it's hard to know what data you have and what outcome you want/need.

Related

How to loop many factors into one function

I have a large data frame regarding Covid patients. I have included a very simplified version of what this frame looks like.
CovidFake <- data.frame(DateReporting=sample(seq(as.Date("2020-10-1"), as.Date("2020-11-01"), by="day"), 50, replace=TRUE),
Industry=sample(c("Minor or Student", "Educational Services", "Medical Services", "Food Production"),50, replace =TRUE))
I want use ggplot to make a graph of the daily cases by industry of the patient. I have this function to structure the frame so ggplot can graph it.
library(zoo)
MainFunction <- function(MainFrame, CatVal){
Frame <- subset(MainFrame, Industry==CatVal)
Frame <- as.data.frame(table(Frame$DateReporting))
colnames(Frame) <- c("Var1", "Freq")
Frame$Var1 <- as.Date(Frame$Var1, "%Y-%m-%d")
Frame <- Frame %>% complete(Var1 = seq.Date(as.Date("2020-10-01", "%Y-%m-%d"),
as.Date("2020-11-01", "%Y-%m-%d"), by="day"))
Frame$Freq <- replace_na(Frame$Freq, 0)
Frame$CumSum <- cumsum(Frame$Freq)
Frame$Cat <- CatVal
Frame$SevenDayAverage <- rollmean(Frame$Freq, 7, fill=NA, align = "right")
colnames(Frame) <- c("Date", "DailyCases", "CumSum", "Industry", "SevenDayAve")
Frame <- subset(Frame, Date >= "2020-03-13")
return(Frame)
}
I need to create a frame that has all of these industries, so I've been doing something like this.
IndGraph <- rbind(MainFunction(CovidFake, "Minor or Student"),
MainFunction(CovidFake, "Educational Services"),
MainFunction(CovidFake, "Medical Services"),
MainFunction(CovidFake, "Food Production"))
The true frame has about 15 industries, so the code gets pretty long and seemingly unnecessarily repetitive. Is there anyway to loop in all the factors into the function and do this in one? Or is there a simpler way to structure the frame? I'm new to R so any and all help is much appreciated.
Thanks!
Using a for loop:
IndGraph <- vector()
for(i in CovidFake$Industry){
IndGraph <- rbind(IndGraph, MainFunction(CovidFake, i))}
Output:
> IndGraph
# A tibble: 1,600 x 5
Date DailyCases CumSum Industry SevenDayAve
<date> <dbl> <dbl> <chr> <dbl>
1 2020-10-01 0 0 Minor or Student NA
2 2020-10-02 0 0 Minor or Student NA
3 2020-10-03 1 1 Minor or Student NA
4 2020-10-04 0 1 Minor or Student NA
5 2020-10-05 0 1 Minor or Student NA
6 2020-10-06 0 1 Minor or Student NA
7 2020-10-07 1 2 Minor or Student 0.286
8 2020-10-08 1 3 Minor or Student 0.429
9 2020-10-09 2 5 Minor or Student 0.714
10 2020-10-10 0 5 Minor or Student 0.571
# ... with 1,590 more rows
One option would be:
do.call("rbind", lapply(unique(CovidFake$Industry), FUN = function(x, y = CovidFake) MainFunction(y, x)))

Trying to group data by region and summarize by date in R Studio on COVID19 epidemic

I'm an old FOTRAN, C programmer trying to learn R. I started working with data on the COVID19 epidemic and have run aground.
The data I'm working with started out as wide data and I have converted it row data. It contains a daily case count of cases by ProvinceState, Region/Country, Lat, Long, Date, Cases.
I want to filter the dataframe for Mainland China and summarize cases by date as a first step. The code below generates a NULL data set when I try to group the data.
Thanks for any help!
library(dplyr)
library(dygraphs)
library(lubridate)
library(tidyverse)
library(timeSeries)
# Set current working directory.
#
setwd("/Users/markmcleod/MarksRepository/Data")
# Read a Case csv files
#
Covid19ConfirmedWideData <- read.csv("Covid19Deaths.csv",header=TRUE,check.names = FALSE)
# count the number of days of data
#
Covid19ConfirmedDays = NCOL(Covid19ConfirmedWideData)
# Gather Wide Data columns starting at column 5 until NCOL() into RowData DataFrame
#
Covid19ConfirmedRowData <- gather(Covid19ConfirmedWideData, Date, Cases, 5:Covid19ConfirmedDays, na.rm = FALSE, convert = TRUE)
tibble(Covid19ConfirmedRowData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedRowData$ProvinceState $CountryRegion $Lat $Long $Date $Cases
# <fct> <fct> <dbl> <dbl> <chr> <int>
# 1 Anhui Mainland China 31.8 117. 1/22/20 0
# 2 Beijing Mainland China 40.2 116. 1/22/20 0
# 3 Chongqing Mainland China 30.1 108. 1/22/20 0
# Transmute date from chr to date
#
Covid19ConfirmedFormatedData <- transmute(Covid19ConfirmedRowData,CountryRegion,Date=as.Date(Date,format="%m/%d/%Y"),Cases)
tibble(Covid19ConfirmedFormatedData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedFormatedData$CountryRegion $Date $Cases
# <fct> <date> <int>
# 1 Mainland China 0020-01-22 0
# 2 Mainland China 0020-01-22 0
Covid19ConfirmedGroupedData <- Covid19ConfirmedFormatedData %>%
filter(Covid19ConfirmedFormatedData$CountryRegion=='Mainland China')
tibble(Covid19ConfirmedGroupedData)
# A tibble: 2,204 x 1
Covid19ConfirmedGroupedData[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 NA NA NA
It appears that I have a conflict in the libraries I am using.
I fell backto a previous version of the code and used only the following libraries.
library(dygraphs)
library(lubridate)
library(tidyverse)
The code seems to work again.

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Looping through two dataframes and adding columns inside of the loop

I have a problem when specifying a loop with a data frame.
The general idea I have is the following:
I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).
I have two data frames:
1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):
df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")
2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.
df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")
I want to know when and how often the full area was 'sampled'.
Scheme of what I want to do; different colors indicate different areas/regions
My rationale:
I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.
For this I wrote a loop (see below)
In order not to replicate a quadrant I created a vector "visited"
visited<-c()
Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.
Here comes the loop:
visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"First_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This gives me the first full sampling period.
Raster_Q First_visit
1 12345 1950
2 12346 1968
3 12347 1965
4 12348 1952
However, I want to have all full sampling periods.
So I do:
df1$"Second_visit"<-NA
I reset the visited vector and specify the following loop:
visited <- c()
for (i in 1:nrow(df2)){
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1$"Second_visit"[index]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
}
Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.
That gives me the second full sampling period:
Raster_Q First_visit Second_visit
1 12345 1950 NA
2 12346 1968 1970
3 12347 1965 1986
4 12348 1952 2003
Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way.
So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:
The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:
visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
#finish "first_visit"
df1[,m+1]<-NA
# add column for "second visit"
if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
# make sure that the first visit year are not included
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
if(df1$"Raster_Q"[index] %in% visited){next()} else{
df1[index,m+1]<- df2$"Year"[i]
visited[index]<- df1$"Raster_Q"[index]
}
}
}
This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.
I am stuck.... any ideas?
You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:
df <- df2 %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year),
visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
Here is the output:
> df
# A tibble: 4 x 10
# Groups: Raster_Q [4]
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 1951 1952 1955 1967 NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017 NA NA
3 12347 1965 1986 2000 NA NA NA NA NA NA
4 12348 1952 1955 1957 1965 2003 2014 2015 2016 2017
EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:
filter_by_round <- function(data, round) {
output <- data %>%
arrange(Raster_Q, Year) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round,
TRUE, FALSE)) %>%
filter(!in_round) %>%
select(-c(in_round, visit))
return(output)
}
What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:
df2 %>%
filter_by_round(1) %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
which would give you this:
# A tibble: 4 x 8
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA NA NA
2 12346 1968 1970 1998 2001 2014 2015 2017
3 12347 1965 1986 2000 NA NA NA NA
4 12348 1952 2003 2014 2015 2016 2017 NA
However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.
# I do this so we do not lose the original dataset
df <- df2
# I chose 1:5 after some trial and error showed there are 5 unique
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work,
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is
# recommended
purrr::map(1:5, function(x){
# this assigns the output of each iteration to the global variable df
df <<- df %>%
filter_by_round(x)
})
# now applying the original transformation to get the spread dataset
df %>%
group_by(Raster_Q) %>%
mutate(visit = rank(Year, ties.method = "first")) %>%
ungroup() %>%
mutate(visit = paste0("visit_", as.character(visit))) %>%
tidyr::spread(key = visit, value = Year)
This will give you the following output:
# A tibble: 4 x 6
Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 1950 NA NA NA NA
2 12346 1968 1970 2014 2015 2017
3 12347 1965 1986 NA NA NA
4 12348 1952 2003 2014 2015 2016
granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

Resources