Applying a function using elements within a list take 2 - r

I attempted this question yesterday(Applying a function using elements within a list) but my reprex produced the wrong data structure and unfortunately the suggestions didn't work for my actual dataset.
I have what is hopefully a simple functional programming question. I have a list of locations with average temperature and amplitude for each day (180 days in my actual dataset). I want to iterate through these locations and create a sine curve of 24 points using a custom made function taking the average temperature and amplitude from each day within a list. Below is my new reprex.
library(tibble)
library(REdaS)##degrees to radians
library(tidyverse)
sinefunc<- function(Amplitude,Average){
hour<- seq(0,23,1)
temperature<-vector("double",length = 24)
for(i in seq_along(hour)){
temperature[i]<- (Amplitude*sin(deg2rad(180*(hour[i]/24)))+Average)+Amplitude*sin(deg2rad(180*hour[i]/12))
}
temperature
}
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
Using Purrr and map_dfr I get the error Error in .x$Average : $ operator is invalid for atomic vectors
df<-data %>%
map_dfr(~sinefunc(.x$Average, .x$Amplitude))
Using lapply I get the error Error in x[, "Amplitude"] : incorrect number of dimensions
data <- lapply(data, function(x){
sinefunc(Amplitude = x[,"Amplitude"], Average = x[,"Average"])
})
My goal is to have 24 hourly data points for each day and location.
Any further help would be much appreciated.
Stuart

Maybe you look for this? You get a dataframe back with 24 datapoints for each day and location, e.g. London-Day1, Dublin-Day1 etc.
library(dplyr)
library(purrr)
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))
# get group name
group_name <- data %>%
group_by(Location, Day) %>%
group_keys() %>%
mutate(group_name = stringr::str_c(Location,"_",Day)) %>%
pull(group_name)
data %>%
# split into lists
group_split(Location, Day) %>%
# get list name
setNames(group_name) %>%
# apply your function and get a dataframe back
map_dfr(~sinefunc(.x$Average, .x$Amplitude))

Related

How can I access a matrix entries using a for loop in R?

I have a distance matrix with all distances between a all points in the data set.
How can I access the individual distances in the matrix without using a for loop?
This is a working example using a for loop:
# Create a distance matrix of all possible distances
DistMatrix <- Stations %>%
st_sf(crs = 4326) %>%
filter(!is.na(end.station.id)) %>%
st_distance(Stations$geometry)
# initialisation of new distance data frame Dist containing start and end station id.
Dist <- Bike %>%
select(start.station.id, end.station.id, dateS, tripduration) %>%
filter(!is.na(end.station.id)) %>%
mutate(Dist = NA)
# iterates over all rows and allocates the corresponding distance to start and end station id's.
for(i in 1:length(Dist$dateS)){
Dist$Dist[i] <- DistMatrix[which(Stations$end.station.id == Dist$end.station.id[i]),
which(Stations$end.station.id == Dist$start.station.id[i])]
}
This is my best try go at this problem using dplyr::mutate:
Dist2 <- Dist2 %>% mutate(Dist = DistMatrix[which(end.station.id == Stations[1]),
which(start.station.id == Stations[1])])
The expected outcome would be that the dataframe Dist is edited with the column Dist(distances):
Is there a working solution to this problem?
Thx!
EDIT: The example code is more detailed now.
EDIT 2: Added expected out come.

Want to write a loop to find reflectance values for each column

So I have this code in R that I'm using on a dataframe df that comes in the format where each row is a wavelength (823 rows/wavelengths) and each column is a pixel (written as V1-V2554).
I have the code to normalise each reflectance value as such per each spectrum/pixel:
# Define function to find vector length
veclen=function(vec) {
sqrt(sum(vec^2))
}
# Find vector length for spectrum of each pixel
df_vecV6 <- df %>%
group_by(Wavelength) %>%
summarise(veclengthV6 = veclen(V6))
# Join new variable "veclength"
df <- df %>%
left_join(df_vecV6, by = "Wavelength")
# Define function that return normalized vector
vecnorm=function(vector) {
vector/veclen(vector)
}
# Normalize by dividing each reflectance value by the vector’s length
df$refl_normV6 <- vecnorm(df$V6)
but I want to create a loop to do this for all 2553 columns. I started writing it but seem to come up with problems. In this case df is finaldatat and I wanted to create a list svec to store vector lengths before the next steps:
for(i in (1:ncol(finaldatat))){
svec[[i]]<- finaldatat %>%
#group_by(Wavelength) %>%
summarise (x = veclen(finaldatat[,i]))
}
That first step runs, but the vector lengths that are meant to be below zero are way above so I already know there's a problem. Any help is appreciated!
Ideally in the final dataframe I would only have the normalised results in the same 2554x824 format.
You can use dplyr's across function to apply vecnorm to all columns from V1 to V2554.
result <- df %>%
group_by(Wavelength) %>%
summarise(across(V1:V2554, vecnorm))
#In older version of dplyr use summarise_at :
summarise_at(vars(V1:V2554), vecnorm)

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

COUNTIF function in R to a new column

I am using a data set of public transit information in rstudio. One column in this huge data frame is Origin Station. I'd like to be able to count the number of times each specific station appears as an origin station and then create a new column with that value. I'd do this in excel but the data file is way too big. IE, for every record where "14 Street-Union Sq" is the value for Origin Station, there will be a new column counting the total number of times that 14 St-Union Sq was the Origin Station.
Thanks.
sounds like the dplyr package and the n() function along with a group_by variable. Try something like this:
df <- data.frame(origin = sample(letters[1:5], 1000, replace = TRUE),
other_column = rnorm(1000))
library(dplyr)
df %>% group_by(origin) %>% mutate(n_appearances = n())
You can using ave function
test['count']=with(test,ave(variable, variable, FUN=function(x) length(x)))

Moving averages

I have daily data for over 100 years that looks like
01.01.1856 12
02.01.1956 9
03.01.1956 -12
04.01.1956 7
etc.
I wish to calculate the 30 year running average for this huge data. I tried converting the data into a time series but cant still figure out how to go about it. I will prefer a simple method that has to do with working with a data.frame.
I guess the preparation is the difficulty considering some leapyears.
So I try to show some way for preparing, before using the already mentioned function runmean of package require(caTools).
First we create example data (which is not necessary for you, but for the understanding).
Second I divide the data frame into a list of data frames, one for each year and taking the mean values for each year. These two steps could be done at once, but I think the separated way is easier to understand and to adapt.
#example data
Days <- seq(as.Date("1958-01-01"), as.Date("2015-12-31"), by="days")
Values <- runif(length(Days))
DF <- data.frame(Days = Days, Values = Values)
#start of script
Years <- format(DF$Days, "%Y")
UniqueYears <- unique(format(DF$Days, "%Y"))
#Create subset of years
#look for every unique year which element of days is in this year.
YearlySubset <- lapply(UniqueYears, function(x){
DF[which(Years == x), ]
})
YearlyMeanValues <- sapply(YearlySubset, function(x){
mean(x$Values)
})
Now the running mean is applied:
#install.packages("caTools")
require(caTools)
RM <- data.frame(Years = UniqueYears, RunningMean30y = runmean(YearlyMeanValues, 30))
Just if I didn't got you right at first and you want some running mean for every day within about 30 years, of course you could simply do:
RM <- cbind(DF, runmean(DF$Values, 365 * 30))
And considering your problems creating a timeseries:
DF[ , 1] <- as.Date(DF[ , 1], format = "%Y.%m.%d")
I would also suggest exploring RcppRoll in combination with dplyr which provides a fairly convenient solution to calculate rolling averages, sums, etc.
Code
# Libs
library(RcppRoll) # 'roll'-ing functions for R vectors and matrices.
library(dplyr) # data grammar (convenience)
library(zoo) # time series (convenience)
library(magrittr) # compound assignment pipe-operator (convenience)
# Data
data("UKgas")
## Convert to data frame to make example better
UKgas <- data.frame(Y = as.matrix(UKgas), date = time(UKgas))
# Calculations
UKgas %<>%
# To make example more illustrative I converted the data to a quarterly format
mutate(date = as.yearqtr(date)) %>%
arrange(date) %>%
# The window size can be changed to reflect any period
mutate(roll_mean = roll_mean(Y, n = 4, align = "right", fill = NA))
Notes
As the data provided in the example was fairly modest I used quarterly UK gas consumption data available via the data function in the utils package.

Resources