Why are undesired sheets being read into my list with lapply() - r

Full disclosure- I inherited this code and tried to Frankenstein it enough to make it work. It isn't perfect.
I have a series of Excel workbooks I'm iterating through to extract financial data for a group of medical practices. The workbooks have a tab for each month. I used lapply() to iterate over the sheets to pull only the months in each quarter. One of the practices only has data from January and February of 2022 so I wouldn't expect that to show up for the 4th quarter update we just ran. However, that data is there.
library(tidyverse)
library(readxl)
library(openxlsx)
df1 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
df2 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
list_of_datasets <- list("January" = df1, "February" = df2)
write.xlsx(list_of_datasets, file = "C:/MC_report.xlsx")
current_month <- lubridate::month(as.Date(Sys.Date(), format = "%Y/%m/%d"))
current_year <- lubridate::year(as.Date(Sys.Date(), format = "%Y/%m/%d"))
Q1 <- c("January", "February", "March")
Q2 <- c("April", "May", "June")
Q3 <- c("July", "August", "September")
Q4 <- c("October", "November", "December")
quarter <- switch(current_month,
"1" = Q4, "2" = Q4, "3" = Q4,
"4" = Q1, "5" = Q1, "6" = Q1,
"7" = Q2, "8" = Q2, "9" = Q2,
"10" = Q3, "11" = Q3, "12" = Q3)
year <- ifelse(current_month %in% c(1, 2, 3), current_year - 1, current_year)
names = c("Medication", "Total WAC")
MCPath22 = "C:/MC_report.xlsx"
MClist22 = lapply(quarter, function(x){ # this function is repeated for each practice. I won't paste it over and over
dat = read_excel(MCPath22, sheet = x, skip = 1)[c(1,2)] # 1 is 'Medication' 2 is "Total WAC'
names(dat) = names
dat$Month = x
dat$Year = year
dat$Location = "Medical Center"
return(dat)
})
MC_newdata = do.call(rbind, MClist22) %>%
select( Medication, `Total WAC`, Month, Year, Location) %>%
mutate(Date.Added = Sys.time())
data = rbind(MC_newdata, DHP_newdata, Lex_newdata, Derm_newdata, Onc_newdata, oldvalues) %>%
filter(!is.na(Medication)) #includes all the practices
write_csv(data,"PAP Data.csv")
I just ran this again and all facilities save for the one with only January and February tabs are running correctly. It throws an error that 'October" not found, which is expected. I can stop that piece in R Studio and the script completes. And then Jan and Feb are in the output. Any idea why it's outputting the wrong data?

Related

In R, what's the most efficient way to check if an object meets a criteria, and if it doesn't, modify it?

I have many cuts of my data that I eventually join together into one large dataset. However, sometimes the object is an error message because it didn't have enough sample size, causing the code to fail.
Before I do my full_joins, I want a simple way to say "If the length of any of these objects is 1, then make that object--and only those objects--have this set structure with just NAs". Is there a simple way to do that other than an if statement for each object? Or, alternatively, is there a way for R to 'skip' over the problematic rows if there's an error message (without relying on any specific characters)? I've used try(), but that doesn't always work and sometimes stops continuing to other joins.
#Here's an example of my data
library(dplyr)
object_1 <- tibble(name = c("Justin", "Corey"), month = c("Jan", "Jan"), score = c(1, 2))
object_2 <- tibble(name = c("Justin", "Corey"), month = c("Feb", "Feb"), score = c(100, 200))
object_3 <- "error message!"
object_4 <- tibble(name = c("Justin", "Corey"), month = c("Apr", "Apr"), score = c(95, 23))
object_5 <- "Another error!!"
#Here's me trying to join them, but it isn't working because of the errors
all_the_objects <- object_1 %>%
full_join(object_2) %>%
full_join(object_3) %>%
full_join(object_4) %>%
full_join(object_5)
#Here's a solution that works, but doesn't seem very elegant:
if(length(object_1) == 1) {
object_1 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_2) == 1) {
object_2 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_3) == 1) {
object_3 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_4) == 1) {
object_4 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_5) == 1) {
object_5 <- tibble(name = NA, month = NA, score = NA_real_)
}
#Now it'll work
all_the_objects <- object_1 %>%
full_join(object_2) %>%
full_join(object_3) %>%
full_join(object_4) %>%
full_join(object_5)
We may place the objects in a list and do the check at once and then join with reduce
library(dplyr)
library(purrr)
map(mget(ls(pattern = '^object_\\d+$')),
~ if(is.vector(.x)) tibble(name = NA_character_, month = NA_character_,
score = NA_real_) else .x) %>%
reduce(full_join)
-output
# A tibble: 7 × 3
name month score
<chr> <chr> <dbl>
1 Justin Jan 1
2 Corey Jan 2
3 Justin Feb 100
4 Corey Feb 200
5 <NA> <NA> NA
6 Justin Apr 95
7 Corey Apr 23

advanced reshaping / pivoting in a r dataframe

I am struggling to reshape a dataframe in R. My starting point is a dataframe, which has the following structure:
df_given <- data.frame(
first_column = c("NA", "NA", "NA", "Country1", "Country2", "Country3"),
second_column = c("Consumption", "real", "2021", 10, 11, 23),
third_column = c("Consumption", "real", "2022", 20, 22, 12),
fourth_column = c("Inflation", "expected", "2021", 1, 1.2, 2.5),
fifth_column = c("Inflation", "expected", "2022", 5, 3, 2)
)
Now my problem is the following: I would like to have the 2021 and 2022 only as two columns, instead of repeating the sequence two times. This, therefore, involves transforming the "description" of this time series (e.g. consumption real and inflation expected) from a row to a column. For this reason, my final target dataframe would look somehow like this:
df_target <- data.frame(
first_column = c("type", "Country1 Consumption real", "Country2 Consumption real",
"Country3 Consumption real", "Country1 Inflation expected",
"Country2 Inflation expected","Country3 Inflation expected"),
second_column = c(2021, 10, 11, 23, 1, 1.2, 2.5),
third_column = c(2022, 20, 22, 12, 5, 3, 2)
)
I assume that pivoting to wider or longer would do the trick. However, my problem is, that I can't really tell if my current dataframe is actually in long or wide format, because I think it is kind of both. Can anyone tell me how to approach this problem? Thanks in advance
You can use data table, after dropping the extra info in the first couple of rows which aren't really data.
names(df_given) <- c("country","Real C 2021", "Real C 2022", "Inf 2021", "Inf 2022")
df_given <- df_given[-c(1:3),]
library(data.table)
setDT(df_given)
melt(df_given, measure = patterns("^Real C","^Inf"), value.name = c("2021","2022"))
country variable 2021 2022
1: Country1 1 10 1
2: Country2 1 11 1.2
3: Country3 1 23 2.5
4: Country1 2 20 5
5: Country2 2 22 3
6: Country3 2 12 2
Documentation
Easiest way is manual, using Base R:
# Transpose: ir => data.frame
ir <- data.frame(t(df_given))
# Derive metrics: ir2 => character vector
ir2 <- apply(ir[,1:3], 1, paste, collapse = " ")[-1]
# Derive countries: ir3 => character vector
ir3 <- unlist(ir[1,4:ncol(ir), drop = TRUE])
# Derive values: ir4 => data.frame
ir4 <- unlist(ir[2:nrow(ir), 4:ncol(ir)])
# Reshape into long df: ir5 => data.frame
ir5 <- within(
data.frame(
cbind(
stat = ir2,
country = rep(ir3, each = length(ir2)),
val = ir4
),
row.names = NULL
),
{
year <- substring(stat, nchar(stat)-4)
stat <- trimws(gsub(paste0(year, collapse = "|"), "", stat))
}
)
# Pivot: data.frame => stdout(console)
reshape(
ir5,
idvar=c("country", "stat"),
timevar="year",
v.names="val",
direction="wide"
)
Thanks a lot for your input, your solutions were very helpful to find my own solution. For me, the most important take away was to merge all "identifying" rows into the header, which makes all of the following operations a lot easier.
# merge row 1 with 2
df_given[1,] <- paste(df_given[1,], df_given[2,])
df_given = df_given[-c(2),]
# merge "merged row" with daterow with unique seperator into a header
names(df_given) <- as.character(paste(df_given[1,], df_given[2,], sep ="-x-"))
df_given = df_given[-c(1,2),]
names(df_given)[1] <- 'country'
# pivot longer
df_given_new <- df_given %>%
pivot_longer(!country, names_to = "identifier", values_to = "obs")
# split columns
df_given_new[c('type', 'year')] <- str_split_fixed(df_given_new$identifier, '-x-', 2)
df_given_new <- subset(df_given_new, select=-c(identifier))
# back to long
dfFinal <- df_given_new %>%
pivot_wider(names_from = year, values_from = obs)

Plotting Y (country) against X (years) showing GDP growth of 24 countries YoY (line graph)

So I am not sure what I am doing wrong but I keep encountering errors. So my aim of my project is to narrow down selection of countries based on the variable I feel that make sense to me hence I tried to cut down bit by bit based on my code below (pardon if it is messy as I just started learning R). So I have managed to get 24 desired countries but will definitely want to narrow it further but would like to view the graph of the countries chosen against the years 2015:2019 and show their gdp growth.
So I keep getting the error such as Error in dimnames(x) <- dnx : 'dimnames' applied to non-array and Error in dataframe.
I am not sure what I am suppose to do after library (tidyr). I tried many options online but I feel I am doing something error prior to this code.
data1 <-WDI(indicator= c("IT.NET.USER.ZS", "BX.KLT.DINV.CD.WD", "IT.NET.SECR.P6" , "NY.GDP.MKTP.KD.ZG"), start = 2015, end = 2019, extra = FALSE)
#get column names
colnames(data1)
#View data
data1 %>%
view()
#Change column names
names(data1)[names(data1) == "IT.NET.USER.ZS"] <- "internet_users"
names(data1)[names(data1) == "BX.KLT.DINV.CD.WD"] <- "foreign_direct_investment"
names(data1)[names(data1) == "NY.GDP.MKTP.KD.ZG"] <- "gdp_growth"
names(data1)[names(data1) == "IT.NET.SECR.P6"] <- "secure_internet"
summary(data1)
data1 %>%
count(internet_users, sort = TRUE)
data1 %>%
count(year, sort = TRUE)
data1 %>%
count(country, sort = TRUE)
view(data1)
gdp <- summary(data1$gdp_growth)
users <- summary(data1$internet_users)
fdi <- summary(data1$foreign_direct_investment)
secure <- summary(data1$secure_internet)
country <- summary(data1$country)
names(data1)
#Selecting range from Mean to 3rd Quartile of data1 for internet_users
data2 <- data1[ c(data1$internet_users < 76.56 & data1$internet_users > 52.54 & data1$year == 2019), ]
summary(data2)
#Selecting gdp growth >= Mean of gdp growth 2.470
data3 <- data2[ c(data2$gdp_growth >= 2.4451),]
is.na(data3)
na.omit(data3)
view(data3)
#Removal of non country data from data3
data4 <- data3[-c(1,4,9,17,19,20,24,25,29,30,33,34,35,36,38,39,40,42,43,44),]
view(data4)
countries_1 <- c(data4$country)
head(countries_1)
summary(countries_1)
#trynna plot something that works with year/country and gdp
yearchoice <- c(year = 2015:2019)
str(data4)
datalinegraph <- data.frame(c(yearchoice,countries_1))
unique(data4$country)
**#listing the countries I think I need to plat against**
data5 <- data1 %>%
filter(country %in% countries_1)
library(tidyr)
data6 <-data.frame(data5)
data6.df$gdp = rownames(data5)
df.long = gather(data = data5,
key = yearchoice,
value = gdp)
ggplot(data = df.long, aes(x = yearchoice,
y = gdp,
group=data5,
color=variable)) +
geom_line() +
geom_point()```
Here's one that's country variables within country:
data1 <-WDI(indicator= c("IT.NET.USER.ZS", "BX.KLT.DINV.CD.WD", "IT.NET.SECR.P6" , "NY.GDP.MKTP.KD.ZG"), start = 2015, end = 2019, extra = FALSE)
#Change column names
names(data1)[names(data1) == "IT.NET.USER.ZS"] <- "internet_users"
names(data1)[names(data1) == "BX.KLT.DINV.CD.WD"] <- "foreign_direct_investment"
names(data1)[names(data1) == "NY.GDP.MKTP.KD.ZG"] <- "gdp_growth"
names(data1)[names(data1) == "IT.NET.SECR.P6"] <- "secure_internet"
#Selecting range from Mean to 3rd Quartile of data1 for internet_users
data2 <- data1 %>% filter(internet_users < 76.56 &
internet_users > 52.54 &
year == 2019 &
gdp_growth >= 2.4451)
#Removal of non country data from data3
data4 <- data2 %>% filter(!iso2c %in% c("4E", "T4", "V3", "XT", "Z4"))
countries_1 <- c(data4$country)
#listing the countries I think I need to plat against**
data5 <- data1 %>%
filter(country %in% countries_1)
r01 <- function(x){
x <- x-min(x, na.rm=TRUE)
x/max(x, na.rm=TRUE)
}
df.long = data5 %>%
mutate(across(internet_users:gdp_growth, r01)) %>%
pivot_longer(internet_users:gdp_growth,
names_to="variables",
values_to="vals")
ggplot(data = df.long, aes(x = year,
y = vals,
color=variables)) +
geom_line() +
facet_wrap(~country) +
theme(legend.position = "top")
And here's one that's countries within variables:
df.long2 = data5 %>%
pivot_longer(internet_users:gdp_growth,
names_to="variables",
values_to="vals")
ggplot(data = df.long2, aes(x = year,
y = vals,
color=country)) +
geom_line() +
facet_wrap(~variables, scales="free_y") +
theme(legend.position = "top")

Combine outputs of a function for each index in a for loop in R

I have created a function which completes a calculation based on data from two data frames for an individual.
I want to complete that function for each individual and combine all the outputs in a data frame and export to .csv
Currently the output .csv only has data for person 34, none of the other.
I've noted that when I run this it creates an object ID, which is just the numeric 34. It seems to be only holding on to the last ID in data$ID.
How can I create an output with results for all persons?
library(dplyr)
library(lubridate)
library(date)
screen_function = function(x){
# Select each person and get necessary inputs
data = data%>%filter(ID == x)
demogs = demogs %>% filter (P_ID == x)
data$Age = demogs$Age
data$result = data$test * data$Age
data$Date = as.Date(data$Date,format='%d/%m/%Y') # ensures date column is in correct format
# only include tests from most recent 12-24 months and only proceed if test in last 12 months
Recent_12m = data %>% filter(between(Date, today() - years(1), today()))
Recent_24m = data %>% filter(between(Date, today() - years(2), today()))
if ((nrow(Recent_12m)) > 0) {
data = rbind(Recent_12m,Recent_24m)
Recent_12m$min_date = min(Recent_12m$Date)
Recent_12m$Date = as.Date(Recent_12m$Date,format='%d/%m/%Y')
Recent_12m$min_date = as.Date(Recent_12m$min_date,format='%d/%m/%Y')
Recent_24m$min_date = min(Recent_24m$Date)
Recent_24m$Date = as.Date(Recent_24m$Date,format='%d/%m/%Y')
Recent_24m$min_date = as.Date(Recent_24m$min_date,format='%d/%m/%Y')
Recent_12m$Period = interval(Recent_12m$min_date, Recent_12m$Date)
Recent_12m$Years = as.numeric(Recent_12m$Period, unit = "years")
Recent_24m$Period = interval(Recent_24m$min_date, Recent_24m$Date)
Recent_24m$Years = as.numeric(Recent_24m$Period, unit = "years")
# Latest result
Last = filter(Recent_12m, (Recent_12m$Date == max(Date)))
# linear regression model
Reg_12month <- lm(result ~ Years, data=Recent_12m)
Reg_24month <- lm(result ~ Years, data=Recent_24m)
info = c(x, round(Last$result, digits=1), round(Reg_12month$coefficients["Years"], digits = 1), round(Reg_24month$coefficients["Years"], digits = 1))
newdf = data.frame(matrix(0, ncol = 4))
colnames(newdf) = c("ID", "Latest result", "Trend 12month", "Trend 24 month")
newdf= rbind(newdf, info)
write.csv(newdf, "filepath.csv")
}
}
Date= sample(seq(as.Date('2019/11/01'), as.Date('2020/11/01'), by="day"), 12)
ID= c(12,12,12,450,450,450,1,1,1,34,34,34)
test= rnorm(12, mean=150, sd=60)
data= data.frame(ID, Date, test)
P_ID = c(1,12,34,450)
Age = c(50,45,60,72)
demogs = data.frame(P_ID, Age)
persons = unique(data$ID)
for(ID in persons){
screen_function(paste("", ID,"", sep=""))
}
Created on 2020-11-16 by the reprex package (v0.3.0)
So, I've got around this by using a pre-made .csv, instead of creating a new dataframe. The .csv just contains a single row of 4 columns, with random entries in each cell.
newdf= read.csv(file = "filepath.csv")
info = c(x, round(Last$result, digits=1), round(Reg_12month$coefficients["Years"], digits = 1),
round(Reg_24month$coefficients["Years"], digits = 1))
newdf= rbind(Summary, patient_info)
colnames(newdf) = c("ID", "Latest result", "Trend 12month", "Trend 24 month")
newdf= distinct(newdf, ID, .keep_all = TRUE)
write.csv(Summary, "filepath.csv", row.names = FALSE)}}

Calculate the percentage of missing values per month in a dataframe

I create the following dataframe:
df <- data.frame(seq(from = as.Date("2001-01-01"), to = as.Date("2001-12-31"), by = 1),
seq(1,365), seq(1, 365), seq(1, 365), seq(1, 365))
colnames(df) <- c("date", "C1", "C2", "C3", "C4")
df$C1[50:100] <- NA
df$C2[20:80] <- NA
df$C3[70:150] <- NA
df$C4[250:300] <- NA
I would like to calculate the percentage of missing values per month, not just per column but for the whole dataset.
Is there an efficient way to do this?
library(dplyr)
library(lubridate)
#is.na(.) can be more specific e.g. is.na(.[,2:5]) OR is.na(.[,grepl("C",colnames(df))])
df %>% mutate(Month=month(date), Mis = rowSums(is.na(.))) %>%
group_by(Month) %>%
summarise(Sum=sum(Mis), Percentage=mean(Mis))

Resources