Find Date in Data Frame in R - r

I'm trying to use a list of dates to find the same date in a data frame and using ifelse loop to add a new column. Here is the code:
library(lubridate)
df <- tibble(DateCol = seq(ymd("2021-01-01"),ymd("2021-05-31"),"days"))
df2 <- c("2021-02-04","2021-02-07","2021-02-17","2021-02-25")
for (i in 1:length(df$DateCol)) {
if (df$DateCol[i] == df2) {
df$ValueCol[i] <- "1"
} else {
df$ValueCol[i] <- "0"
return(i)
}
}
The final df I get only has the first date of df2 is 1 in df$ValueCol. Not sure how to make this loop work, seems like there are some mistakes in the df$DateCol[i] == df2 part.

You can do this without a loop.
df$ValueCol <- as.integer(df$DateCol %in% as.Date(df2))
df
as.integer is faster way than using ifelse with 1 and 0 as output.
df$ValueCol <- ifelse(df$DateCol %in% as.Date(df2), 1, 0)
Your loop can be corrected by using %in% instead of ==, as we use %in% to compare more than 1 value.
library(tibble)
library(lubridate)
df <- tibble(DateCol = seq(ymd("2021-01-01"),ymd("2021-05-31"),"days"))
df2 <- as.Date(c("2021-02-04","2021-02-07","2021-02-17","2021-02-25"))
df$ValueCol <- NA
for (i in 1:length(df$DateCol)) {
if (df$DateCol[i] %in% df2) {
df$ValueCol[i] <- 1
} else {
df$ValueCol[i] <- 0
}
}

Or you can try using dplyr
library(dplyr)
library(magrittr)
df <- tibble(DateCol = seq(ymd("2021-01-01"),ymd("2021-05-31"),"days"))
df2 <- ymd("2021-02-04","2021-02-07","2021-02-17","2021-02-25")
df3 <- df %>% mutate(ValueCol = if_else(DateCol %in% df2, 1, 0))

Related

For Loop to Reorder Columns in R

I am trying to re-order columns in R using a for loop since the column range needs to be dynamic. Does anyone know what is missing from my code?
Group <- c("A","B","C","D")
Attrib1 <- c("x","y","x","z")
Attrib2 <- c("q","w","u","i")
Day1A <- c(5,4,6,3)
Day2A <- c(6,5,7,4)
Day3A <- c(9,8,10,7)
Day1B <- c(4,3,5,2)
Day2B <- c(3,2,4,1)
Day3B <- c(2,1,3,0)
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day2A,Day3A,Day1B,Day2B,Day3B)
day_count <- 3
for(i in 4:ncol(df)) {
if (i == day_count+3) break
df[c(i,day_count+i)]
}
Here is my desired result:
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day1B,Day2A,Day2B,Day3A,Day3B)
So, in theory you can just do sort(colnames(df)[4:ncol(df)]) to get that, but it gets tricky when you have say Day1A..Day10A..Day20A
Below is a quick workaround, to get the numbers and alphabets:
COLS = colnames(df)[4:ncol(df)]
day_no = as.numeric(gsub("[^0-9]","",COLS))
day_letter = gsub("Day[0-9]*","",COLS)
o = order(day_no,day_letter)
To get your final dataframe:
df[,c(colnames(df)[1:3],COLS[o])]
An option with select
library(dplyr)
library(stringr)
df %>%
select(Group, starts_with('Attrib'),
names(.)[-(1:3)][order(str_remove_all(names(.)[-(1:3)], '\\D+'))])

Trying to create a function to calculate over 6 months instead of a year

I have created a function in r which does some calculation by filtering each Year. Now I want to filter out 6months from each year. For eg 2014 will be divided into two dataframes 20140101-201406 and 20140701-20141201.
I tried using logical operators in the filter command and it keeps giving me errors as an unexpected symbol.
func2 <- function(years, data) {
years <- c(2014, 2015, 2016, 2017, 2018, 2019)
listofdfs <- list()
efor(i in 1:length(years)) {
#d <- data[data$Year == years[i]]
d <- filter(data, (data$Year==years[i]))
df <- data.frame(d)[, c(4,5)]
names(df) <- unlist(d[1, "headers"])
names(df)[1:2] <- c("Actual", "Estimated")
listofdfs[[i]] <- df
}
return(listofdfs)
}
expected should give me dataframes with 6rows for each month but it gives out the whole year instead
Edit:
This is what I tired:
func2 <- function(years, data) {
years <- c(20140101, 20140601, 20140701,20141201, 20150101, 20150601, 20150701, 20151201, 20160101,20191201)
listofdfs <- list() for(i in 1:length(years)) { #d <- data[data$Year == years[i]]
d <- filter(data, (years[i]==data$Year || years[i]<data$Year) | data$Year==inc(years)[i] || data$Year<inc(years[i]))
df <- data.frame(d)[, c(4,5)]
names(df) <- unlist(d[1, "headers"])
}
This is what the data looks like
If you want to split the data into semesters, here is a base R approach with no loops at all.
fun <- function(data){
d <- as.Date(data[[1]], format = "%Y%m%d")
m <- as.integer(format(d, "%m"))
semester <- 2 - (m <= 6)
year <- format(d, "%Y")
split(data, list(semester, year))
}
Just pass the data set to this function, it outputs a list of dataframes, one per semester/year.
There are some errors in your code!
In line 2 there is a closing bracket ) missing after "2019". This alone could account for your error.
But without sample data it is hard to solve your original problem.
Edit 1:
It is really hard to say what your problem is because the code you have added in the post and the comment has too many syntax errors to pinpoint the problem.
Maybe those syntax mistakes are the problem but there are too many, there are a lot of closing brackets missing, commented out lines, etc.
If I take your code example this is what it would look like just taking out those mistakes, does that solve your problem?
func2 <- function(years, data) {
years <- c(20140101, 20140601, 20140701,20141201, 20150101, 20150601, 20150701, 20151201, 20160101,20191201)
listofdfs <- list() for(i in 1:length(years)) {
#d <- data[data$Year == years[i]]
d <- filter(data, (years[i]==data$Year || years[i]<data$Year) | data$Year==inc(years)[i] || data$Year<inc(years[i]))
df <- data.frame(d)[, c(4,5)]
names(df) <- unlist(d[1, "headers"])
}
}
Edit 2:
So here is a solution:
# Sample data that looks like yours
years <- c(20140101, 20140601, 20140701,20141201, 20150101, 20150601, 20150701, 20151201, 20160101)
test <- c("A", "A","A","B","B","B", "C", "C", "C")
df <- data.frame(years,test)
library(dplyr)
# This code creates a splitting variable, Halfyear
df %>%
separate(years,into = c("Year","Month","Day"), sep = c(4,6)) %>%
mutate(Halfyear = paste(Year,case_when(Month <= 6 ~ "First Half",TRUE ~ "Second Half"))) %>%
mutate(Halfyear = as.factor(Halfyear)) %>%
{.} -> df
# Now we can use split to create the relevant data frames and access them with [i] where i is the index of the half year
as.data.frame(split(df,f = df$Halfyear)[1]) %>%
head()
Let me know if this works.
Consider split or by to subset dataframe into a list of data frames by one or more factors, specifically year and calculated year_half:
df$year_half <- ifelse(as.integer(substring(df$Year, 5, 6)) <= 6,
"first_half", "second_half")
# SIMILAR CALLS:
listofdfs <- split(df, list(df$year, df$year_half))
listofdfs <- by(df, df[,c("year", "year_half")], FUN=identity)

Removing rows from R data frame using the for loop with if statement

df <- data.frame(
V1 = c(1,3,3,5,5,6),
V2 = c(19,19,38,19,38,19),
V3 = c(1,3,1,7,2,10)
)
How can I remove the rows where V2 is an odd number using the for loop with if statement?
You can use #Aleh answer, even if you want to use it with for loop and if loop
Try this it will help:
dt = data.frame()
for(i in 1:nrow(df) ) {
if(df$V2[i] %% 2 == 0){
dt <- rbind(dt , df[i,])
}
}
The above answer is good, you can make this in a single line...
Try this.
df[df$V2 %% 2 == 0,]
Here is a dplyr solution:
df_filtered <- df %>% filter(V2 %% 2 == 0)
You don't actually need for loop, can do like this:
idx <- df$V2 %% 2 == 0
df[idx, ]

Loop over a subset, source a file and save results in a dataframe

Similar questions have been asked already but none was able to solve my specific problem. I have a .R file ("Mycalculus.R") containing many basic calculus that I need to apply to subsets of a dataframe: one subset for each year where the modalities of "year" are factors (yearA, yearB, yearC) not numeric values. The file generates a new dataframe that I need to save in a Rda file. Here is what I expect the code to look like with a for loop (this one obviously do not work):
id <- identif(unlist(df$year))
for (i in 1:length(id)){
data <- subset(df, year == id[i])
source ("Mycalculus.R", echo=TRUE)
save(content_df1,file="myresults.Rda")
}
Here is an exact of the main data.frame df:
obs year income gender ageclass weight
1 yearA 1000 F 1 10
2 yearA 1200 M 2 25
3 yearB 1400 M 2 5
4 yearB 1350 M 1 11
Here is what the sourced file "Mycalculus.R" do: it applies numerous basic calculus to columns of the dataframe called "data", and creates two new dataframes df1 and then df2 based on df1. Here is an extract:
data <- data %>%
group_by(gender) %>%
mutate(Income_gender = weighted.mean(income, weight))
data <- data %>%
group_by(ageclass) %>%
mutate(Income_ageclass = weighted.mean(income, weight))
library(GiniWegNeg)
gini=c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))
df1=data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c("content_df1")
df2=(1/5)*df1$Income_gender+df2$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c("content_df2")
So that in the end, I get two dataframes like this:
Income_Gender Income_Ageclass
content_df1 .... ....
And for df2:
myresult
content_df2 ....
But I need to save df1 and Rf2 as a Rda file where the row names of content_df1 and content_df2 are given per subset, something like this:
Income_Gender Income_Ageclass
content_df1_yearA .... ....
content_df1_yearB .... ....
content_df1_yearC .... ....
and
myresult
content_df2_yearA ....
content_df2_yearB ....
content_df2_yearC ....
Currently, my program does not use any loop and is doing the job but messily. Basically the code is more than 2500 lines of code. (please don't throw tomatoes at me).
Anyone could help me with this specific request?
Thank you in advance.
Consider incorporating all in one script with a defined function of needed arguments, called by lapply(). Lapply then returns a list of dataframes that you can rowbind into one final df.
library(dplyr)
library(GiniWegNeg)
runIncomeCalc <- function(data, y){
data <- data %>%
group_by(gender) %>%
mutate(Income_gender = weighted.mean(income, weight))
data <- data %>%
group_by(ageclass) %>%
mutate(Income_ageclass = weighted.mean(income, weight))
gini <- c(Gini_RSV(data$Income_gender, weight), Gini_RSV(data$Income_ageclass,weight))
df1 <- data.frame(gini)
colnames(df1) <- c("Income_gender","Income_ageclass")
rownames(df1) <- c(paste0("content_df1_", y))
return(df1)
}
runResultsCalc <- function(df, y){
df2 <- (1/5) * df$Income_gender + df$Income_ageclass
colnames(df2) <- c("myresult")
rownames(df2) <- c(paste0("content_df2_", y)
return(df2)
}
dfIncList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
runIncomeCalc(yeardata, i)
})
dfResList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
df <- runIncomeCalc(yeardata, i)
runResultsCalc(df, i)
})
df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)
Now if you need to source across scripts. Create same two functions, runIncomeCalc and runResultsCalc in Mycalculus.R and then call each in other script:
library(dplyr)
library(GiniWegNeg)
if(!exists("runIncomeCalc", mode="function")) source("Mycalculus.R")
dfIncList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
runIncomeCalc(yeardata, i)
})
dfResList <- lapply(unique(df$year), function(i) {
yeardata <- subset(df, year == i)
df <- runIncomeCalc(yeardata, i)
runResultsCalc(df, i)
})
df1 <- do.call(rbind, dfIncList)
df2 <- do.call(rbind, dfResList)
If you functional-ize your steps you can create a workflow like the following:
calcFunc <- function(df) {
## Do something to the df, then return it
df
}
processFunc <- function(fname) {
## Read in your table
x <- read.table(fname)
## Do the calculation
x <- calcFunc(x)
## Make a new file name (remember to change the file extension)
new_fname <- sub("something", "else", fname)
## Write the .RData file
save(x, file = new_fname)
}
### Your workflow
## Generate a vector of files
my_files <- list.files()
## Do the work
res <- lapply(my_files, processFunc)
Alternatively, don't save the files. Omit the save call in the processFunc, and return a list of data.frame objects. Then use either data.table::rbindlist(res) or do.call(rbind, list) to make one large data.frame object.

Add missing value in column with value from row above

Every week I a incomplete dataset for a analysis. That looks like:
df1 <- data.frame(var1 = c("a","","","b",""),
var2 = c("x","y","z","x","z"))
Some var1 values are missing. The dataset should end up looking like this:
df2 <- data.frame(var1 = c("a","a","a","b","b"),
var2 = c("x","y","z","x","z"))
Currently I use an Excel macro to do this. But this makes it harder to automate the analysis. From now on I would like to do this in R. But I have no idea how to do this.
Thanks for your help.
QUESTION UPDATE AFTER COMMENT
var2 is not relevant for my question. The only thing I am trying to is. Get from df1 to df2.
df1 <- data.frame(var1 = c("a","","","b",""))
df2 <- data.frame(var1 = c("a","a","a","b","b"))
Here is one way of doing it by making use of run-length encoding (rle) and its inverse rle.inverse:
fillTheBlanks <- function(x, missing=""){
rle <- rle(as.character(x))
empty <- which(rle$value==missing)
rle$values[empty] <- rle$value[empty-1]
inverse.rle(rle)
}
df1$var1 <- fillTheBlanks(df1$var1)
The results:
df1
var1 var2
1 a x
2 a y
3 a z
4 b x
5 b z
Here is a simpler way:
library(zoo)
df1$var1[df1$var1 == ""] <- NA
df1$var1 <- na.locf(df1$var1)
The tidyr packages has the fill() function which does the trick.
df1 <- data.frame(var1 = c("a",NA,NA,"b",NA), stringsAsFactors = FALSE)
df1 %>% fill(var1)
Here is another way which is slightly shorter and doesn't coerce to character:
Fill <- function(x,missing="")
{
Log <- x != missing
y <- x[Log]
y[cumsum(Log)]
}
Results:
# For factor:
Fill(df1$var1)
[1] a a a b b
Levels: a b
# For character:
Fill(as.character(df1$var1))
[1] "a" "a" "a" "b" "b"
Below is my unfill function, encontered same problem, hope will help.
unfill <- function(df,cols){
col_names <- names(df)
unchanged <- df[!(names(df) %in% cols)]
changed <- df[names(df) %in% cols] %>%
map_df(function(col){
col[col == col %>% lag()] <- NA
col
})
unchanged %>% bind_cols(changed) %>% select(one_of(col_names))
}

Resources