How to fill dataframe in R with months and NA values - r

This is my dataframe:
df <- structure(list(month_date = structure(c(19117, 19149, 19180,
19212, 19244, 19275), class = "Date"), Values = c(9693, 10227,
10742, 11672, 10565, 10080)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I need to increse the column month_date until "2023-12-01" with "NA" values.
The output should be a datframe with months until "2023-12-01" and on the Values column filled by "NA" values starting on "2022-11-01".
How can I do this?

library(tidyr)
complete(df, month_date = seq(min(month_date), as.Date("2023-12-01"),
by = '1 day'))

You can also create a separate dataframe/tibble if for some reason you do not want to use tidyr()
add <- data.frame(month_date = seq.Date(as.Date("2022-11-01"), as.Date("2023-12-01"), by = "month"), Values = NA)
final <- rbind(df, add)

Related

How to cbind a list of tables by one column, and suffix headings with the list item name

I've got a list of dataframes. I'd like to cbind them by the index column, sample_id. Each table has the same column headings, so I can't just cbind them otherwise I won't know which list item the columns came from. The name of the list item gives the measure used to generate them, so I'd like to suffix the column headings with the list item name.
Here's a simplified demo list of dataframes:
list_of_tables <- list(number = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(655, 331, 271
), max = c(12, 5, 7)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame")), concentration_cm_3 = structure(list(sample_id = structure(1:3, levels = c("CSF_1",
"CSF_2", "CSF_4"), class = "factor"), total = c(121454697, 90959097,
43080697), max = c(2050000, 2140000, 915500)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), volume_nm_3 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(2412783009, 1293649395, 438426087
), max = c(103500000, 117400000, 23920000)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame")), area_nm_2 = structure(list(
sample_id = structure(1:3, levels = c("CSF_1", "CSF_2", "CSF_4"
), class = "factor"), total = c(15259297.4, 7655352.2, 3775922
), max = c(266500, 289900, 100400)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame")))
You'll see it's a list of 4 tables, and the list item names are "number", "concentration_cm_3", "volume_nm_3", and "area_nm_2".
Using join_all from plyr I can merge them all by sample_id. However, how do I suffix with the list item name?
merged_tables <- plyr::join_all(stats_by_measure, by = "sample_id", type = "left")
we could do it this way:
The trick is to use .id = 'id' in bind_rows which adds the name as a column. Then we could pivot:
library(dplyr)
library(tidyr)
bind_rows(list_of_tables, .id = 'id') %>%
pivot_wider(names_from = id,
values_from = c(total, max))
sample_id total_number total_concentration_cm_3 total_volume_nm_3 total_area_nm_2 max_number max_concentration_cm_3 max_volume_nm_3 max_area_nm_2
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSF_1 655 121454697 2412783009 15259297. 12 2050000 103500000 266500
2 CSF_2 331 90959097 1293649395 7655352. 5 2140000 117400000 289900
3 CSF_4 271 43080697 438426087 3775922 7 915500 23920000 100400
Probably, we may use reduce2 here with suffix option from left_join
library(dplyr)
library(purrr)
nm <- names(list_of_tables)[1]
reduce2(list_of_tables, names(list_of_tables)[-1],
function(x, y, z) left_join(x, y, by = 'sample_id', suffix = c(nm, z)))
Or if we want to use join_all, probably we can rename the columns before doing the join
library(stringr)
imap(list_of_tables, ~ {
nm <- .y
.x %>% rename_with(~str_c(.x, nm), -1)
}) %>%
plyr::join_all( by = "sample_id", type = "left")
Or use a for loop
tmp <- list_of_tables[[1]]
names(tmp)[-1] <- paste0(names(tmp)[-1], names(list_of_tables)[1])
for(nm in names(list_of_tables)[-1]) {
tmp2 <- list_of_tables[[nm]]
names(tmp2)[-1] <- paste0(names(tmp2)[-1], nm)
tmp <- left_join(tmp, tmp2, by = "sample_id")
}
tmp

Compare data frame and list to find highest priority value using R

Edited As per request
Team,
Need suggestion in below request.
I have a static list df2= c("Maths/Science", "Science/Engg", "Maths/Engg", "Maths","Science","Engg"). I need to compare each column of df1 with df2 and check if all these combinations are present or not. It can appear separately or in combination with other values as well.
Weightage is as follows
df2= c("Maths/Science", "Science/Engg", "Maths/Engg", "Maths","Science","Engg")
Maths/Science= 6
Science/Engg=5
Maths/Engg = 4
Maths=3
Science=2
Engg=1
A new dataframe df3 is created to include d1 data and new column as 'weightage' and mention the highest available values in the row(as per weightage).
Please find the data below,
df1-Input df1
dput(input)
structure(list(Col_1 = c("Maths/Science", "Engg", "Commerce",
"Engg"), Col_2 = c("Science L", "Science/Maths", "English,",
"Science/Engg"), Col_3 = c("Commerce", "NA", "NA", "Science"),
Col_4 = c("CS/Engg", "NA", "NA", "NA")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
d2
structure(list(Col_1 = "(\"Maths/Science\", \"Science/Engg\", \"Maths/Engg\", \"Maths\",\"Science\",\"Engg\")"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
df3 output
structure(list(Col_1 = c("Maths", "Engg", "Science", "Engg"),
Col_2 = c("Science L", "Science/Maths", "Engg", "Science/Engg"
), Col_3 = c("Commerce", "NA", "NA", "Science"), Col_4 = c("Maths/Science",
"NA", "NA", "NA"), Weightage = c("Maths/Science", "Science/Maths",
"Science/Engg", "Science/Engg")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
It is not entirely clear how you want to handle subjects, but this may be helpful for you.
First, create a vector and put in order your subjects based on weight:
vec <- c("Maths/Science", "Science/Engg", "Maths/Engg", "Maths", "Science", "Engg")
Then, you can convert your columns to factors, and use ordered levels based on your vector:
df_fac <- lapply(df1, factor, levels = vec, ordered = T)
Finally, you can get the minimum factor level (in this case highest weight, based on my ordered in the vector) for each row:
do.call(pmin, c(df_fac, na.rm = T))
You can assign to df1$Weightage and compare with your example.

Convert dataframe column names to cell values of a new variable

I have the dataframe below
d<-structure(list(WaterYear = c(2014, 2015), Discharge = c(1783.939638,
1970.891674), EnvWater = c(6, 1)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
and I want to convert it in a way that will have 3 columns. The WaterYear as it is, the Category which will include Discharge and EnvWater and the Value with thei relative values. Normally I want to apply it to more than 2 columns.
We can reshape to 'long' with pivot_longer to create the required data
library(tidyr)
pivot_longer(d, cols = -WaterYear, names_to = 'Category', values_to = 'Value')

How to plot layers of tupples on same plot in R?

I am trying to plot the time and NDVI for each region on the same plot. I think to do this I have to convert the date column from characters to time and then plot each layer. However I cannot figure out how to do this. Any thoughts?
list(structure(list(observation = 1L, HRpcode = NA_character_,
timeseries = NA_character_), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3793765496776215", "0.21686891782421552", "0.3785652933528299",
"0.41027240624704164", "0.4035578030242673", "0.341299793064468"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.4071076986818826", "0.09090719657570319", "0.35214166081795284",
"0.4444311032927228", "0.5220702877666005", "0.5732370503295022"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(observation = 1:6, time = c("2014-01-01",
"2014-02-01", "2014-03-01", "2014-04-01", "2014-05-01", "2014-06-01"
), ` NDVI` = c("0.3412131556625801", "0.18815996897460135", "0.5218904976415136",
"0.6970128777711452", "0.7229657162729096", "0.535967435470161"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
111
First we need to clean your data. The first element in this list is empty
df = df[-1]
Now we need to make a data.frame
df = do.call(rbind, df)
I am going to add a region variable, change the name of NDVI to remove the space,
change ndvi into a numeric vector, and change time into a Date object
library(dplyr)
df = df %>%
mutate(region = factor(rep(1:3, rep(6, 3)))) %>%
rename(ndvi = ' NDVI') %>%
mutate(ndvi = as.numeric(ndvi)) %>%
mutate(time = as.Date(time))
Now we can use ggplot2 to plot the data by region
library(ggplot2)
g = df %>%
ggplot(aes(x = time, y = ndvi, col = region)) +
geom_line()
g
Which gives this plot:
Here's an approach with lubridate to handle dates and dplyr to make the binding of the data.frames easier to understand.
Note that the group names are taken from the names of the list, and since those don't exist in the data you provided, we have to set them in advance.
library(lubridate)
library(ggplot2)
library(dplyr)
names(data) <- 1:3
data <- bind_rows(data, .id = "group")
data$time <- ymd(data$time)
setnames(data," NDVI","NDVI")
data$NDVI <- as.numeric(data$NDVI)
ggplot(data, aes(x=time,y=NDVI,color=Group)) + geom_line()

Convert days to calendar dates within a data frame in R

I have a dataframe like
ID |TRTSDT| TRTEDT
101|17952 | 18037
102|17956 | 18041
How can i convert the days into Date format...Thank you
Try
df1[-1] <- lapply(df1[-1], as.Date, origin='1970-01-01')
data
df1 <- structure(list(ID = 101:102, TRTSDT = c(17952L, 17956L),
TRTEDT = c(18037L,
18041L)), .Names = c("ID", "TRTSDT", "TRTEDT"), class = "data.frame",
row.names = c(NA, -2L))

Resources