Shifting Values in R in rows - r

I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you

Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.

library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14

Related

R - recursively create dataframe columns via index inside function

I have a dataframe with a large amount of annual data. For example consider the following toy example like so:
dat <- data.frame(id = 1:2, quantity = 3:4, agg_2002 = 5:6, agg_2003 = 7:8, agg_2020 = 9:10)
What I would like to do is the following:
Look for columns named "agg_",in the set of column names, names(df)
Substitute the "agg_" in names(df) for "change_"
Calculate the relative change from year to year, so for example,
df$change_2002 <- df$agg_2002/df$agg_2002 (since 2002 is first year)
df$change_2003 <- df$agg_2003/df$agg_2002
df$change_2004 <- df$agg_2004/df$agg_2003...all the way up to 2020 or the latest value with "agg_" in the column name.
What I have so far is the following function:
func <- function(dat, overwrite = FALSE) {
nms <- grep("agg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("agg_", "chg_", nms)
for i = 1:ncol(df) %in% revnms{
dat[, rvnms][i] <- lapply(dat[, rvnms][i], `/`, dat[, rvnms][i-1])
}
dat
}
What I am struggling with is the indexing. How do I get R to make the above calculations recursively without having to do it manually? The desired result is the "chg_" columns appended to the original dataframe:
id quantity agg_2002 agg_2003 agg_2020 chg_2002 chg_2003 chg_2020
1 1 3 5 7 9 1 1.40 1.28
2 2 4 6 8 10 1 1.33 1.25
I would like to modify the specified function above to produce the desired result via lapply if possible. All ideas are welcome. Thank you.
UPDATE: I would much prefer something using lapply or something that can accomodate differing data types
You can make table to long form, change name (can use gsub), then spread back
library(tidyverse)
library(stringr)
df <- dat %>% pivot_longer(-c(id,quantity), names_to = "agg", values_to = "year") %>%
mutate(agg = str_replace(agg, "agg", "change")) %>%
group_by(id) %>%
mutate(year = ifelse(is.na(lag(year)), year/year, year/lag(year))) %>% # Divide itself if there is no lag(year)
pivot_wider(names_from = "agg", values_from = "year")
inner_join(dat, df, by = c("id","quantity"))
id quantity agg_2002 agg_2003 agg_2020 change_2002 change_2003 change_2020
1 1 3 5 7 9 1 1.400000 1.285714
2 2 4 6 8 10 1 1.333333 1.250000
Here is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(cols = starts_with("agg"),
names_to = "year",
names_prefix = "agg_",
values_to = "agg") %>%
group_by(id) %>%
arrange(year) %>%
mutate(change = agg / lag(agg, 1)) %>%
pivot_wider(names_from = year, values_from = c("agg", "change"))

Addressing another dataframe with dynamic variables in R (ideally dplyr, using mutate)

I have already read a variety of threads on dynamically named variables, but I couldn't quite find an answer.
I have two dataframes.
df <- data.frame(qno=c(1,2,3,4))
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
Now I want to create another column 'average' in the df dataframe which gives me the average of each column in ref.
Intended output:
df <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
Here is what I have tried:
df <- df %>%
mutate(average := mean(!!as.name(paste0("ref$Q",qno)))
I have also tried a version with a for loop, but that didn't work either.
for (i in 1:length(df$qno)){
df$average[i] <- mean(as.name(paste0("ref$Q",df$qno[i])))
}
df <- df %>%
mutate(average = mean(as.name(paste0("ref$Q",qno))))```.
Here it is with mutate:
df %>% mutate(average = t(ref %>% summarise(across(everything(), ~mean(.x, na.rm = TRUE)))))
qno average
1 1 10.5
2 2 30.5
3 3 50.5
4 4 70.5
But you can use it without mutate entirely if you want the names from ref:
t(ref %>% summarise(across(everything(), list(mean), .names = "{.col}"))) %>%
data.frame() %>%
rename(average = 1)
average
Q1 10.5
Q2 30.5
Q3 50.5
Q4 70.5
Does this solve your problem?
ref <- data.frame(Q1 = c(1:20),Q2 = c(21:40),Q3=c(41:60),Q4 = c(61:80))
out <- data.frame(qno=c(1,2,3,4), average = c(10.5,30.5,50.5,70.5))
df <- data.frame(qno=c(1:length(ref)))
for (i in seq_along(ref)) {
df$average[i] <- mean(ref[[i]], na.rm = T)
}
I was not really sure if you want to name the rows like the variables, so you could just add this when you create the df object:
df <- data.frame(qno = paste0("Q", c(1:length(ref))))

Calculate cumulative sum after a set period of time

I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.

Re-Format data with multiple headers to long format with one header row becoming data in a new column

I regularly receive data that is formatted with multiple headers and merged cells (yes..excel). Typically these data come in the form of 2+ merged cells representing sample sites, over the top of a number of observations in columns representing parameters of interest for that site. I am using the "openxlsx" package to read in the data with the read.xlsx function shown below (won't run just for reference):
read.xlsx('Mussels.xlsx',
detectDates = T,
sheet = 2,
fillMergedCells = T,
startRow = 2)
An example: I am currently working with invasive mussel survey data where I have 25 lengths for two species for each of 14 sites, which I've abbreviated for ease here for ease:
lendat <- data.frame(site.a = c("species.1",1,1,1,1),
site.a = c("species.2",2,2,2,2),
site.b = c("species.1",3,3,3,3),
site.b = c("species.2",4,4,4,4),
check.names = F)
I would like to be able to write some code that will re-format these data into long form where the column names become values under a new column named "site", and the first row of data becomes the other column names representing the lengths for each species like this:
data_form <- data.frame(site = c(rep("site.a", 4), rep("site.b",4)),
species.1 = c(1,1,1,1,3,3,3,3),
species.2 = c(2,2,2,2,4,4,4,4))
Update based on #Ronak Shah answer
Using code from the accepted answer below with the actual data results in a tibble with no data. I discovered that the issue arises with the filter step when decimal values are introduced in the data (actual data contains decimal values). I thought this was a data format issue (example data are all factors) but even when this is true the decimal data are changed into NA's. See example:
lendat <- data.frame(site.a = c("species.1", 1.1,2.2,3,4),
site.a = c("species.2",5,6,7,8),
site.b = c("species.1", 9,10,11,12),
site.b = c("species.2",13,14,15,16),
check.names = F)
str(lendat)
'data.frame': 5 obs. of 4 variables:
$ site.a: Factor w/ 5 levels "1.1","2.2","3",..: 5 1 2 3 4
$ site.a: Factor w/ 5 levels "5","6","7","8",..: 5 1 2 3 4
$ site.b: Factor w/ 5 levels "10","11","12",..: 5 4 1 2 3
$ site.b: Factor w/ 5 levels "13","14","15",..: 5 1 2 3 4
I split the piped code out to go line by line
#Get data in long format
pivot_longer(junk, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('\\D', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
x <- pivot_longer(junk, cols = everything(), names_to = 'site')
x
x <- mutate(x, col = paste0('species', .copy))
x
x <- filter(x, !grepl('\\D', value))
x
x <- select(.data = x, -.copy)
x
x <- group_by(x, col)
x
x <- mutate(x, row = row_number())
x
x <- pivot_wider(x, names_from = col, values_from = value)
x
x <- select(x, -row)
x
x <- arrange(x, site)
x
The code executes but leaves NA's in the final tibble.
Using dplyr and tidyr :
library(dplyr)
library(tidyr)
#Get data in long format
pivot_longer(lendat, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('[A-Za-z]', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
# site species1 species2
# <chr> <chr> <chr>
#1 site.a 1.1 5
#2 site.a 2.2 6
#3 site.a 3 7
#4 site.a 4 8
#5 site.b 9 13
#6 site.b 10 14
#7 site.b 11 15
#8 site.b 12 16

Filter from a dataframe and create another dataframe using r

I have two data frames as following:
df <- data.frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data.frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
What I want to do is create a row called total_car which would be a product of car and mazda. However if df_whole doesnt have row car I would still like to create a row total_car which would mazda*0.The output I want to get to is as below. How could I calculate something like this in R
df_car <- data.frame(month = c("mazda", "yamaha","total"),
april = c(11,12,1.1),
may = c(14,15,7))
It sounds like a sql query would work better to join your tables and get the summary data.
Here's a tidyverse solution (note the inner_join(), which is sql):
library(dplyr)
library(tidyr)
df <- data_frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data_frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
# adds car row if missing - there is probably a better way to handle the missing data
if (!"car" %in% df_whole$month) df_whole <- bind_rows(df_whole, data_frame(month = "car"))
# converts NAs to 0
df_whole[is.na(df_whole)] <- 0
# convert to long format
df_long <- df %>%
rename("brand" = "month") %>%
gather(key = "month", value = "val", april:may)
df_whole_long <- df_whole %>%
rename("type" = "month") %>%
gather(key = "month", value = "val", april:may)
# calcualte the multiplication
dat <- inner_join(df_long, df_whole_long, by = "month") %>% # combine dfs
filter(brand == "mazda", type == "car") %>% # filter out key rows
mutate(total = val.x * val.y) # do multiplication and add into new column
# reformat and append
df_car <- bind_rows(df,
bind_cols(data_frame(month = "total"),
dat %>%
select(month, total) %>%
spread(month, total)))
df_car is:
month april may
1 mazda 11.0 14
2 yamaha 12.0 15
3 total 1.1 7
I used tidyverse data frames (data_frame() rather than data.frame()) so strings weren't converted to factors (the same can be achieved by adding stringsAsFactors = F to data.frame()) and I converted your data to long format, which would probably be a better structure for your database anyway.

Resources