Manipulating cells within a group based on value in another cell - r

I have a data.frame in the following format:
CowId Bacillus Week
1234 1 Week1
1234 0 Week2
1234 0 Week3
1234 0 Week4
If a cow is bacillus-positive (yes=1, no=0) on Week1, then I want to change remaining values within this column to 1, as so:
CowId Bacillus Week
1234 1 Week1
1234 1 Week2
1234 1 Week3
1234 1 Week4
I tried the following, but unsure how to proceed after determining the infection status of Week1 cows:
dt %>%
group_by(CowId) %>%
mutate(Bacillus = ifelse(Week == "Week1" & Bacillus, 1,
ifelse(Week != "Week1" do something)
Appreciate any comments/feedback.

Try this approach using any() and testing for the week. I have created dummy data to show the example:
library(dplyr)
library(tidyr)
#Code
df %>% group_by(CowId) %>%
mutate(Bacillus=ifelse(any(Bacillus[Week=='Week1']==1),1,0))
Output:
# A tibble: 8 x 3
# Groups: CowId [2]
CowId Bacillus Week
<dbl> <dbl> <chr>
1 1234 1 Week1
2 1234 1 Week2
3 1234 1 Week3
4 1234 1 Week4
5 1235 0 Week1
6 1235 0 Week2
7 1235 0 Week3
8 1235 0 Week4
Some data used:
#Data
df <- structure(list(CowId = c(1234, 1234, 1234, 1234, 1235, 1235,
1235, 1235), Bacillus = c(1, 0, 0, 0, 0, 0, 0, 0), Week = c("Week1",
"Week2", "Week3", "Week4", "Week1", "Week2", "Week3", "Week4"
)), row.names = c(NA, -8L), class = "data.frame")

In base R, we can create a logical vector with 'Bacillus' & the 'Week' where its value is 'Week1', subset the 'CowId', check whether it is in 'CowId', coerce the logical to binary (+)
df$Bacillus <- with(df, +(CowId %in% unique(CowId[as.logical(Bacillus) &
Week == 'Week1'])))
df$Bacillus
#[1] 1 1 1 1 0 0 0 0
data
df <- structure(list(CowId = c(1234, 1234, 1234, 1234, 1235, 1235,
1235, 1235), Bacillus = c(1, 0, 0, 0, 0, 0, 0, 0), Week = c("Week1",
"Week2", "Week3", "Week4", "Week1", "Week2", "Week3", "Week4"
)), row.names = c(NA, -8L), class = "data.frame")

Related

Column/Line in R

In R, i have a table where the column name is a date, how do I invert the columns by rows to be able to record in the database?
Example Table:
estab codigo descricao 2021-02-01 2021-02-02
1 103 4390160 ANM 2003 0 0
2 103 4390161 ANM 2004 MF 0 0
3 103 4390162 ANM 2008 MF 0 0
4 103 4390193 ANM 3004 ST 0 0
5 103 4390189 ANM 3008 ST 0 0
6 103 4543512 ANM 24 NET 0 0
7 103 4390163 AMT 2008 RF 0 0
8 103 4543520 ANM 2003 COM BATERIA 0 0
9 103 4543521 ANM 2004 MF COM BATERIA 0 0
10 103 4543522 ANM 2008 MF COM BATERIA 0 0
11 103 4543523 ANM 3004 ST COM BATERIA 0 0
12 103 4543524 ANM 3008 ST COM BATERIA 0 0
13 103 4543516 AMT 8000 0 0
14 103 4390165 AMT 2018 0 0
15 103 4390164 AMT 2010 0 0
I tried to use melt, but it didn't work very well:
xxx <- reshape2::melt(xxx[[1]], id.vars = 'codigo')
If I understood your question, here is a code that should work for you:
# Tried recreating your dataframe
dt <- data.frame(estab = 103,
codigo = 4390160:4390174,
descricao = c("ANM 2003", "ANM 2004", "ANM BATERIA"),
"2021-02-01" = 0,
"2021-02-02" = 0)
dt <- reshape2::melt(dt, id.vars = c("estab", "codigo", "descricao"), variable.name = "Date", value.name = "Value")
# Make column into date
dt$Date <- gsub("X", "", dt$Date)
dt$Date <- as.Date(dt$Date, format = "%Y.%m.%d")
head(dt)
In base R, we can use reshape
out <- reshape(df1, direction = "long", varying = 4:5, sep = "")
row.names(out) <- NULL
data
df1 <- structure(list(estab = c(103, 103, 103, 103, 103, 103, 103, 103,
103, 103, 103, 103, 103, 103, 103), codigo = 4390160:4390174,
descricao = c("ANM 2003", "ANM 2004", "ANM BATERIA", "ANM 2003",
"ANM 2004", "ANM BATERIA", "ANM 2003", "ANM 2004", "ANM BATERIA",
"ANM 2003", "ANM 2004", "ANM BATERIA", "ANM 2003", "ANM 2004",
"ANM BATERIA"), X2021.02.01 = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), X2021.02.02 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-15L))
I think you might be after something like the following.
library(tidyr)
library(lubridate)
pivot_longer(df, matches('\\d{4}-\\d{2}-\\d{2}'),
names_to = 'date',
names_transform = list(date = ymd))
# # A tibble: 6 x 5
# estab codigo descricao date value
# <dbl> <dbl> <chr> <date> <dbl>
# 1 103 4390160 ANM 2003 2021-02-01 0
# 2 103 4390160 ANM 2003 2021-02-02 0
# 3 103 4390161 ANM 2004 MF 2021-02-01 0
# 4 103 4390161 ANM 2004 MF 2021-02-02 0
# 5 103 4390162 ANM 2008 MF 2021-02-01 0
# 6 103 4390162 ANM 2008 MF 2021-02-02 0
Data
df <- structure(list(estab = c(103, 103, 103), codigo = c(4390160,
4390161, 4390162), descricao = c("ANM 2003", "ANM 2004 MF", "ANM 2008 MF"
), `2021-02-01` = c(0, 0, 0), `2021-02-02` = c(0, 0, 0)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

Why is group_by and mutate giving me the unexpected result?

This is an excerpt of my dataset:
check = structure(list(currency = c("AED", "ATS", "AUD", "BEF", "BND",
"CAD"), year = c(2005, 2005, 2005, 2005, 2005, 2005), value = c(0,
0, 14628, 0, 27, 1604), month = c("1", "1", "1", "1", "1", "1"
), quarter = c(1, 1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
Running this code:
check2 = check %>% group_by(currency) %>% mutate(sum = sum(value))
gives me
currency year value month quarter sum
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 AED 2005 0 1 1 16259
2 ATS 2005 0 1 1 16259
3 AUD 2005 14628 1 1 16259
4 BEF 2005 0 1 1 16259
5 BND 2005 27 1 1 16259
6 CAD 2005 1604 1 1 16259
Shouldn't it give me a different value for each currency? When I tried to group by different combinations of variables, it gives me the same value 16259. Could someone point out where I did it wrong? Thank you.

Need to separate out variable names from a column in r

So I have a pretty bad dataset I am not allowed to change. I would like to take the column "Draw_CashFlow" and make only certain values into their own columns. Additionally I need to make the variables all one column (period) (wide to Tidy if you will).
In the dataset below we have a column (Draw_CashFlow) which begins with the variable in question followed by a list of IDs, then repeats for the next variable. Some variables may have NA entries.
structure(list(Draw_CashFlow = c("Principal", "R01",
"R02", "R03", "Workout Recovery Principal",
"Prepaid Principal", "R01", "R02", "R03",
"Interest", "R01", "R02"), `PERIOD 1` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 366627.74, 0, 0, NA, 317521.73,
29175.1), `PERIOD 2` = c(NA, 834659.51, 85800.18, 27540.31, NA,
NA, 306125.98, 0, 0, NA, 302810.49, 28067.8), `PERIOD 3` = c(NA,
834659.51, 85800.18, 27540.31, NA, NA, 269970.12, 0, 0, NA, 298529.92,
27901.36), `PERIOD 4` = c(NA, 834659.51, 85800.18, 27540.31,
NA, NA, 307049.06, 0, 0, NA, 293821.89, 27724.4)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
Now it is a finite list of variables needed (Principal, Workout Recovery Principal, Prepaid Principal, and Interest) so I tried to make a loop where it would see if it existed then gather but that was not correct.
After the variables are set apart from Draw_CashFlow I hope it looks something like this (First four rows, ignore variable abbreviations).
ID Period Principal Wrk_Reco_Principal Prepaid_Principal Interest
R01 1 834659.51 NA 366627.74 317521.73
R02 1 85800.18 NA 0.00 29175.10
R03 1 27540.31 NA 0.00 NA
R01 2 834659.51 NA 306125.98 302810.49
Notes: Wrl_Reco_Principal is NA because there are no ID's within this Draw_CashFlow for this variable. Keep in mind this is supposed to be built to combat any number of IDs, but the variable names in the Draw_CashFlow column will always be the same.
Here's an approach which assumes the Draw_CashFlow values that start with an R are ID numbers. You might need a different method (e.g. !Draw_CashFlow %in% LIST_OF_VARIABLES) if that doesn't hold up.
df %>%
# create separate columns for ID and Variable
mutate(ID = if_else(Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_),
Variable = if_else(!Draw_CashFlow %>% str_starts("R"),
Draw_CashFlow, NA_character_)) %>%
fill(Variable) %>% # Fill down Variable in NA rows from above
select(-Draw_CashFlow) %>%
gather(Period, value, -c(ID, Variable)) %>% # Gather into long form
drop_na() %>%
spread(Variable, value, fill = 0) %>% # Spread based on Variable
mutate(Period = parse_number(Period))
# A tibble: 12 x 5
ID Period Interest `Prepaid Principal` Principal
<chr> <dbl> <dbl> <dbl> <dbl>
1 R01 1 317522. 366628. 834660.
2 R01 2 302810. 306126. 834660.
3 R01 3 298530. 269970. 834660.
4 R01 4 293822. 307049. 834660.
5 R02 1 29175. 0 85800.
6 R02 2 28068. 0 85800.
7 R02 3 27901. 0 85800.
8 R02 4 27724. 0 85800.
9 R03 1 0 0 27540.
10 R03 2 0 0 27540.
11 R03 3 0 0 27540.
12 R03 4 0 0 27540.

How can I make conditional selections using dplyr in R?

I have the following situation. Given the table
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
I am trying to find a way to implement the following logic: If for an ID, both types (MC and MK) occur, use value1 from MK and value2 from MC. Otherwise (only the type MC occurs), use MC.
Hence, the final result is supposed to be:
data.frame(ID = c(1, 2, 3, 4),
type = c("MC", "MC", "MC", "MC"),
value1 = c(512, 4523, 1221, 2556),
value2 = c(726, 4000, 998, 6789))
Assuming the type MK is dropped after extracting the value1.
Another version with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(value1 = ifelse(any(type == "MK"), value1[type=="MK"],value1[type=="MC"]),
value2 = value2[type == "MC"]) %>%
filter(type == "MC")
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
Here, for value1 we check value in "MK" if it is present or take corresponding "MC" value instead and for value2 by default we take "MC" value and keep only rows with type "MC". This is assuming every group (ID) would have a "MC" type row.
For efficiency I would definitely prefer #Andre Elrico' answer but here is a dplyr option. Try:
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
library(dplyr)
df %>%
reshape(., idvar = "ID", timevar = "type", direction = "wide") %>%
group_by(ID) %>%
mutate(value1 = ifelse(is.na(value1.MK), value1.MC, value1.MK),
value2 = ifelse(is.na(value2.MC), value2.MK, value2.MC),
type = "MC") %>%
select(ID, type, value1, value2)
# output
# A tibble: 4 x 4
# Groups: ID [4]
ID type value1 value2
<dbl> <chr> <dbl> <dbl>
1 1 MC 512 726
2 2 MC 4523 4000
3 3 MC 1221 998
4 4 MC 2556 6789
data.table solution
setDT(df1)[,{x=.SD;if(all(c("MC","MK") %in% type)){x$value1[] = last(value1)};first(x)},by=ID]
result:
# ID type value1 value2
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
dplyr:
df1 %>% group_by(ID) %>% do(.,(function(x){if(all(c("MC","MK") %in% x$type)){x$value1[] = x$value1[x$type=="MK"]};x[1,]})(.))
# A tibble: 4 x 4
# Groups: ID [4]
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789

Row by row application in R [duplicate]

I have my data in the form of a data.table given below
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94,
57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp",
"len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
I need to form 4 new columns csi_begin,csi_end, IRQ and csi_order. the value of csi_begin and csi_end when atp=1 depends directly on inv and gu values.
But when atp is not equal to 1 csi_begin and csi_end depends on inv and gu values and IRQ value of previous row
The value of IRQ depends on csi_order of that row if atp==1 else its 0 and csi_order value depends on two rows previous csi_begin value.
I have written the condition with the help of for loop.
Below is the code given
lostsales<-function(transit)
{
if (transit$atp==1)
{
transit$csi_begin[i]<-(transit$inv)[i]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
else
{
transit$csi_begin[i]<-(transit$inv)[i]+transit$IRQ[i-1]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
if (transit$csi_begin[i-2]!= NA)
{
transit$csi_order[i]<-transit$csi_begin[i-2]
}
else
{ transit$csi_order[i]<-0}
if (transit$atp==1)
{
transit$IRQ[i]<-transit$csi_order[i]-transit$RUTL[i]
}
else
{
transit$IRQ[i]<-0
}
}
Can anyone help me how to do efficient looping with data.tables using setkeys? As my data set is very large and I cannot use for loop else the timing would be very high.
Adding the desired outcome to your example would be very helpful, as I'm having trouble following the if/then logic. But I took a stab at it anyway:
library(data.table)
# Example data:
dt <- structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94, 57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp", "len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = "<pointer: 0x0000000000320788>")
# Add a row number:
dt[,rn:=.I]
# Use this function to get the value from a previous (shiftLen is negative) or future (shiftLen is positive) row:
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
# My attempt to follow the seemingly circular if/then rules:
lostsales2 <- function(transit) {
# If atp==1, set csi_begin to inv and csi_end to csi_begin - GU:
transit[atp==1, `:=`(csi_begin=inv, csi_end=inv-GU)]
# Set csi_order to the value of csi_begin from two rows prior:
transit[, csi_order:=rowShift(csi_begin,-2)]
# Set csi_order to 0 if csi_begin from two rows prior was NA
transit[is.na(csi_order), csi_order:=0]
# Initialize IRQ to 0
transit[, IRQ:=0]
# If ATP==1, set IRQ to csi_order - RUTL
transit[atp==1, IRQ:=csi_order-RUTL]
# If ATP!=1, set csi_begin to inv + IRQ value from previous row, and csi_end to csi_begin - GU
transit[atp!=1, `:=`(csi_begin=inv+rowShift(IRQ,-1), csi_end=inv+rowShift(IRQ,-1)-GU)]
return(transit)
}
lostsales2(dt)
## atp len inv GU RUTL rn csi_begin csi_end csi_order IRQ
## 1: 1 2 593 36 100 1 593 557 0 -100
## 2: 0 NA 823 94 NA 2 NA NA 0 0
## 3: 1 3 668 57 173 3 668 611 593 420
## 4: 0 NA 640 105 NA 4 640 535 0 0
## 5: 0 NA 593 48 NA 5 593 545 668 0
## 6: 1 1 745 67 7 6 745 678 640 633
Is this output close to what you were expecting?

Resources