In [R], gen new variable for each value of group - r

I have id variable and date variable where there are multiple dates for a given id (a panel). I would like to generate a new variable based on whether ANY of the years for a given id meet a logical condition. I am not sure of how to code it so please don't take the following as R code, just as logical pseudocode. Something like
foreach(i in min(id):max(id)) {
if(var1[yearvar[1:max(yearvar)]=="A") then { newvar==1}
}
As an example:
ID Year Letter
1 1999 A
1 2000 B
2 2000 C
3 1999 A
Should return newvar
1
1
0
1
Since data[ID==1] contains A in some year, it should also ==1 in 2000 despite Letter==B that year.

Here's a way of approaching it with base R:
#Find which ID meet first criteria
withA <- unique(dat$ID[dat$Letter == "A"])
#add new column based on whether ID is in withA
dat$newvar <- as.numeric(dat$ID %in% withA)
# ID Year Letter newvar
# 1 1 1999 A 1
# 2 1 2000 B 1
# 3 2 2000 C 0
# 4 3 1999 A 1

Here's a solution using plyr:
library(plyr)
a <- ddply(dat, .(ID), summarise, newvar = as.numeric(any(Letter == "A")))
merge(ID, a, by="ID")

Without using a package:
dat <- data.frame(
ID = c(1,1,2,3),
Year = c(1999,2000,2000,1999),
Letter = c("A","B","C","A")
)
tableData <- table(dat[,c("ID","Letter")])
newvar <- ifelse(tableData[dat$ID,"A"]==1,1,0)
dat <- cbind(dat,newvar)
# ID Year Letter newvar
#1 1 1999 A 1
#2 1 2000 B 1
#3 2 2000 C 0
#4 3 1999 A 1

Related

Multiply Rows Conditionally in data.table

I have the following data.table
# Load Library
library(data.table)
# Generate Data
test_data <- data.table(
year = c(2000,2000,2001,2001),
grp = rep(c("A","B"),2),
value = 1:4
)
I want to multiply the values of group A with varying parameters over year. My attempt was using sapply with fifelse and a fixed value of 2, but it seems that this solutions is going to be messy if I want to vary this value over time.
multiply_effect <- sapply(
1:nrow(test_data), function(i){
fifelse(
test = test_data$grp[i] == "A", test_data$value[i] * 2, test_data$value[i]
)
}
)
Lets say that I want to multiply the value of grp A with 2 in 2000 and by 3 in 2001 and keeping grp B as it is, then my desired outputwould be,
year grp value new_value
1: 2000 A 1 2
2: 2000 B 2 2
3: 2001 A 3 9
4: 2001 B 4 4
Im looking for at data.table-solution only.
You could define the factors for the years / groups you want to modify in another lookup data.table and update the main table with a join:
test_data <- data.table(
year = c(2000,2000,2001,2001),
grp = rep(c("A","B"),2),
value = 1:4
)
factor_lookup <- data.table(
year = c(2000,2001),
grp = rep("A",2),
factor = c(2,3)
)
test_data[factor_lookup ,value:=value*factor,on=.(year,grp)][]
year grp value
<num> <char> <int>
1: 2000 A 2
2: 2000 B 2
3: 2001 A 9
4: 2001 B 4

R: Vlookup for a 'for' loop

i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())

How to generate a dummy treatment variable based on values from two different variables

I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)
Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows

Assign values of a new column based on the frequency of a special pattern in dataframe

I would like to create another column of a data frame that groups each member in the first column based on the order.
Here is a reproducible demo:
df1=c("Alex","23","ID #:123", "John","26","ID #:564")
df1=data.frame(df1)
library(dplyr)
library(data.table)
df1 %>% mutate(group= ifelse(df1 %like% "ID #:",1,NA ) )
This was the output from the demo:
df1 group
1 Alex NA
2 23 NA
3 ID #:123 1
4 John NA
5 26 NA
6 ID #:564 1
This is what I want:
df1 group
1 Alex 1
2 23 1
3 ID #:123 1
4 John 2
5 26 2
6 ID #:564 2
So I want to have a group column indicates each member in order.
I appreciate in advance for any reply or thoughts!
Shift the condition with lag first and then do a cumsum:
df1 %>%
mutate(group= cumsum(lag(df1 %like% "ID #:", default = 1)))
# df1 group
#1 Alex 1
#2 23 1
#3 ID #:123 1
#4 John 2
#5 26 2
#6 ID #:564 2
Details:
df1 %>%
mutate(
# calculate the condition
cond = df1 %like% "ID #:",
# shift the condition down and fill the first value with 1
lag_cond = lag(cond, default = 1),
# increase the group when the condition is TRUE (ID encountered)
group= cumsum(lag_cond))
# df1 cond lag_cond group
#1 Alex FALSE TRUE 1
#2 23 FALSE FALSE 1
#3 ID #:123 TRUE FALSE 1
#4 John FALSE TRUE 2
#5 26 FALSE FALSE 2
#6 ID #:564 TRUE FALSE 2
You don't mention whether you're always expecting 3 rows per member. This code will allow you to toggle the number of rows per member (in case there's not always 3):
# Your code:
df1=c("Alex","23","ID #:123", "John","26","ID #:564")
df1=data.frame(df1)
library(dplyr)
library(data.table)
df1 %>% mutate(group= ifelse(df1 %like% "ID #:",1,NA ) )
number_of_rows_per_member <- 3 # Change if necessary
positions <- 1:(nrow(df1)/number_of_rows_per_member)
group <- c()
for (i in 1:length(positions)) {
group[(i*number_of_rows_per_member):((i*number_of_rows_per_member)-(number_of_rows_per_member-1))] <- i
}
group # This is the group column
df1$group <- group # Now just move the group coloumn into your original dataframe
df1 # Done!

Setting column value of a subset of rows in a dataframe in R [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 5 years ago.
I have a dataframe df with a column called ID.
Multiple rows may have the same ID and I want to set a column value "occurrence" to indicate how many times the ID has been seen before.
for (i in unique(df$ID)) {
rows = df[df$ID==i, ]
for (idx in 1:nrow(rows)) {
rows[idx,'occurrence'] = idx
}
}
Unfortunately, this adds the occurrence column to rows, but it does not update the original data frame. How do I get the occurrence column added to df?
Update: The row_number() function pointed out by neilfws works great. Actually, I have a followup question: The dataframe also has a year column, an what I need to do is to add a new column (say Prev.Year.For.This.ID) for the year of the previous occurrence of the ID. e.g if the input is
Year = c(1991,1991,1993,1994,1995)
ID = c(1,2,1,2,1)
df <- data.frame (Year, ID)
I'd like the output to look like this:
ID Year occurrence Prev.Year.For.This.Id
1 1991 1 <NA>
2 1992 1 <NA>
1 1993 2 1991
2 1994 2 1992
1 1995 3 1993
You can use dplyr to group_by ID, then row_number gives the running total of occurrences.
library(dplyr)
df1 <- data.frame(ID = c(1,2,3,1,4,5,6,2,7,8,2))
df1 %>%
group_by(ID) %>%
mutate(cnt = row_number()) %>%
ungroup()
ID cnt
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 1 2
5 4 1
6 5 1
7 6 1
8 2 2
9 7 1
10 8 1
11 2 3
Are you after something like the following (I made up sample data for you):
library(dplyr)
df = data.frame(ID = c(1,1,1,2,2,3))
answer = df %>% group_by(ID) %>% mutate(occurrence = cumsum(ID / ID) - 1) %>% as.data.frame
This will give something which looks like this:
ID occurrence
1 0
1 1
1 2
2 0
2 1
3 0
The dplyr package is a great tool for grouping and summarising data. I also find the code very readable when I use the pipe %>% (though, admittedly, it does take some getting used to).
> library(data.table)
> df = data.frame(ID = c(1,1,1,2,2,3))
> df <- data.table(df)
> df[, occurrence := sequence(.N), by = c("ID")]
> df
ID occurrence
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 3 1

Resources