i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())
Related
I have a data frame in R for which I want to remove certain rows provided that match certain conditions. How can I do it ?
I have tried using dplyr and ifelse but my code does not give right answer
check8 <- distinct(df5,prod,.keep_all = TRUE)
Does not work! gives the entire data set
Input is:
check1 <- data.frame(ID = c(1,1,2,2,2,3,4),
prod = c("R","T","R","T",NA,"T","R"),
bad = c(0,0,0,1,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 1 T 0
# 3 2 R 0
# 4 2 T 1
# 5 2 <NA> 0
# 6 3 T 1
# 7 4 R 0
Output expected:
data.frame(ID = c(1,2,3,4),
prod = c("R","R","T","R"),
bad = c(0,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 2 R 0
# 3 3 T 1
# 4 4 R 0
I want to have the output such that for IDs where both prod or NA are there, keep only rows with prod R, but if only one prod is there then keep that row despite the prod .
Using dplyr we can use filter to select rows where prod == "R" or if there is only one row in the group, select that row.
library(dplyr)
check1 %>%
group_by(ID) %>%
filter(prod == "R" | n() == 1)
# ID prod bad
# <dbl> <fct> <dbl>
#1 1 R 0
#2 2 R 0
#3 3 T 1
#4 4 R 0
Here solution using an anti_join
library(dplyr)
check1 <- data.frame(ID = c(1,1,2,2,2,3,4), prod = c("R","T","R","T",NA,"T","R"), bad = c(0,0,0,1,0,1,0))
# First part: select all the IDs which contain 'R' as prod
p1 <- check1 %>%
group_by(ID) %>%
filter(prod == 'R')
# Second part: using anti_join get all the rows from check1 where there are not
# matching values in p1
p2 <- anti_join(check1, p1, by = 'ID')
solution <- bind_rows(
p1,
p2
) %>%
arrange(ID)
I have a data frame df1 with information of acquisitions by ID. Every acquirer A and target B have their four-digit SIC codes on one line separated by "/".
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"))
ID A B
1 1230/1344/2334/2334 1333/2334
2 3322/3344/3443 3344/8840
3 1112/9099 4454
4 3332/4483 9988/2221/4483
I would need to classify each transaction ID as follows:
If the primary code (i.e. the first four digits) of either A or B matches any other code than the primary code of B or A, then the Primary.other.match column takes a value of 1 and 0 else.
If any other than code the primary code of A or B matches any other than the primary code of B or A, then the Other.other.match column takes value of 1 and 0 else.
The desired output is shown below in the updated df1.
df1 <- data.frame(ID = c(1,2,3,4),
A = c("1230/1344/2334/2334","3322/3344/3443", "1112/9099", "3332/4483"),
B = c("1333/2334","3344/8840", "4454", "9988/2221/4483"),
Primary.other.match = c(0,1,0,0), #only if primary Code of A or B matches
any other code of B or A
Other.other.match = c(1,0,0,1)) # only if primary codes do not match
primary or any other codes, but any other codes match
ID A B Primary.other.match Other.other.match
1 1230/1344/2334/2334 1333/2334 0 1
2 3322/3344/3443 3344/8840 1 0
3 1112/9099 4454 0 0
4 3332/4483 9988/2221/4483 0 1
Thank you for your help!
here is a solution within the tidyverse.
You first create a function which checks whether there is a primary match or a other match and then apply this function column wise with purrr::map:
library(tidyverse)
fun1 <- function(str1, str2){
str1 <- str1 %>% str_split("/") %>% unlist()
str2 <- str2 %>% str_split("/") %>% unlist()
str1p <- str1[1]
str2p <- str2[1]
pom <- ifelse(str1p %in% str2 | str2p %in% str1, 1, 0)
oom <- ifelse(pom == 0 & length(intersect(str1, str2)) > 0, 1, 0)
tibble(pom = pom, oom = oom)
}
df1 %>% as_tibble() %>%
mutate(result = map2(A, B, fun1)) %>%
unnest(result)
# A tibble: 4 x 5
ID A B pom oom
<dbl> <fct> <fct> <dbl> <dbl>
1 1 1230/1344/2334/2334 1333/2334 0 1
2 2 3322/3344/3443 3344/8840 1 0
3 3 1112/9099 4454 0 0
4 4 3332/4483 9988/2221/4483 0 1
I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))
Say I have a dataset like this:
id <- c(1, 1, 2, 2, 3, 3)
code <- c("a", "b", "a", "a", "b", "b")
dat <- data.frame(id, code)
I.e.,
id code
1 1 a
2 1 b
3 2 a
4 2 a
5 3 b
6 3 b
Using dplyr, how would I get a count of how many a's there are for each id
i.e.,
id countA
1 1 1
2 2 2
3 3 0
I'm trying stuff like this which isn't working,
countA<- dat %>%
group_by(id) %>%
summarise(cip.completed= count(code == "a"))
The above gives me an error, "Error: no applicable method for 'group_by_' applied to an object of class "logical""
Thanks for your help!
Try the following instead:
library(dplyr)
dat %>% group_by(id) %>%
summarise(cip.completed= sum(code == "a"))
Source: local data frame [3 x 2]
id cip.completed
(dbl) (int)
1 1 1
2 2 2
3 3 0
This works because the logical condition code == a is just a series of zeros and ones, and the sum of this series is the number of occurences.
Note that you would not necessarily use dplyr::count inside summarise anyway, as it is a wrapper for summarise calling either n() or sum() itself. See ?dplyr::count. If you really want to use count, I guess you could do that by first filtering the dataset to only retain all rows in which code==a, and using count would then give you all strictly positive (i.e. non-zero) counts. For instance,
dat %>% filter(code==a) %>% count(id)
Source: local data frame [2 x 2]
id n
(dbl) (int)
1 1 1
2 2 2
I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464