Collapse rows with same identifier and columns and retain all values in r - r

I have a dataframe that contains several fields related to an identifier but some are disjointed:
id store manager fruit vegetable
1 Grocery1 Joe apple NA
1 Grocery1 Joe lemon NA
1 Grocery1 Joe NA zucchini
2 Grocery2 Amy orange NA
2 Grocery2 Amy NA asparagus
2 Grocery2 Amy NA spinach
3 Grocery3 Bill NA NA
I want the dataframe to look like:
id store manager fruit vegetable
1 Grocery1 Joe apple zucchini
1 Grocery1 Joe lemon zucchini
2 Grocery2 Amy orange asparagus
2 Grocery2 Amy orange spinach
3 Grocery3 Bill NA NA
Is there a way to easily do this?

You can use tidyr::fill to fill the NA, and only keep the non-duplicated rows using distinct.
library(dplyr)
library(tidyr)
df %>%
group_by(store, manager) %>%
fill(fruit, vegetable, .direction = "updown") %>%
distinct()
# A tibble: 5 × 5
# Groups: store, manager [3]
id store manager fruit vegetable
<int> <chr> <chr> <chr> <chr>
1 1 Grocery1 Joe apple zucchini
2 1 Grocery1 Joe lemon zucchini
3 2 Grocery2 Amy orange asparagus
4 2 Grocery2 Amy orange spinach
5 3 Grocery3 Bill NA NA

Related

Combining two dataframes based on presence of values of various different columns

I have a question about creating new columns in my dataset by checking whether a value is present in one of the columns of my dataframe, and assigning the columns of a different dataframe based on that presence. As this description is quite vague, see the example dataset below:
newDf <- data.frame(c("Juice 1", "Juice 2", "Juice 3", "Juice 4","Juice 5"),
c("Banana", "Banana", "Orange", "Pear", "Apple"),
c("Blueberry", "Mango", "Rasberry", "Spinach", "Pear"),
c("Kale", NA, "Cherry", NA, "Peach"))
colnames(newDf) <- c("Juice", "Fruit 1", "Fruit 2", "Fruit 3")
dfChecklist <- data.frame(c("Banana", "Cherry"),
c("100", "80"),
c("5", "3"),
c("4", "5"))
colnames(dfChecklist) <- c("FruitCheck", "NutritionalValue", "Deliciousness", "Difficulty")
This gives the following dataframes:
Juice Fruit 1 Fruit 2 Fruit 3
1 Juice 1 Banana Blueberry Kale
2 Juice 2 Banana Mango <NA>
3 Juice 3 Orange Rasberry Cherry
4 Juice 4 Pear Spinach <NA>
5 Juice 5 Apple Pear Peach
FruitCheck NutritionalValue Deliciousness Difficulty
1 Banana 100 5 4
2 Cherry 80 3 5
I want to combine the two and make the result to be like this:
Juice Fruit 1 Fruit 2 Fruit 3 FruitCheck NutritionalValue Deliciousness Difficulty
1 Juice 1 Banana Blueberry Kale Banana 100 5 4
2 Juice 2 Banana Mango <NA> Banana 100 5 4
3 Juice 3 Orange Rasberry Cherry Cherry 80 3 5
4 Juice 4 Pear Spinach <NA> <NA> <NA> <NA> <NA>
5 Juice 5 Apple Pear Peach <NA> <NA> <NA> <NA>
The dataset above is an example, my own dataset is much larger and complexer.
Thanks so much in advance for your help!
First find the first match for each row
tmp=unlist(
apply(
newDf[,grepl("Fruit",colnames(newDf))],
1,
function(x){
y=as.vector(x)
y=y[which.min(match(y,dfChecklist$FruitCheck))]
ifelse(length(y)==0,NA,y)
}
)
)
add this to your original df and then a simple merge
newDf$FruitCheck=tmp
merge(
newDf,
dfChecklist,
by="FruitCheck",
all.x=T
)
resulting in
FruitCheck Juice Fruit 1 Fruit 2 Fruit 3 NutritionalValue Deliciousness
1 Banana Juice 1 Banana Blueberry Kale 100 5
2 Banana Juice 2 Banana Mango <NA> 100 5
3 Cherry Juice 3 Orange Rasberry Cherry 80 3
4 <NA> Juice 4 Pear Spinach <NA> <NA> <NA>
5 <NA> Juice 5 Apple Pear Peach <NA> <NA>
Difficulty
1 4
2 4
3 5
4 <NA>
5 <NA>

My question is about R: How to number each repetition in a table in R?

In my data set, their is column of full names (eg: below) and I want to add the another column next to it mentioning if a name has appeared two one, two, three, four.... times using R. My output should look like the column below: Number of repetition.
Eg: Data set name: People
**Full name** **Number of repetition**
Peter 1
Peter 2
Alison
Warren
Jack 1
Jack 2
Jack 3
Jack 4
Susan 1
Susan 2
Henry 1
Walison
Tinder 1
Peter 3
Henry 2
Tinder 2
Thanks
Teena
Here is an alternative way solved with help from akrun: sum() condition in ifelse statement
library(dplyr)
df1 %>%
group_by(Fullname) %>%
mutate(newcol = row_number(),
newcol = if(sum(newcol)> 1) newcol else NA) %>%
ungroup
Fullname newcol
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
Here is one way. Do a group by 'Fullname', and create the sequence with row_number() if the number of rows is greater than 1. By default, case_when returns the other case as NA
library(dplyr)
df1 <- df1 %>%
group_by(Fullname) %>%
mutate(Number_of_repetition = case_when(n() > 1 ~ row_number())) %>%
ungroup
-output
df1
# A tibble: 16 × 2
Fullname Number_of_repetition
<chr> <int>
1 Peter 1
2 Peter 2
3 Alison NA
4 Warren NA
5 Jack 1
6 Jack 2
7 Jack 3
8 Jack 4
9 Susan 1
10 Susan 2
11 Henry 1
12 Walison NA
13 Tinder 1
14 Peter 3
15 Henry 2
16 Tinder 2
If we need to add a third column, use unite on the updated data from previous step
library(tidyr)
df1 %>%
unite(FullNameRep, Fullname, Number_of_repetition, sep="", na.rm = TRUE, remove = FALSE)
-output
# A tibble: 16 × 3
FullNameRep Fullname Number_of_repetition
<chr> <chr> <int>
1 Peter1 Peter 1
2 Peter2 Peter 2
3 Alison Alison NA
4 Warren Warren NA
5 Jack1 Jack 1
6 Jack2 Jack 2
7 Jack3 Jack 3
8 Jack4 Jack 4
9 Susan1 Susan 1
10 Susan2 Susan 2
11 Henry1 Henry 1
12 Walison Walison NA
13 Tinder1 Tinder 1
14 Peter3 Peter 3
15 Henry2 Henry 2
16 Tinder2 Tinder 2
data
df1 <- structure(list(Fullname = c("Peter", "Peter", "Alison", "Warren",
"Jack", "Jack", "Jack", "Jack", "Susan", "Susan", "Henry", "Walison",
"Tinder", "Peter", "Henry", "Tinder")), row.names = c(NA, -16L
), class = "data.frame")

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Collapsing group of strings into one string using an if statement within a for loop in R

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?
With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4
Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )
You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))
You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

"new drug user" design R

I want to establish a cohort of new users of drugs (Ray 2003). My original dataset is huge approx 19 million rows, so a loop is proving inefficient. Here is a dummy dataset (done with fruits instead of drugs):
df2
names dates age sex fruit
1 tom 2010-02-01 60 m apple
2 mary 2010-05-01 55 f orange
3 tom 2010-03-01 60 m banana
4 john 2010-07-01 57 m kiwi
5 mary 2010-07-01 55 f apple
6 tom 2010-06-01 60 m apple
7 john 2010-09-01 57 m apple
8 mary 2010-07-01 55 f orange
9 john 2010-11-01 57 m banana
10 mary 2010-09-01 55 f apple
11 tom 2010-08-01 60 m kiwi
12 mary 2010-11-01 55 f apple
13 john 2010-12-01 57 m orange
14 john 2011-01-01 57 m apple
I have identified people who were prescribed an apple between 04-2010 and 10-2010:
temp2
names dates age sex fruit
6 tom 2010-06-01 60 m apple
5 mary 2010-07-01 55 f apple
7 john 2010-09-01 57 m apple
I would like to make a new column in the original DF called "index" which is the first date that a person was prescribed a drug in the the defined date range. This is what I have tried to get the dates from temp into df$index:
df2$index<-temp2$dates
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)
I'm not doing this right - as none of these work. This is the desired output.
df2
names dates age sex fruit index
1 tom 2010-02-01 60 m apple <NA>
2 mary 2010-05-01 55 f orange <NA>
3 tom 2010-03-01 60 m banana <NA>
4 john 2010-07-01 57 m kiwi <NA>
5 mary 2010-07-01 55 f apple 2010-07-01
6 tom 2010-06-01 60 m apple 2010-06-01
7 john 2010-09-01 57 m apple 2010-09-01
8 mary 2010-07-01 55 f orange <NA>
9 john 2010-11-01 57 m banana <NA>
10 mary 2010-09-01 55 f apple <NA>
11 tom 2010-08-01 60 m kiwi <NA>
12 mary 2010-11-01 55 f apple <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple <NA>
Once I have the desired output, I want to trace back from the index date to see if any person had an apple in the previous 180 days. if they did not have an apple - I want to keep them. If they did have an apple (e.g., tom) I want to discard him. This is the code i have tried on the desired output:
df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me
I would appreciate any guidance at all on these questions - even a direction to what I should read to help me learn how to do this. Perhaps my logic is flawed and my method won't work - please tell me if thats the case! Thank you in advance.
Here is my df:
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
Here is temp2:
data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates< "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ]
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL
SOLUTION
df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1]) ##DWin code for assigning index date for each fruit in the pre-period
df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"])) ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids) ##gets rid of id that has at least one value of true
First order by name and date:
df <- df[with(df, order(names, dates)), ]
Then just pick the first date within each name:
df$first.date <- ave(df$date, df$name, FUN="[", 1)
Now that you have will see "the power of the fully operational Death Star \w\w", er, the ave-function. You are ready to pick out the first date within individual 'names' and 'fruits' within that date-range:
> df$first.date <- ave(df$date, df$name, df$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
names dates age sex fruit first.date
4 john 2010-07-01 57 m kiwi 2010-07-01
7 john 2010-09-01 57 m apple 2010-09-01
9 john 2010-11-01 57 m banana <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple 2010-09-01
2 mary 2010-05-01 55 f orange 2010-05-01
5 mary 2010-07-01 55 f apple 2010-07-01
8 mary 2010-07-01 55 f orange 2010-05-01
10 mary 2010-09-01 55 f apple 2010-07-01
12 mary 2010-11-01 55 f apple 2010-07-01
1 tom 2010-02-01 60 m apple 2010-06-01
3 tom 2010-03-01 60 m banana <NA>
6 tom 2010-06-01 60 m apple 2010-06-01
11 tom 2010-08-01 60 m kiwi 2010-08-01
Since you have 19 million rows , I think you should try a data.table solution. Here my attempt. The result is slightly different from #Dwin result since I filter my data between (begin,end) and then I create a new index variable which is the min dates occurring in this chosen range for each (names,fruits)
library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
index := as.character(min(dates))
, by=c('names','fruit')]
## names dates age sex fruit index
## 1: john 2010-07-01 57 m kiwi 2010-07-01
## 2: john 2010-09-01 57 m apple 2010-09-01
## 3: john 2010-11-01 57 m banana NA
## 4: john 2010-12-01 57 m orange NA
## 5: john 2011-01-01 57 m apple NA
## 6: mary 2010-05-01 55 f orange 2010-05-01
## 7: mary 2010-07-01 55 f apple 2010-07-01
## 8: mary 2010-07-01 55 f orange 2010-05-01
## 9: mary 2010-09-01 55 f apple 2010-07-01
## 10: mary 2010-11-01 55 f apple NA
## 11: tom 2010-02-01 60 m apple NA
## 12: tom 2010-03-01 60 m banana NA
## 13: tom 2010-06-01 60 m apple 2010-06-01
## 14: tom 2010-08-01 60 m kiwi 2010-08-01

Resources