R - Find rows based on group factors - r

I'm trying to figure out a way to find specific values based on each factor within R. In other words, how can I keep all rows that suffice a certain condition for each factor, even if that specific row fails a condition but it's same factor passes the condition on another row?
So I have something like this:
gender values fruit
1 M 20 apple
2 M 22 pear
3 F 24 mango
4 F 19 mango
5 F 9 mango
6 F 17 apple
7 M 18 banana
8 M 22 banana
9 M 12 banana
10 M 14 mango
11 F 7 apple
12 F 8 apple
I want every fruit and has at least one F gender (even if that fruit has some M's). It's also possible to have multiple genders, such as neutral (not shown). So my ideal output out be this:
gender values fruit
1 M 20 apple
3 F 24 mango
4 F 19 mango
5 F 9 mango
6 F 17 apple
10 M 14 mango
11 F 7 apple
12 F 8 apple
Notice that the banana and pear are missing, that's because those fruits ONLY have M's and no F's. Also, rows 1 and 10 are still there even though those are M's, because there are other apples and mangos that have F's, it still applies. Please let me know if this is possible. Thank you!
Below is my code for replicating this data:
gender <- c("M","M","F","F","F","F","M","M","M","M","F","F")
values <- c(20,22,24,19,9,17,18,22,12,14,7,8)
fruit <- c("apple","pear","mango","mango","mango","apple","banana","banana","banana","mango","apple","apple")
df <- data.frame(gender, values, fruit)
Here's what I've tried so far:
df[duplicated(df[,c("fruit","gender")]),]
ave(df$gender, df$fruit, FUN=function(x) ifelse(x=='F','yes','no'))
Also, third party libraries are welcomed but I prefer to stay within R (packages stats and plyr are fine as I have those on my system).

df[df$fruit %in% unique(df[df$gender =='F', ]$fruit),]
# gender values fruit
#1 M 20 apple
#3 F 24 mango
#4 F 19 mango
#5 F 9 mango
#6 F 17 apple
#10 M 14 mango
#11 F 7 apple
#12 F 8 apple

Possible data.table approach
library(data.table)
setDT(df)[, if(any(gender == "F")) .SD, by = fruit]
# fruit gender values
# 1: apple M 20
# 2: apple F 17
# 3: apple F 7
# 4: apple F 8
# 5: mango F 24
# 6: mango F 19
# 7: mango F 9
# 8: mango M 14
I like the other approach, so here's a data.table equivalent using binary join
setkey(setDT(df), fruit)[.(unique(df[gender == "F", fruit], by = "fruit"))]
# gender values fruit
# 1: F 17 apple
# 2: F 7 apple
# 3: F 8 apple
# 4: M 20 apple
# 5: F 24 mango
# 6: F 19 mango
# 7: F 9 mango
# 8: M 14 mango

The base r, the data.table and here I provide the dplyr solution even though some outputs are different (at least in order of the results).
library(dplyr)
df %>% group_by(fruit) %>% filter(any(gender == "F"))
Source: local data frame [8 x 3]
Groups: fruit
gender values fruit
1 M 20 apple
2 F 24 mango
3 F 19 mango
4 F 9 mango
5 F 17 apple
6 M 14 mango
7 F 7 apple
8 F 8 apple

Related

Need help categorizing data into 2 groups according to multiple columns in R?

I have 6 columns of numeric and non numeric data in R as follows:
V1 V2 V3 V4 V5 V6
1 N abc M 'apple' 2 60
2 C pqr R 'banana' 3 70
3 N pqr M 'tomato' 1 50
4 D abc A 'cheese' 2 300
5 D uio R 'potato' 1 60
6 C xyz A 'milk' 5 200
7 N gef M 'milk' 6 500
8 D wvy A 'yogurt' 1 300
9 C abc A 'apple' 7 100
10 C abc R 'potato' 8 100
I want to group the data into 2 groups according to some characteristics using the 6 columns.
For example: Grocery category: if V1= 'N' or V1=C and V3='M' and V4= 'apple' or V4= 'banana' or V4 = 'potato'or V4= 'tomato' and if V1='N' it is necessary to consider V6 <=100$ etc
Milk Category = whatever does not belong to grocery.
How would I do it?
I tried using the case_when but it didn't work.
Here is an approach using case_when. I'm not sure what you tried that didn't work, please let me know if I can clarify further.
You can use %in% to see if a particular letter or word is contained in a vector, as alternative to having multiple "OR" operations.
The final TRUE case will be considered if there are no TRUE evaluations earlier in the case_when statement.
Edit: Added additional logic that would consider V6 in the event that V1 is "N".
df %>%
mutate(category = case_when(
(V1 == "C" | (V1 == "N" & V6 <= 100)) &
V3 == "M" &
V4 %in% c("apple", "banana", "potato", "tomato") ~ "grocery",
TRUE ~ "milk"
))
Output
V1 V2 V3 V4 V5 V6 category
1 N abc M apple 2 60 grocery
2 C pqr R banana 3 700 milk
3 N pqr M tomato 1 50 grocery
4 D abc A cheese 2 300 milk
5 D uio R potato 1 60 milk
6 C xyz A milk 5 20 milk
7 N gef M milk 6 500 milk
8 D wvy A yogurt 1 30 milk
9 C abc A apple 7 600 milk
10 C abc R potato 8 400 milk
Probably not the nicest way but I would use ifelse()
Sample code:
category<-0
category<-ifelse((df$V1=="N" | df$V1=="C") & (df$V4=="apple" | df$V4=="banana" | df$V4=="potato" | df$V4=="tomato"), "Grocery", "Milk")
category<-as.data.frame(category)
ex<-cbind(category,df)
ex
Output:
category V1 V2 V3 V4 V5 V6
1 Grocery N abc M apple 2 60
2 Grocery C pqr R banana 3 700
3 Grocery N pqr M tomato 1 50
4 Milk D abc A cheese 2 300
5 Milk D uio R potato 1 60
6 Milk C xyz A milk 5 20
7 Milk N gef M milk 6 500
8 Milk D wvy A youghurt 1 30
9 Grocery C abc A apple 7 600
10 Grocery C abc R potato 8 400

Subsetting dataframe based on unique values and other column data

I have a dataframe that has a series of ID characters (trt,individual, and session):
> trt<-c(rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3))
individual<-rep(c("Bob","Nancy","Tim"),9)
session<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
data<-rnorm(27,mean=4,sd=1)
df<-as.data.frame(cbind(trt,individual,session,data))
df
trt individual session data
1 A Bob 1 4.36604594311893
2 A Nancy 1 3.29568979189961
3 A Tim 1 3.55849387209243
4 B Bob 2 5.41661201729216
5 B Nancy 2 4.7158873476798
6 B Tim 2 5.34401708530548
7 C Bob 3 4.54277206331273
8 C Nancy 3 3.53976115781019
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
11 A Nancy 4 4.7995601464568
12 A Tim 4 3.17821205815185
13 B Bob 5 3.62379779744325
14 B Nancy 5 4.07387328854209
15 B Tim 5 5.60156909861945
16 C Bob 6 4.06727142161431
17 C Nancy 6 4.59940289933985
18 C Tim 6 3.07543217234973
19 A Bob 7 2.63468285023662
20 A Nancy 7 3.22650587327078
21 A Tim 7 6.31062631711196
22 B Bob 8 4.69047076193906
23 B Nancy 8 4.79190101388308
24 B Tim 8 1.61906440409175
25 C Bob 9 2.85180524036416
26 C Nancy 9 3.43304058627408
27 C Tim 9 4.89263600498695
I am looking to create a new dataframe where I have randomly pulled each trtxindividual combination but under the constraint that each unique session number is only selected once
This is what I want my dataframe to look like:
trt individual session data
2 A Nancy 1 3.29568979189961
4 B Bob 2 5.41661201729216
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
15 B Tim 5 5.60156909861945
17 C Nancy 6 4.59940289933985
21 A Tim 7 6.31062631711196
23 B Nancy 8 4.79190101388308
25 C Bob 9 2.85180524036416
I know how to randomly select a subset of each trtxindividual combination:
> setDT(df)
newdf<-df[, .SD[sample(.N, 1)] , by=.(trt, individual)]
newdf
trt individual session data
1: A Bob 4 4.75145309337952
2: A Nancy 1 3.29568979189961
3: A Tim 7 6.31062631711196
4: B Bob 8 4.69047076193906
5: B Nancy **2** 4.7158873476798
6: B Tim **2** 5.34401708530548
7: C Bob 6 4.06727142161431
8: C Nancy 9 3.43304058627408
9: C Tim 3 3.7954788384957
But I dont know how to restrict the pulls to only allow one session to be pulled (aka not allow duplicates as there are above)
Thanks in advance for your help!
This will need to iterate through the data.table and might not be quick, but it doesn't require setting any parameters for the fields of interest
library(data.table)
set.seed(7)
setDT(df)
dt1 <- df[, .SD[sample(.N)]]
dt1[, i := .I]
dt1[, flag := NA]
setkey(dt1, flag)
lapply(dt1$i, function(x) {
dt1[is.na(flag[x]) & (trt == trt[x] & individual == individual[x] | session == session[x]), flag := i == x]
})
dt1[flag == TRUE, ]
trt individual session data i flag
1: C Tim 9 3.63712332100071 1 TRUE
2: A Nancy 4 4.54908662150973 2 TRUE
3: A Tim 1 5.84217708521442 3 TRUE
4: B Tim 2 2.37343483362789 5 TRUE
5: C Nancy 3 2.87792051390258 7 TRUE
6: A Bob 7 3.45471592963754 12 TRUE
7: B Nancy 8 4.54792567807183 15 TRUE
8: C Bob 6 4.45667777212948 24 TRUE
9: B Bob 5 2.33285598638319 27 TRUE

Concatenate two datasets in r

I have two datasets animal and plants
ANIMAL PLANT
OBS Common Animal Number OBS Common Plant Number
1 a Ant 5 1 g Grape 69
2 b Bird 2 h Hazelnut 55
3 c Cat 17 3 i Indigo
4 d Dog 9 4 j Jicama 14
5 e Eagle 5 k Kale 5
6 f Frog 76 6 l Lentil 77
I want to concatenate these two into a new dataset.
Below is the desired output
Obs Common Animal Plant Number
1 a Ant 5
2 b Bird .
3 c Cat 17
4 d Dog 9
5 e Eagle .
6 f Frog 76
7 g Grape 69
8 h Hazelnut 55
9 i Indigo .
10 j Jicama 14
11 k Kale 5
12 l Lentil 77
How to do these kind of concatenate in R?
rbind() will not work because of the differing names.
Something like this will work for the given example:
rbind_ <- funciton(data1, data2) {
nms1 <- names(data1)
nms2 <- names(data2)
if(mean(nms1==nms2)==1) {
out <- rbind(data1, data2)
} else {
data1[nms2[!nms2%in%nms1]] <- NA
data2[nms1[!nms1%in%nms2]] <- NA
out <- rbind(data1, data2)
}
return(out)
}
rbind_(animal, plant)
OBS Common Animal Number Plant
1 1 a Ant 5 <NA>
2 2 b Bird NA <NA>
3 3 c Cat 17 <NA>
4 4 d Dog 9 <NA>
5 5 e Eagle NA <NA>
6 6 f Frog 76 <NA>
7 1 g <NA> 69 Grape
8 2 h <NA> 55 Hazelnut
9 3 i <NA> NA Indigo
10 4 j <NA> 14 Jicama
11 5 k <NA> 5 Kale
12 6 l <NA> 77 Lentil
But would require a bit of tweaking to get to work in all cases, I think.
This should give you the desired output:
PLANT$OBS = PLANT$OBS + nrow(ANIMAL)
ANIMAL$Plant = ''
PLANT$Animal = ''
Final_DF= rbind(ANIMAL,PLANT)

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Remove duplicates column combinations from a dataframe in R

I want to remove duplicate combinations of sessionid, qf and qn from the following data
sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville
I read in the data as a data.frame and call it mydata. Heree is the code I have so far, but I need to know how to first sort the data.frame correctly. Secondly remove the duplicate combinations of sessionid, qf, and qn. And lastly graph in a histogram characters in the column qf
sortDATA<-function(name)
{
#sort the code by session Id, first name, then last name
sort1.name <- name[order("sessionid","qf","qn") , ]
#create a vector of length of first names
sname<-nchar(sort1.name$qf)
hist(sname)
}
thanks!
duplicated() has a method for data.frames, which is designed for just this sort of task:
df <- data.frame(a = c(1:4, 1:4),
b = c(4:1, 4:1),
d = LETTERS[1:8])
df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D
In your example the repeated rows were entirely repeated. unique works with data.frames.
udf <- unique( my.data.frame )
As for sorting... joran just posted the answer.
To address your sorting problems, first reading in your example data:
dat <- read.table(text = " sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos NA
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee NA
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville ",sep = "",header = TRUE)
and then you can use arrange from plyr,
arrange(dat,sessionid,qf,qn)
or using base functions,
with(dat,dat[order(sessionid,qf,qn),])
It works if you use duplicated twice:
> df
a b c d
1 1 2 A 1001
2 2 4 B 1002
3 3 6 B 1002
4 4 8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006
> df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]
a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Resources