Remove duplicates column combinations from a dataframe in R - r

I want to remove duplicate combinations of sessionid, qf and qn from the following data
sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville
I read in the data as a data.frame and call it mydata. Heree is the code I have so far, but I need to know how to first sort the data.frame correctly. Secondly remove the duplicate combinations of sessionid, qf, and qn. And lastly graph in a histogram characters in the column qf
sortDATA<-function(name)
{
#sort the code by session Id, first name, then last name
sort1.name <- name[order("sessionid","qf","qn") , ]
#create a vector of length of first names
sname<-nchar(sort1.name$qf)
hist(sname)
}
thanks!

duplicated() has a method for data.frames, which is designed for just this sort of task:
df <- data.frame(a = c(1:4, 1:4),
b = c(4:1, 4:1),
d = LETTERS[1:8])
df[!duplicated(df[c("a", "b")]),]
# a b d
# 1 1 4 A
# 2 2 3 B
# 3 3 2 C
# 4 4 1 D

In your example the repeated rows were entirely repeated. unique works with data.frames.
udf <- unique( my.data.frame )
As for sorting... joran just posted the answer.

To address your sorting problems, first reading in your example data:
dat <- read.table(text = " sessionid qf qn city
1 9cf571c8faa67cad2aa9ff41f3a26e38 cat biddix fresno
2 e30f853d4e54604fd62858badb68113a caleb amos NA
3 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
4 2ad41134cc285bcc06892fd68a471cd7 daniel folkers NA
5 63a5e839510a647c1ff3b8aed684c2a5 charles pierce flint
6 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
7 691df47f2df12f14f000f9a17d1cc40e j franz prescott+valley
8 b3a1476aa37ae4b799495256324a8d3d carrie mascorro brea
9 bd9f1404b313415e7e7b8769376d2705 fred morales las+vegas
10 b50a610292803dc302f24ae507ea853a aurora lee NA
11 fb74940e6feb0dc61a1b4d09fcbbcb37 andrew price yorkville ",sep = "",header = TRUE)
and then you can use arrange from plyr,
arrange(dat,sessionid,qf,qn)
or using base functions,
with(dat,dat[order(sessionid,qf,qn),])

It works if you use duplicated twice:
> df
a b c d
1 1 2 A 1001
2 2 4 B 1002
3 3 6 B 1002
4 4 8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006
> df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]
a b c d
1 1 2 A 1001
4 4 8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Related

Subsetting dataframe based on unique values and other column data

I have a dataframe that has a series of ID characters (trt,individual, and session):
> trt<-c(rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3),rep("A",3),rep("B",3),rep("C",3))
individual<-rep(c("Bob","Nancy","Tim"),9)
session<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
data<-rnorm(27,mean=4,sd=1)
df<-as.data.frame(cbind(trt,individual,session,data))
df
trt individual session data
1 A Bob 1 4.36604594311893
2 A Nancy 1 3.29568979189961
3 A Tim 1 3.55849387209243
4 B Bob 2 5.41661201729216
5 B Nancy 2 4.7158873476798
6 B Tim 2 5.34401708530548
7 C Bob 3 4.54277206331273
8 C Nancy 3 3.53976115781019
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
11 A Nancy 4 4.7995601464568
12 A Tim 4 3.17821205815185
13 B Bob 5 3.62379779744325
14 B Nancy 5 4.07387328854209
15 B Tim 5 5.60156909861945
16 C Bob 6 4.06727142161431
17 C Nancy 6 4.59940289933985
18 C Tim 6 3.07543217234973
19 A Bob 7 2.63468285023662
20 A Nancy 7 3.22650587327078
21 A Tim 7 6.31062631711196
22 B Bob 8 4.69047076193906
23 B Nancy 8 4.79190101388308
24 B Tim 8 1.61906440409175
25 C Bob 9 2.85180524036416
26 C Nancy 9 3.43304058627408
27 C Tim 9 4.89263600498695
I am looking to create a new dataframe where I have randomly pulled each trtxindividual combination but under the constraint that each unique session number is only selected once
This is what I want my dataframe to look like:
trt individual session data
2 A Nancy 1 3.29568979189961
4 B Bob 2 5.41661201729216
9 C Tim 3 3.7954788384957
10 A Bob 4 4.75145309337952
15 B Tim 5 5.60156909861945
17 C Nancy 6 4.59940289933985
21 A Tim 7 6.31062631711196
23 B Nancy 8 4.79190101388308
25 C Bob 9 2.85180524036416
I know how to randomly select a subset of each trtxindividual combination:
> setDT(df)
newdf<-df[, .SD[sample(.N, 1)] , by=.(trt, individual)]
newdf
trt individual session data
1: A Bob 4 4.75145309337952
2: A Nancy 1 3.29568979189961
3: A Tim 7 6.31062631711196
4: B Bob 8 4.69047076193906
5: B Nancy **2** 4.7158873476798
6: B Tim **2** 5.34401708530548
7: C Bob 6 4.06727142161431
8: C Nancy 9 3.43304058627408
9: C Tim 3 3.7954788384957
But I dont know how to restrict the pulls to only allow one session to be pulled (aka not allow duplicates as there are above)
Thanks in advance for your help!
This will need to iterate through the data.table and might not be quick, but it doesn't require setting any parameters for the fields of interest
library(data.table)
set.seed(7)
setDT(df)
dt1 <- df[, .SD[sample(.N)]]
dt1[, i := .I]
dt1[, flag := NA]
setkey(dt1, flag)
lapply(dt1$i, function(x) {
dt1[is.na(flag[x]) & (trt == trt[x] & individual == individual[x] | session == session[x]), flag := i == x]
})
dt1[flag == TRUE, ]
trt individual session data i flag
1: C Tim 9 3.63712332100071 1 TRUE
2: A Nancy 4 4.54908662150973 2 TRUE
3: A Tim 1 5.84217708521442 3 TRUE
4: B Tim 2 2.37343483362789 5 TRUE
5: C Nancy 3 2.87792051390258 7 TRUE
6: A Bob 7 3.45471592963754 12 TRUE
7: B Nancy 8 4.54792567807183 15 TRUE
8: C Bob 6 4.45667777212948 24 TRUE
9: B Bob 5 2.33285598638319 27 TRUE

Create new group based on cumulative sum and group

I am looking to create a new group based on two conditions. I want all of the cases until the cumulative sum of Value reaches 10 to be grouped together and I want this done within each person. I have managed to get it to work for each of the conditions separately, but not together using for loops and dplyr. However, I need both of these conditions to be applied. Below is what I would like the data to look like (I don't need an RunningSum_Value column, but I kept it in for clarification). Ideally I would like a dplyr solution, but I m not picky. Thank you in advance!
ID Value RunningSum_Value Group
PersonA 1 1 1
PersonA 3 4 1
PersonA 10 14 1
PersonA 3 3 2
PersonB 11 11 3
PersonB 12 12 4
PersonC 3 3 5
PersonD 4 4 6
PersonD 9 13 6
PersonD 5 5 7
PersonD 11 16 7
PersonD 6 6 8
PersonD 1 7 8
Here is my data:
df <- read.table(text="ID Value
PersonA 1
PersonA 3
PersonA 10
PersonA 3
PersonB 11
PersonB 12
PersonC 3
PersonD 4
PersonD 9
PersonD 5
PersonD 11
PersonD 6
PersonD 1", header=TRUE,stringsAsFactors=FALSE)
Define function sum0 which does a sum on its argument except that each time it gets to 10 or more it outputs 0. Define function is_start that returns TRUE for the start position of a group and FALSE otherwise. Finally apply is_start to each ID group using ave and then perform a cumsum on that to get the group numbers.
sum0 <- function(x, y) { if (x + y >= 10) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(DF$Value, DF$ID, FUN = is_start))
## [1] 1 1 1 2 3 4 5 6 6 7 7 8 8
UPDATE: fix

Concatenate two datasets in r

I have two datasets animal and plants
ANIMAL PLANT
OBS Common Animal Number OBS Common Plant Number
1 a Ant 5 1 g Grape 69
2 b Bird 2 h Hazelnut 55
3 c Cat 17 3 i Indigo
4 d Dog 9 4 j Jicama 14
5 e Eagle 5 k Kale 5
6 f Frog 76 6 l Lentil 77
I want to concatenate these two into a new dataset.
Below is the desired output
Obs Common Animal Plant Number
1 a Ant 5
2 b Bird .
3 c Cat 17
4 d Dog 9
5 e Eagle .
6 f Frog 76
7 g Grape 69
8 h Hazelnut 55
9 i Indigo .
10 j Jicama 14
11 k Kale 5
12 l Lentil 77
How to do these kind of concatenate in R?
rbind() will not work because of the differing names.
Something like this will work for the given example:
rbind_ <- funciton(data1, data2) {
nms1 <- names(data1)
nms2 <- names(data2)
if(mean(nms1==nms2)==1) {
out <- rbind(data1, data2)
} else {
data1[nms2[!nms2%in%nms1]] <- NA
data2[nms1[!nms1%in%nms2]] <- NA
out <- rbind(data1, data2)
}
return(out)
}
rbind_(animal, plant)
OBS Common Animal Number Plant
1 1 a Ant 5 <NA>
2 2 b Bird NA <NA>
3 3 c Cat 17 <NA>
4 4 d Dog 9 <NA>
5 5 e Eagle NA <NA>
6 6 f Frog 76 <NA>
7 1 g <NA> 69 Grape
8 2 h <NA> 55 Hazelnut
9 3 i <NA> NA Indigo
10 4 j <NA> 14 Jicama
11 5 k <NA> 5 Kale
12 6 l <NA> 77 Lentil
But would require a bit of tweaking to get to work in all cases, I think.
This should give you the desired output:
PLANT$OBS = PLANT$OBS + nrow(ANIMAL)
ANIMAL$Plant = ''
PLANT$Animal = ''
Final_DF= rbind(ANIMAL,PLANT)

Append sequence number to data frame based on grouping field and date field

I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:
x y
1 A 2012-01-02
2 A 2012-02-03
3 A 2012-02-25
4 A 2012-03-04
5 B 2012-01-02
6 B 2012-02-03
7 C 2013-01-02
8 C 2012-02-03
9 C 2012-03-04
10 C 2012-04-05
in to this:
x y v
1 A 2012-01-02 1
2 A 2012-02-03 2
3 A 2012-02-25 3
4 A 2012-03-04 4
5 B 2012-01-02 1
6 B 2012-02-03 2
7 C 2013-01-02 1
8 C 2012-02-03 2
9 C 2012-03-04 3
10 C 2012-04-05 4
where "x" is the individual, "y" is the date, and "v" is the appended sequence number
I have had success on a small data frame using a for loop in this code:
x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)
z$v=rep(1,nrow(z))
for(i in 2:nrow(z)){
if(z$x[i]==z$x[i-1]){
z$v[i]=(z$v[i-1]+1)
} else {
z$v[i]=1
}
}
but when I expand this to a much larger data frame (250K+ rows) the process takes forever.
Any thoughts on how I can make this more efficient?
This seems to work. May be overkill though.
## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))
EDIT
I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.
## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
## x y v
## 1 A 2012-01-02 1
## 2 A 2012-02-03 2
## 3 A 2012-02-25 3
## 4 A 2012-03-04 4
## 5 B 2012-01-02 1
## 6 B 2012-02-03 2
## 7 C 2013-01-02 1
## 8 C 2012-02-03 2
## 9 C 2012-03-04 3
## 10 C 2012-04-05 4
Also, an interesting one is stack. Take a look.
> stack(sapply(table(d$x), seq))
## values ind
## 1 1 A
## 2 2 A
## 3 3 A
## 4 4 A
## 5 1 B
## 6 2 B
## 7 1 C
## 8 2 C
## 9 3 C
## 10 4 C
I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.
# order data
z=z[order(z$x,z$y),]
#convert to data table
dt.z=data.table(z)
# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1
The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.

Create rows in dataframe for variable that was observed but not explicitly recorded by factor

I created a variable describing a species' group as domestic, wild or exotic based on a dataframe where each row represented the species found in unique sites (siteID). I want to insert rows into my dataframe by each siteID to report 0 for a group(s) that was not observed at that site. In other words this is the dataframe I have:
df.start <- data.frame(species = c("dog","deer","toucan","dog","deer","toucan"),
siteID = c("a","b","b","c","c","c"),
group = c("domestic", "wild", "exotic", "domestic", "wild", "exotic"),
value = c(2:7))
df.start
# species siteID group value
# 1 dog a domestic 2
# 2 deer b wild 3
# 3 toucan b exotic 4
# 4 dog c domestic 5
# 5 deer c wild 6
# 6 toucan c exotic 7
This is the data frame I want:
df.end <-data.frame(species=c("dog","NA","NA","NA","deer",
"toucan","dog","deer","toucan"),
siteID = c("a","a","a","b","b","b","c","c","c"),
group = rep(c("domestic", "wild", "exotic"),3),
value = c(2,0,0,0,3,4,5,6,7))
df.end
# species siteID group value
# 1 dog a domestic 2
# 2 NA a wild 0
# 3 NA a exotic 0
# 4 NA b domestic 0
# 5 deer b wild 3
# 6 toucan b exotic 4
# 7 dog c domestic 5
# 8 deer c wild 6
# 9 toucan c exotic 7
This came up because I wanted to use a plyr function to summarize the mean values by group and I realized the zeros were missing for some groups site combinations and inflating my estimate. Maybe I'm missing a more obvious workaround?
Using base R functions:
result <- merge(
with(df.start, expand.grid(siteID=unique(siteID),group=unique(group))),
df.start,
by=c("siteID","group"),
all.x=TRUE
)
result$value[is.na(result$value)] <- 0
> result
siteID group species value
1 a domestic dog 2
2 a exotic <NA> 0
3 a wild <NA> 0
4 b domestic <NA> 0
5 b exotic toucan 4
6 b wild deer 3
7 c domestic dog 5
8 c exotic toucan 7
9 c wild deer 6
df.sg <- data.frame(xtabs(value~siteID+group, data=df.start))
merge(df.start[-4], df.sg, by=c("siteID", "group"), all.y=TRUE)
#-------------
siteID group species Freq
1 a domestic dog 2
2 a exotic <NA> 0
3 a wild <NA> 0
4 b domestic <NA> 0
5 b exotic toucan 4
6 b wild deer 3
7 c domestic dog 5
8 c exotic toucan 7
9 c wild deer 6
The xtabs function returns a table which lets the as.data.frame.table method works on it. Very handy.

Resources