Use unique rows from data.frame to subset another data.frame - r

I have a data.frame v that I would like to use the unique rows from
#v
DAY MONTH YEAR
1 1 1 2000
2 1 1 2000
3 2 2 2000
4 2 2 2000
5 2 3 2001
to subset a data.frame w.
# w
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 1 1 2000 3 2 1
3 2 2 2000 2 3 1
4 2 2 2001 1 2 3
5 3 4 2001 3 2 1
The result is data.frame vw. Where only the rows in 'w' that match the unique rows (e.g. (DAY, MONTH, YEAR)) in v are remaining.
# vw
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 2 2 2000 2 3 1
Right now I am using the code below, where I merge the data.frames and then use ddply to pick only the unqiue/ first instance of a row. This work, but will become cumbersome if I have to include V1=x$V1[1], etc for all of my variables in the ddply part of the code. Is there a way to use the first instance of (DAY, MONTH, YEAR) and the rest of the columns on that row?
Or, is there another to approach the problem of using unique rows from one data.frame to subset another data.frame?
v <- structure(list(DAY = c(1L, 1L, 2L, 2L, 2L), MONTH = c(1L, 1L,
2L, 2L, 3L), YEAR = c(2000L, 2000L, 2000L, 2000L, 2001L)), .Names = c("DAY",
"MONTH", "YEAR"), class = "data.frame", row.names = c(NA, -5L
))
w <- structure(list(DAY = c(1L, 1L, 2L, 2L, 3L), MONTH = c(1L, 1L,
2L, 2L, 4L), YEAR = c(2000L, 2000L, 2000L, 2001L, 2001L), V1 = c(1L,
3L, 2L, 1L, 3L), V2 = c(2L, 2L, 3L, 2L, 2L), V3 = c(3L, 1L, 1L,
3L, 1L)), .Names = c("DAY", "MONTH", "YEAR", "V1", "V2", "V3"
), class = "data.frame", row.names = c(NA, -5L))
vw_example <- structure(list(DAY = 1:2, MONTH = 1:2, YEAR = c(2000L, 2000L),
V1 = 1:2, V2 = 2:3, V3 = c(3L, 1L)), .Names = c("DAY", "MONTH",
"YEAR", "V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-2L))
wv_inter <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) data.frame(DAY=x$DAY[1],MONTH=x$MONTH[1],YEAR=x$YEAR[1], V1=x$V1[1], V2=x$V2[1], V3=x$V3[1]))

In base R, I would take unique of v first before merging. The merge command will by default merge on common column names, so by is unnecessary here.
vw <- merge(unique(v), w)
With your approach (take the first row from each combination), I think you could do (untested):
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) x[1,])

library(data.table)
v <- data.table(v)
w <- data.table(w)
setkey(v)
setkeyv(w, names(v))
# if you want to capture ALL unique values of `v`, use:
w[unique(v, by=NULL)]
# if you want only values that mutually exist in `v` and `w` use:
w[unique(v, by=NULL), nomatch=0L]

EDITED:
Rather than merge a unique v with w, to get a unique vw first merge v and w and then select values unique on the DAY MONTH YEAR columns.
vw <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- vw[which( ! duplicated(vw[,c("DAY","MONTH","YEAR")]) ), ]

Related

Degrouping from one row per group to one row per subject

I have data where each row represents a household, and I would like to have one row per individual in the different households.
The data looks similar to this:
df <- data.frame(village = rep("aaa",5),household_ID = c(1,2,3,4,5),name_1 = c("Aldo","Giovanni","Giacomo","Pippo","Pippa"),outcome_1 = c("yes","no","yes","no","no"),name_2 = c("John","Mary","Cindy","Eva","Doron"),outcome_2 = c("yes","no","no","no","no"))
I would still like to keep the wide format of the data, just with one individual (and related outcome variables) per row. I could find examples that tell how to do the opposite, going from individual to grouped data using dcast, but I could not find examples of this problem I am facing now.
I have tried with melt
reshape2::melt(df, id.vars = "household_ID")
but I get a long format data.
Any suggestions welcome...
Thank you
Use pivot_longer() in tidyr, and set ".value" in names_to to indicate new column names from the pattern of the original column names.
library(tidyr)
df %>%
pivot_longer(-c(village, household_ID),
names_to = c(".value", "n"),
names_sep = "_")
# # A tibble: 10 x 5
# village household_ID n name outcome
# <fct> <dbl> <chr> <fct> <fct>
# 1 aaa 1 1 Aldo yes
# 2 aaa 1 2 John yes
# 3 aaa 2 1 Giovanni no
# 4 aaa 2 2 Mary no
# 5 aaa 3 1 Giacomo yes
# 6 aaa 3 2 Cindy no
# 7 aaa 4 1 Pippo no
# 8 aaa 4 2 Eva no
# 9 aaa 5 1 Pippa no
# 10 aaa 5 2 Doron no
Data
df <- structure(list(village = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "aaa", class = "factor"),
household_ID = c(1, 2, 3, 4, 5), name_1 = structure(c(1L,
3L, 2L, 5L, 4L), .Label = c("Aldo", "Giacomo", "Giovanni",
"Pippa", "Pippo"), class = "factor"), outcome_1 = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
name_2 = structure(c(4L, 5L, 1L, 3L, 2L), .Label = c("Cindy",
"Doron", "Eva", "John", "Mary"), class = "factor"), outcome_2 = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))

Counting where the string appears in row with caveat - [R]

I have a dataset where the first three columns (G1P1, G1P2, G1P3) indicate one grouping of three individuals (i.e. Sidney, Blake, Max on Row 1), the second three columns (G2P1, G2P2, G2P3) indicate another grouping of three individuals (i.e. David, Steve, Daniel on Row 2), etc.... There are a total of 12 individuals, and dataset is pretty much all the possible groupings of these 12 people (approximately 300,000 rows). Each group's cumulative test scores are represented on far right columns, (G1.Sum, G2.Sum, G3.Sum, G4.Sum
).
#### The dput(data) of the first five rows ####
data <- structure(list(X = 1:5, G1P1 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G1P2 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G1P3 = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G2P1 = structure(c(2L, 2L, 1L, 1L, 3L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G2P2 = structure(c(4L, 4L, 3L, 3L, 2L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G2P3 = structure(c(3L, 3L, 2L, 2L, 1L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G3P1 = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G3P2 = structure(c(3L, 2L, 4L, 2L, 4L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G3P3 = structure(c(2L, 1L, 3L, 1L, 3L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G4P1 = structure(c(3L, 1L, 3L, 2L, 1L), .Label = c("CHRIS", "DAVID", "EVA", "SIDNEY"), class = "factor"), G4P2 = structure(c(2L, 3L, 2L, 4L, 3L), .Label = c("BLAKE", "NICK", "PATRIC", "STEVE"), class = "factor"), G4P3 = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("BEAU", "BRANDON", "DANIEL", "MAX"), class = "factor"), G1.Sum = c(63.33333333, 63.33333333, 63.33333333, 63.33333333, 63.33333333), G2.Sum = c(58.78333333, 58.78333333, 54.62333333, 54.62333333, 58.69), G3.Sum = c(54.62333333, 58.69, 58.78333333, 58.69, 58.78333333), G4.Sum = c(58.69, 54.62333333, 58.69, 58.78333333, 54.62333333)), .Names = c("X", "G1P1", "G1P2", "G1P3", "G2P1", "G2P2", "G2P3", "G3P1", "G3P2", "G3P3", "G4P1", "G4P2", "G4P3", "G1.Sum", "G2.Sum", "G3.Sum", "G4.Sum"), row.names = c(NA, 5L), class = "data.frame")
I was wondering how you would write an R function so for each row, you can record where the person's group score ranked. For example, on Row 1, SIDNEY was in a group with the highest score at 63.3333. So his rank would be a '1.' But for BRANDON, his group scored last (54.62333), so her rank would be 4. I would like my final data.frame output to be something like this:
ranks <- t(apply(data[grep("Sum", names(data))], 1, function(x) rep(match(x, sort(x, decreasing=T)),each=3)))
just.names <- data[grep("P", names(data))] #Subset without sums
names <- as.character(unlist(just.names[1,])) #create name vector
sapply(names, function(x) ranks[just.names == x])
# SIDNEY BLAKE MAX DAVID STEVE DANIEL CHRIS PATRIC BRANDON EVA NICK BEAU
# [1,] 1 1 1 2 2 2 4 4 4 3 3 3
# [2,] 1 1 1 2 2 2 4 4 4 3 3 3
# [3,] 1 1 1 2 2 2 4 4 4 3 3 3
# [4,] 1 1 1 2 2 2 4 4 4 3 3 3
# [5,] 1 1 1 2 2 2 4 4 4 3 3 3
We first rank the sums and replicate them 3 times each. Next we subset the larger data frame with the names only (take out the sum columns). We make a vector with the individual names. And lastly, we subset the ranks matrix that we created first by seeing where in the data frame the name appears.
Using dplyr and tidyr. First, ranking, then uniting all the rows with their rank, converting to long data, separating out the variables, then finally spreading.
It got really long, and can probably be simplified:
library(dplyr)
library(tidyr)
data[ ,14:17] <- t(apply(-data[ ,14:17], 1 , rank))
data %>% unite("g1", starts_with("G1")) %>%
unite("g2", starts_with("G2")) %>%
unite("g3", starts_with("G3")) %>%
unite("g4", starts_with("G4")) %>%
gather(Row, val, -X) %>%
select(-Row) %>%
separate(val, c("1", "2", "3", "rank")) %>%
gather(zzz, name, -X, -rank) %>%
select(-zzz) %>%
spread(name, rank)
X BEAU BLAKE BRANDON CHRIS DANIEL DAVID EVA MAX NICK PATRIC SIDNEY STEVE
1 1 3 1 4 4 2 2 3 1 3 4 1 2
2 2 3 1 4 4 2 2 3 1 3 4 1 2
3 3 3 1 4 4 2 2 3 1 3 4 1 2
4 4 3 1 4 4 2 2 3 1 3 4 1 2
5 5 3 1 4 4 2 2 3 1 3 4 1 2
Using previous answer's 'rank' matrix and library(reshape2) to convert wide data.frame to long data.frame,
ranks <- t(apply(test[grep("Sum",names(test))], 1, function (x)
rep(match(x, sort(x, decreasing=T)),each=3)))
colnames(ranks) <- names(test)[grep("P", names(test))]
# data subset
test_L <- test[,-grep("Avg", names(test))]
df_player <- data.frame(position= names(test)[grep("P", names(test))],
t(test_L[,-1]), row.names = NULL)
df_ranks <- data.frame(position=names(test)[grep("P", names(test))],
t(ranks), row.names=NULL)
# Combine two temporary data.frames
df_player_melted <- melt(df_player, id=1,
variable.name = "rowNumber", value.name = "player")
df_ranks_melted <- rank= melt(df_ranks, id=1,
variable.name = "rowNumber", value.name = "rank")
df <- cbind(df_player_melted, rank= df_ranks_melted$rank)
# cast into the output format you want
df <- dcast(df, rowNumber ~ player + rank)[1,]

R: Unique count by first occurrence of grouping variable

I would like to create a new variable "Count" that is a count of the unique values of a factor "Period", by grouping variable "ID". The following data includes a column with the values I would want in "Count":
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
I tried to use mutate with Count = 1:length(Period) but it creates a cumulative count of each value of "Period", whereas I want a cumulative count of only unique values. This is what I tried:
library(plyr)
samp1<-ddply(samp, .(ID, Period), mutate, Count = 1:length(Period))
Could anyone provide the correct function to use?
Edit- New answer
Now that come to think of it some more, my initial approach won't return correct results if each groups elements aren't grouped together, so for example for
v <- c(1, 3, 2, 2, 1, 2)
My function will put non-consecutive 1s and 2 in different groups
myrleid(v)
## [1] 1 2 3 3 4 5
Thus, the best approach seem to be
match(v, unique(v))
## [1] 1 2 3 3 1 3
Will will both preserve the appearance order and keep un-ordered values in the same group.
Thus, I would recommend just doing
library(data.table)
setDT(df)[, Count2 := match(Period, unique(Period)), by = ID]
or (with base R)
with(df, ave(Period, ID, FUN = function(x) match(x, unique(x))))
Old answer
Looks like a good candidate for the rleid function from the data.table devel version on GH
### Devel version installation instructions
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table) # v 1.9.5+
setDT(df)[, Count2 := rleid(Period), by = ID]
df
# ID Period Count Count2
# 1: a 1.1 1 1
# 2: a 1.1 1 1
# 3: a 1.2 2 2
# 4: a 1.3 3 3
# 5: b 1.2 1 1
# 6: b 1.3 2 2
# 7: b 1.5 3 3
# 8: b 1.5 3 3
Or, If you don't want to load external packages, we could define this function on our own
myrleid <- function(x) {
temp <- rle(x)$lengths
rep.int(seq_along(temp), temp)
}
with(df, ave(Period, ID, FUN = myrleid))
## [1] 1 1 2 3 1 2 3 3
Or if the groups are in increasing order, you could try ranking them too
library(data.table) ## V1.9.5+
setDT(df)[, Count2 := frank(Period, ties.method = "dense"), by = ID]
Or
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count2 = dense_rank(Period))
samp <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
select(samp, -Count) %>%
arrange(ID, Period) %>%
group_by(ID) %>%
mutate(dup = !duplicated(Period),
Count = cumsum(dup))
The key steps are to arrange by ID and Period, and then to identify that first new representation of Period as "not duplicated".
A solution in base R with transform:
transform(df, Count2 = unlist(
tapply(df$Period, df$ID, function(x)
as.numeric(factor(x)))
))
ID Period Count Count2
a1 a 1.1 1 1
a2 a 1.1 1 1
a3 a 1.2 2 2
a4 a 1.3 3 3
b1 b 1.2 1 1
b2 b 1.3 2 2
b3 b 1.5 3 3
b4 b 1.5 3 3
as David suggested this solution does not work well if data Period are not monotonic increasing.

Removing duplicate rows with ddply

I have a dataframe df containing two factor variables (Var and Year) as well as one (in reality several) column with values.
df <- structure(list(Var = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Year = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 3L, 1L, 2L, 3L), .Label = c("2000", "2001",
"2002"), class = "factor"), Val = structure(c(1L, 2L, 2L, 4L,
1L, 3L, 3L, 5L, 6L, 6L), .Label = c("2", "3", "4", "5", "8",
"9"), class = "factor")), .Names = c("Var", "Year", "Val"), row.names = c(NA,
-10L), class = "data.frame")
> df
Var Year Val
1 A 2000 2
2 A 2001 3
3 A 2002 3
4 B 2000 5
5 B 2001 2
6 B 2002 4
7 B 2002 4
8 C 2000 8
9 C 2001 9
10 C 2002 9
Now I'd like to find rows with the same value for Val for each Var and Year and only keep one of those. So in this example I would like row 7 to be removed.
I've tried to find a solution with plyr using something like
df_new <- ddply(df, .(Var, Year), summarise, !duplicate(Val))
but obviously that is not a function accepted by ddply.
I found this similar question but the plyr solution by Arun only gives me a dataframe with 0 rows and 0 columns and I do not understand the answer well enough to modify it according to my needs.
Any hints on how to go about that?
Non-duplicates of Val by Var and Year are the same as non-duplicates of Val, Var, and Year. You can specify several columns for duplicated (or the whole data frame).
I think this does what you'd like.
df[!duplicated(df), ]
Or.
df[!duplicated(df[, c("Var", "Year", "Val")]), ]
you can just used the unique() function instead of !duplicate(Val)
df_new <- ddply(df, .(Var, Year), summarise, Val=unique(Val))
# or
df_new <- ddply(df, .(Var, Year), function(x) x[!duplicated(x$Val),])
# or if you only have these 3 columns:
df_new <- ddply(df, .(Var, Year), unique)
# with dplyr
df%.%group_by(Var, Year)%.%filter(!duplicated(Val))
hth
You don't need the plyr package here. If your whole dataset consists of only these 3 columns and you need to remove the duplicates, then you can use,
df_new <- unique(df)
Else, if you need to just pick up the first observation for a group by variable list, then you can use the method suggested by Richard. That's usually how I have been doing it.

average between duplicated rows in R

I have a data frame df with rows that are duplicates for the names column but not for the values column:
name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y
I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:
name value etc1 etc2
A 10 1 X
B 2 1 Y
C 45 1 Y
I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?
Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)
library(data.table)
dat <- read.table(text='name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
name etc1 etc2 mm
1: A 1 X 10
2: B 1 Y 2
3: C 1 Y 45
EDIT extend to more than one value variable
In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this
name value etc1 etc2 value1
1 A 9 1 X 2.1763485
2 A 10 1 X -0.7954326
3 A 11 1 X -0.5839844
4 B 2 1 Y -0.5188709
5 C 40 1 Y -0.8300233
6 C 50 1 Y -0.7787496
The above solution can be extended like this :
X[,lapply(.SD,mean),keys]
name etc1 etc2 value value1
1: A 1 X 10 0.2656438
2: B 1 Y 2 -0.5188709
3: C 1 Y 45 -0.8043865
This will compute the mean for all variables that don't exist in keys list.
You can use aggregate() function like below:
aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:
sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))
sample.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
Assuming your dataframe is df.
install.packages("plyr")
library(plyr)
df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))
df.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
This simple one worked for me:
avg_data <- aggregate( . ~ name, df, mean)
Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.

Resources