I have a data frame df with rows that are duplicates for the names column but not for the values column:
name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y
I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:
name value etc1 etc2
A 10 1 X
B 2 1 Y
C 45 1 Y
I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?
Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)
library(data.table)
dat <- read.table(text='name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
name etc1 etc2 mm
1: A 1 X 10
2: B 1 Y 2
3: C 1 Y 45
EDIT extend to more than one value variable
In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this
name value etc1 etc2 value1
1 A 9 1 X 2.1763485
2 A 10 1 X -0.7954326
3 A 11 1 X -0.5839844
4 B 2 1 Y -0.5188709
5 C 40 1 Y -0.8300233
6 C 50 1 Y -0.7787496
The above solution can be extended like this :
X[,lapply(.SD,mean),keys]
name etc1 etc2 value value1
1: A 1 X 10 0.2656438
2: B 1 Y 2 -0.5188709
3: C 1 Y 45 -0.8043865
This will compute the mean for all variables that don't exist in keys list.
You can use aggregate() function like below:
aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:
sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))
sample.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
Assuming your dataframe is df.
install.packages("plyr")
library(plyr)
df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))
df.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
This simple one worked for me:
avg_data <- aggregate( . ~ name, df, mean)
Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.
Related
I have data where each row represents a household, and I would like to have one row per individual in the different households.
The data looks similar to this:
df <- data.frame(village = rep("aaa",5),household_ID = c(1,2,3,4,5),name_1 = c("Aldo","Giovanni","Giacomo","Pippo","Pippa"),outcome_1 = c("yes","no","yes","no","no"),name_2 = c("John","Mary","Cindy","Eva","Doron"),outcome_2 = c("yes","no","no","no","no"))
I would still like to keep the wide format of the data, just with one individual (and related outcome variables) per row. I could find examples that tell how to do the opposite, going from individual to grouped data using dcast, but I could not find examples of this problem I am facing now.
I have tried with melt
reshape2::melt(df, id.vars = "household_ID")
but I get a long format data.
Any suggestions welcome...
Thank you
Use pivot_longer() in tidyr, and set ".value" in names_to to indicate new column names from the pattern of the original column names.
library(tidyr)
df %>%
pivot_longer(-c(village, household_ID),
names_to = c(".value", "n"),
names_sep = "_")
# # A tibble: 10 x 5
# village household_ID n name outcome
# <fct> <dbl> <chr> <fct> <fct>
# 1 aaa 1 1 Aldo yes
# 2 aaa 1 2 John yes
# 3 aaa 2 1 Giovanni no
# 4 aaa 2 2 Mary no
# 5 aaa 3 1 Giacomo yes
# 6 aaa 3 2 Cindy no
# 7 aaa 4 1 Pippo no
# 8 aaa 4 2 Eva no
# 9 aaa 5 1 Pippa no
# 10 aaa 5 2 Doron no
Data
df <- structure(list(village = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "aaa", class = "factor"),
household_ID = c(1, 2, 3, 4, 5), name_1 = structure(c(1L,
3L, 2L, 5L, 4L), .Label = c("Aldo", "Giacomo", "Giovanni",
"Pippa", "Pippo"), class = "factor"), outcome_1 = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
name_2 = structure(c(4L, 5L, 1L, 3L, 2L), .Label = c("Cindy",
"Doron", "Eva", "John", "Mary"), class = "factor"), outcome_2 = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))
Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]
I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37
I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.
Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1
With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1
Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1
An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2
I have following data and code:
mydf
grp categ condition value
1 A X P 2
2 B X P 5
3 A Y P 9
4 B Y P 6
5 A X Q 4
6 B X Q 5
7 A Y Q 8
8 B Y Q 2
>
>
mydf = structure(list(grp = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("A", "B"), class = "factor"), categ = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("X", "Y"), class = "factor"),
condition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("P",
"Q"), class = "factor"), value = c(2, 5, 9, 6, 4, 5, 8, 2
)), .Names = c("grp", "categ", "condition", "value"), out.attrs = structure(list(
dim = structure(c(2L, 2L, 2L), .Names = c("grp", "categ",
"condition")), dimnames = structure(list(grp = c("grp=A",
"grp=B"), categ = c("categ=X", "categ=Y"), condition = c("condition=P",
"condition=Q")), .Names = c("grp", "categ", "condition"))), .Names = c("dim",
"dimnames")), row.names = c(NA, -8L), class = "data.frame")
However, following works for data.frame but not for data.table:
> data.frame(with(mydf, table(grp, categ, condition)))
grp categ condition Freq
1 A X P 1
2 B X P 1
3 A Y P 1
4 B Y P 1
5 A X Q 1
6 B X Q 1
7 A Y Q 1
8 B Y Q 1
>
> data.table(with(mydf, table(grp, categ, condition)))
V1
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1
>
Am I making some mistake here or do I need to correct the data.table command to get other variables? It is highly unlikely that there is a bug here. With 2 variables it works all right:
> data.table(with(mydf, table(grp, categ)))
categ grp N
1: A X 2
2: B X 2
3: A Y 2
4: B Y 2
>
>
> data.frame(with(mydf, table(grp, categ)))
grp categ Freq
1 A X 2
2 B X 2
3 A Y 2
4 B Y 2
Thanks for your help.
Fixed in commit #1760 of data.table v1.9.5. Closes #1043.