Aggregate a value by 2 variables - r

I have a dataframe that looks something like this
AgeBracket No of People No of Jobs
18-25 2 5
18-25 2 2
26-34 4 6
35-44 4 0
26-34 2 3
35-44 1 7
45-54 3 2
From this I want to aggregate the data so it looks like the following:
AgeBracket 1Person 2People 3People 4People
18-25 0 3.5 0 0
26-34 0 3 0 6
35-44 7 0 0 0
45-54 0 0 2 0
So along the Y axis is the Age Bracket and along X (top row) is the number of people while in the cells it show's the average number of jobs for that age bracket and number of people.
I assume it's something to do with aggregation but can't find anything similar to this on any site.

Here is a data.table method using dcast.
library(data.table)
setnames(dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0),
c("AgeBracket", paste0(sort(unique(df$People)), "Person")))[]
Here, dcast reshapes wide, putting persons as separate variables. fun.aggregate is used to calculate the mean number of jobs across ageBracket-person cells. fill is set to 0.
setnames is used to rename the variables as the default is the integer values. and [] at the end is used to print out the result.
AgeBracket 1Person 2Person 3Person 4Person
1: 18-25 0 3.5 0 0
2: 26-34 0 3.0 0 6
3: 35-44 7 0.0 0 0
4: 45-54 0 0.0 2 0
This can be stretched out into two lines, which is probably more readable.
# reshape wide and calculate means
df.wide <- dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0)
# rename variables
setnames(df.wide, c("AgeBracket", paste0(names(df.wide)[-1], "Person")))

Assuming df is your data.frame then you can use aggregate with mean function using BaseR, but I think data.table way is the faster as suggested by Imo:
agg <- aggregate(No.of.Jobs ~ AgeBracket + No.of.People,data=df,mean)
fin <- reshape2::dcast(agg,AgeBracket ~ No.of.People)
fin[is.na(fin)] <- 0
names(fin) <- c("AgeBracket",paste0("People",1:4))
As suggested by #Imo, a one-liner could be this:
reshape2::dcast(df, AgeBracket ~ No.of.People, value.var="No.of.Jobs", fun.aggregate=mean, fill=0)
we need to just rename the columns after that.
OUtput:
AgeBracket People1 People2 People3 People4
1 18-25 0 3.5 0 0
2 26-34 0 3.0 0 6
3 35-44 7 0.0 0 0
4 45-54 0 0.0 2 0

Related

Problems to separate data

I have the FreqAnual.
Fêmea Macho
Abril 3 0
Agosto 1 0
Dezembro 7 0
Fevereiro 6 4
Janeiro 6 4
Julho 1 0
Junho 5 0
Maio 3 0
Março 20 2
Novembro 4 1
Outubro 3 0
It also comes from a dataset from Excel, in which is the column "Mes", and a row for every register, and another row for sex, that comes to be Fêmea and Macho.
I used the FreqAnual <- table(Dados_procesados$Mes, Dados_procesados$Sexo) .
So i tried FreqJan <- Dados_Procesados [Mes == Janeiro, ], also the one with the $ before Mes, and just get the result
FreqJan <- Dados_Procesados [Mes = Janeiro, ]
Error: object 'Dados_Procesados' not found
What can I do? Also the subtable didn't work
I was expecting something like
Fêmea Macho
Janeiro 6 4
I need it that way, so I can do G test monthly to find the sex ratio, and if there were significant differences

Row data to binary columns while preserving the number of rows

This is similar to this question R Convert row data to binary columns but I want to preserve the number of rows.
How can I convert the row data to binary columns while preserving the number of rows?
Example
Input
myData<-data.frame(gender=c("man","women","child","women","women","women","man"),
age=c(22, 22, 0.33,22,22,22,111))
myData
gender age
1 man 22.00
2 women 22.00
3 child 0.33
4 women 22.00
5 women 22.00
6 women 22.00
7 man 111.00
How to get to this intended output?
gender age man women child
1 man 22.00 1 0 0
2 women 22.00 0 1 0
3 child 0.33 0 0 1
4 women 22.00 0 1 0
5 women 22.00 0 1 0
6 women 22.00 0 1 0
7 man 111.00 1 0 0
Perhaps a slightly easier solution without reliance on another package:
data.frame(myData, model.matrix(~gender+0, myData))
We can use dcast to do this
library(data.table)
dcast(setDT(myData), gender + age + seq_len(nrow(myData)) ~
gender, length)[, myData := NULL][]
Or use table from base R and cbind with the original dataset
cbind(myData, as.data.frame.matrix(table(1:nrow(myData), myData$gender)))

counting occurrence of strings across multiple columns in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I have a dataset in R which looks like the following (only relevant columns shown). It has sex disaggregated data on what crops respondents wanted more information about and how much of a priority this crop for them.
sex wantcropinfo1 priority1 wantcropinfo2 priority2
m wheat high eggplant medium
m rice low cabbage high
m rice high
f eggplant medium
f cotton low
...
I want to be able to (a) count the total occurrences of each crop across all the wantcropinfoX columns; and (b) get the same count but sort them by priority; and (c) do the same thing but disaggregated by sex.
(a) output should look like this:
crop count
wheat 1
eggplant 2
rice 2
...
(b) output should look like this:
crop countm countf
wheat 1 0
eggplant 1 1
rice 2 0
...
(c) should look like this:
crop high_m med_m low_m high_f med_f low_f
wheat 1 0 0 0 0 0
eggplant 0 1 0 0 1 0
rice 1 0 1 0 0 0
...
I'm a bit of an R newbie and the manuals are slightly bewildering. I've googled a lot but couldn't find anything that was quite like this even though it seems like a fairly common thing one might want to do. Similar questions on stackoverflow seemed to be asking something a bit different.
We can use melt from data.table to convert from 'wide' to 'long' format. It can take multiple measure columns.
library(data.table)
dM <- melt(setDT(df1), measure = patterns("^want", "priority"),
value.name = c("crop", "priority"))[crop!='']
In the 'long' format, we get the 3 expected results by either grouping by 'crop' and get the number of rows or convert to 'wide' with dcast specifying the fun.aggregate as length.
dM[,.(count= .N) , crop]
# crop count
#1: wheat 1
#2: rice 2
#3: eggplant 2
#4: cotton 1
#5: cabbage 1
dcast(dM, crop~sex, value.var='sex', length)
# crop f m
#1: cabbage 0 1
#2: cotton 1 0
#3: eggplant 1 1
#4: rice 0 2
#5: wheat 0 1
dcast(dM, crop~priority+sex, value.var='priority', length)
# crop high_m low_f low_m medium_f medium_m
#1: cabbage 1 0 0 0 0
#2: cotton 0 1 0 0 0
#3: eggplant 0 0 0 1 1
#4: rice 1 0 1 0 0
#5: wheat 1 0 0 0 0
Use ddply function in the plyr package.
The structure of how you use this function is the following:
ddply(dataframe,.(var1,var2,...), summarize, function)
In this case you might want to do the follow for:
a) ddply(df,.(wantcropinfo1),summarize,count=length(wantcropinfo1))
b)ddply(df,.(wantcropinfo1,priority),summarize,count=length(wantcropinfo1))
c) ddply(df,.(wantcropinfo1,priority,sex),summarize,count=length(wantcropinfo1))
Note that the output will not have the same structure you mention in your question but the information will be the same. For the mentioned structure use the table function

How to clean and re-code check-all-that-apply responses in R survey data?

I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
unique(HS18)
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

Aggregating big data in R

I have a dataset (dat) that looks like this:
Team Person Performance1 Performance2
1 36465930 1 101
1 37236856 1 101
1 34940210 1 101
1 29135524 1 101
2 10318268 1 541
2 641793 1 541
2 32352593 1 541
2 2139024 1 541
3 35193922 2 790
3 32645504 2 890
3 32304024 2 790
3 22696491 2 790
I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:
tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]
The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?
Thanks in advance! :)
Sincerely,
Amy
The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:
library(dplyr)
datAggregated = dat %>%
group_by(Team) %>%
summarise(sdP1 = sd(Performance1),
sdP2 = sd(Performance2)) %>%
filter(sdP1==0 & sdP2==0)
datAggregated
Team sdP1 sdP2
1 1 0 0
2 2 0 0
Using data.table for big datasets
library(data.table)
setDT(dat)[, setNames(lapply(.SD,sd), paste0("sdP", 1:2)),
.SDcols=3:4, by=Team][,.SD[!sdP1& !sdP2]]
# Team sdP1 sdP2
#1: 1 0 0
#2: 2 0 0
If you have more number of Performance columns, you could use summarise_each from dplyr
datNew <- dat %>%
group_by(Team) %>%
summarise_each(funs(sd), starts_with("Performance"))
colnames(datNew)[-1] <- paste0("sdP", head(seq_along(datNew),-1))
datNew[!rowSums(datNew[-1]),]
which gives the output
# Team sdP1 sdP2
#1 1 0 0
#2 2 0 0

Resources