Row data to binary columns while preserving the number of rows - r

This is similar to this question R Convert row data to binary columns but I want to preserve the number of rows.
How can I convert the row data to binary columns while preserving the number of rows?
Example
Input
myData<-data.frame(gender=c("man","women","child","women","women","women","man"),
age=c(22, 22, 0.33,22,22,22,111))
myData
gender age
1 man 22.00
2 women 22.00
3 child 0.33
4 women 22.00
5 women 22.00
6 women 22.00
7 man 111.00
How to get to this intended output?
gender age man women child
1 man 22.00 1 0 0
2 women 22.00 0 1 0
3 child 0.33 0 0 1
4 women 22.00 0 1 0
5 women 22.00 0 1 0
6 women 22.00 0 1 0
7 man 111.00 1 0 0

Perhaps a slightly easier solution without reliance on another package:
data.frame(myData, model.matrix(~gender+0, myData))

We can use dcast to do this
library(data.table)
dcast(setDT(myData), gender + age + seq_len(nrow(myData)) ~
gender, length)[, myData := NULL][]
Or use table from base R and cbind with the original dataset
cbind(myData, as.data.frame.matrix(table(1:nrow(myData), myData$gender)))

Related

big dataframe: "repeated" t-test between groups for thousand of factors

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.
You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0
I have a big dataframe of gen expression like:
> b<-read.delim("dataset.example.stckovflw.txt")
> head(b)
animal gen condition tissue LogFC
1 animalcontrol1 kjhss1 control brain 7.129283
2 animalcontrol1 sdth2 control brain 7.179909
3 animalcontrol1 sgdhstjh20 control brain 9.353147
4 animalcontrol1 jdygfjgdkydg21 control brain 6.459432
5 animalcontrol1 shfjdfyjydg22 control brain 9.372865
6 animalcontrol1 jdyjkdg23 control brain 9.541097
> str(b)
'data.frame': 21507 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787 360 863 385 133 888 563 738 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 7.13 7.18 9.35 6.46 9.37 ...
Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).
I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.
Because I have so many gens (thousand), I cannot subset each gen.
Do you know how could I automate the procedure ?
I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.
Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r
#
################ additionnal information after reading first comments and answers :
#andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.
I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.
#42:
Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.
This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.
# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :
b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")
b1<-table(b$gen, b$condition)
b1
b2<-subset(data.frame(b1, "control" > 2
|"treatmentA" > 2
|"treatmentB" > 2 ))
b3<-subset(b2, Freq>2) # select only genes that have been quantified in at least 2 animals per group
b3
b4 = within(b3, {
Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we change the value 0 regardless the freq of occurence (>2)
b4
b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")
# make an intermediate file .txt (just add manually the name of the cfirst column title)
# so now we have info
bb5<-read.delim("bb5.txt")
nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))
library(grid)
library(futile.logger)
library(VennDiagram)
venn.plot <- draw.triple.venn(area1 = 1005,
area2 = 927,
area3 = 943,
n12 = 843,
n23 = 861,
n13 = 866,
n123 = 794,
category = c("controls", "treatmentA", "treatmentB"),
fill = c("red", "yellow", "blue"),
cex = 2,
cat.cex = 2,
lwd = 6,
lty = 'dashed',
fontface = "bold",
fontfamily = "sans",
cat.fontface = "bold",
cat.default.pos = "outer",
cat.pos = c(-27, 27, 135),
cat.dist = c(0.055, 0.055, 0.085),
cat.fontfamily = "sans",
rotation = 1);
Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef) %>%
select(-term)
results
# A tibble: 7,118 x 6
gen comparison estimate conf.low conf.high adj.p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 kjhss1 treatmentA-control 1.58 -20.3 23.5 0.997
2 kjhss1 treatmentC-control -3.71 -25.6 18.2 0.962
3 kjhss1 treatmentD-control 0.240 -21.7 22.2 1.000
4 kjhss1 treatmentC-treatmentA -5.29 -27.2 16.6 0.899
5 kjhss1 treatmentD-treatmentA -1.34 -23.3 20.6 0.998
6 kjhss1 treatmentD-treatmentC 3.95 -18.0 25.9 0.954
7 sdth2 treatmentC-control -1.02 -21.7 19.7 0.991
8 sdth2 treatmentD-control 3.25 -17.5 24.0 0.909
9 sdth2 treatmentD-treatmentC 4.27 -16.5 25.0 0.849
10 sgdhstjh20 treatmentC-control -7.48 -30.4 15.5 0.669
# ... with 7,108 more rows
Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).
But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.
For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.
require(tidyverse)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ aov(LogFC ~ condition, data = .x)),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef)
A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.
results %>%
filter(term!="Residuals") %>%
select(gen, df, statistic, p.value)
results
# A tibble: 1,111 x 4
gen df statistic p.value
<chr> <dbl> <dbl> <dbl>
1 kjhss1 3. 0.175 0.912
2 sdth2 2. 0.165 0.850
3 sgdhstjh20 2. 0.440 0.654
4 jdygfjgdkydg21 2. 0.267 0.770
5 shfjdfyjydg22 2. 0.632 0.548
6 jdyjkdg23 2. 0.792 0.477
7 fckjhghw24 2. 0.790 0.478
8 shsnv25 2. 1.15 0.354
9 qeifyvj26 2. 0.588 0.573
10 qsiubx27 2. 1.14 0.359
# ... with 1,101 more rows
Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.
You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:
table(b$animal, b$condition)
control treatmentA treatmentB treatmentC treatmentD
animalcontrol1 1005 0 0 0 0
animalcontrol2 857 0 0 0 0
animalcontrol3 959 0 0 0 0
animalcontrol4 928 0 0 0 0
animalcontrol5 1005 0 0 0 0
animaltreatmentA1 0 927 0 0 0
animaltreatmentA2 0 883 0 0 0
animaltreatmentA3 0 908 0 0 0
animaltreatmentA4 0 861 0 0 0
animaltreatmentA5 0 927 0 0 0
animaltreatmentB1 0 0 943 0 0
animaltreatmentB2 0 0 841 0 0
animaltreatmentB3 0 0 943 0 0
animaltreatmentB4 0 0 910 0 0
animaltreatmentB5 0 0 943 0 0
animaltreatmentC1 0 0 0 742 0
animaltreatmentC2 0 0 0 724 0
animaltreatmentC3 0 0 0 702 0
animaltreatmentC4 0 0 0 698 0
animaltreatmentC5 0 0 0 742 0
animaltreatmentD1 0 0 0 0 844
animaltreatmentD2 0 0 0 0 776
animaltreatmentD3 0 0 0 0 812
animaltreatmentD4 0 0 0 0 783
animaltreatmentD5 0 0 0 0 844
Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:
str( b[ b$gen == "dghwg1041", ])
'data.frame': 13 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 4.34 2.98 4.44 3.87 2.65 ...
You do have a fair number with "complete representation:
gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length
n missing distinct Info Mean Gmd .05 .10
21507 0 18 0.976 20.32 4.802 13 14
.25 .50 .75 .90 .95
18 20 24 25 25
Value 5 8 9 10 12 13 14 15 16 17
Frequency 100 48 288 270 84 624 924 2220 64 527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025
Value 18 19 20 21 22 23 24 25
Frequency 666 2223 3840 42 220 1058 3384 4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229
You might start by looking at all the "gen"s that have complete data:
head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
dghwg1131 dghwg546 dghwg591 dghwg636 dghwg681
25 25 25 25 25
dghwg726 dgkuck196 dgkuck286 dgkuck421 dgkuck691
25 25 25 25 25
dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692
25 25 25 25 25
dgkukdgse737 djh592 djh637 djh682 djh727
25 25 25 25 25
dkgkjd327 dkgkjd642 dkgkjd687 dkgkjd732 fckjhghw204
25 25 25 25 25

R average over ages when some ages missing

I have a data.table with columns for Age, food category, and the kcal consumed. I'm trying to get the average kcal for each category, but for some of the categories there is no consumption in that category. So I can't take a simple average, because there are zeroes that aren't in the data.table.
So for the example data:
dtp2 <- data.table(age = c(4,4,4,5,6,18), category = c("chips","vegetables","pizza","chips","pizza","beer"), kcal = c(100,5,100,120,100,150))
just doing dtp2[,mean(kcal),by=category] gives the wrong answer because only the 18 year olds are consuming beer, and the 4-17 year olds aren't.
The actual data set is 4:18 year olds with many, many categories. I've tried populating the datatable with zeroes for omitted ages with a nested for loop, which is very slow, then taking the means as above.
Is there a sensible R way of taking the mean kcal where missing values are assumed to be zero, without nested for loops putting in the zeroes?
I take it you want to include missing or 0 kcal values in the calculation. Instead of taking the average, you could just sum by category and divide by the total n for each category.
The suggestion by Mr. Bugle is rather generic and doesn't show any code. Picking this up, the code of the OP needs to be modified as follows:
library(data.table)
dtp2[, sum(kcal) / uniqueN(dtp2$category), by = category]
which returns
category V1
1: chips 55.00
2: vegetables 1.25
3: pizza 50.00
4: beer 37.50
Note that uniqueN(dtp2$category) is used not just uniqueN(category) as this is always 1 when grouped by category.
However, there are situations where missing values are assumed to be zero, without nested for loops putting in the zeroes as the OP has asked.
One situation could arise when data is reshaped from long to wide format for presentation of the data:
reshape2::dcast(dtp2, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 NaN 100 100 5 68.33333
2 5 NaN 120 NaN NaN 120.00000
3 6 NaN NaN 100 NaN 100.00000
4 18 150 NaN NaN NaN 150.00000
5 (all) 150 110 100 5 95.83333
Here, the margin means are computed only from the available data which is not what the OP askd for. (Note that the parameter fill = 0 has no effect on the computation of the margins.)
So, the missing values need to be filled up before reshaping. In base R, expand.grid() can be used for this purpose, in data.table it's the cross join function CJ():
expanded <- dtp2[CJ(age, category, unique = TRUE), on = .(age = V1, category = V2)
][is.na(kcal), kcal := 0][]
expanded
age category kcal
1: 4 beer 0
2: 4 chips 100
3: 4 pizza 100
4: 4 vegetables 5
5: 5 beer 0
6: 5 chips 120
7: 5 pizza 0
8: 5 vegetables 0
9: 6 beer 0
10: 6 chips 0
11: 6 pizza 100
12: 6 vegetables 0
13: 18 beer 150
14: 18 chips 0
15: 18 pizza 0
16: 18 vegetables 0
Now, reshaping from long to wide returns the expected results:
reshape2::dcast(expanded, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)
age beer chips pizza vegetables (all)
1 4 0.0 100 100 5.00 51.2500
2 5 0.0 120 0 0.00 30.0000
3 6 0.0 0 100 0.00 25.0000
4 18 150.0 0 0 0.00 37.5000
5 (all) 37.5 55 50 1.25 35.9375

Aggregate a value by 2 variables

I have a dataframe that looks something like this
AgeBracket No of People No of Jobs
18-25 2 5
18-25 2 2
26-34 4 6
35-44 4 0
26-34 2 3
35-44 1 7
45-54 3 2
From this I want to aggregate the data so it looks like the following:
AgeBracket 1Person 2People 3People 4People
18-25 0 3.5 0 0
26-34 0 3 0 6
35-44 7 0 0 0
45-54 0 0 2 0
So along the Y axis is the Age Bracket and along X (top row) is the number of people while in the cells it show's the average number of jobs for that age bracket and number of people.
I assume it's something to do with aggregation but can't find anything similar to this on any site.
Here is a data.table method using dcast.
library(data.table)
setnames(dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0),
c("AgeBracket", paste0(sort(unique(df$People)), "Person")))[]
Here, dcast reshapes wide, putting persons as separate variables. fun.aggregate is used to calculate the mean number of jobs across ageBracket-person cells. fill is set to 0.
setnames is used to rename the variables as the default is the integer values. and [] at the end is used to print out the result.
AgeBracket 1Person 2Person 3Person 4Person
1: 18-25 0 3.5 0 0
2: 26-34 0 3.0 0 6
3: 35-44 7 0.0 0 0
4: 45-54 0 0.0 2 0
This can be stretched out into two lines, which is probably more readable.
# reshape wide and calculate means
df.wide <- dcast(df, AgeBracket ~ People, value.var="Jobs", fun.aggregate=mean, fill=0)
# rename variables
setnames(df.wide, c("AgeBracket", paste0(names(df.wide)[-1], "Person")))
Assuming df is your data.frame then you can use aggregate with mean function using BaseR, but I think data.table way is the faster as suggested by Imo:
agg <- aggregate(No.of.Jobs ~ AgeBracket + No.of.People,data=df,mean)
fin <- reshape2::dcast(agg,AgeBracket ~ No.of.People)
fin[is.na(fin)] <- 0
names(fin) <- c("AgeBracket",paste0("People",1:4))
As suggested by #Imo, a one-liner could be this:
reshape2::dcast(df, AgeBracket ~ No.of.People, value.var="No.of.Jobs", fun.aggregate=mean, fill=0)
we need to just rename the columns after that.
OUtput:
AgeBracket People1 People2 People3 People4
1 18-25 0 3.5 0 0
2 26-34 0 3.0 0 6
3 35-44 7 0.0 0 0
4 45-54 0 0.0 2 0

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Aggregating big data in R

I have a dataset (dat) that looks like this:
Team Person Performance1 Performance2
1 36465930 1 101
1 37236856 1 101
1 34940210 1 101
1 29135524 1 101
2 10318268 1 541
2 641793 1 541
2 32352593 1 541
2 2139024 1 541
3 35193922 2 790
3 32645504 2 890
3 32304024 2 790
3 22696491 2 790
I am trying to identify and remove all teams that have variance on Performance1 or Performance2. So, for example, team 3 in the example has variance on Performance 2, so I would want to remove that team from the dataset. Here is the code as I've written it:
tda <- aggregate(dat, by=list(data$Team), FUN=sd)
tda1 <- tda[ which(tda$Performance1 != 0 | tda$Performance2 != 0), ]
The problem is that there are over 100,000 teams in my dataset, so my first line of code is taking an extremely long time, and I'm not sure if it will ever finish aggregating the dataset. What would be a more efficient way to solve this problem?
Thanks in advance! :)
Sincerely,
Amy
The dplyr package is generally very fast. Here's a way to select only those teams with standard deviation equal to zero for both Performance1 and Performance2:
library(dplyr)
datAggregated = dat %>%
group_by(Team) %>%
summarise(sdP1 = sd(Performance1),
sdP2 = sd(Performance2)) %>%
filter(sdP1==0 & sdP2==0)
datAggregated
Team sdP1 sdP2
1 1 0 0
2 2 0 0
Using data.table for big datasets
library(data.table)
setDT(dat)[, setNames(lapply(.SD,sd), paste0("sdP", 1:2)),
.SDcols=3:4, by=Team][,.SD[!sdP1& !sdP2]]
# Team sdP1 sdP2
#1: 1 0 0
#2: 2 0 0
If you have more number of Performance columns, you could use summarise_each from dplyr
datNew <- dat %>%
group_by(Team) %>%
summarise_each(funs(sd), starts_with("Performance"))
colnames(datNew)[-1] <- paste0("sdP", head(seq_along(datNew),-1))
datNew[!rowSums(datNew[-1]),]
which gives the output
# Team sdP1 sdP2
#1 1 0 0
#2 2 0 0

Resources