Calculating group median from frequency table in R - r

I am trying to implement the data.table method described here: Calculating grouped variance from a frequency table in R.
I can successfully replicate their example. But when I apply it to my own data, nothing seems to happen. In particular, output is this:
table <- data.frame(districts,proportions,populations)
table<-setDT(table)
districts proportions populations
1: 24 0.8270270 1269
2: 26 0.8867925 1679
3: 12 0.9136691 510
4: 27 0.4220532 3274
5: 20 0.5457650 3644
---
8937: 1 0.7798072 3444
8938: 1 0.6080247 6128
8939: 1 0.4655172 4335
8940: 1 0.4813200 4297
8941: 1 0.7690167 3906
setDT(table)[, list(GroupMedian=as.double(median(rep(proportions, populations))),
TotalCount=sum(populations)) , by = districts]
print(table)
##Same output as above###
I have no idea whats going on, after much time.

Related

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Why can't I see my full rbindlist result?

I used the rbindlist() function to try and merge two melted data frames (means_melt and means_melt_50). I'm wondering why it comes up with the break in the data? And whether I can use the whole list as I ultimately intend to create two graphs, each with 5 sets of data (grouped by variable), and using facet_grid(). I want the two graphs separated based on "Accuracy".
> compiled_means <- list(means_melt, means_melt_50)
> rbindlist(compiled_means, use.names = TRUE, fill=FALSE, idcol = NULL)
Divisions Accuracy variable value
1: 1 0 mean20 16
2: 2 0 mean20 20
3: 3 0 mean20 21
4: 4 0 mean20 17
5: 5 0 mean20 20
---
196: 16 50 mean_2 2
197: 17 50 mean_2 2
198: 18 50 mean_2 2
199: 19 50 mean_2 4
200: 20 50 mean_2 3
If anyone has a more efficient way for me to format the data so that it can be put in the graphs I want, I'm happy to hear suggestions. I'm not sure if the route I'm taking if effective or long-winded...
Simply a matter of preferences and options - by default the function shows a summary of data.tables that have >100 rows. The following direct print gives the full data table.
print(your.data.table, nrows = Inf)
https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#only-the-first-10-rows-are-printed-how-do-i-print-more

common dispersion in R

I try to use edge R package to analysis my data as following
This is my data frame called subdata
A B C D E
1 13707 13866 12193 12671 10178
2 0 0 0 0 1
3 7165 5002 1256 1341 2087
6 8537 16679 9042 9620 19168
10 19438 25234 15563 16419 16582
16 3 3 11 3 5
genotype=factor(c("LMS1","LRS1","MS3","MS4","RS5"))
y=DGEList(counts=data.matrix(subdata),group=genotype)
design=model.matrix(~0+group,data=subdata)
y=estimateGLMCommonDisp(y,design, verbose=TRUE)
I try to calculate the common dispersion to estimate up and down regulation genes but I get this error message as following
Warning message:
In estimateGLMCommonDisp.default(y = y$counts, design = design, :
No residual df: setting dispersion to NA
please could any one help me to dissolve this problem , I really appreciate for that

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

R script using consecutive rows with allowance frame

I am trying to create an R script that says, "make a new variable and, based on a previous variable 'scores,' put a 1 for ten consecutive 'scores' in which at least 8 of those 10 'scores' are at or above 1952"
How about this with zoo::rollapply()
#make dataframe with scores
df<-data.frame(score=sample(1000:3000,2000))
require(zoo) # for rollapply() function
df$newvar<-c(rep(0,9),rollapply(df,width=10,FUN=function(x)ifelse(length(x[x>=1952])>=8,1,0)))
head(df[df$newvar==1,])
score newvar
25 2695 1
26 2750 1
30 2468 1
140 2525 1
141 2515 1
275 1989 1

Resources