Frequency Distribution Plot of Document Term Matrix - r

I have created a document term matrix that looks something like this:
inspect(dtm[1:4,1:6])
allowed allowing almost alone companyunder companywide
Doc1.txt 1 1 1 0 1 0
Doc2.txt 0 1 1 0 1 1
Doc3.txt 0 0 0 1 0 1
Doc4.txt 1 0 1 0 1 1
After taking it's column sum it gives me.
colSums(dtm)
allowed 2
allowing 2
almost 3
alone 1
companyunder 3
companywide 3
This essentially indicates that these words are found in how many documents (for eg allowed 2 tells me that allowed is found in two documents.).
I'm having difficulty in creating a frequency distribution plot which will have x-axis as the document number and y-axis as the number of words the document contains.

Is this what you're looking for?
dtm = array(c(1,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,1,0,1,0,1,1,1),dim=c(4,6))
dimnames(dtm) = list(c("Doc1","Doc2","Doc3","Doc4"),c("allowed","allowing","almost","alone","companyunder","companywide"))
print(dtm)
plot(rowSums(dtm))

Related

Regression with before and after

I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.

How to correctly merge two files and count values before Fisher's test in R?

I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code:
https://github.com/mhguo1/TRAPD
Run burden testing This script will run the actual burden testing. It
performs a one-sided Fisher's exact test to determine if there is a
greater burden of qualifying variants in cases as compared to controls
for each gene. It will perform this burden testing under a dominant
and a recessive model.
It requires R; the script was tested using R v3.1, but any version of
R should work. The script should be run as: Rscript burden.R
--casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A
--casesize: Number of cases that were tested in Step 2A
--controlfile: Path to the counts file for the controls, as generated in Step 2B
--controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for
total sample size
--output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases
carrying potentially compound heterozygous qualifying variants in a
given gene CASE_COUNT_HOM: Number of cases carrying homozygous
qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a
given gene. CONTROL_COUNT_HET: Approximate number of controls carrying
heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM:
Number of controlss carrying homozygous qualifying variants in a given
gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value
under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!
If you want all row in two file. You can use full join with by = "GENE" and suffix as you wish
library(dplyr)
z <- outer_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
6 ENSG00000174776 NA NA NA NA
7 ENSG00000076864 NA NA NA NA
8 ENSG00000086015 NA NA NA NA
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
6 1 1 0 2
7 1 0 1 13
8 1 0 1 25
If you want only GENE that are in both rows, use inner_join
z <- inner_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

Multiple responses in SPSS

I have multiple response questions which have 5 categories (values). I want to get respondents who answered only one category.
For example,
Respondents who answered category not 2,3,4,5.
I want only A mentions like, who are all checked A category alone. I need count of this.
Help, Please.
The following solution is assuming the data has 5 dichotomous variables - one for each of the multiple response categories.
* creating some sample data to demonstrate on.
data list list/cat1 to cat5.
begin data
1 0 0 0 1
0 1 1 0 0
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 0 1
1 0 0 0 0
1 1 1 0 0
end data.
* now checking in which cases only category 1 was chosen.
compute NumCats=sum(cat1 to cat5).
if cat1=1 and NumCats=1 onlyCat1=1.
execute.
* if instead you wish to do the same check for each of the 5 categories,
use `do repeat` this way.
do repeat cat=cat1 to cat5/only=only1 to only5.
compute only=(cat=1 and NumCats=1).
end repeat.
execute.
But ditch the EXECUTE commands. They just cause a useless data pass in this case except for immediately updating the Data Editor (instead of updating on the next data pass).

T test to find differentially expressed genes in R

I have a matrix which contains the genes and the mrna.
ID_REF GSM362168 GSM362169 GSM362170 GSM362171 GSM362172 GSM362173 GSM362174
244901_at 5.171072 5.207896 5.191145 5.067809 5.010239 5.556884 4.879528
244902_at 5.296012 5.460796 5.419633 5.440318 5.234789 7.567894 6.908795
I wanted to find the differentially expressed genes from the matrix using t test and i carried out the following.
stat=mt.teststat(control,classlabel,test="t",na=.mt.naNUM,nonpara="n")
and I get the following error
Error in is.factor(classlabel) : object 'classlabel' not found.
I am not sure how I have to assign the classlabels.Is it the right way to find the differentially expressed genes.
The classlabel should be a vector of integers corresponding to observation (column) class labels. I do not understand what that is.
If you open the documentation for mt.teststat:
?mt.teststat
and scroll down to the end, you'll see an example using the "Golub data":
data(golub)
teststat <- mt.teststat(golub, golub.cl)
If you look at golub.cl,it will become clear what the classlabel vector should look like:
golub.cl
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
In this case, 0 or 1 are labels for two classes of sample. There should be as many values in the vector as you have samples, in the same order that the samples appear in the data matrix. You can also look at:
?golub
golub.cl: numeric vector indicating the tumor class, 27 acute
lymphoblastic leukemia (ALL) cases (code 0) and 11 acute
myeloid leukemia (AML) cases (code 1).
So you need to create a similar vector, with labels (0, 1, ...) for however many classes you have for your own data.

Resources