Problems to separate data - r

I have the FreqAnual.
Fêmea Macho
Abril 3 0
Agosto 1 0
Dezembro 7 0
Fevereiro 6 4
Janeiro 6 4
Julho 1 0
Junho 5 0
Maio 3 0
Março 20 2
Novembro 4 1
Outubro 3 0
It also comes from a dataset from Excel, in which is the column "Mes", and a row for every register, and another row for sex, that comes to be Fêmea and Macho.
I used the FreqAnual <- table(Dados_procesados$Mes, Dados_procesados$Sexo) .
So i tried FreqJan <- Dados_Procesados [Mes == Janeiro, ], also the one with the $ before Mes, and just get the result
FreqJan <- Dados_Procesados [Mes = Janeiro, ]
Error: object 'Dados_Procesados' not found
What can I do? Also the subtable didn't work
I was expecting something like
Fêmea Macho
Janeiro 6 4
I need it that way, so I can do G test monthly to find the sex ratio, and if there were significant differences

Related

How to correctly merge two files and count values before Fisher's test in R?

I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code:
https://github.com/mhguo1/TRAPD
Run burden testing This script will run the actual burden testing. It
performs a one-sided Fisher's exact test to determine if there is a
greater burden of qualifying variants in cases as compared to controls
for each gene. It will perform this burden testing under a dominant
and a recessive model.
It requires R; the script was tested using R v3.1, but any version of
R should work. The script should be run as: Rscript burden.R
--casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A
--casesize: Number of cases that were tested in Step 2A
--controlfile: Path to the counts file for the controls, as generated in Step 2B
--controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for
total sample size
--output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases
carrying potentially compound heterozygous qualifying variants in a
given gene CASE_COUNT_HOM: Number of cases carrying homozygous
qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a
given gene. CONTROL_COUNT_HET: Approximate number of controls carrying
heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM:
Number of controlss carrying homozygous qualifying variants in a given
gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value
under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!
If you want all row in two file. You can use full join with by = "GENE" and suffix as you wish
library(dplyr)
z <- outer_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
6 ENSG00000174776 NA NA NA NA
7 ENSG00000076864 NA NA NA NA
8 ENSG00000086015 NA NA NA NA
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
6 1 1 0 2
7 1 0 1 13
8 1 0 1 25
If you want only GENE that are in both rows, use inner_join
z <- inner_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7

How can I create a vector with values that are obtained by a function that returns different values for every row?

I have a function club_points(club) that returns me the total points of the club. Now I want to make a data frame with club on the rows and the club_points values of the respective club in the columns. Is there a way to iterate my function in order to automatically assign the points in the same row as the club?
After some research I believe I have to use the apply family... but since I am new I dont know how to do it
teams total_points
1 Rio Ave 0
2 Moreirense 0
3 Sp Lisbon 0
4 Tondela 0
5 Boavista 0
6 Guimaraes 0
7 Setubal 0
8 Estoril 0
9 Belenenses 0
10 Chaves 0
11 Maritimo 0
12 Pacos Ferreira 0
13 Porto 0
14 Arouca 0
15 Benfica 0
16 Feirense 0
17 Sp Braga 0
18 Nacional 0
this the current format of my dataframe final_pos, but i would like to iterate the club_points function in the total_points column
Do you mean something like
final_pos$total_points <- Vectorize(club_points, "club")(final_pos$teams)
or
final_pos$total_points <- sapply(final_pos$teams,club_points)

R 3.4.1 gsub on Windows 10 - find and replace all strings except for

I am trying to clean up data for a class project. The data deals with NOAA Storm data from 1950 to 2011. The storm types (EVTYPE) are only supposed to be 48 different levels, but there are over 1000 unique entries. I am trying to find all the snow related entries, which gives me:
table(grep("snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
ACCUMULATED.SNOWFALL BLOWING.SNOW COLD.AND.SNOW DRIFTING.SNOW
4 5 1 1
EARLY.SNOWFALL EXCESSIVE.SNOW FALLING.SNOW.ICE FIRST.SNOW
7 25 2 2
HEAVY.SNOW HEAVY.SNOW.SHOWER HEAVY.SNOW.SQUALLS ICE.SNOW
13988 1 1 4
LAKE.EFFECT.SNOW LATE.SEASON.SNOW LATE.SEASON.SNOWFALL LATE.SNOW
656 1 3 2
LIGHT.SNOW LIGHT.SNOW.FLURRIES LIGHT.SNOW.FREEZING.PRECIP LIGHT.SNOWFALL
174 3 1 1
MODERATE.SNOW MODERATE.SNOWFALL MONTHLY.SNOWFALL MOUNTAIN.SNOWS
1 101 1 1
RECORD.MAY.SNOW RECORD.SNOW RECORD.SNOWFALL RECORD.WINTER.SNOW
1 2 2 3
SEASONAL.SNOWFALL SNOW SNOW.ACCUMULATION SNOW.ADVISORY
1 425 1 1
SNOW.AND.ICE SNOW.AND.SLEET SNOW.BLOWING.SNOW SNOW.DROUGHT
4 5 6 4
SNOW.ICE SNOW.SHOWERS SNOW.SLEET SNOW.SQUALL
1 5 5 5
SNOW.SQUALLS THUNDERSNOW.SHOWER UNUSUALLY.LATE.SNOW
14 1 1
There is a storm type called "Lake.Effect.Snow", which is one of the 48 storm types. How can I replace all of the other entries while excluding that particular storm type? I've tried:
table(grep("([^lake]?)snow", temp$EVTYPE, ignore.case = TRUE, value = TRUE))
to try and ignore the Lake.Effect.Snow entries, but no good.
Use stringr::str_detect with if.else.
library("stringr")
temp$EVTYPE <- if.else(str_detect(temp$EVTYPE, regex("snow", ignore_case = TRUE)) & temp$EVTYPE != "Lake.Effect.Snow", "Snow", temp$EVTYPE)

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Resources