I tried ps with different kinds of switches e.g. -A, aux, ef, and so forth but I cannot seem to find the right combination of switches that will tell me the Process ID (PID), Parent Process ID (PPID), Process Group ID (PGID), and the Session ID (SID) of a process in the same output.
Here you go:
$ ps xao pid,ppid,pgid,sid | head
PID PPID PGID SID
1 0 1 1
2 0 0 0
3 2 0 0
6 2 0 0
7 2 0 0
21 2 0 0
22 2 0 0
23 2 0 0
24 2 0 0
If you want to see the process' name as well, use this:
$ ps xao pid,ppid,pgid,sid,comm | head
PID PPID PGID SID COMMAND
1 0 1 1 init
2 0 0 0 kthreadd
3 2 0 0 ksoftirqd/0
6 2 0 0 migration/0
7 2 0 0 watchdog/0
21 2 0 0 cpuset
22 2 0 0 khelper
23 2 0 0 kdevtmpfs
24 2 0 0 netns
Related
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I am a rookie in R. I think my questions are basic ones. I want to know the frequency of a variable under couple conditions. I try to use table() but it does not work. I have searched a lot, I still cannot find the answers.
My data looks like this
ID AGE LEVEL End_month
1 14 1 201005
2 25 2 201006
3 17 2 201006
4 16 1 201008
5 19 3 201007
6 33 2 201008
7 17 2 201006
8 15 3 201005
9 23 1 201004
10 25 2 201007
I want to know two things.
First, I want to know the frequency of age under different level. The age shows in certain range and aggregate the rest as a variable. It looks like this.
level
1 2 3 sum
age 14 1 0 0 1
16 1 0 0 1
15 0 0 1 1
17 0 2 0 2
19 0 0 1 1
20+ 1 3 0 4
sum 3 5 2 10
Second, I want to know the frequency of different age in different end_month of level 2&3 customer. I want to get a table like this.
For level 2 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 0 0 0 0 0
19 0 0 0 0 0 0
17 0 0 2 0 0 2
19 0 0 0 0 0 0
25 0 0 0 1 0 1
33 0 0 0 1 1 2
sum 0 0 2 2 1 5
For level 3 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 1 0 0 0 1
19 0 0 0 1 0 1
17 0 0 0 0 0 0
19 0 0 0 0 0 0
25 0 0 0 0 0 0
33 0 0 0 0 0 0
sum 0 1 0 1 0 2
Many thanks in advance.
You can still achieve this with table, because it can take more than one variables.
For example, use
table(AGE, LEVEL)
to get the first two-way table.
Now, when you want to produce such table for each subset according to LEVEL, you can do it this way, assuming we are going for level 1:
subset <- LEVEL == 1
table(AGE[subset], END[subset])
Given a dataset in the following form:
> Test
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0
23 39045 0 0 0
I can compress these data to remove zero rows with the following code:
a=subset(Test, Total!=0)
> a
Pos Watson Crick Total
4 39026 2 1 3
6 39028 0 4 4
8 39030 0 1 1
12 39034 1 0 1
15 39037 3 0 3
16 39038 2 0 2
18 39040 0 1 1
How would I code the reverse transformation? i.e. To convert dataframe a back into the original form of Test.
More specifically: without any access to the original data, how would I re-expand the data (to include all sequential "Pos" rows) for an arbitrary range of Pos?
Here, the ID column is irrelevant. In a real example, the ID numbers are just row numbers created by R. In a real example, the compressed dataset will have sequential ID numbers.
Here's another possibility, using base R. Unless you explicitly provide the initial and the final value of Pos, the first and the last index value in the restored dataframe will correspond to the values given in the "compressed" dataframe a:
restored <- data.frame(Pos=(a$Pos[1]:a$Pos[nrow(a)])) # change range if required
restored <- merge(restored,a, all=TRUE)
restored[is.na(restored)] <- 0
#> restored
# Pos Watson Crick Total
#1 39026 2 1 3
#2 39027 0 0 0
#3 39028 0 4 4
#4 39029 0 0 0
#5 39030 0 1 1
#6 39031 0 0 0
#7 39032 0 0 0
#8 39033 0 0 0
#9 39034 1 0 1
#10 39035 0 0 0
#11 39036 0 0 0
#12 39037 3 0 3
#13 39038 2 0 2
#14 39039 0 0 0
#15 39040 0 1 1
Possibly the last step can be combined with the merge function by using the na.action option correctly, but I didn't find out how.
You need to know at least the Pos values you want to fill in. Then, it's a combination of join and mutate operations in dplyr.
Test <- read.table(text = "
Pos Watson Crick Total
1 39023 0 0 0
2 39024 0 0 0
3 39025 0 0 0
4 39026 2 1 3
5 39027 0 0 0
6 39028 0 4 4
7 39029 0 0 0
8 39030 0 1 1
9 39031 0 0 0
10 39032 0 0 0
11 39033 0 0 0
12 39034 1 0 1
13 39035 0 0 0
14 39036 0 0 0
15 39037 3 0 3
16 39038 2 0 2
17 39039 0 0 0
18 39040 0 1 1
19 39041 0 0 0
20 39042 0 0 0
21 39043 0 0 0
22 39044 0 0 0")
library(dplyr)
Nonzero <- Test %>% filter(Total > 0)
All_Pos <- Test %>% select(Pos)
Reconstruct <-
All_Pos %>%
left_join(Nonzero) %>%
mutate_each(funs(ifelse(is.na(.), 0, .)), Watson, Crick, Total)
In my code, All_Pos contains all valid positions as a one-column data frame; the mutate_each() call converts NA values to zeros. If you only know the largest MaxPos, you can construct it using
All_Pos <- data.frame(seq_len(MaxPos))
I have data from a discrete choice experiment (DCE), looking at hiring preferences for individuals from different sectors. that I've formatted into long format. I want to model using mlogit. I have exported the data and can successfully run the model in Stata using the asclogit command, but I'm having trouble getting it to run in R.
Here's a snapshot of the first 25 rows of data:
> data[1:25,]
userid chid item sector outcome cul fit ind led prj rel
1 11275 211275 2 1 1 0 1 0 1 1 1
2 11275 211275 2 2 0 1 0 0 0 0 0
3 11275 211275 2 0 0 0 0 1 1 0 1
4 11275 311275 3 0 1 1 1 0 0 0 1
5 11275 311275 3 2 0 0 1 0 0 0 1
6 11275 311275 3 1 0 0 1 0 0 0 0
7 11275 411275 4 0 0 1 0 1 1 0 0
8 11275 411275 4 2 1 0 1 1 1 1 0
9 11275 411275 4 1 0 0 1 0 1 0 0
10 11275 511275 5 1 1 1 0 1 0 1 1
11 11275 511275 5 2 0 0 0 1 1 0 0
12 11275 511275 5 0 0 0 0 1 1 1 0
13 11275 611275 6 0 0 0 1 1 0 0 1
14 11275 611275 6 1 1 1 1 1 0 0 1
15 11275 611275 6 2 0 1 1 1 0 1 0
16 11275 711275 7 1 0 0 0 0 0 1 0
17 11275 711275 7 0 0 1 0 0 1 1 0
18 11275 711275 7 2 1 1 0 0 1 1 1
19 11275 811275 8 0 1 0 1 0 0 1 1
20 11275 811275 8 1 0 1 0 1 1 1 1
21 11275 811275 8 2 0 0 0 0 0 1 1
22 11275 911275 9 0 0 1 1 0 0 1 0
23 11275 911275 9 2 1 1 1 1 1 0 1
24 11275 911275 9 1 0 1 0 1 1 0 0
25 11275 1011275 10 0 0 0 0 0 0 0 0
userid and chid are factor variables, the rest are numeric. The variables:
Userid is unique respondent ID
chid is unique choice set ID per respondent
item is choice set ID (they are repeated across respondents)
sector is alternatives (3 different sectors)
outcome is alternative selected by respondent in the given choice set
cul-rel is binary factor variables, alternative specific that vary across alternatives according to the experimental design.
Here is my mlogit syntax:
mlogit(outcome~cul+fit+ind+led+prj+rel,shape="long",
data=data,id.var=userid,chid.var="chid",
choice=outcome,alt.var="sector")
Here is the error I get:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
I've made sure there are no missing data, and that each choice set has exactly 1 selected alternative.
Any ideas about why I'm getting this error, when the model runs fine in Stata with the exact same dataset? I've probably misread the mlogit syntax somewhere. If it helps, my Stata syntax is:
asclogit outcome cul fit rel ind fit led prj, case(chid) alternatives(sector)
Answering my own question here as I figured it out.
R mlogit can't handle when none of the alternatives in a choice set is selected. R also needs the data ordered properly, each alternative in a choice set must be in a row. I hadn't done that due to some data management. Interestingly, Stata can handle both of these conditions, so that's why my Stata commands worked.
As an aside, for those interested, Stata's asclogit and R's mlogit give the exact same results. Always nice when that happens.
You may need to use mlogit.data() to shape the data. There's an examples at ?mlogit. Hope that helps.
I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?