Create multiple subset with for loops - r

I need to create several matrices based on two criterion: ideo and time.
Here is a part of my code which does not work. I see the problem in using list object to store the numbers from new subsets but I don't know the way to list i and j simultaneously in the list. Should I make list(list())? What other ways to code this problem?
ideo.list<-list(f1,f2,f3,f4,f5,f6,f7,f8,f9)
time.list<-list(t1,t2,t3,t4)
dattime.list<-list()
for (j in 1:length(time.list)){
for (i in 1:length(ideo.list)){
dat.sub<-subset(dat,iyear %in% time.list[[j]] & Ideo %in% ideo.list[[i]])
dattime.list[[i*j]]<-apply(dat.sub[,5:13],2,sum)
}}
nn<-matrix(unlist(dattime.list), byrow=TRUE, ncol=9,nrow=length(dattime.list) )
The head of input data is below:
iyear Ideo Armed.Assault Assassination
1 1982 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
2 1994 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
3 1995 Left Wing Terrorist Groups (Anarchist) 0 0
4 2010 Racist Terrorist Groups 1 0
5 2013 Left Wing Terrorist Groups (Anarchist) 0 0
6 2014 Cell Strategy and Terrorist Groups 0 0
Bombing.Explosion Facility.Infrastructure.Attack Hijacking Hostage.Taking..Barricade.Incident.
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 0 0 0
5 0 1 0 0
6 0 1 0 0
Hostage.Taking..Kidnapping. Unarmed.Assault Unknown
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Thank you for help!

Related

R function merging date data into one column

I'm still getting my head around R so please excuse my ignorance. I currently have a dataset with several columns of different dates:
Country.Region X2020.01.22 X2020.01.23 X2020.01.24
1 Afghanistan 0 0 0
2 Algeria 0 0 0
3 Andorra 0 0 0
4 Angola 0 0 0
5 Antigua and Barbuda 0 0 0
6 Argentina 0 0 0
X2020.01.25 X2020.01.26 X2020.01.27 X2020.01.28 X2020.01.29
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
I am trying to tidy the data to gather all the date columns into one and give the "recovered" value in another separate column. Any suggestions on what function I should use? That would also allow me to convert the date column from a character format to a date format.
Any help I would really appreciate it!!

R design.matrix issue -- dropped column in design matrix?

I'm having an odd problem while trying to set up a design matrix to do downstream pairwise differential expression analysis on RNAseq data.
For the design matrix, I have both the donor information and each condition:
group<-factor(y$samples$group) #44 samples, 6 different conditions
sample<-factor(y$samples$samples) #44 samples, 11 different donors.
design<- model.matrix(~0+sample+group)
head(design)
Donor11.CD8 Donor12.CD8 Donor14.CD8 Donor15.CD8 Donor16.CD8
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Donor17.CD8 Donor18.CD8 Donor19.CD8 Donor20.CD8 Donor3.CD8
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Donor4.CD8 Treatment2 Treatment3 Treatment4 Treatment5
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
Treatment6
1 1
2 0
3 0
4 0
5 0
6 0
>
The issue is that I seem to be losing a condition (treatment 1) when I form the design matrix, and I'm not sure why.
Many thanks, in advance, for your help!
That's not a problem. Treatment 1 is indicated by all 0 for the columns in the design matrix. Look at row 4 - zero for Treatments 2 through 6. That means it is Treatment 1. This is called a "treatment contrast" because the coefficients in the model contrast the named treatment against the "base" level, in this case the base level is Treatment1.

R - merge/combine columns with same name but some data values equal zero

First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
And the second set looks like this:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
I got this code from this topic but I'm new to R and I still need some help with it:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?
There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1

sum or group specific columns based on clusters in r

So I have a data set of species and abundances, here is a sample of it:
aca.qua aca.bah aca.chi achi.lin alb.vul alu.mon ani.vir arc.rho asp.lun aux.roc bag.bag bag.mar bal.cap cal.cal cal.pen
1 0 0 0 0 5 0 57 0 0 0 0 0 0 0 16
2 0 0 1 0 2 0 3 0 0 0 0 8 0 0 0
3 0 0 0 0 1 0 3 0 0 0 0 0 0 0 3
4 0 0 0 0 5 0 0 0 22 0 0 94 0 0 0
5 0 0 0 0 1 0 0 0 0 2 3 2 0 0 1
6 0 0 0 0 0 0 0 1 0 0 2 2 0 0 0
A made a cluster analysis with some of the species traits and came up with some clusters were each species should be included:
aca.qua aca.bah aca.chi achi.lin alb.vul alu.mon ani.vir arc.rho asp.lun aux.roc bag.bag bag.mar bal.cap cal.cal cal.pen
1 1 1 2 3 1 4 4 1 5 4 4 1 1 1
"aca.qua" should be in cluster 1, as well as "aca.bah", "aca.chi" and "alu.mon", etc. "achi.lin" in cluster two and so on.
I was trying to come up with a code that uses the references in the second data frame to group the columns by cluster and sum them. I was trying to do so with dplyr, mutate and some loops, but I never managed to get to a good way of doing that. I tried adding the clusters as a row thant using t() to transpose and select(), then transpose back, etc, it was getting way too complicated.
Is there any way that I can use the the vector containing the names of the species and it's clusters as reference to sum the respective columns of each cluster?
The idea is to end up with something like this, but for all the clusters:
V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 cluster1
1 1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 1 0 0 22
6 0 1 0 0 0 0 0 0 0 0 0
Here I used the following code:
teste4 <- teste3 %>%
filter(V1 == 1) %>%
select(-1)
teste5 <- teste4 %>%
mutate(cluster1 = rowSums(teste4[, 1:rowSums(teste4)]))
The point here is that I will also try several different cluster methods and models, therefore, I need to make it somehow more automatic when I come up with new cluster combinations instead of manualy selecting each columns (the original dataset is much larger.
Try to add the rows that match each cluster with rowSums. We can wrap it in an lapply call to cycle through each unique cluster:
lst <- lapply(1:max(df2[1,]), function(x) rowSums(df1[,df2[1,] == x, drop=F]))
setNames(data.frame(lst),paste0("clust",1:length(lst)))
# clust1 clust2 clust3 clust4 clust5
# 1 16 0 5 57 0
# 2 1 0 2 11 0
# 3 3 0 1 3 0
# 4 22 0 5 94 0
# 5 1 0 1 5 2
# 6 0 0 0 5 0

How do I create a DTM-like text matrix from a list of text blocks?

I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?

Resources