R function merging date data into one column - r

I'm still getting my head around R so please excuse my ignorance. I currently have a dataset with several columns of different dates:
Country.Region X2020.01.22 X2020.01.23 X2020.01.24
1 Afghanistan 0 0 0
2 Algeria 0 0 0
3 Andorra 0 0 0
4 Angola 0 0 0
5 Antigua and Barbuda 0 0 0
6 Argentina 0 0 0
X2020.01.25 X2020.01.26 X2020.01.27 X2020.01.28 X2020.01.29
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
I am trying to tidy the data to gather all the date columns into one and give the "recovered" value in another separate column. Any suggestions on what function I should use? That would also allow me to convert the date column from a character format to a date format.
Any help I would really appreciate it!!

Related

Splitting column of comma separated categories into binary matrix

I'm pretty new to R and I really need some help. I have a column cats in my dataframe which i would like to spread into a binary matrix where 1 is where the respondent reported interest in and 0 if they did not.
I've found that my problem is very similar to the one here:
Split column of comma-separated numbers into multiple columns based on value
However I am unable to solve my problem using the said solution and keep receiving multiple different errors at different points. I suspect it's because my data frame contains strings and not integers or numbers.
Here is a sample data frame of what I am working with
df <- data.frame(c("sports", "business,IT,entertainment", "feature,entertainment", "business,politics,sports", "health", "politics", "reviews", "entertainment,health", "IT"))
colnames(df) <- "cats"
# cats
#1 sports
#2 business,IT,entertainment
#3 feature,entertainment
#4 business,politics,sports
#5 health
#6 politics
#7 reviews
#8 entertainment,health
#9 IT
And this is what I'm trying to make it look like
sports business IT entertainment politics review health feature
1 1 0 0 0 0 0 0 0
2 0 1 1 1 0 0 0 0
3 0 0 0 1 0 0 0 1
4 1 1 0 0 1 0 0 0
etc...
Examples of errors I have received are:
Error: row_number() should only be called in a data context
Error in eval_tidy(enquo(var), var_env) : object '' not found
Any help would be greatly appreciated!
+with(df, sapply(unique(unlist(strsplit(as.character(cats), ","))), grepl, cats))
# sports business IT entertainment feature politics health reviews
# [1,] 1 0 0 0 0 0 0 0
# [2,] 0 1 1 1 0 0 0 0
# [3,] 0 0 0 1 1 0 0 0
# [4,] 1 1 0 0 0 1 0 0
# [5,] 0 0 0 0 0 0 1 0
# [6,] 0 0 0 0 0 1 0 0
# [7,] 0 0 0 0 0 0 0 1
# [8,] 0 0 0 1 0 0 1 0
# [9,] 0 0 1 0 0 0 0 0
One option with mtabulate
library(qdapTools)
mtabulate(strsplit(as.character(df$cats), ","))
# business entertainment feature health IT politics reviews sports
#1 0 0 0 0 0 0 0 1
#2 1 1 0 0 1 0 0 0
#3 0 1 1 0 0 0 0 0
#4 1 0 0 0 0 1 0 1
#5 0 0 0 1 0 0 0 0
#6 0 0 0 0 0 1 0 0
#7 0 0 0 0 0 0 1 0
#8 0 1 0 1 0 0 0 0
#9 0 0 0 0 1 0 0 0
Or with table from base R
table(stack(setNames(strsplit(as.character(df$cats), ","), seq_len(nrow(df))))[2:1])
Based on you can do:
library(tidyverse)
df %>%
rownames_to_column(var="row") %>%
separate_rows(cats, sep=",") %>%
count(row, cats) %>%
spread(cats, n, fill = 0)
Edit thanks to #eipi10

R design.matrix issue -- dropped column in design matrix?

I'm having an odd problem while trying to set up a design matrix to do downstream pairwise differential expression analysis on RNAseq data.
For the design matrix, I have both the donor information and each condition:
group<-factor(y$samples$group) #44 samples, 6 different conditions
sample<-factor(y$samples$samples) #44 samples, 11 different donors.
design<- model.matrix(~0+sample+group)
head(design)
Donor11.CD8 Donor12.CD8 Donor14.CD8 Donor15.CD8 Donor16.CD8
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Donor17.CD8 Donor18.CD8 Donor19.CD8 Donor20.CD8 Donor3.CD8
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Donor4.CD8 Treatment2 Treatment3 Treatment4 Treatment5
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
Treatment6
1 1
2 0
3 0
4 0
5 0
6 0
>
The issue is that I seem to be losing a condition (treatment 1) when I form the design matrix, and I'm not sure why.
Many thanks, in advance, for your help!
That's not a problem. Treatment 1 is indicated by all 0 for the columns in the design matrix. Look at row 4 - zero for Treatments 2 through 6. That means it is Treatment 1. This is called a "treatment contrast" because the coefficients in the model contrast the named treatment against the "base" level, in this case the base level is Treatment1.

Create multiple subset with for loops

I need to create several matrices based on two criterion: ideo and time.
Here is a part of my code which does not work. I see the problem in using list object to store the numbers from new subsets but I don't know the way to list i and j simultaneously in the list. Should I make list(list())? What other ways to code this problem?
ideo.list<-list(f1,f2,f3,f4,f5,f6,f7,f8,f9)
time.list<-list(t1,t2,t3,t4)
dattime.list<-list()
for (j in 1:length(time.list)){
for (i in 1:length(ideo.list)){
dat.sub<-subset(dat,iyear %in% time.list[[j]] & Ideo %in% ideo.list[[i]])
dattime.list[[i*j]]<-apply(dat.sub[,5:13],2,sum)
}}
nn<-matrix(unlist(dattime.list), byrow=TRUE, ncol=9,nrow=length(dattime.list) )
The head of input data is below:
iyear Ideo Armed.Assault Assassination
1 1982 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
2 1994 Separatist / New Regime Nationalist / Ethnic Nationalist 0 0
3 1995 Left Wing Terrorist Groups (Anarchist) 0 0
4 2010 Racist Terrorist Groups 1 0
5 2013 Left Wing Terrorist Groups (Anarchist) 0 0
6 2014 Cell Strategy and Terrorist Groups 0 0
Bombing.Explosion Facility.Infrastructure.Attack Hijacking Hostage.Taking..Barricade.Incident.
1 1 0 0 0
2 0 1 0 0
3 0 1 0 0
4 0 0 0 0
5 0 1 0 0
6 0 1 0 0
Hostage.Taking..Kidnapping. Unarmed.Assault Unknown
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Thank you for help!

R - merge/combine columns with same name but some data values equal zero

First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
And the second set looks like this:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
I got this code from this topic but I'm new to R and I still need some help with it:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?
There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1

Vertex names by creating a network object via an edgelist (R package: network)

I want to create a network object, representing a directed network on basis of an edgelist. The first column contains some unique ID of project leaders, the second project partners, let's say:
library("network")
x <- cbind(rbind(1,1,2,2,3), rbind(3,7,10,9,6))
y.nw <- network(x, matrix="edgelist", directed=TRUE, loops=FALSE)
Now my problem is: I need all vertexes to have the right ID, since after creating the network object I have to transfer it back to a adjacency matrix with the right corresponding firm IDs. However, I am not sure in which order I should assign them, since I sorted the dataframe by column 1 (project leaders), which, however, not always show up as project partners as well.
If your ids are sequential integers as in your example, you can produce the adjacency matrix corresponding to the edgelist in your example with:
>as.sociomatrix(y.nw))
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
But maybe you have a different type of id system in your real input?

Resources