How to cross-tabulate two variables in R? - r

This seems to be basic, but I wont get it. I am trying to compute the frequency table in R for the data as below
1 2
2 1
3 1
I want to transport the the two way frequencies in csv output, whose rows will be all the unique entries in column A of the data and whose columns will be all the unique entries in column B of the data, and the cell values will be the number of times the values have occurred. I have explored some constructs like table but I am not able to output the values correctly in csv format.
Output of sample data:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

The data:
df <- read.table(text = "1 2
2 1
3 1")
Calculate frequencies using table:
(If your object is a matrix, you could convert it to a data frame using as.data.frame before using table.)
tab <- table(df)
V2
V1 1 2
1 0 1
2 1 0
3 1 0
Write data with the function write.csv:
write.csv(tab, "tab.csv")
The resulting file:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

Related

Convert the rows of frequency "table" (NOT matrix or dataframe) to separate lists [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed last month.
I'm running frequency table of frequencies, I want to convert the table to two lists of numbers.
numbers <- c(1,2,3,4,1,2,3,1,2,3,1,2,3,4,2,3,5,1,2,3,4)
freq_of_freq <- table(table(numbers))
> freq_of_freq
1 3 5 6
1 1 1 2
From the table freq_of_freq, I'd like to get create two list, x and y, one containing the numbers 1,3,5,6 and the other with the frequency values 1,1,1,2
I tried this x <- freq_of_freq[ 1 , ] and y <- freq_of_freq[ 2 , ], but this doesn't work.
Any help greatly appreciated. Thanks
One approach is to use stack() to create a list.
numbers <- c(1,2,3,4,1,2,3,1,2,3,1,2,3,4,2,3,5,1,2,3,4)
freq_of_freq <- table(table(numbers))
stack(freq_of_freq)
#> values ind
#> 1 1 1
#> 2 1 3
#> 3 1 5
#> 4 2 6
To exactly match your expected output, you could do:
x = as.integer(names(freq_of_freq))
y = unname(freq_of_freq)
Note, the OP attempt of freq_of_freq[1, ] does not work because table returns a named integer vector for this example dataset. That is, we can't subset using matrix or data.frame notation because we only have one dimension.

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Replace semicolon-separated values to tab

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

R require cell counts for number of occurrences of regex pattern over entire data frame

I'm working in R and I have a data frame containing epigenetic information. I have 300,000 rows containing genomic locations and 15 columns each of which identifies a transcription factor motif that may or may not occur at each locus.
I'm trying to use regular expressions to count how many times each transcription factor occurs at each genomic locus. Individual motifs can occur > 15 times at any one locus, so I'd like the output to be a matrix/data frame containing motif counts for each individual cell of the data frame.
A typical single occurrence of a motif in a cell could be:
2212(AATTGCCCCACA,-,0.00)
Whereas if there were multiple occurrences of a motif, these would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:
144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)
Here is some toy data:
df <-data.frame(NAMES = c('LOC_A', 'LOC_B', 'LOC_C', 'LOC_D'),
TFM1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "0", "0"),
TFM2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "0"),
stringsAsFactors = F)
I'd be looking for the output in the following format:
NAMES TFM1 TFM2
LOC_A 2 2
LOC_B 1 1
LOC_C 0 1
LOC_D 0 0
If possible, I'd like to avoid for loops, but if loops are required so be it. To get row counts for this data frame I used the following code, kindly recommended by #akrun:
df$MotifCount <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))
Notice that the unique identifier for the motifs used here is "\\d+\\(" to pick up the number and opening bracket at the start of each motif identification string. This would have to be included in any solution code. Something similar which worked across the whole data frame to provide individual cell counts would be ideal.
Many Thanks
We don't need the Reduce part
data.frame(c(df[1],lapply(df[-1], function(x) lengths(str_extract_all(x, "\\d+\\(")))) )
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0
This will also work:
cbind.data.frame(df[1],sapply(lapply(df[-1], function(x) str_extract_all(x, "\\d+\\(")), function(x) lapply(x, length)))
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0

Doubts about ddply function in R

I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?
When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))

Resources