Specific separator can separate the data frame - r

I try to execute this command
df2 <- as.data.frame.matrix(table(stack(setNames(strsplit(df$col1, "---", fixed = TRUE), df$id))[2:1]))
However I receive this error:
Error in table(stack(setNames(strsplit(df$col1, :
attempt to make a table with >= 2^31 elements
Any idea why this error happaned? Unfortunately I can't provide a reproducable example with this code because I can't find what caused this error.
What makes this command is that it make 0 and 1 values which separate by ---.
Example input:
data.frame(id = c(1,2), col1 = c("text---here","text---there"))
expected output
data.frame(id = c(1,2), text = c(1,1), here = c(1,0), there = c(0,1))

If the task in question is complex, it is worth splitting it into chunks. Try this:
x = data.frame(id = c(1,2), col1 = c("text---here","text---there")); x$col1 = as.vector(x$col1)
Split = strsplit(as.vector(x$col1), split = "---")
levels = unique(unlist(Split))
x = cbind(x, matrix(ncol = length(levels), nrow = nrow(x)))
for(i in 1:length(levels))
{
x[,ncol(x)-length(levels)+i] <- sapply(Split, function(x) max(x == levels[i]))
}
colnames(x) <- c("id", "col1", levels)
x
# id col1 text here there
# 1 1 text---here 1 1 0
# 2 2 text---there 1 0 1

Related

how to enclose combined cells as vector

I have got a data table like
library(data.table)
library(lifecontingencies)
dt <- data.table(cash = c(100,120), Flows = c(110,130),time = c(1,1),
Ids = c(2,2), int = c(0.02,0.04), Rates = c(0.02,0.04),
proba = c(0.9,0.8), bilities = c(0.7,0.6))
dt
# cash Flows time Ids int Rates proba bilities
#1: 100 110 1 2 0.02 0.02 0.9 0.7
#2: 120 130 1 2 0.04 0.04 0.8 0.6
and want to calculate
#presentValue(cashFlows, timeIds, interestRates, probabilities)
row-wise. How can I do it automatically instead of manually like so:
pV1 <- presentValue(cashFlows = c(100,110),
timeIds = c(1,2),
interestRates = c(0.02,0.02),
probabilities = c(0.9,0.7))
pV2 <- presentValue(cashFlows = c(120,130),
timeIds = c(1,2),
interestRates = c(0.04,0.04),
probabilities = c(0.8,0.6))
result <- c(pV1,pV2)
result
#162.2453 164.4231
As we are using data.table, one approach is to group by sequence of rows and apply the function
dt[, .(presValue = presentValue(cashFlows = unlist(c(cash, Flows)),
timeIds = unlist(c(time, Ids)),
interestRates = unlist(c(int, Rates)),
probabilities = unlist(c(proba, bilities)))), by = .(Row = 1:nrow(dt))]
# Row presValue
#1: 1 162.2453
#2: 2 164.4231
Another approach is to combine the multiple columns into one by melting and then apply the presentValue
dM <- melt(dt, measure = patterns('cash|Flow', 'time|Ids', 'int|Rates', 'proba|bilities'),
value.name = c('cashFlows', 'timeIds', 'interestRates', 'probabilities'))[,
rn := rowid(variable)][]
dM[, .(presValue = do.call(presentValue, .SD)),
by = .(Row = rn), .SDcols = cashFlows:probabilities]
# Row presValue
#1: 1 162.2453
#2: 2 164.4231
I would vectorize your calculation via apply:
apply(dt, 1, function(x) presentValue(cashFlows = x[1:2],
timeIds = x[3:4],
interestRates = x[5:6],
probabilities = x[7:8]))
# [1] 162.2453 164.4231
Note that you can manipulate indexes of row any way you want, for example, x[1:2] here stands for first and second cells of a row. You can choose first and fourth cells through x[c(1,4)], or just second cell via x[2]
It took me a minute to understand the presentValue and what it needed but I think this should do what you want.
apply(dt, 1, function(row) {
cashFlows <- c(row[1], row[2])
tIds <- c(row[3], row[4])
interestRates <- c(row[5], row[6])
probabilities <- c(row[7], row[8])
presentValue(cashFlows = cashFlows,
timeIds = tIds,
interestRates = interestRates,
probabilities = probabilities)
})

Building network in R

I have a csv file which looks like this:
"","people_id","commit_id"
"1",1,0
"2",1,117
"3",1,144
"4",1,278
…
Here's the csv file if you wanna look at it. It contains 11735 lines but 5923 unique people ids.
Does anyone know how to connect the people ids with the common "commit_id" and ignore commit_id 0 as id 0 does not exist.
For now I have done this:
# read the csv file
commitsNetwork <- read.csv("commits.csv", header=TRUE)
# use a subset for demo purpose
commitsNetwork <- commitsNetwork[c("people_id", "commit_id")]
#build edgelist(for commits)
C <- spMatrix(nrow = length(unique(commitsNetwork$people_id)),
ncol = length(unique(commitsNetwork$commit_id)),
i = as.numeric(factor(commitsNetwork$people_id)),
j = as.numeric(factor(commitsNetwork$commit_id)),
x = rep(1, length(as.numeric(commitsNetwork$people_id))) )
row.names(C) <- levels(factor(commitsNetwork$people_id))
colnames(C) <- levels(factor(commitsNetwork$commit_id))
adjC <- tcrossprod(C)
comG <- graph.adjacency(adjC, mode = "undirected", weighted = TRUE, diag = FALSE)
#write to pajek file
write.graph(comG, "comNetwork.net", format = "pajek")
Also, the edges are from the 2nd column "commit_id". If both vertices(people) are connected by the common commit_id from the 6th column.
Therefore I'm not sure how to generate the network with this csv file in R.
The ideal output is should turn out like:
*Vertices 5923
1
2
3
4
...
*Edges
1 4 1
1 25 1
1 39 1
1 41 1
1 48 1
until 5923...
Maybe you want something like this:
library(igraph)
library(Matrix)
download.file("https://www.dropbox.com/s/q7sxfwjec97qzcy/people.csv?dl=1",
tf <- tempfile(fileext = ".csv"), mode = "wb")
people <- read.csv(tf)
A <- spMatrix(nrow = length(unique(people$people)),
ncol = length(unique(people$repository_id)),
i = as.numeric(factor(people$people)),
j = as.numeric(factor(people$repository_id)),
x = rep(1, length(as.numeric(people$people))) )
row.names(A) <- levels(factor(people$people))
colnames(A) <- levels(factor(people$repository_id))
adj <- tcrossprod(A)
g <- graph.adjacency(adj, mode = "undirected", weighted = TRUE, diag = FALSE)
See also here.

R: Processing data within a loop

Below, I run a loop using the sample data provided, appending necessary items to geneexptest, and processing it further within the loop. However, in building dfs, I wish for the endpoint of each run to be data.frame(geneexptotal,...), as shown. The problem is, it seems to be stopping somehow at geneexptestapp, and outputting that instead into dfs each round. Please let me know how I might include the rest of the loop into the output.
gex <- data.frame("sample" = c("BIX","HEF","TUR","ZOP","VAG","JUF","FED","MEQ","YIF","HRB","LOP","LIX","COT","DRP","KFC","TUY","DOG","KEX","RAV","UEH"),
"TCGA-F4-6703-01" = runif(20, -1, 1),
"TCGA-DM-A28E-01" = runif(20, -1, 1),
"TCGA-AY-6197-01" = runif(20, -1, 1),
"TCGA-A6-5657-01" = runif(20, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(20, -1, 1),
"TCGA-AA-3663-11" = runif(20, -1, 1),
"TCGA-AD-6901-01" = runif(20, -1, 1),
"TCGA-AZ-2511-01" = runif(20, -1, 1),
"TCGA-A6-A567-01" = runif(20, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
dfs <- lapply(listx, function(colName) {
do.call(rbind, lapply(seq(nrow(mxy)), function(i) {
zvalues <- zScore(gex[i,colName], mxy[i,])
geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
is.na(geneexptest) <- sapply(geneexptest, is.infinite)
geneexptestapp <- na.omit(geneexptest)
geneexptestorder <- geneexptestapp[order(geneexptestapp$zvalues, decreasing = FALSE, na.last = NA), ]
geneexpa <- geneexptestorder[1:((0.05)*nrow(geneexptest)),]
geneexpz <- geneexptestorder[(nrow(geneexptestorder)-((0.05)*nrow(geneexptest))):nrow(geneexptestorder),]
geneexptotal <- rbind(geneexpa, geneexpz)
data.frame(geneexptotal$gex.sample, row.names = NULL, stringsAsFactors = TRUE)
}))
})
Your code, as it is, is working fine. You just have unexpected output because of some data management you're doing. I broke your code out a bit to help with readability. I made two new functions, fun1 and fun2 - fun2 is your inner function and fun1 is the outer function. fun2 takes colName as an argument to pass it along.
fun2 = function(i,colName) {
zvalues <- zScore(gex[i,colName], mxy[i,])
geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
is.na(geneexptest) <- sapply(geneexptest, is.infinite)
geneexptestapp <- na.omit(geneexptest)
geneexptestorder <- geneexptestapp[order(geneexptestapp$zvalues, decreasing = FALSE, na.last = NA), ]
geneexpa <- geneexptestorder[1:((0.05)*nrow(geneexptest)),]
geneexpz <- geneexptestorder[(nrow(geneexptestorder)-((0.05)*nrow(geneexptest))):nrow(geneexptestorder),]
geneexptotal <- rbind(geneexpa, geneexpz)
data.frame(geneexptotal, row.names = NULL, stringsAsFactors = TRUE)
}
fun1 = function(colName) {
do.call(rbind, lapply(seq(nrow(mxy)), fun2, colName=colName))
}
dfs <- lapply(listx, fun1)
So, listx has two column names in it - let's start with the first one, TCGA_DM_A28E_01. So, we call fun1("TCGA_DM_A28E_01"). This is going to rbind each of 5 (nrow(mxy) = 5) iterations for fun2.
So, let's start with i=1 (we're now in fun2). You take one value from gex and one row from mxy, put them in zScore which gives a value of -0.6955057. You then make a one row data.frame and do some stuff to the zvalues column. So we have:
> geneexptest
gex.sample.i. zvalues
1 BIX -0.6955057
Do some checks and verify that nothing is infinite and remove it if it is. So now we have:
> geneexptestapp
gex.sample.i. zvalues
1 BIX -0.6955057
Now you do some ordering on a 1 row data frame. Nothing changes. This is where the problem is nrow(geneexptest) = 1 so for geneexpa you're asking for rows 1:.05 which is the same as 1 and for geneexpz you're asking for rows .95:1 which is 0.95. There are no fractional rows. This results in:
> geneexpa;geneexpz
gex.sample.i. zvalues
1 BIX -0.6955057
[1] gex.sample.i. zvalues
<0 rows> (or 0-length row.names)
which you rbind and return as a data frame. In this case, it results in a single row data frame. fun1 produces 5 of these and rbinds them together giving you a 5 row data.frame. dfs is a list of 2 of these.
Your code suggests you expect to get more than 1 row (your multiplication suggests at least 20) but there's only 1. Either the example here isn't complete or you need to rethink what's happening in fun2.
UPDATE
OK, given your updated requirements, consider this function:
getExtremeValues = function(x,p=0.05){
z = x[,2]
n = ceiling(nrow(x)*p)
r = x[order(z),1]
return(as.character(r[c(1:n,length(r):(length(r)-n+1))]))
}
You pass this function a dataframe (x), tell it which column has your zvalues (c = 2, by default) and what proportion you want from the top and bottom (p = 0.05 by default). It then returns the first column where the zvalues are in the top and bottom percent.
How to make this all work:
fun2 = function(i,colName) {
zvalues <- zScore(gex[i,colName], mxy[i,])
geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
geneexptest$zvalues <- as.numeric(as.character(geneexptest$zvalues))
is.na(geneexptest) <- sapply(geneexptest, is.infinite)
return(na.omit(geneexptest))
}
fun1 = function(colName) {
getExtremeValues(do.call(rbind, lapply(seq(nrow(mxy)), fun2, colName=colName)))
}
dfs <- lapply(listx, fun1)
This returns:
> dfs
[[1]]
[1] "BIX" "TUY"
[[2]]
[1] "BIX" "TUR"
Given there are 20 samples, 1 is in the top 5%, 1 is in the bottom 5%, there are two column names listed in listx, so 4 samples returned.

Error in adding rows to an empty data frame in R

CREATING AN EMPTY DATA FRAME:
data <- data.frame(ticks = numeric(0), identity = numeric(0), p_h = numeric(0), p_y = numeric(0), v_x = numeric(0), v_y = numeric(0), size = numeric(0), homo = numeric(0))
ADDING DATA TO DATA FRAME:
while (x < timeStepsToRun)
{
.....
data[i, ] <- c(ag$ticks, ag$who, ag$xcor, ag$ycor, ag$v-x, ag$v-y,"10","1")
i=i+1;
...
}
THOUGH I GET THE FOLLOWING ERROR WHEN ADDING DATA:
Error in value[[jvseq[[jjj]]]] : subscript out of bounds
In addition: Warning message:
In matrix(value, n, p) : data length exceeds size of matrix
Please suggest a better strategy or help me in correcting the above.
Thanks in advance!
If you know how large you need your data.frame to be, prespecify the size, then you won't encounter these kind of errors:
rows <- 1e4 # it's not clear how many you actually need from your example
data <- setNames(as.data.frame(matrix(nrow = rows, ncol = 8))
c('ticks', 'identity', 'p_h', 'p_y', 'v_x', 'v_y', 'size', 'homo'))
Then you can fill it in the way that you describe. Even creating a dataframe larger than the one you need and cutting it down to size later is more efficient than growing it row by row.
If you know the classes of the columns you are going to create it can also be performance-enhancing to prespecify the column classes:
rows <- 1e4
data <- data.frame(ticks = integer(rows),
identity = character(rows),
p_h = numeric(rows),
p_y = numeric(rows),
v_x = numeric(rows),
v_y = numeric(rows),
size = numeric(rows),
homo = numeric(rows))
I gotchu:
data <- data.frame(ticks=NA, identity=NA, p_h=NA, p_y=NA, v_x=NA, v_y=NA,
size=NA, homo=NA)
timeStepstoRun <- 10
x <- #something
i <- 1
while (x < timeStepstoRun) {
data[i,] <- 1:8
i <- i + 1
}
Just replace timeStepstoRun, x, and data[x,] <- ... with whatever you actually have. This is never gonna be the good way to do what you're trying to do, but I thought I'd just throw it out.
My personal favorite solution:
# Create data frame with 0 rows and 3 columns.
df <- data.frame(matrix(ncol = 3, nrow = 0))
# Provide column names.
colnames(df) <- c('var1', 'var2', 'var3')
# Add the row.
df[nrow(df) + 1,] = c("a", "b", "c")
Once you have this simplified version working, you can adapt it to your while loop.
Source: https://www.statology.org/create-empty-data-frame-in-r/

Combining frequencies and summary statistics in one table?

I just discovered the power of plyr frequency table with several variables in R
and I am still struggling to understand how it works and I hope some here can help me.
I would like to create a table (data frame) in which I can combine frequencies and summary stats but without hard-coding the values.
Here an example dataset
require(datasets)
d1 <- sleep
# I classify the variable extra to calculate the frequencies
extraClassified <- cut(d1$extra, breaks = 3, labels = c('low', 'medium', 'high') )
d1 <- data.frame(d1, extraClassified)
The results I am looking for should look like that :
require(plyr)
ddply(d1, "group", summarise,
All = length(ID),
nLow = sum(extraClassified == "low"),
nMedium = sum(extraClassified == "medium"),
nHigh = sum(extraClassified == "high"),
PctLow = round(sum(extraClassified == "low")/ length(ID), digits = 1),
PctMedium = round(sum(extraClassified == "medium")/ length(ID), digits = 1),
PctHigh = round(sum(extraClassified == "high")/ length(ID), digits = 1),
xmean = round(mean(extra), digits = 1),
xsd = round(sd(extra), digits = 1))
My question: how can I do this without hard-coding the values?
For the records:
I tried this code, but it does not work
ddply (d1, "group",
function(i) c(table(i$extraClassified),
prop.table(as.character(i$extraClassified))),
)
Thanks in advance
Here's an example to get you started:
foo <- function(x,colfac,colval){
tbl <- table(x[,colfac])
res <- cbind(n = nrow(x),t(tbl),t(prop.table(tbl)))
colnames(res)[5:7] <- paste(colnames(res)[5:7],"Pct",sep = "")
res <- as.data.frame(res)
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
Don't take anything in that function foo as gospel. I just wrote that off the top of my head. Surely improvements/modifications are possible, but at least it's something to start with.
Thanks to Joran.
I slighlty modified your function to make it more generic (without reference to the position of the variables) .
require(plyr)
foo <- function(x,colfac,colval)
{
# table with frequencies
tbl <- table(x[,colfac])
# table with percentages
tblpct <- t(prop.table(tbl))
colnames( tblpct) <- paste(colnames(t(tbl)), 'Pct', sep = '')
# put the first part together
res <- cbind(n = nrow(x), t(tbl), tblpct)
res <- as.data.frame(res)
# add summary statistics
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
and it works !!!
P.S : I still do not understand what (group) stands for but

Resources