I have the following code in R:
library(party)
dat = read.csv("data.csv", header = TRUE)
train <- dat[1:1000, ]
test <- dat[1000:1200, ]
output.tree <- cforest(t_class ~ var1 + var2,
data = train)
train_predict <- predict(output.tree, newdata = test, OOB=TRUE, type = "prob")
for (name in names(train_predict))
{
p <- (train_predict[[name]][1:3])
write.table(p, file = "result.csv",col.names = FALSE, append=TRUE)
}
I am trying to write the result of the random forest prediction to a csv file.
The result train_predict looks like the following:
When I run the above code its only write the first column of each row to the csv and not all three.
How can I write all three columns of the list to the file?
Also is there a way in R to clear the csv before you write to it in case there is something in it already?
Rather than write serially, you can convert to a data.frame and just write all at once:
Generate fake data that looks similar to what you posted:
fakeVec <- function(dummy) t(setNames(rnorm(3), letters[1:3]))
my_list <- lapply(0:4, fakeVec)
names(my_list) <- 6000:6004
Here's the fake data:
$`6000`
a b c
[1,] -0.2444195 -0.2189598 -1.442364
$`6001`
a b c
[1,] 0.2742636 1.068294 -0.8335477
$`6002`
a b c
[1,] -1.13298 1.927268 -2.123603
$`6003`
a b c
[1,] 0.8260184 1.003259 -0.003590849
$`6004`
a b c
[1,] -0.2025963 0.1192242 -1.121807
Then convert format:
# crush to flat matrix
my_mat <- do.call(rbind, my_list)
# add in list names as new column
my_df <- data.frame(id = names(my_list), my_mat)
Now you have a data.frame like this:
id a b c
1 6000 -0.2444195 -0.2189598 -1.442364429
2 6001 0.2742636 1.0682937 -0.833547659
3 6002 -1.1329796 1.9272681 -2.123603334
4 6003 0.8260184 1.0032591 -0.003590849
5 6004 -0.2025963 0.1192242 -1.121807439
Which you can just write straight to a file:
write.csv(my_df, 'my_file.csv', row.names=F)
How about this?
temp = list(x = data.frame(a = "a", b = "b", c = "c"),
y = data.frame(d = "d", e = "e", f = "f"),
z = data.frame(g = "g", h = "h", i = "i"))
for (i in 1:length(temp)) {
write.table(temp[[i]], "temp.csv",col.names = F, append = T)
}
Regarding clearing the csv. If I understood your question correctly, just erase the append = T?
Related
I try to execute this command
df2 <- as.data.frame.matrix(table(stack(setNames(strsplit(df$col1, "---", fixed = TRUE), df$id))[2:1]))
However I receive this error:
Error in table(stack(setNames(strsplit(df$col1, :
attempt to make a table with >= 2^31 elements
Any idea why this error happaned? Unfortunately I can't provide a reproducable example with this code because I can't find what caused this error.
What makes this command is that it make 0 and 1 values which separate by ---.
Example input:
data.frame(id = c(1,2), col1 = c("text---here","text---there"))
expected output
data.frame(id = c(1,2), text = c(1,1), here = c(1,0), there = c(0,1))
If the task in question is complex, it is worth splitting it into chunks. Try this:
x = data.frame(id = c(1,2), col1 = c("text---here","text---there")); x$col1 = as.vector(x$col1)
Split = strsplit(as.vector(x$col1), split = "---")
levels = unique(unlist(Split))
x = cbind(x, matrix(ncol = length(levels), nrow = nrow(x)))
for(i in 1:length(levels))
{
x[,ncol(x)-length(levels)+i] <- sapply(Split, function(x) max(x == levels[i]))
}
colnames(x) <- c("id", "col1", levels)
x
# id col1 text here there
# 1 1 text---here 1 1 0
# 2 2 text---there 1 0 1
Quantum entanglement phrasing obviously a little tongue-in-cheek but you'll see what I mean.
If one creates multiple data.frame columns using chain assignment, those columns all behave as one later, even though they (and their parent df) appear normal. Observe, columns created independently:
library(data.table)
testdf <- data.frame(a = 1:10)
testdf$b <- as.numeric(rep(NA, nrow(testdf)))
testdf$c <- as.numeric(rep(NA, nrow(testdf)))
mergedf <- data.frame(a = 5:7, b = 1:3, c = 8:10)
setDT(testdf)
setDT(mergedf)
testdf[mergedf, on = "a", b := i.b]
testdf[mergedf, on = "a", c := i.c]
# works as expected
Columns created by chain assignment, which assignOps help and this suggest should be fine (or at least don't warn against it):
testdf2 <- data.frame(a = 1:10)
testdf2$c <- testdf2$b <- as.numeric(rep(NA, nrow(testdf)))
mergedf2 <- data.frame(a = 5:7, b = 1:3, c = 8:10)
setDT(testdf2)
setDT(mergedf2)
testdf2[mergedf2, on = "a", b := i.b]
# testdf2$b and $c both get mergedf2's b values
testdf2[mergedf2, on = "a", c := i.c]
# testdf2$b and $c both get mergedf2's c values, overwriting the b values
This doesn't seem to be documented in assignOps, I've not seen it mentioned anywhere, and it seems highly unintuitive that columns/values created in this way - which appears to be a simple space-saving shortcut - become bound together by a secretive pact, potentially to turn against you when your guard is down. To further underline how secret this is, if you just do the creation lines:
testdf <- data.frame(a = 1:10)
testdf$b <- as.numeric(rep(NA, nrow(testdf)))
testdf$c <- as.numeric(rep(NA, nrow(testdf)))
testdf2 <- data.frame(a = 1:10)
testdf2$c <- testdf2$b <- as.numeric(rep(NA, nrow(testdf)))
all.equal(testdf, testdf2) #TRUE
It's also true if you setDT() both dfs.
Consider these three dataframes in a nested list:
df1 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df2 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
df3 <- data.frame(a = runif(10,1,10), b = runif(10,1,10), c = runif(10,1,10))
dflist1 <- list(df1,df2,df3)
dflist2 <- list(df1,df2,df3)
nest_list <- list(dflist1, dflist2)
I want to do a 'cor.test' between column 'a' against column 'a', 'b' against 'b' and 'c' against 'c' in all 'dfs' for each dflist. I can do it individually if assign each one to the global environment with the code below thanks to this post:
for (i in 1:length(nest_list)) { # extract dataframes from list in to individual dfs
for(j in 1:length(dflist1)) {
temp_df <- Norm_red_list[[i]][[j]]}
ds <- paste (names(nest_list[i]),names(nestlist[[i]][[j]]), sep = "_")
assign(ds,temp_df)
}
}
combn(paste0("df", 1:3), 2, FUN = function(x) { #a ctual cor.test
x1 <- mget(x, envir = .GlobalEnv)
Map(function(x,y) cor.test(x,y, method = "spearman")$p.value, x1[[1]], x1[[2]])})
I am not sure that I understand exactly what you want to do but could something like this help you ?
#vector of your columns name
columns <- c("a","b","c")
n <- length(columns)
# correlation calculation function
correl <- function(i,j,data) {cor.test(unlist(data[i]),unlist(data[j]), method = "spearman")$p.value}
correlfun <- Vectorize(correl, vectorize.args=list("i","j"))
# Make a "loop" on columns vector (u will then be each value in columns vector, "a" then "b" then "c")
res <- sapply(columns,function(u){
# Create another loop on frames that respect the condition names(x)==u (only the data stored in columns "a", "b" or "c")
lapply(lapply(nest_list,function(x){sapply(x,function(x){x[which(names(x)==u)]})}),function(z)
# on those data, use the function outer to apply correlfun function on each pair of vectors
{outer(1:n,1:n,correlfun,data=z)})},simplify = FALSE,USE.NAMES = TRUE)
Is this helping ? Not sure I'm really clear in my explanation :)
I am trying to apply cor function to a data set. Below is my code:
corr <- function(directory, threshold = 0) {
for (i in 1:332) {
data = read.csv(paste(directory, '/',
formatC(i, width = 3, flag = '0'), '.csv', sep = '')) # reading all files
}
cv = numeric() #initializing list
data = na.omit(data) #omitting NAs from read file
if (nrow(data) > threshold) {
cv = c(cv, cor(data[,2], data[,3])) #if number of rows more than threshold, get correlation of data
}
cv
}
In command line, I can then call:
cr <- corr('specdata', 150)
head(cr)
My expected output is:
[1] -0.01896 -0.14051 -0.04390 -0.06816 -0.12351 -0.07589
but the return value I get is only:
[1] -0.01896
I don't fully understand cor and why I am getting this result, please help. All my CSV files contain normal tables. Thank you!
For two vectors x and y, cor(x,y) returns the correlation coefficient of x and y, which is just a single number. This is what your code is doing.
cor(1:10, 2:11) # returns 1.0
If you want more correlations, you need to send in a dataframe which contains your variables. For a dataframe 'df' with (say) 3 columns, then cor(df) will return a 3-by-3 matrix.
df <- data.frame(a=1:3, b=c(3,2,8), c=c(12,3,8))
cor(df)
a b c
a 1.0000000 0.7777138 -0.4435328
b 0.7777138 1.0000000 0.2184630
c -0.4435328 0.2184630 1.0000000
You have added a for loop in your edit. It seems you're trying to return correlation constant for every csv in directory.
We can try something like this.
df1 <- data.frame(x = rnorm(10), y = rnorm(10))
df2 <- data.frame(x = rnorm(10), y = rnorm(10))
df3 <- data.frame(x = rnorm(10), y = rnorm(10))
write.csv(df1, "1.csv")
write.csv(df2, "2.csv")
write.csv(df3, "3.csv")
corr <- function(directory){
temp = list.files(path = directory, pattern = "[0-9]+.csv")
# in your case
# temp = list.files(path = directory, pattern = "[0-9]{3}.csv")
dat = lapply(temp, function(x){read.csv(x, header = T)})
corlist <- lapply(dat, function(x){cor(cor(x[,1], x[,2]))})
unlist(corlist)
}
corr(".")
0.07766259 0.24449723 0.20367101
I have a csv file which looks like this:
"","people_id","commit_id"
"1",1,0
"2",1,117
"3",1,144
"4",1,278
…
Here's the csv file if you wanna look at it. It contains 11735 lines but 5923 unique people ids.
Does anyone know how to connect the people ids with the common "commit_id" and ignore commit_id 0 as id 0 does not exist.
For now I have done this:
# read the csv file
commitsNetwork <- read.csv("commits.csv", header=TRUE)
# use a subset for demo purpose
commitsNetwork <- commitsNetwork[c("people_id", "commit_id")]
#build edgelist(for commits)
C <- spMatrix(nrow = length(unique(commitsNetwork$people_id)),
ncol = length(unique(commitsNetwork$commit_id)),
i = as.numeric(factor(commitsNetwork$people_id)),
j = as.numeric(factor(commitsNetwork$commit_id)),
x = rep(1, length(as.numeric(commitsNetwork$people_id))) )
row.names(C) <- levels(factor(commitsNetwork$people_id))
colnames(C) <- levels(factor(commitsNetwork$commit_id))
adjC <- tcrossprod(C)
comG <- graph.adjacency(adjC, mode = "undirected", weighted = TRUE, diag = FALSE)
#write to pajek file
write.graph(comG, "comNetwork.net", format = "pajek")
Also, the edges are from the 2nd column "commit_id". If both vertices(people) are connected by the common commit_id from the 6th column.
Therefore I'm not sure how to generate the network with this csv file in R.
The ideal output is should turn out like:
*Vertices 5923
1
2
3
4
...
*Edges
1 4 1
1 25 1
1 39 1
1 41 1
1 48 1
until 5923...
Maybe you want something like this:
library(igraph)
library(Matrix)
download.file("https://www.dropbox.com/s/q7sxfwjec97qzcy/people.csv?dl=1",
tf <- tempfile(fileext = ".csv"), mode = "wb")
people <- read.csv(tf)
A <- spMatrix(nrow = length(unique(people$people)),
ncol = length(unique(people$repository_id)),
i = as.numeric(factor(people$people)),
j = as.numeric(factor(people$repository_id)),
x = rep(1, length(as.numeric(people$people))) )
row.names(A) <- levels(factor(people$people))
colnames(A) <- levels(factor(people$repository_id))
adj <- tcrossprod(A)
g <- graph.adjacency(adj, mode = "undirected", weighted = TRUE, diag = FALSE)
See also here.