R - split dataset by row position and save in different files - r

I have a huge dataset in which several mini dataset were merged. I want to split them in different dataframes and save them. The mini datasets are identified by a variable name (which always include the string "-gram") on a given row.
I have been trying to construct a for loop, but with no luck.
grams <- read.delim("grams.tsv", header=FALSE) #read dataset
index <- which(grepl("-gram", grams$V1), arr.ind=TRUE) # identify the row positions where each mini dataset starts
index[10] <- nrow(grams) # add the total number of rows as last variable of the vector
start <- c() # initialize vector
end <- c() # initialize vector
for (i in 1:length(index)-1) for ( k in 2:length(index)) {
start[i] <- index[i] # add value to the vector start
if (k != 10) { end[k-1] <- index[k]-1 } else { end[k-1] <- index[k] } # add value to the vector end
gram <- grams[start[i]:end[i],] #subset the dataset grams so that the split mini dataset has start and end that correspond to the index in the vector
write.csv(gram, file=paste0("grams_", i, ".csv"), row.names=FALSE) # save dataset
}
I get an error when I try to subset the dataset:
Error in start[i]:end[i] : argument of length 0
...and I do not understand why! Can anyone help me?
Thanks!

You can cumsum and split:
dat <- data.frame(V1 = c("foo", "bar", "quux-gram", "bar-gram", "something", "nothing"),
V2 = 1:6, stringsAsFactors = FALSE)
dat
# V1 V2
# 1 foo 1
# 2 bar 2
# 3 quux-gram 3
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
grepl("-gram$", dat$V1)
# [1] FALSE FALSE TRUE TRUE FALSE FALSE
cumsum(grepl("-gram$", dat$V1))
# [1] 0 0 1 2 2 2
spl_dat <- split(dat, cumsum(grepl("-gram$", dat$V1)))
spl_dat
# $`0`
# V1 V2
# 1 foo 1
# 2 bar 2
# $`1`
# V1 V2
# 3 quux-gram 3
# $`2`
# V1 V2
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
With that, you can write them to files with:
ign <- Map(write.csv, spl_dat, sprintf("gram-%03d.csv", seq_along(spl_dat)),
list(row.names=FALSE))

An option with group_split and endsWith
library(dplyr)
library(stringr)
dat %>%
group_split(grp = cumsum(endsWith(V1, '-gram')), keep = FALSE)

Related

Return multiple results of column-to-matrix operations within a data.table

I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.

Dynamically generate columns in r using existing columns of the data set for multiple data sets

I have data sets named as data2,data3,data4,data7,data11,data12 & data13. For each data set I need to calculate the error_rate_V2/V3/V4...=(V2/V3/V4...actual-V2/V3/V4...predicted)/V2/V3/V4...actual So, for each data set the calculation will change based on the column we are making predictions for.
For eg data2 looks like :
V2.actual predicted V3 V4 V5 V6 V7 V11 V12 V13
639 1294.704556 53817 13 1 5 39316 13 104 104
2767 2724.884429 5131 3 1 5 39311 2 22 22
673 683.8030988 11332 7 2 5 39315 14 140 80
3100 2556.14175 442 8 3 5 39317 1 0 6
3015 2115.371589 8143 3 1 5 39323 1 6 6
V2 V3.predicted V3.actual V4 V5 V6 V7 V11 V12 V13
4338 3410.386101 1516 4 1 5 39315 3 18 18
726 2654.803413 442 8 3 5 39317 2 0 12
730 762.412623 12617 16 5 5 39314 2 0 7
755 1653.438693 6722 1 1 5 39322 2 12 12
673 701.7884088 15572 10 3 5 39315 8 0 16
and so on for the other data sets:
I can use the following function to address the data sets but unable to calculate the error rates dynamically & separately for each datsets:
Error_Rate=lapply(mget(paste0("data",c(2:4,7,11:13))), transform,
Error_Rate= ?
Can someone help out? Thanks in advance.
I created a possible solution. Because I don't know if the datasets are provided in the form of dataframes or matrices, I provide solutions for both possibilities.
First the matrix solution:
#### Example with Matrix #####
V2 <- c(5,8,6,2,3,9)
V3 <- c(8,8,1,15,48,58)
V4 <- c(7,8,9,4,5,6)
V2.predicted <- c(5.5,6.4,8,9,3,4)
V2.actual <- c(8,8,1,15,48,58)
V3.predicted <- c(4,8,6,55,2,3)
V3.actual <- c(5,8,6,2,3,9)
# cbind as matrix
data2 <- cbind(V2.predicted,V2.actual,V3,V4)
data3 <- cbind(V2,V3.predicted,V3.actual,V4)
str(data2)
str(data3)
fun_calc_error <- function(data,name) {
library(plyr)
str(data) # Debugging
# Tests if name is supplied, if not, it trys to extract the name from the dataframe/matrix
# (doesn't work in lapply and ldply, as it tries to access list through X[[1]])
if(missing(name)==TRUE) {
dataname <- deparse(substitute(data)) # extracts the name of the data object
# http://stackoverflow.com/questions/10520772/in-r-how-to-get-an-objects-name-after-it-is-sent-to-a-function
} else {
dataname <- name
}
cat("dataname: ",dataname,"\n") # Debugging
# extract the number of the matrix
df_num <- as.numeric(gsub("data","",dataname)) # extract number of dataframe
# creates column names
col_pred <- paste0("V",df_num,".predicted")
col_act <- paste0("V",df_num,".actual")
# reduce matrix to the 2 columns predicted and actual
new_matrix <- data[,c(col_pred,col_act)]
# split the matrix by row and apply function
error_rate <- aaply(.data=new_matrix,
.margins=1,
.fun=function(new_matrix) error_rate = (new_matrix[1]-new_matrix[2])/new_matrix[2]
)
# debugging
cat("\n str Error rate: ","\n")
str(error_rate)
return(error_rate)
}
# Test function for one matrix
fun_calc_error(data3)
Then the dataframe solution:
#### Example with dataframes #####
V2 <- c(5,8,6,2,3,9)
V3 <- c(8,8,1,15,48,58)
V4 <- c(7,8,9,4,5,6)
V2.predicted <- c(5.5,6.4,8,9,3,4)
V2.actual <- c(8,8,1,15,48,58)
V3.predicted <- c(4,8,6,55,2,3)
V3.actual <- c(5,8,6,2,3,9)
# cbind as matrix
data2 <- cbind.data.frame(V2.predicted,V2.actual,V3,V4,stringsAsFactors=FALSE)
data3 <- cbind.data.frame(V2,V3.predicted,V3.actual,V4,stringsAsFactors=FALSE)
str(data2)
str(data3)
fun_calc_error_df <- function(data,name) {
library(dplyr)
str(data) # Debugging
# Tests if name is supplied, if not, it trys to extract the name from the dataframe/matrix
# (doesn't work in lapply and ldply, as it tries to access list through X[[1]])
if(missing(name)==TRUE) {
dataname <- deparse(substitute(data)) # extracts the name of the data object
# http://stackoverflow.com/questions/10520772/in-r-how-to-get-an-objects-name-after-it-is-sent-to-a-function
} else {
dataname <- name
}
cat("dataname: ",dataname,"\n") # Debugging
df_num <- as.numeric(gsub("data","",dataname)) # extract number of dataframe
# creates column names
col_pred <- paste0("V",df_num,".predicted")
col_act <- paste0("V",df_num,".actual")
new_df <- select_(data,col_pred,col_act)
colnames(new_df) <- c("predicted","actual")
new_df %>%
mutate(error_rate = (predicted-actual)/actual) %>%
select(error_rate) -> error_rate
# debugging
cat("\n str Error rate: ","\n")
str(error_rate)
return(error_rate)
}
# TEST for one dataframe
fun_calc_error_df(data3,"data3")
When you use those functions on one dataframe/matrix it works fine, even without providing the name of the dataframe/matrix, because with
dataname <- deparse(substitute(data))
I can extract it.
If you want to put a list of dataframes in lapply or ldply, to apply the function to a number of dataframes at once, there will be is a problem. ldply addresses a list element with X[[i]] and is not providing the name of the dataframe.
To workaround this issue I used a loop in the following code. Maybe you find a solution for this issue and I hope the code helps.
##### Possible solution for more than one dataframe ####
# Create named!!! list of dataframes
df.list <- list(data2=data2,data3=data3)
# Create list of names
nameslist <- names(df.list)
# Create empty dataframe
df_error_rate <- as.data.frame(NULL)
# loop over list elements
i<-1
while(i <= length(df.list)){
cat(i,"\n") # Debugging
# put list element in variable as dataframe
data <- as.data.frame(df.list[[i]],stringsAsFactors=FALSE)
# put name of dataframe from list in variable
name <- nameslist[i]
# apply function
error_rate <- fun_calc_error_df(data,name)
# create vector with names of dataframe
dataframe <- rep.int(name,nrow(df.list[[i]]))
# bind names and values to data frame
tmp_err_rate <- cbind.data.frame(dataframe,error_rate,stringsAsFactors=FALSE)
# bind rows to big data frame
df_error_rate <- rbind.data.frame(df_error_rate,tmp_err_rate)
# count loop up
i <- i + 1
}

rename column in dataframe using variable name R

I have a number of data frames. Each with the same format.
Like this:
A B C
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I would like to change the name of the third column--C--so that it includes part if the name of the variable name associated with the data frame.
For the variable df_elephant the data frame should look like this:
A B C.elephant
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I have a function which will change the column name:
rename_columns <- function(x) {
colnames(x)[colnames(x)=='C'] <-
paste( 'C',
strsplit (deparse (substitute(x)), '_')[[1]][2], sep='.' )
return(x)
}
This works with my data frames. However, I would like to provide a list of data frames so that I do not have to call the function multiple times by hand. If I use lapply like so:
lapply( list (df_elephant, df_horse), rename_columns )
The function renames the data frames with an NA rather than portion of the variable name.
[[1]]
A B C.NA
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.02344361
[[2]]
A B C.NA
1 0.45387054 0.02279488 1.6746280
2 -1.47271378 0.68660595 -0.2505752
3 1.26475917 -1.51739927 -1.3050531
Is there some way that I kind provide a list of data frames to my function and produce the desired result?
You are trying to process the data frame column names instead of the actual lists' name. And this is why it's not working.
# Generating random data
n = 3
item1 = data.frame(A = runif(n), B = runif(n), C = runif(n))
item2 = data.frame(A = runif(n), B = runif(n), C = runif(n))
myList = list(df_elephant = item1, df_horse = item2)
# 1- Why your code doesnt work: ---------------
names(myList) # This will return the actual names that you want to use : [1] "df_elephant" "df_horse"
lapply(myList, names) # This will return the dataframes' column names. And thats why you are getting the "NA"
# 2- How to make it work: ---------------
lapply(seq_along(myList), # This will return an array of indicies
function(i){
dfName = names(myList)[i] # Get the list name
dfName.animal = unlist(strsplit(dfName, "_"))[2] # Split on underscore and take the second element
df = myList[[i]] # Copy the actual Data frame
colnames(df)[colnames(df) == "C"] = paste("C", dfName.animal, sep = ".") # Change column names
return(df) # Return the new df
})
# [[1]]
# A B C.elephant
# 1 0.8289368 0.06589051 0.2929881
# 2 0.2362753 0.55689663 0.4854670
# 3 0.7264990 0.68069346 0.2940342
#
# [[2]]
# A B C.horse
# 1 0.08032856 0.4137106 0.6378605
# 2 0.35671556 0.8112511 0.4321704
# 3 0.07306260 0.6850093 0.2510791
You can also try. Somehow similar to Akrun's answer using also Map in the end:
# Your data
d <- read.table("clipboard")
# create a list with names A and B
d_list <- list(A=d, B=d)
# function
foo <- function(x, y){
gr <- which(colnames(x) == "C") # get index of colnames C
tmp <- colnames(x) #new colnames vector
tmp[gr] <- paste(tmp[gr], y, sep=".") # replace the old with the new colnames.
setNames(x, tmp) # set the new names
}
# Result
Map(foo, d_list, names(d_list))
$A
A B C.A
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
$B
A B C.B
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
We can try with Map. Get the datasets in a list (here we used mget to return the values of the strings in a list), using Map, we change the names of the third column with that of the corresponding vector of names.
Map(function(x, y) {names(x)[3] <- paste(names(x)[3], sub(".*_", "", y), sep="."); x},
mget(c("df_elephant", "df_horse")), c("df_elephant", "df_horse"))
#$df_elephant
# A B C.elephant
#1 -0.02299388 0.7140416 0.8492423
#2 -1.43027866 -1.9642077 -1.2886368
#3 -1.01827712 -0.9414119 -2.0234436
#$df_horse
# A B C.horse
#1 0.4538705 0.02279488 1.6746280
#2 -1.4727138 0.68660595 -0.2505752
#3 1.2647592 -1.51739927 -1.3050531

How to rank rows by two columns at once in R?

Here is the code to rank based on column v2:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
x$rank1 <- rank(x$v2, ties.method='first')
But I really want to rank based on both v2 and/then v1 since there are ties in v2. How can I do that without using RPostgreSQL?
How about:
within(x, rank2 <- rank(order(v2, v1), ties.method='first'))
# v1 v2 rank1 rank2
# 1 2 1 1 2
# 2 1 1 2 1
# 3 1 3 4 4
# 4 2 2 3 3
order works, but for manipulating data frames, also check out the plyr and dplyr packages.
> arranged_x <- arrange(x, v2, v1)
Here we create a sequence of numbers and then reorder it as if it was created near the ordered data:
x$rank <- seq.int(nrow(x))[match(rownames(x),rownames(x[order(x$v2,x$v1),]))]
Or:
x$rank <- (1:nrow(x))[order(order(x$v2,x$v1))]
Or even:
x$rank <- rank(order(order(x$v2,x$v1)))
Try this:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
# The order function returns the index (address) of the desired order
# of the examined object rows
orderlist<- order(x$v2, x$v1)
# So to get the position of each row in the index, you can do a grep
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
x
# For a little bit more general case
# With one tie
x <- data.frame(v1 = c(2,1,1,2,2), v2 = c(1,1,3,2,2))
x$rankv2<-rank(x$v2)
x$rankv1<-rank(x$v1)
orderlist<- order(x$rankv2, x$rankv1)
orderlist
#This rank would not be appropriate
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
#there are ties
grep(T,duplicated(x$rankv2,x$rankv1) )
# Example for only one tie
makeTieRank<-mean(x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank)
x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank<-makeTieRank
x

R ignore missing data

I have two R data files each with 100 columns but row number vary from 220 to 360 in each data1 and data2. data1 and data2 represent changes of two quantities changes during a set of experiments. so [i,j] of data1 and[i,j] of data2 represent same event, but will have different value. I want to print data which is greater than 2.5 in any of the file, along with the column and row number
for (i in 1:360){
for (j in 1:100){
if((data1[i,j]>2.5) | ( data2[i,j]>2.5)) {
cat(i, j, data1[i,j], data2[i,j], "\n", file="extr-b2.5.txt", append=T)
}
}
}
I get this error because of NAs.
Error in if ((data1[i, j] > 2.5) | (data2[i, j] > :
missing value where TRUE/FALSE needed
if I set i to 1:220 (every column has at least 220 row), it works fine.
How can modify above code to neglect NA values.
I would something like this :
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
reproducible example:
set.seed(1)
dat1 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
dat2 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
# row col v1 v2
# [1,] 3 1 3.291413 4.079366
# [2,] 4 1 4.632831 2.990797
# [3,] 2 2 4.593559 4.967624
# [4,] 3 2 4.778701 2.520141
# [5,] 4 2 3.643191 4.109781
# [6,] 1 3 3.516456 4.738821
where dat1 and dat2:
# dat1
# V1 V2 V3
# 1 2.062035 1.806728 3.516456
# 2 2.488496 4.593559 1.247145
# 3 3.291413 4.778701 1.823898
# 4 4.632831 3.643191 1.706227
# > dat2
# V1 V2 V3
# 1 3.748091 3.870474 4.738821
# 2 2.536415 4.967624 1.848570
# 3 4.079366 2.520141 3.606695
# 4 2.990797 4.109781 1.502220
Without the for loops you can use pmax to compare two arrays.
bigger=pmax(data1,data2)
this gives an array with the maximum values. Then you can check if the max is bigger than 2.5
which( bigger>2.5,arr.ind=T)
will give the location where the max is bigger than your cutoff.
for completeness if I were to do it in your double looping framework, I would just set the Missing values to be below the min of all the other data, this will work so long as you have a value below 2.5 somewhere in your data.
lowest=min(c(data1,data2))
data1[which(is.na(data1),arr.ind=T)]=lowest
then run your double loop

Resources