R ignore missing data - r

I have two R data files each with 100 columns but row number vary from 220 to 360 in each data1 and data2. data1 and data2 represent changes of two quantities changes during a set of experiments. so [i,j] of data1 and[i,j] of data2 represent same event, but will have different value. I want to print data which is greater than 2.5 in any of the file, along with the column and row number
for (i in 1:360){
for (j in 1:100){
if((data1[i,j]>2.5) | ( data2[i,j]>2.5)) {
cat(i, j, data1[i,j], data2[i,j], "\n", file="extr-b2.5.txt", append=T)
}
}
}
I get this error because of NAs.
Error in if ((data1[i, j] > 2.5) | (data2[i, j] > :
missing value where TRUE/FALSE needed
if I set i to 1:220 (every column has at least 220 row), it works fine.
How can modify above code to neglect NA values.

I would something like this :
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
reproducible example:
set.seed(1)
dat1 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
dat2 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
# row col v1 v2
# [1,] 3 1 3.291413 4.079366
# [2,] 4 1 4.632831 2.990797
# [3,] 2 2 4.593559 4.967624
# [4,] 3 2 4.778701 2.520141
# [5,] 4 2 3.643191 4.109781
# [6,] 1 3 3.516456 4.738821
where dat1 and dat2:
# dat1
# V1 V2 V3
# 1 2.062035 1.806728 3.516456
# 2 2.488496 4.593559 1.247145
# 3 3.291413 4.778701 1.823898
# 4 4.632831 3.643191 1.706227
# > dat2
# V1 V2 V3
# 1 3.748091 3.870474 4.738821
# 2 2.536415 4.967624 1.848570
# 3 4.079366 2.520141 3.606695
# 4 2.990797 4.109781 1.502220

Without the for loops you can use pmax to compare two arrays.
bigger=pmax(data1,data2)
this gives an array with the maximum values. Then you can check if the max is bigger than 2.5
which( bigger>2.5,arr.ind=T)
will give the location where the max is bigger than your cutoff.
for completeness if I were to do it in your double looping framework, I would just set the Missing values to be below the min of all the other data, this will work so long as you have a value below 2.5 somewhere in your data.
lowest=min(c(data1,data2))
data1[which(is.na(data1),arr.ind=T)]=lowest
then run your double loop

Related

R - split dataset by row position and save in different files

I have a huge dataset in which several mini dataset were merged. I want to split them in different dataframes and save them. The mini datasets are identified by a variable name (which always include the string "-gram") on a given row.
I have been trying to construct a for loop, but with no luck.
grams <- read.delim("grams.tsv", header=FALSE) #read dataset
index <- which(grepl("-gram", grams$V1), arr.ind=TRUE) # identify the row positions where each mini dataset starts
index[10] <- nrow(grams) # add the total number of rows as last variable of the vector
start <- c() # initialize vector
end <- c() # initialize vector
for (i in 1:length(index)-1) for ( k in 2:length(index)) {
start[i] <- index[i] # add value to the vector start
if (k != 10) { end[k-1] <- index[k]-1 } else { end[k-1] <- index[k] } # add value to the vector end
gram <- grams[start[i]:end[i],] #subset the dataset grams so that the split mini dataset has start and end that correspond to the index in the vector
write.csv(gram, file=paste0("grams_", i, ".csv"), row.names=FALSE) # save dataset
}
I get an error when I try to subset the dataset:
Error in start[i]:end[i] : argument of length 0
...and I do not understand why! Can anyone help me?
Thanks!
You can cumsum and split:
dat <- data.frame(V1 = c("foo", "bar", "quux-gram", "bar-gram", "something", "nothing"),
V2 = 1:6, stringsAsFactors = FALSE)
dat
# V1 V2
# 1 foo 1
# 2 bar 2
# 3 quux-gram 3
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
grepl("-gram$", dat$V1)
# [1] FALSE FALSE TRUE TRUE FALSE FALSE
cumsum(grepl("-gram$", dat$V1))
# [1] 0 0 1 2 2 2
spl_dat <- split(dat, cumsum(grepl("-gram$", dat$V1)))
spl_dat
# $`0`
# V1 V2
# 1 foo 1
# 2 bar 2
# $`1`
# V1 V2
# 3 quux-gram 3
# $`2`
# V1 V2
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
With that, you can write them to files with:
ign <- Map(write.csv, spl_dat, sprintf("gram-%03d.csv", seq_along(spl_dat)),
list(row.names=FALSE))
An option with group_split and endsWith
library(dplyr)
library(stringr)
dat %>%
group_split(grp = cumsum(endsWith(V1, '-gram')), keep = FALSE)

Split integers based on a value in second column, assign new values, and, recombine into new dataset

In R, I have a 2xn matrix of data containing all integers.
The first column indicates the size of an item. Some of these sizes were due to merging, so the second column indicates the number of items that went into that size (including 1) (calling it 'index'). The sum of the indices indicate how many items were actually in the original data.
I now need to create a new data set that splits any merged sizes back out according to the number in the index, resulting in a 2xn vector (with a new length n according the the total number of indices) and a second column all 1's.
I need this split to happen in two ways.
"Homogeneously" where any merged sizes are assigned to the number of indices as homogeneously as possible. For instance, a size of 6 with index of 3 would now be c(2,2,2). Importantly, all number have to be integers, so it should be something like c(1,2) or c(2,1). It cant be c(1.5,1.5).
"Heterogeneously" where the number of sizes are skewed to assign 1 to all positions in the index except one, which would contain the reminder. For instance, of a size of 6 with index of 3, it would now be c(1,1,4) or any combination of 1, 1, and 4.
Below I am providing some sample data that gives an example of what I have, what I want, and what I have tried.
#Example data that I have
Y.have<-cbind(c(19,1,1,1,1,4,3,1,1,8),c(3,1,1,1,1,2,1,1,1,3))
The data show that three items went into the size of 19 for the first row, one item went into the size one in the second column, and so on. Importantly, in these data there were originally 15 items (i.e. sum(Y.have[,2])), some of which got merged, so the final data will need to be of length 15.
What I want the data to look like is:
####Homogenous separation - split values evenly as possible
#' The value of 19 in row 1 is now a vector of c(6,6,7) (or any combination thereof, i.e. c(6,7,6) is fine) since the position in the second column is a 3
#' Rows 2-5 are unchanged since they have a 1 in the second column
#' The value of 4 in row 6 is now a vecttor of c(2,2) since the position of the second column is a 2
#' Rows 7-9 are unchanged since they have a 1 in the second column
#' The value of 8 in row 10 is now a vector of c(3,3,2) (or any combination thereof) since the position in the second column is a 3
Y.want.hom<-cbind(c(c(6,6,7),1,1,1,1,c(2,2),3,1,1,c(3,3,2)),c(rep(1,times=sum(Y.have[,2]))))
####Heterogenous separation - split values with as many singles as possible,
#' The value of 19 in row 1 is now a vector of c(1,1,17) (or any combination thereof, i.e. c(1,17,1) is fine) since the position in the second column is a 3
#' Rows 2-5 are unchanged since they have a 1 in the second column
#' The value of 4 in row 6 is now a vecttor of c(1,3) since the position of the second column is a 2
#' Rows 7-9 are unchanged since they have a 1 in the second column
#' The value of 8 in row 10 is now a vector of c(1,1,6) (or any combination thereof) since the position in the second column is a 3
Y.want.het<-cbind(c(c(1,1,17),1,1,1,1,c(1,3),3,1,1,c(1,1,6)),c(rep(1,times=sum(Y.have[,2]))))
Note that the positions of the integers in the final data don't matter since they will all have one index case.
I have tried splitting the data (split) according to index case. This creates a list with a length according to the number of unique index values. I then iterated through that positions in that list and divided by the position.
a<-split(Y.have[,1],Y.have[,2]) #Split into a list according to the index
b<-list() #initiate new list
for (i in 1:length(a)){
b[[i]]<-a[[i]]/i #get homogenous values
b[[i]]<-rep(b[i],times=i) #repeat the values based on the number of indicies
}
Y.test<-cbind(unlist(b),rep(1,times=length(unlist(c)))) #create new dataset
This was a terrible approach. First, it will produce decimals. Second, the position in the list does not necessarily equal the index number (i.e. if there was no index of 2, the second position would be the next lowest index, but would divide by 2).
However, it at least allowed me to separate out the data by index, manipulate it, and recombine it to a proper length. I now need help in that middle part - manipulating the data for both homogeneous and heterogenous reassignment. I would prefer base r, but any approach would certainly be fine! Thank you in advance!
Here might be one approach.
Create two functions for homogeneous and heterogeneous splits:
get_hom_ints <- function(M, N) {
vec <- rep(floor(M/N), N)
for (i in seq_len(M - sum(vec))) {
vec[i] <- vec[i] + 1
}
vec
}
get_het_ints <- function(M, N) {
vec <- rep(1, N)
vec[1] <- M - sum(vec) + 1
vec
}
Then use apply to go through each row of the matrix:
het_vec <- unlist(apply(Y.have, 1, function(x) get_het_ints(x[1], x[2])))
unname(cbind(het_vec, rep(1, length(het_vec))))
hom_vec <- unlist(apply(Y.have, 1, function(x) get_hom_ints(x[1], x[2])))
unname(cbind(hom_vec, rep(1, length(het_vec))))
Output
(heterogeneous)
[,1] [,2]
[1,] 17 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
[5,] 1 1
[6,] 1 1
[7,] 1 1
[8,] 3 1
[9,] 1 1
[10,] 3 1
[11,] 1 1
[12,] 1 1
[13,] 6 1
[14,] 1 1
[15,] 1 1
(homogeneous)
[,1] [,2]
[1,] 7 1
[2,] 6 1
[3,] 6 1
[4,] 1 1
[5,] 1 1
[6,] 1 1
[7,] 1 1
[8,] 2 1
[9,] 2 1
[10,] 3 1
[11,] 1 1
[12,] 1 1
[13,] 3 1
[14,] 3 1
[15,] 2 1
library(partitions) is created for this type of requirements check it out.
Apply below logics to your code it should work
ex:
hom <- restrictedparts(19,3) #where 19 is Y.have[,1][1] and 3 is Y.have[,2][1] as per your data
print(hom[,ncol(hom)])
#output : 7 6 6
het <- Reduce(intersect, list(which(hom[2,1:ncol(hom)] %in% 1),which(hom[3,1:ncol(hom)] %in% 1)))
hom[,het]
#output : 17 1 1
One option would be to use integer division (%/%) and modulus (%%). It may not give the exact results you specified ie. 8 and 3 give (2,2,4) rather than (3,3,2), but does generally do what you described.
Y.have<-cbind(c(19,1,1,1,1,4,3,1,1,8),c(3,1,1,1,1,2,1,1,1,3))
homoVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = Y.have[i,1]
homoVec <- append(homoVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
b <- quantNum %/% indexNum
c <- quantNum %% indexNum
a <- c(rep(b, indexNum-1), b + c)
homoVec <- append(homoVec, a)
}
}
homoOut <- data.frame(homoVec, 1)
heteroVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = 1
heteroVec <- append(heteroVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
firstNum <- quantNum - (indexNum - 1)
a <- c(firstNum, rep(1, indexNum - 1))
heteroVec <- append(heteroVec, a)
}
}
heteroOut <- data.frame(heteroVec, 1)
If it is really important to have the math exactly as you described in your example then this should work.
homoVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = Y.have[i,1]
homoVec <- append(homoVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
b <- round(quantNum/indexNum)
roundSum <- b * (indexNum - 1)
c <- quantNum - roundSum
a <- c(rep(b, indexNum-1), c)
homoVec <- append(homoVec, a)
}
}
homoOut <- data.frame(homoVec, 1)

subset of data frame on based on multiple conditions

I'm actually having a trouble with a particular task of my code. I have a data frame as
n <- 6
set.seed(123)
df <- data.frame(x=paste0("x",seq_along(1:n)), A=sample(c(-2:2),n,replace=TRUE), B=sample(c(-1:3),n,replace=TRUE))
#
# x A B
# 1 x1 -1 1
# 2 x2 1 3
# 3 x3 0 1
# 4 x4 2 1
# 5 x5 2 3
# 6 x6 -2 1
and a decision tree as
A>0;Y;Y;N;N
B>1;Y;N;Y;N
C;1;2;2;1
that I load by
dt <- read.csv2("tmp.csv", header=FALSE)
I'd like to create a loop for all the possible combinations of (A>0) and (B>1) and set the C value to the subset x column that satisfy that condition. So, here's what I did
nr <- 3
nc <- 5
cond <- dt[1:(nr-1),1,drop=FALSE]
rule <- dt[nr,1,drop=FALSE]
subdf <- vector(mode="list",2^(nr-1))
for (i in 2:nc) {
check <- paste0("")
for (j in 1:(nr-1)) {
case <- paste0(dt[j,1])
if (dt[j,i]=="N")
case <- paste0("!",case)
check <- paste0(check, "(", case, ")" )
if (j<(nr-1))
check <- paste0(check, "&")
}
subdf[i] <- subset(df,check)
subdf[i]$C <- dt[nr,i]
}
unlist(subdf)
unfortunately, I got an error using subset as by this, it cannot parse the conditions from my string statements. what should I do?
Your issue is your creating of the subset: the subset commands expects a boolean and you gave it a string. ('check'). So the simplest solution here is to add a 'parse'. I feel there is a more elegant way to solve this problem and I hope someone'll come along and do it, but you can fix the final part of your code with the following
mysubset <- subset(df,with(df,eval(parse(text=check))))
if(nrow(mysubset)>0){
mysubset$C <- dt[nr,i]
}
subdf[[i]]<-mysubset
I have added the parse/eval part to generate a vector of booleans to subset only the 'TRUE' cases, and added a check for whether C could be added (will give error if there are no rows).
Based on the previous answer, I came up with a more elegant/practical way of generating a vector of combined rules, and then applying them all to the data, using apply/lapply.
##create list of formatted rules
#format each 'building' block separately,
#based on rows in 'dt'.
part_conditions <- apply(dt[-nrow(dt),],MARGIN=1,FUN=function(x){
res <- sprintf("(%s%s)", ifelse(x[-1]=="Y","","!"), x[1])
})
# > part_conditions
# 1 2
# [1,] "(A>0)" "(B>1)"
# [2,] "(A>0)" "(!B>1)"
# [3,] "(!A>0)" "(B>1)"
# [4,] "(!A>0)" "(!B>1)"
#combine to vector of conditions
conditions <- apply(part_conditions, MARGIN=1,FUN=paste, collapse="&")
# > conditions
# [1] "(A>0)&(B>1)" "(A>0)&(!B>1)" "(!A>0)&(B>1)" "(!A>0)&(!B>1)"
#for each condition, test in data wheter condition is 'T'
temp <- sapply(conditions, function(rule){
return(with(df, eval(parse(text=rule))))
}
)
rules <- as.numeric(t(dt[nrow(dt),-1]))
#then find which of the (in this case) four is 'T', and put the appropriate rule
#in df
df$C <- rules[apply(temp,1,which)]
> df
x A B C
1 x1 -1 1 1
2 x2 1 3 1
3 x3 0 1 1
4 x4 2 1 2
5 x5 2 3 1
6 x6 -2 1 1

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.
I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...
I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

Delete a row and recalculate R^2

I have coded the following in R:
User chooses a file that contains 2 columns (V1 and V2), with numerous rows (number of rows will vary depending on input file)
The script calculates the rsq of the relationship between 2 the variables. There can be anything from 10 to 1000 rows of data depending on the input file.
I want to code the following:
The code should loop through all rows, removing/omitting/ignoring one row at a time and calculating the new rsq with this row missing. So, for example:
There are 10 rows of data and the total rsq = 0.97
Step1: The first row of data are removed from the equation
The rsq is calculated again, but this time for 9 rows, giving rsq = 0.98.
Step 2:The 1st row is re-added and the 2nd row is removed
rsq is calculated again
Step 3: The second row is re-added and the 3rd row is removed
rsq is calculated again
After each loop the "new rsq" will be placed in a new column next to the row that was removed.
Can anyone advise how to do this? I have this coded in excel and it works well but is cumbersome and therefore not ideal.
Do you want to do something like this?
# Make some sample data
set.seed(1095)
data <- data.frame( V1 = 1:10 , V2 = sample.int(5 ,10 ,repl = TRUE ) )
# Use sapply to get r2 removing each row at a time
r2 <- sapply( 1:nrow(data) , function(x) ( cor( data[-x,1] , data[-x,2] ) )^2 )
# Combine into a data frame
newdata <- cbind( data , r2 )
newdata
# V1 V2 r2
# 1 1 5 0.2526316
# 2 2 3 0.4657601
# 3 3 5 0.3204721
# 4 4 5 0.3691612
# 5 5 1 0.5405405
# 6 6 3 0.3769480
# 7 7 3 0.3840426
# 8 8 2 0.3409425
# 9 9 1 0.2725806
# 10 10 3 0.4986702

Resources