Accessing high dimension tables - clearer way to index the different dimensions? - r

I was wondering if there is a clearer easier way of accessing different dimensions of a table.
I have this code
datasettable = addmargins(
table(dataset[c('observer','condition','stimulus1', 'stimulus2','response')]),
4, FUN = sum)
And I access it the following way:
datasettable[,'u',1,'sum',]
However, I find accessing it this way somewhat confusing. Because the indices for the different dimensions are separated by a comma it is easy to confuse the indices for the separate dimensions.
Is there a way to define the indices to the different dimensions by name( especially important for numerical indices) such as with
datasettable ['obsever'=='ALL','condition'=='u',
'stimulus1'==1, 'stimulus2'=='sum','response'=='ALL']

I'll make up some data (hint: including data of your own helps you get better answers; dput can be a great tool for that).
dataset <- expand.grid(observer=LETTERS[1:3], condition=c("u","v"),
stimulus1=1:2, stimulus2=1:2)
set.seed(5)
dataset$response <- sample(1:4, nrow(dataset), replace=TRUE)
datasettable <- addmargins(table(dataset), 4, FUN = sum)
What you suggest is this:
> datasettable[,'u',1,'sum',]
response
observer 1 2 3 4
A 1 1 0 0
B 0 0 2 0
C 0 1 0 1
I'd probably get the total without first converting to a table, perhaps using the reshape package, like this:
> library(reshape)
> dw <- cast(dataset, condition + stimulus1 + observer ~ response,
fun.aggregate=length, value="stimulus2")
> subset(dw, condition=="u" & stimulus1==1)
condition stimulus1 observer 1 2 3 4
1 u 1 A 1 1 0 0
2 u 1 B 0 0 2 0
3 u 1 C 0 1 0 1
But to answer your question, no, I don't think there's an alternate way built in to access parts of a table, but you could certainly build one, maybe like this:
tableaccess <- function(tabl, ...) {
d <- list(...)
vv <- c(list(tabl), as.list(rep(TRUE, length(dim(tabl)))))
vv[match(names(d), names(dimnames(datasettable)))+1] <- d
do.call(`[`, vv)
}
with a result of
> tableaccess(datasettable, condition='u', stimulus1=1, stimulus2='sum')
response
observer 1 2 3 4
A 1 1 0 0
B 0 0 2 0
C 0 1 0 1

For what you described, you can use subset function:
subset(datasettable, observer == 'ALL' & condition == 'u' &
stimulus1 == 1 & stimulus2 == 'sum' & response == 'ALL')
This, of course, assumes datasettable is a data.frame.

Related

Is there a best practice in producing a series of columns filled with values based on logical checks?

Consider the following data:
mydf<-data.frame(ID=c('a','b','c','d','e'),
home=c(0,2,1,0,0),
aroundtown=c(0,3,0,1,2),
outinspace=c(5,0,0,2,1))
My goal is to produce columns _any and _exclusive for each of home, aroundtown, and outinspace. The _any columns should be filled with the original values associated with the ID and the variable. The _exclusive columns should read 0 if there are no other columns for a given ID with a value other than zero. (As shown below).
I am able to get this done with the following:
md2<-mydf%>%
pivot_longer(cols=-ID)%>%
group_by(ID)%>%
mutate(tot=sum(value))%>%
mutate(Exclusive=ifelse(value==tot,"Exclusive",""))%>%
mutate(freq=ifelse(Exclusive=="",1,2))%>%
select(-Exclusive)%>%
uncount(freq)%>%
group_by(ID,name)%>%
mutate(exclusive=ifelse(cumsum(value)>value,"exclusive","Any"))%>%
select(-tot)%>%
pivot_wider(names_from = c(name,exclusive),values_from = value)%>%
replace(is.na(.),0)
I can't help but wonder if there is a better, or preferred-by-some, means of achieving this.
The ideal solution would
Avoid explicitly naming home,aroundtown, and outinspace
Have the flexibility to allow any number of variables beyond ID
Allow for multiple logical checks similar in nature (for instance _mostly comes to mind), resulting in columns for each check.
My preferred solution would be to use something not necessarily computationally optimized but more readable and adaptable as several of your requirements involve that aspect.
I decided to separate each of your logical categories (any, exclusive) into different functions. This way, you can add whatever criteria you want in the future by creating a new function (for instance is_Mostly).
is_Any <- function(dfrow){
return(dfrow)
}
is_Exclusive <- function(dfrow){
nonzeros <- which(dfrow > 0)
if (length(nonzeros) == 1){
dfrow[!nonzeros] <- 0
return(dfrow)
}
return(rep(0, length(dfrow)))
}
is_Mostly <- function(dfrow){
dfrow[!dfrow == max(dfrow)] <- 0
return(dfrow)
}
The master function will be called using the numerical columns of your choice (without naming them) and with the selected conditions.
add_new_columns <- function(df, columns, conditions){
for (condition in conditions){
eval_func <- eval(parse(text = paste0("is_", condition)))
new_cols <- paste0(columns, "_", condition)
df[, new_cols] <- t(apply(df[, columns], 1, eval_func))
}
return(df)
}
my_columns <- colnames(mydf)[-1]
my_conditions <- c("Any", "Exclusive")
mydf2 <- add_new_columns(mydf, my_columns, my_conditions)
mydf2
# ID home aroundtown outinspace home_Any aroundtown_Any outinspace_Any home_Exclusive aroundtown_Exclusive outinspace_Exclusive
#1 a 0 0 5 0 0 5 0 0 5
#2 b 2 3 0 2 3 0 0 0 0
#3 c 1 0 0 1 0 0 1 0 0
#4 d 0 1 2 0 1 2 0 0 0
#5 e 0 2 1 0 2 1 0 0 0

R: find difference if at least one of the two values are zero

I would like to calculate the abs(difference) of rows (current with preceding row), if at least one of the values is zero. If both values are non-zeros, it should return 0.
Example: the column 'A' shows the initial data, 'Diff' shows what I want to get:
df <- data.frame(A=c(0,0,1,2,3,4,0,0),Diff=c(0,0,1,0,0,0,4,0))
Thanks
Here's a way using the dplyr package and the lag function:
library(dplyr)
df1 <- data.frame(A=c(0,0,1,2,3,4,0,0),Diff=c(0,0,1,0,0,0,4,0))
df1 %>%
mutate(lag_A = lag(A),
Diff2 = abs(A - lag_A)) %>%
mutate(Diff2 = ifelse(is.na(lag_A), 0,
ifelse(A == 0 | lag_A == 0, Diff2, 0)))
A Diff lag_A Diff2
1 0 0 NA 0
2 0 0 0 0
3 1 1 0 1
4 2 0 1 0
5 3 0 2 0
6 4 0 3 0
7 0 4 4 4
8 0 0 0 0
You could then use the select function to grab the columns you want. I left all of the columns in the table to illustrate the solution.
Assumption
I'm gonna assume the first value of Diff is not the result of the operation you want, because the first value of A has no preceding value, therefore i chose to repeat it on the answer, but you can add anything instead of A[1].
Edit: I also assumed you wanted a solution using only base, otherwise, bouncyball's solution works like a charm.
Solution
Diff = rep(0,(length(A)-1))
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
Diff = c(A[1],Diff)
You can improve upon this by making it a named function like this:
foo = function(A){
Diff = rep(0,(length(A)-1))
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
Diff = c(A[1],Diff)
Diff
}
Explanation
Instead of operating A, we use:
A[-1], which yields 0 1 2 3 4 0 0 (equal to A but without the first element)
and A[-length(A)], which yields: 0 0 1 2 3 4 0 (equal to A, but without the last element).
If we subtract the first from the latter we have the element-wise subtractions right, but we still need to know when these subtractions need to occur.
By combining functions xor, mapply and which (all of them are members of base), like this:
which(mapply(xor,A[-1],A[-length(A)])) we find:
2 6 which are the indexes of the positions in A[-1] and A[-length(A)] where subtractions should happen.
Now we create a 0 vector called Diff the same size of our new vectors using:
Diff = rep(0,length(A)-1), then we change only the indexes that should not contain 0s using:
Diff[which(mapply(xor,A[-1],A[-length(A)]))] = abs(A[-1][which(mapply(xor,A[-1],A[-length(A)]))]-A[-length(A)][which(mapply(xor,A[-1],A[-length(A)]))])
And finally we put back the first element again using Diff = c(A[1],Diff).

How to get which() to return similar indices from two dataframes?

I have two dataframes (ma.sig, pricebreak) that look like this:
Date A B C
01/1 1 0 1
02/1 1 0 1
Date D E G
01/1 1 0 1
02/1 0 1 0
For starters, I just want to retrieve the column indices for all non-zero values in the first row. I tried doing this via the following methods:
sig <- which(!ma.sig[1,]==0&!pricebreak[1,]==0)
and
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
I would like it to return something like: 1, 3 (based in the above sample dataframe). However, I get this string of logical sequences:
[1] TRUE FALSE TRUE
How do I get it to return the columns indices? I do not want to use merge to merge my dataframes because of the nature of the data.
EDIT: Just for background information, the above data frames are 'signals' that are on when the values are non-zero. I'm trying to use sig to collect indices that I can use for my main dataframe so that I can only calculate and print outputs when the signals are on.
#serhatCevikel already given the answer:
I am just trying to explain it more for your convenience.
ma.sig =
Date A B C
01/1 1 0 1
02/1 1 0 1
pricebrake =
Date D E G
01/1 1 0 1
02/1 0 1 0
Now as per your method:
sig <- which(!ma.sig[1,]==0)&which(!pricebreak[1,]==0)
print(sig)
gives:
TRUE TRUE TRUE
Now try:
which(sig)
it will return index of TRUE value:
1 2 3
Please let me know if you get this. I have checked it twice in my terminal. Hope you will get this too.

Converting Categorical Columns into Multiple Binary Columns in R

I am trying to convert a column that has categorical data ('A', 'B', or 'C') to 3 columns where 1,0,0 would be 'A'; 0,1,0 would represent 'B', etc.
I found this code online:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
names(flags) = levels(d$purpose)
d = cbind(d, flags)
# Include the new columns as input variables
levelnames = paste(names(flags), collapse = " + ")
neuralnet(paste("output ~ ", levelnames), d)
Converting categorical variables in R for ANN (neuralnet)
But I'm very new to R. Can anyone break down what this complicated looking code is doing?
edit:
Implementing #nongkrong's recommendations I'm running into a problem:
CSV:
X1,X2,X3
A,D,Q
B,E,R
C,F,S
B,G,T
C,H,U
A,D,Q
R:
newData <- read.csv("new.csv")
newerData <- model.matrix(~ X1 + X2 + X3 -1, data=newData)
newerData
R Output:
X1A X1B X1C X2E X2F X2G X2H X3R X3S X3T X3U
1 1 0 0 0 0 0 0 0 0 0 0
2 0 1 0 1 0 0 0 1 0 0 0
3 0 0 1 0 1 0 0 0 1 0 0
4 0 1 0 0 0 1 0 0 0 1 0
5 0 0 1 0 0 0 1 0 0 0 1
6 1 0 0 0 0 0 0 0 0 0 0
It works great with 1 column, but is missing X2D and X3Q. Any ideas why?
#nongkrong is right--read ?formulas and you'll see that most functions that accept formulas as input (e.g. lm, glm, etc.) will automatically convert categorical variables (stored as factors or characters) to dummies; you can force this on non-factor numeric variables by specifying as.factor(var) in your formula.
That said, I've encountered situations where it's convenient to have created these indicators by hand anyway--e.g., a data set with an ethnicity variable where <1% of the data fit in one or several of the ethnicity codes. There are other ways to deal with this (simply delete the minority-minority observations, e.g.), but I find that varies by situation.
So, I've annotated the code for you:
flags = data.frame(Reduce(cbind,
lapply(levels(d$purpose), function(x){(d$purpose == x)*1})
))
Lots going on in this first line, so let's go bit-by-bit:
d$purpose==x checks each entry of d$purpose for equality to x; the result will be TRUE or FALSE (or NA if there are missing values). Multiplying by 1 (*1) forces the output to be an integer (so TRUE becomes 1 and FALSE becomes 0).
lapply applies the function in its second argument to each element of its first argument--so for each element of levels(d$purpose) (i.e., each level of d$purpose), we output a vector of 0s and 1s, where the 1s correspond to the elements of d$purpose matching the given level. The output of lapply is a list (hence l in front of apply), with one list element corresponding to each of the levels of d$purpose.
We want to get this into our data.frame, so a list isn't very useful; Reduce is what we use to back out the information from the list to a data.frame form. Reduce(cbind,LIST) is the same as cbind(LIST[[1]],LIST[[2]],LIST[[3]],...)--convenient shorthand, especially when we don't know the length of LIST.
Wrapping this in data.frame casts this into the mode data.frame.
#This line simply puts column names on each of the indicator variables
# Note that you can replace the RHS of this line with whatever
# naming convention you want for the levels--a common approach might
# be to specify paste0(levels(d$purpose),"_flag"), e.g.
names(flags) = levels(d$purpose)
#this line adds all the indicator variables to the original
# data.frame
d = cbind(d, flags)
#this creates a string of the form "level1 + level2 + ... + leveln"
levelnames = paste(names(flags), collapse = " + ")
#finally we create a formula of the form y~x+d1+d2+d3
# where each of the d* is a dummy for a level of the categorical variable
neuralnet(paste("output ~ ", levelnames), d)
Also note that something like this could have been done much simpler in the data.table package:
library(data.table)
setDT(d)
l = levels(purpose)
d[ , (l) := lapply(l, function(x) as.integer(purpose == x))]
d[ , neuralnet(paste0("output~", paste0(l, collapse = "+"))]

need to count number of specific transitions in a vector in R

I am programming a sampler in R, which basically is a big for loop, and for every Iterations I have to count the number of transitions in a vector. I have a vector called k, which contains zeros and ones, with 1000 entries in the vector.
I have used the following, horribly slow, code:
#we determine the number of transitions n00,n01,n10,n11
n00=n01=n10=n11=0 #reset number of transitions between states from last time
for(j in 1:(1000-1)){
if(k[j+1]==1 && k[j]==0) {n01<-n01+1}
else { if(k[j+1]==1 && k[j]==1) {n11<-n11+1}
else { if(k[j+1]==0 && k[j]==1) {n10<-n10+1}
else{n00<-n00+1}
}
}
}
So for every time the loop goes, the variables n00,n01,n10,n11 counts the transitions in the vector. For example, n00 counts number of times a 0 is followed by another 0. And so on...
This is very slow, and I am very new to R, so I am kind of desperate here. I do not understand how to use grep, if that even is possible.
Thank you for your help
Try something like this:
x <- sample(0:1,20,replace = TRUE)
> table(paste0(head(x,-1),tail(x,-1)))
00 01 10 11
4 3 4 8
The head and tail return portions of the vector x: all but the last element, and then all but the first element. This means that the corresponding elements are the consecutive pairs from x.
Then paste0 just converts each one to a character vector and pastes the first elements, the second element, etc. The result is a character vector with elements like "00", "01", etc. Then table just counts up how many of each there are.
You can assign the result to a new variable like so:
T <- table(paste0(head(x,-1),tail(x,-1)))
Experiment yourself with each piece of the code to see how it works. Run just head(x,-1), etc. to see what each piece does.
To address the comment below, to ensure that all types appear with counts when you run table, convert it to a factor first:
x1 <- factor(paste0(head(x,-1),tail(x,-1)),levels = c('00','01','10','11'))
table(x1)
If we don't care about distinguishing the n00 and n11 cases, then this becomes much simpler:
x <- sample(0:1,20,replace = TRUE)
# [1] 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0
table(diff(x))
# -1 0 1
# 4 11 4
Since the question says that you're primarily interested in the transitions, this may be acceptable, otherwise one of the other answers would be preferable.
x <- sample(0:1, 10, replace = TRUE)
# my sample: [1] 0 0 0 0 0 1 0 1 1 0
rl <- rle(x)
zero_to_zero <- sum(rl$len[rl$val == 0 & rl$len > 1] - 1)
one_to_one <- sum(rl$len[rl$val == 1 & rl$len > 1] - 1)
zero_to_one <- sum(diff(rl$val) == -1)
one_to_zero <- sum(diff(rl$val) == 1)
x
# [1] 0 0 0 0 0 1 0 1 1 0
zero_to_zero
# [1] 4
one_to_one
# [1] 1
zero_to_one
# [1] 2
one_to_zero
# [1] 2
#joran's answer is faaaar cleaner though...Still, I thought I just as well could finish the stroll I started down (the dirty) trail, and share the result.

Resources