This question already has answers here:
Why are these numbers not equal?
(6 answers)
Closed 1 year ago.
I have been trying to figure out why the standardization outputs using these methods do not seem to be equal, even though numerically they are the same?
library(vegan)
# subset data
env.data <- mite.env[1:10, c("SubsDens", "WatrCont")]
# method 1
env.data.x <- env.data
env.data.x$SubsDens <- as.vector(scale(env.data.x$SubsDens))
env.data.x$WatrCont <- as.vector(scale(env.data.x$WatrCont))
# method 2
env.data.y <- env.data
env.data.y <- as.data.frame(decostand(as.matrix(env.data.y), method = "standardize"))
# method 3
env.data.z <- env.data
normalize <- function(x){
return((x - mean(x))/sd(x))
}
env.data.z$SubsDens <- normalize(env.data.z$SubsDens)
env.data.z$WatrCont <- normalize(env.data.z$WatrCont)
# comparison
env.data.x == env.data.y
env.data.x == env.data.z
env.data.y == env.data.z
Here is the output:
> env.data.x == env.data.y
SubsDens WatrCont
1 TRUE TRUE
2 TRUE TRUE
3 TRUE TRUE
4 TRUE TRUE
5 TRUE TRUE
6 TRUE TRUE
7 TRUE TRUE
8 TRUE TRUE
9 TRUE TRUE
10 TRUE TRUE
> env.data.x == env.data.z
SubsDens WatrCont
1 FALSE TRUE
2 FALSE TRUE
3 FALSE TRUE
4 FALSE TRUE
5 FALSE TRUE
6 FALSE TRUE
7 FALSE TRUE
8 FALSE TRUE
9 FALSE TRUE
10 FALSE TRUE
> env.data.y == env.data.z
SubsDens WatrCont
1 FALSE TRUE
2 FALSE TRUE
3 FALSE TRUE
4 FALSE TRUE
5 FALSE TRUE
6 FALSE TRUE
7 FALSE TRUE
8 FALSE TRUE
9 FALSE TRUE
10 FALSE TRUE
Method 3, standardizing using the formula as a function, seems to be doing something different...
Thank you in advance for your answers!
Thank you Jonny Phelps and r2evans for your comments.
I should've just checked the difference between the columns.
env.data.x - env.data.z
Output was on the order of 1e-16, so not at all significant for my purposes.
Related
Thanks in advance for your kind help. This is my dataframe:
df <- data.frame('a'=c(1,2,3,4,5), 'b'=c("A",NA,"B","C","A"))
df
And I want to create a new column based on if the value of dataframe$b is present/or absent (TRUE/FALSE). I'm using grepl for this but I'm not sure how to dinamically create the new column.
I'm creating a vector with the unique values of df$b
list <- as.vector(unique(df$b))
And want to iterate with a for in df$b, in order to get a dataframe like this:
a b A B C
1 1 A TRUE FALSE FALSE
2 2 NA FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 A FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
But I'm not sure how to generate the new column inside the for loop. I'm trying to do something like this:
for (i in list) {
logical <- grepl(df$b, i)
df$i <- logical
But it generates an error. Any help will be appreciated
This may need table
df <- cbind(df, as.data.frame.matrix(table(df) > 0))
-output
df
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
You can use this for loop
list <- as.vector(unique(na.omit(df$b)))
for(i in 1:length(list)){
`[[`(df , list[i]) <- ifelse(!is.na(df$b),
list[i] == df$b , FALSE)
}
output
a b A B C
1 1 A TRUE FALSE FALSE
2 2 <NA> FALSE FALSE FALSE
3 3 B FALSE TRUE FALSE
4 4 C FALSE FALSE TRUE
5 5 A TRUE FALSE FALSE
a<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
b<-c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)
c<-c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
costumer<-c("one","two","three","four","five","six")
df<-data.frame(costumer,a,b,c)
That's an example code. It looks like this printed:
costumer a b c
1 one TRUE TRUE TRUE
2 two FALSE FALSE TRUE
3 three TRUE TRUE TRUE
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I want to create a new column df$items that contains only the column names that are TRUE for each row in the data. Something like this:
costumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I thought of using apply function or use which for selecting indexes, but couldn't figure it out. Can anyone help me?
df$items <- apply(df, 1, function(x) paste0(names(df)[x == TRUE], collapse = ","))
df
custumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE a,c
6 six FALSE FALSE FALSE
df$items = apply(df[2:4], 1, function(x) toString(names(df[2:4])[x]))
df
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
You could use
df$items <- apply(df, 1, function(x) toString(names(df)[which(x == TRUE)]))
Output
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
We can use pivot_longer to reshape to 'long' format and then do a group by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = a:c) %>%
group_by(costumer) %>%
summarise(items = toString(name[value])) %>%
left_join(df)
I am looking for a function that takes a column of a data.frame as the reference and finds all subsets with respect to the other variable levels. For example, let z be data frame with 4 columns a,b,c,d, each column has 2 levels for instance. let a be the reference. Then z would be like
z$a : TRUE FALSE
z$b : TRUE FALSE
z$c : TRUE FALSE
z$d : TRUE FALSE
Then what I need is a LIST that the elements are combination names such as
aTRUEbTRUEcTRUEdTR UE :subset of the dataframe
aTRUEbFALSEcTRUEdTRUE : subset
...
Here is an example,
set.seed(123)
z=matrix(sample(c(TRUE,FALSE),size = 100,replace = TRUE),ncol=4)
colnames(z) = letters[1:4]
z=as.data.frame(z)
output= list(
'bTUEcTRUEdFALSE' = subset(z,b==TRUE & c==TRUE & d==FALSE),
'bTRUEcTRUEdTRUE' = subset(z,b==TRUE & c==TRUE & d==TRUE),
'bTRUEcFALSEdFALSE' = subset(z,b==TRUE & c==FALSE & d==FALSE),
'bTRUEcFALSEdTRUE' = subset(z,b==TRUE & c==FALSE & d==TRUE)
# and so on ...
)
output
$bTUEcTRUEdFALSE
a b c d
13 FALSE TRUE TRUE FALSE
14 FALSE TRUE TRUE FALSE
$bTRUEcTRUEdTRUE
a b c d
4 FALSE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE
16 FALSE TRUE TRUE TRUE
20 FALSE TRUE TRUE TRUE
24 FALSE TRUE TRUE TRUE
$bTRUEcFALSEdFALSE
a b c d
17 TRUE TRUE FALSE FALSE
19 TRUE TRUE FALSE FALSE
22 FALSE TRUE FALSE FALSE
$bTRUEcFALSEdTRUE
a b c d
5 FALSE TRUE FALSE TRUE
11 FALSE TRUE FALSE TRUE
15 TRUE TRUE FALSE TRUE
18 TRUE TRUE FALSE TRUE
21 FALSE TRUE FALSE TRUE
23 FALSE TRUE FALSE TRUE
However, there is an issue with the example. firstly, I do not know the number of variables (in this case 4 (a to d). Secondly, the name of the variables must be caught from the data (simple speaking, I cannot use subset since I do not know the variable name in the condition (a== can be anything==)
What is the most efficient way of doing this in R?
You can use split and paste like so:
split(z, paste(z$b, z$c, z$d))
But the tricky part of your question is how to programmatically combine the variables in columns 2:end without knowing beforehand the number of columns, their names or values. We can use a function like below to paste the values by row in columns 2:end
apply(df, 1, function(i) paste(i[-1], collapse=""))
Now combine with split
split(z, apply(z, 1, function(i) paste(i[-1], collapse="")))
I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!
We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)
For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE
Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4
I have a dataframe :
> s <- expand.grid(c(T,F),c(T,F))
> s
Var1 Var2
1 TRUE TRUE
2 FALSE TRUE
3 TRUE FALSE
4 FALSE FALSE
and would like to duplicate each line a number of times, which is stored in a vector :
> r <- c(2,3,4,1)
Do you know how to do that?
In functional programming terms, it would just be a mapping over zipped list, duplicate, and collect.
I am not sure on how to do either the zip with plyr, or the map with mapply...
Much easier than all that:
s[rep(1:4,times = r),]
Var1 Var2
1 TRUE TRUE
1.1 TRUE TRUE
2 FALSE TRUE
2.1 FALSE TRUE
2.2 FALSE TRUE
3 TRUE FALSE
3.1 TRUE FALSE
3.2 TRUE FALSE
3.3 TRUE FALSE
4 FALSE FALSE