Improve my coding "for loop" - r

The following is a simple loop to insert a new column in a data frame after checking a specific condition (if 2 consecutive rows have the same value).
The code works just fine but I would like to improve my coding skills so I ask for alternative solutions (faster, more elegant).
I checked previous threads on the topic and learned a lot but I am curious about my specific case.
Thanks for any input.
vector<-1
vector_tot<-NULL
for(i in 1:length(dat$Label1))
{
vector_tot<-c(vector_tot,vector)
if(dat$Label1[i]==dat$Label1[i+1]){
vector<-0
}
else {
vector<-1
}
}
dat$vector<- vector_tot

For many things in R, you do not need a for loop, since functions are vectorized. So we can achieve what you want with:
# sample data
dat <- data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
# first create a vector that contains the previous value
dat$next_element <- c(dat$Label1[2:nrow(dat)],"")
# then check if they match
dat$vector <- as.numeric(dat$Label1==dat$next_element)
Output:
Label1 next_element vector
1 A B 0
2 B B 1
3 B C 0
4 C C 1
5 C C 1
6 C D 0
7 D 0
It can also be done in one line, but I think the above illustrates better how it works:
dat$vector <- dat$Label1==c(dat$Label1[2:nrow(dat)],"")
Or compare with the previous element:
dat$vector <- dat$Label1==c("",dat$Label1[1:nrow(dat)-1])

You can do this in one line...
library(dplyr) #for the 'lead' function
dat = data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
dat$vector <- as.numeric(dat$Label1!=lead(dat$Label1,default = ""))
dat
Label1 vector
1 A 1
2 B 0
3 B 1
4 C 0
5 C 0
6 C 1
7 D 1

Related

How to write a function to calculate the mean of some columns in a dataframe in r?

I need to add a new column with the results calculated from the mean value of other columns.
For example:
A B C D E
1 2 3 4 ?
the question mark should equal to mean(2, 3, 4)
I wrote my code like this
df_new <- df %>% mutate(new_column = rowMeans(dplyr::select(., B:D))
But because I have a really big data frame, I have to repeat this process many times, is it possible for me to write a function to make it easier? I really don't know where to start.
If your data.frame looks like this:
df <- data.frame(A=1,B=2,C=3,D=4)
df
A B C D
1 1 2 3 4
you can get the mean you asked for like this:
data.frame(df,E=mean(as.numeric(df[,2:4])))
A B C D E
1 1 2 3 4 3
or means for a data.frame with more rows like this:
data.frame(df,E=rowMeans(df[,2:4]))

How to change part of a data.frame using a dynamic variable name for the df?

I have a dataframe "Tester" like below
Tester <- data.frame(A=c(1,3,5,7), B=c(2,4,6,8), C=0)
#A B C
#1 2 0
#3 4 0
#5 6 0
#7 8 0
I'd like to change the first two elements in column C so it reads c(1,1,0,0) by using a dynamically-determined variable name (stored in a string).
Because I'm looping this over several similar variable names, I'm operating with strings as variable names, and I've been able to do everything but this using get() and assign().
Because the variable name is stored in a string,
Tester[1:2,3] <- 1
is not possible.
When I try to use get or assign, R throws up "incorrect number of dimensions" errors
assign(Tester[1:2,3], 1)
or
assign(get("Tester")[1:2,3], 1)
and when I try double square brackets it throws up "incorrect number of subscripts."
I'm at a loss here...any help?
Here is a hacky workaround
Tester <- data.frame(A=c(1,3,5,7), B=c(2,4,6,8), C=0)
dfname <- "Tester"
colname <- "C"
df <- get(dfname)
df[1:2, colname] <- 1
assign(dfname, df)
get(dfname)
Do you mean something like this:
> dfName <- "mydf"
> mydf <- data.frame(A = c(1,3,5,7), B = c(2,4,6,8), C = c(0,0,0,0))
> mydf
A B C
1 1 2 0
2 3 4 0
3 5 6 0
4 7 8 0
>
> mydf <- get(dfName)
> mydf$C[1:2] <-1
> get(dfName)
A B C
1 1 2 1
2 3 4 1
3 5 6 0
4 7 8 0
Notice that once you find the df, you just save it into your variable of choice and then print.

Single element assignment inside same within as entire column of data frame

Why is it that I can't assign a value to an entire column of a data frame, and then a single element in the same "within" statement? The code:
foo <- data.frame( a=seq(1,10) )
foo <- within(foo, {
b <- 1 # set all of b to 1
})
foo <- within(foo, {
c <- 1 # set all of c to 1
c[2] <- 20 # set one element to 20
b[2] <- 20
})
foo
Gives:
a b c
1 1 1 1
2 2 20 20
3 3 1 1
4 4 1 20
5 5 1 1
6 6 1 20
7 7 1 1
8 8 1 20
9 9 1 1
10 10 1 20
The value of b is what I expected. The value of c is strange. It seems to do what I expect if the assignment to the entire column (ie b <- 1) is in a different "within" statement than the assignment to a single element (ie b[2] <- 20). But not if they're in the same "within".
Is this a bug, or something I just don't understand about R?
My guess is that the assignments to new columns are done as you "leave" the function. When doing
c <- 1
c[2] <- 20
all you have really created is a vector c <- c(1, 20). When R has to assign this to a new column, the vector is recycled, creating the 1,20,1,20,... pattern you are seeing.
That's an interesting one.
It has to do with the fact that c is defined only up to length 2, and after that the typical R "recycling rule" takes over and repeats c until it matches the length of the data frame. (And as an aside, this only works for whole multiples: you would not be able to replicate a vector of length 3 or 4 in a data frame of ten 10 rows.)
Recycling has its critics. I think it is an asset for a dynamically-typed interpreted language R, particularly when one wants to interactively explore data. "Expanding" data to fit a container and expression is generally a good thing -- even if it gives the odd puzzle as it does here.

Working with long data format in R

Good day,
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
df <- data.frame(d,e,f)
I have data the looks like the above. What I need to do is for each unique element of d find the first non-zero value in f, and find the corresponding value in e. To be specific, I want another vector g so it looks like this:
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
g <- c(7,7,7,6,6,6,7,7,7)
df <- data.frame(d,e,f,g)
Suggestions to do this easily? I thought I could use split(), but I am having trouble using which() after the split. I can use ave like this:
foo <- function(x){which(x>0)[1]}
df$t <- ave(df$f,df$d,FUN=foo)
But I am having trouble finding the value of e. Any help is appreciated.
Someone else can provide a base R solution, but here's a way to do this using plyr:
> ddply(df,.(d),transform,g = head(e[f != 0],1))
d e f g
1 1 5 0 7
2 1 6 0 7
3 1 7 1 7
4 2 5 0 6
5 2 6 1 6
6 2 7 0 6
7 3 5 0 7
8 3 6 0 7
9 3 7 1 7
Note that I took your note about the "first nonzero element" literally, even though your example data only had a single unique nonzero element in the column (by group).
here's a way in base R
g <- inverse.rle(list(lengths=rle(d)$lengths, values=e[f != 0]))

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources