Working with long data format in R - r

Good day,
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
df <- data.frame(d,e,f)
I have data the looks like the above. What I need to do is for each unique element of d find the first non-zero value in f, and find the corresponding value in e. To be specific, I want another vector g so it looks like this:
d <- c(1,1,1,2,2,2,3,3,3)
e <- c(5,6,7,5,6,7,5,6,7)
f <- c(0,0,1,0,1,0,0,0,1)
g <- c(7,7,7,6,6,6,7,7,7)
df <- data.frame(d,e,f,g)
Suggestions to do this easily? I thought I could use split(), but I am having trouble using which() after the split. I can use ave like this:
foo <- function(x){which(x>0)[1]}
df$t <- ave(df$f,df$d,FUN=foo)
But I am having trouble finding the value of e. Any help is appreciated.

Someone else can provide a base R solution, but here's a way to do this using plyr:
> ddply(df,.(d),transform,g = head(e[f != 0],1))
d e f g
1 1 5 0 7
2 1 6 0 7
3 1 7 1 7
4 2 5 0 6
5 2 6 1 6
6 2 7 0 6
7 3 5 0 7
8 3 6 0 7
9 3 7 1 7
Note that I took your note about the "first nonzero element" literally, even though your example data only had a single unique nonzero element in the column (by group).

here's a way in base R
g <- inverse.rle(list(lengths=rle(d)$lengths, values=e[f != 0]))

Related

cbind named vectors in R by name

I have two named vectors similar to these ones:
x <- c(1:5)
names(x) <- c("a","b","c","d","e")
t <- c(6:10)
names(t) <- c("e","d","c","b","a")
I would like to combine them so to get the following outcome:
x t
a 1 10
b 2 9
c 3 8
d 4 7
e 5 6
Unfortunately when I run cbind(x,t) the result just combines them in the order they are disregarding the names of t and only keeping those of x. Giving the following result:
x t
a 1 6
b 2 7
c 3 8
d 4 9
e 5 10
I'm pretty sure there must be an easy solution, but I cannot find it. As this passage is part of a long and tedious loop (and the vectors I'm working with are much longer), it is important to have the least convoluted and quicker to compute options.
We can use the names of 'x' to change the order the 't' elements and cbind with 'x'
cbind(x, t = t[names(x)])
# x t
#a 1 10
#b 2 9
#c 3 8
#d 4 7
#e 5 6

Improve my coding "for loop"

The following is a simple loop to insert a new column in a data frame after checking a specific condition (if 2 consecutive rows have the same value).
The code works just fine but I would like to improve my coding skills so I ask for alternative solutions (faster, more elegant).
I checked previous threads on the topic and learned a lot but I am curious about my specific case.
Thanks for any input.
vector<-1
vector_tot<-NULL
for(i in 1:length(dat$Label1))
{
vector_tot<-c(vector_tot,vector)
if(dat$Label1[i]==dat$Label1[i+1]){
vector<-0
}
else {
vector<-1
}
}
dat$vector<- vector_tot
For many things in R, you do not need a for loop, since functions are vectorized. So we can achieve what you want with:
# sample data
dat <- data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
# first create a vector that contains the previous value
dat$next_element <- c(dat$Label1[2:nrow(dat)],"")
# then check if they match
dat$vector <- as.numeric(dat$Label1==dat$next_element)
Output:
Label1 next_element vector
1 A B 0
2 B B 1
3 B C 0
4 C C 1
5 C C 1
6 C D 0
7 D 0
It can also be done in one line, but I think the above illustrates better how it works:
dat$vector <- dat$Label1==c(dat$Label1[2:nrow(dat)],"")
Or compare with the previous element:
dat$vector <- dat$Label1==c("",dat$Label1[1:nrow(dat)-1])
You can do this in one line...
library(dplyr) #for the 'lead' function
dat = data.frame(Label1=c("A","B","B","C","C","C","D"),stringsAsFactors = F)
dat$vector <- as.numeric(dat$Label1!=lead(dat$Label1,default = ""))
dat
Label1 vector
1 A 1
2 B 0
3 B 1
4 C 0
5 C 0
6 C 1
7 D 1

Return first occurring 2nd largest value in data frame rows using colnames & apply

Consider i have a df
> editor
A B C D E F G H I J
User1 1 0 5 6 5 6 5 6 2 6
User2 0 5 4 6 4 5 5 1 7 5
I want to store the column name of the first occuring 2nd largest value in above rows.
Expected results
> editor
A B C D E F G H I J 2nd_highest
User1 1 0 5 6 5 6 5 6 2 6 C
User2 0 5 4 6 4 5 5 1 7 5 D
i tried edited$2nd_highest <- colnames(edited)[apply(edited, 1, which.max)+1] but did'nt worked well .
Any ideas ?
Here's an attempt to achieve this using algebra in order to keep it vectorized and avoid by row operations (though it still does a matrix conversion similar to apply). The idea here is to find the maximum- then reduce it from the data set, then convert to log (after multiplying by -1) which will result in the largest value becoming -Inf (meaning the smallest value) and then do 1/result in order to find the largest value out of the values left.
indx <- max.col(1/log((editor - editor[cbind(1:nrow(editor),
max.col(editor))]) * -1), ties.method = "first")
names(editor)[indx]
# [1] "C" "D"
Here is an idea. We first sort the unique values of each row and extract the second value. Since we specify decreasing = TRUE, then the second value will be the second highest. We then use the first value of each element of the new list as the index for the column names
ind_lst <- apply(df, 1, function(i) which(i == sort(unique(i), decreasing = TRUE)[2]))
df$highest.two <- names(df)[unlist(lapply(ind_lst, '[', 1))]
df
# A B C D E F G H I J highest.two
#User1 1 0 5 6 5 6 5 6 2 6 C
#User2 0 5 4 6 4 5 5 1 7 5 D
This can help you:
mat <- matrix(sample(1:8, 24, replace=TRUE), ncol=6)
mat
sec_highest <- apply(mat, 1, function(x) which(x == max(x[which(x != max(x))])))
LETTERS[sec_highest] # letters display
Note that if you have two second highests with same scores, only one will be displayed.

Convert a full length column to one variable in a row in R

I was wondering if it is possible to convert 1 column into 1 variable next to eachother
i.e.:
d <- data.frame(y = 1:10)
> d
y
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Convert this column into:
> d
1 2 3 4 5 6 7 8 9 10
We don't know how are you going to use the numbers, but I think it is unnecessary to make any transformation. You can use d$y to get the numbers applied to any map of colors. See for example.
d <- data.frame(y = 1:7)
library(RColorBrewer)
mypalette<-brewer.pal(4,"Greens")
mycol <-palette()#rainbow(7)
heatmap(matrix(1:28,ncol=4),col=mypalette[d$y[1:4]],xlab="Greens (sequential)",
ylab="",xaxt="n",yaxt="n",bty="n",RowSideColors=mycol[d$y])
Not sure what is the prupose of:
1 variable next to eachother
But there are few ways to get the desired result (again, depends on the objective). You can do either:
d$y
unname(unlist(d)) #suggested by agstudy
or, better yet, to convert your dataframe's column into a vector, do this:
v <- as.vector(d[,1])
as string:
args <- paste(d$y, sep=" ")
args<-noquote(args)
now you'll have
[1] 1 2 3 4 5 6 7 8 9 10

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources