Find repeated data from index and string it together - r

Using R, if I have a 2 column data frame:
meta <- c(1,2,2,3,4,4,4,5)
value <- c("a","b","c","d","e","f","g","h")
df <- data.frame(meta,value)
df
meta value
1 1 a
2 2 b
3 2 c
4 3 d
5 4 e
6 4 f
7 4 g
8 5 h
How would I go about combining "value" with a delimiter (like ||) by repeated "meta" such that the resulting data frame would look like:
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
Thanks!

Slightly different, fairly lean, and in base:
y <- split(df$value, df$meta)
data.frame(meta=names(y), value=sapply(y, paste, collapse="||"))
or even simpler:
aggregate(value~meta, df, paste, collapse="||")

Using the plyr package the following works
library(plyr)
> ldply(split(df,meta),function(x){paste(x$value,collapse="||")})
.id V1
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
or
> ddply(df,.(meta),function(x){c(value=paste(x$value,collapse="||"))})
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
if you want to preserve names

I hope you don't dislike one liners: data.frame(meta=unique(df$meta), value=sapply(unique(df$meta), function(m){ paste(df$value[which(df$meta==m)],collapse="||") }) )
> data.frame(meta=unique(df$meta), value=sapply(unique(df$meta), function(m){ paste(df$value[which(df$meta==m)],collapse="||") }) )
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h

Here is another way...
uni.meta <- unique(df$meta)
list <- lapply(1:length(uni.meta),function(x) which(df$meta==uni.meta[x]))
new.value <- unlist(lapply(1:length(list),function(x) paste(df$value[list[[x]]],collapse="||")))
new.df <- data.frame(uni.meta,new.value)
new.df
uni.meta new.value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h

Related

How to add a row to a dataframe and avoid "non-numeric argument to binary operator"

I have a dataframe. I want to normalize columns 2 and 3 by dividing them by the maximum value of column 2 and 3.
> testdf<- data.frame("a"=c("b",2), "b"=2:3, "c"=3:4, "d"=4:5, stringsAsFactors = F)
> testdf
a b c d
1 b 2 3 4
2 2 3 4 5
> testdf[2:3]<-testdf[2:3] / do.call(pmax, testdf[2:3])
> testdf
a b c d
1 b 0.6666667 1 4
2 2 0.7500000 1 5
Notice how the df contains a mix of numerical and string values?
Now I want to add a row with more data. If the first element of the added row is a string, the code gives an error.
> testdf<- data.frame("a"=c("b",2), "b"=2:3, "c"=3:4, "d"=4:5, stringsAsFactors = F)
> testdf
a b c d
1 b 2 3 4
2 2 3 4 5
> testdf<- testdf %>% rbind(c("a",6,7,8))
> testdf
a b c d
1 b 2 3 4
2 2 3 4 5
3 a 6 7 8
> testdf[2:3]<-testdf[2:3] / do.call(pmax, testdf[2:3])
Error in FUN(left, right) : non-numeric argument to binary operator
If instead I use only numerical values, it works.
> testdf<- data.frame("a"=c("b",2), "b"=2:3, "c"=3:4, "d"=4:5, stringsAsFactors = F)
> testdf
a b c d
1 b 2 3 4
2 2 3 4 5
> testdf<- testdf %>% rbind(c(5,6,7,8))
> testdf
a b c d
1 b 2 3 4
2 2 3 4 5
3 5 6 7 8
> testdf[2:3]<-testdf[2:3] / do.call(pmax, testdf[2:3])
> testdf
a b c d
1 b 0.6666667 1 4
2 2 0.7500000 1 5
3 5 0.8571429 1 8
Any help to why this happens is greatly appreciated. I need to be able to add rows that contain text and numbers while keeping the code working. My guess is that I'm messing up types but I couldn't figure out, how.
When you do rbind(c("a",6,7,8)) you are effectively doing rbind(c("a","6","7","8")) thereby making everything in testdf character. This is because a vector (c(...) or individual columns of testdf) can hold data of only one type and R will try to do so while accommodating all data. In this case, character would store all data but numeric would get rid of the letters for example.
Just use testdf %>% rbind(list("a",6,7,8)) instead of testdf %>% rbind(c("a",6,7,8)).
Compare the output of list("a",6,7,8) vs that of c("a",6,7,8).
We can use add_row
library(tibble)
testdf <- add_row(testdf, !!!set_names(list('a', 6, 7, 8), names(testdf)))
testdf
# a b c d
#1 b 2 3 4
#2 2 3 4 5
#3 a 6 7 8
Now, do the pmax on the numeric columns
testdf[2:3] / do.call(pmax, testdf[2:3])
# b c
#1 0.6666667 1
#2 0.7500000 1
#3 0.8571429 1

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

How to write the remaining data frame in R after randomly subseting the data

I took a random sample from a data frame. But I don't know how to get the remaining data frame.
df <- data.frame(x=rep(1:3,each=2),y=6:1,z=letters[1:6])
#select 3 random rows
df[sample(nrow(df),3)]
What I want is to get the remaining data frame with the other 3 rows.
sample sets a random seed each time you run it, thus if you want to reproduce its results you will either need to set.seed or save its results in a variable.
Addressing your question, you simply need to add - before your index in order to get the rest of the data set.
Also, don't forget to add a comma after the indx if you want to select rows (unlike in your question)
set.seed(1)
indx <- sample(nrow(df), 3)
Your subset
df[indx, ]
# x y z
# 2 1 5 b
# 6 3 1 f
# 3 2 4 c
Remaining data set
df[-indx, ]
# x y z
# 1 1 6 a
# 4 2 3 d
# 5 3 2 e
Try:
> df
x y z
1 1 6 a
2 1 5 b
3 2 4 c
4 2 3 d
5 3 2 e
6 3 1 f
>
> df2 = df[sample(nrow(df),3),]
> df2
x y z
5 3 2 e
3 2 4 c
1 1 6 a
> df[!rownames(df) %in% rownames(df2),]
x y z
1 1 6 a
2 1 5 b
5 3 2 e

Subseting data frame by another data frame

The data is as follows:
> x
a b
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
> y
a b
1 2 a
2 3 a
3 3 b
My goal is to compare both data frames, and for each row in x indicate whether equivalent row exists in y. All of the y rows are actually contained in x, so I would like to end up with something like this:
> x
a b intersect.x.y
1 1 a F
2 2 a T
3 3 a T
4 1 b F
5 2 b F
6 3 b T
How about that?
How about this?
x$rn <- 1:nrow(x)
xyrows <- merge(x,y)$rn # maybe you just want to look at the merge ...?
x$iny <- FALSE
x$iny[xyrows] <- TRUE
I suspect there is a more standard approach, but this way is easy to understand.

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
hereĀ“s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources