R: collapse rows and then convert row into a new column - r

So here is my challenge. I am trying to get rid of rows of data that are best organized as a column. The original data set looks like
1|1|a
2|3|b
2|5|c
1|4|d
1|2|e
10|10|f
And the end result desired is
1 |1,2,4 |a| e d
2 |3,5 |b| c
10|10 |f| NA
The table's shaping is based from minimum value Col 2 within groupings of Col 1, where new column 3 is defined from the minimum values within the group and new column 4 is collapsed from not the minimum of. Some of the approaches tried include:
newTable[min(newTable[,(1%o%2)]),] ## returns the minimum of both COL 1 and 2 only
ddply(newTable,"V1", summarize, newCol = paste(V7,collapse = " ")) ## collapses all values by Col 1 and creates a new column nicely.
Variations to combine these lines of code into a single line have not worked, in part to my limited knowledge. These modifications are not included here.

Try:
library(dplyr)
library(tidyr)
dat %>%
group_by(V1) %>%
summarise_each(funs(paste(sort(.), collapse=","))) %>%
extract(V3, c("V3", "V4"), "(.),?(.*)")
gives the output
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f
Or using aggregate and str_split_fixed
res1 <- aggregate(.~ V1, data=dat, FUN=function(x) paste(sort(x), collapse=","))
library(stringr)
res1[, paste0("V", 3:4)] <- as.data.frame(str_split_fixed(res1$V3, ",", 2),
stringsAsFactors=FALSE)
If you need NA for missing values
res1[res1==''] <- NA
res1
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f <NA>
data
dat <- structure(list(V1 = c(1L, 2L, 2L, 1L, 1L, 10L), V2 = c(1L, 3L,
5L, 4L, 2L, 10L), V3 = c("a", "b", "c", "d", "e", "f")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))

Here's an approach using data.table, with data from #akrun's post:
It might be useful to store the columns as list instead of pasting them together.
require(data.table) ## 1.9.2+
setDT(dat)[order(V1, V2), list(V2=list(V2), V3=V3[1L], V4=list(V3[-1L])), by=V1]
# V1 V2 V3 V4
# 1: 1 1,2,4 a e,d
# 2: 2 3,5 b c
# 3: 10 10 f
setDT(dat) converts the data.frame to data.table, by reference (without copying it). Then, we sort it by columns V1,V2 and group by V1 column on the sorted data, and for each group, we create the columns V2, V3 and V4 as shown.
V2 and V4 will be of type list here. If you'd rather have a character column where all entries are pasted together, just replace list(.) with paste(., sep=...).
HTH

Related

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

Replacing column values in R

Example df:
index name V1 V2 etc
1 x 2 1
2 y 1 2
3 z 3 4
4 w 4 3
I would like to replace values in columns V1 and V2 with related values in name column for particular index value. Output should look like this:
index name V1 V2 etc
1 x y x
2 y x y
3 z z w
4 w w z
I have tried multiple merge statements in loop but not sure how I can replace the values instead of creating new columns and also got a duplicate name error.
V<-2 # number of V columns
names<-c()
for (i in 1:k){names[[i]]<-paste0('V',i)}
lookup_table<-df[,c('index','name'),drop=FALSE] # it's at unique index level
for(col in names){
df<- merge(df,lookup_table,by.x=col,by.y="index",all.x = TRUE)
}
We can do
df[3:4] <- lapply(df[3:4], function(x) df$name[x])
Or without looping
df[3:4] <- df$name[as.matrix(df[3:4])]
df
# index name V1 V2
#1 1 x y x
#2 2 y x y
#3 3 z z w
#4 4 w w z
data
df <- structure(list(index = 1:4, name = c("x", "y", "z", "w"), V1 = c(2L,
1L, 3L, 4L), V2 = c(1L, 2L, 4L, 3L)), .Names = c("index", "name",
"V1", "V2"), class = "data.frame", row.names = c(NA, -4L))

Group by operarion in R

I have a data-set having millions of rows and i need to apply the 'group by' operation in it using R.
The data is of the form
V1 V2 V3
a u 1
a v 2
b w 3
b x 4
c y 5
c z 6
performing 'group by' using R, I want to add up the values in column 3 and concatenate the values in column 2 like
V1 V2 V3
a uv 3
b wx 7
c yz 11
I have tried doing the opertaion in excel but due to a lot of tuples i can't use excel. I am new to R so any help would be appreciated.
Many possible ways to solve, here are two
library(data.table)
setDT(df)[, .(V2 = paste(V2, collapse = ""), V3 = sum(V3)), by = V1]
# V1 V2 V3
# 1: a uv 3
# 2: b wx 7
# 3: c yz 11
Or
library(dplyr)
df %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ""), V3 = sum(V3))
# Source: local data table [3 x 3]
#
# V1 V2 V3
# 1 a uv 3
# 2 b wx 7
# 3 c yz 11
Data
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), V2 = structure(1:6, .Label = c("u",
"v", "w", "x", "y", "z"), class = "factor"), V3 = 1:6), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Another option, using aggregate
# Group column 2
ag.2 <- aggregate(df$V2, by=list(df$V1), FUN = paste0, collapse = "")
# Group column 3
ag.3 <- aggregate(df$V3, by=list(df$V1), FUN = sum)
# Merge the two
res <- cbind(ag.2, ag.3[,-1])
Another option with sqldf
library(sqldf)
sqldf('select V1,
group_concat(V2,"") as V2,
sum(V3) as V3
from df
group by V1')
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
Or using base R
do.call(rbind,lapply(split(df, df$V1), function(x)
with(x, data.frame(V1=V1[1L], V2= paste(V2, collapse=''), V3= sum(V3)))))
using ddply
library(plyr)
ddply(df, .(V1), summarize, V2 = paste(V2, collapse=''), V3 = sum(V3))
# V1 V2 V3
#1 a uv 3
#2 b wx 7
#3 c yz 11
You could also just use the groupBy function in the 'caroline' package:
x <-cbind.data.frame(V1=rep(letters[1:3],each=2), V2=letters[21:26], V3=1:6, stringsAsFactors=F)
groupBy(df=x, clmns=c('V2','V3'),by='V1',aggregation=c('paste','sum'),collapse='')

New conditioned column in R data

I'm taking a data mining course and need to manipulate some data to do desired task using randomForest. V1, V2, and V3 are the column names. If V1=A and V2=2, I want R to output "Eureka" to the corresponding row of a new column V4. I want the other values in V4 to be set to "NOPE". The actual data set has 300000 rows and 6 columns. This may seem strange but if I can learn how to do this my problem will be solved. Thanks.
V1 V2 V3
A 1 4
A 1 8
A 2 4
A 2 8
C 1 10
C 1 9
C 2 10
C 2 9
V1 V2 V3 V4
A 1 4 NOPE
A 1 8 NOPE
A 2 5 Eureka
A 2 3 Eureka
C 1 10 NOPE
C 1 8 NOPE
C 2 10 NOPE
C 2 4 NOPE
The following code does NOT work.
`for(g in 1:8){
if(data$V1[g]=="A"&data$V2[g]==2){
data$V4[g]=Eureka
}else{
data$V4[g]="NOPE"
}
}`
We could use either numeric index or ifelse to create the "V4" column. V1=='A' & V2==2 gives a logical index (TRUE/FALSE). Adding 1, coerces the logical vector to binary (1/0) and gives 2/1 corresponding to TRUE/FALSE. This numeric values can be used as index to replace it with `NOPE'/'Eureka'.
df$V4 <- with(df, c('NOPE', 'Eureka')[(V1=='A' & V2==2)+1])
df
# V1 V2 V3 V4
#1 A 1 4 NOPE
#2 A 1 8 NOPE
#3 A 2 4 Eureka
#4 A 2 8 Eureka
#5 C 1 10 NOPE
#6 C 1 9 NOPE
#7 C 2 10 NOPE
#8 C 2 9 NOPE
Or using ifelse
df$V4 <- with(df, ifelse(V1=='A' & V2==2, 'Eureka', 'NOPE'))
Another option would be data.table. Convert the "data.frame" to "data.table" (setDT). Create column (V4) with value NOPE. The rows of V4 that meets the condition (V1=='A' & V2==2) is assigned to Eureka
library(data.table)
setDT(df)[,V4:='NOPE'][V1=='A' & V2==2, V4:='Eureka'][]
Regarding the error in your code, 'Eureka' should be quoted. It is better to use vectorized methods rather than loops.
for(g in 1:8){
if(df$V1[g]=='A' & df$V2[g]==2){
df$V4[g] <- 'Eureka'
}
else{
df$V4[g] <- 'NOPE'
}
}
df$V4
#[1] "NOPE" "NOPE" "Eureka" "Eureka" "NOPE" "NOPE" "NOPE" "NOPE"
data
df <- structure(list(V1 = c("A", "A", "A", "A", "C", "C", "C", "C"),
V2 = c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), V3 = c(4L, 8L, 4L,
8L, 10L, 9L, 10L, 9L)), .Names = c("V1", "V2", "V3"), class =
"data.frame", row.names = c(NA, -8L))
Set the vector to the defautl value, then replaced the correct instances with the new value using logical indexing.
data$V4 <- "Nope"
data$V4[ data$V1[g]=="A" & data$V2[g]==2] <- "Eureka"

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

Resources