Reshape R: split a column - r

A really simple question but I could'nt find a solution:
I have a data.frame like
V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
df <- data.frame(cbind(V1,V2,V3))
i.e.
V1 V2 V3
A D 10
A D 11
B E 12
B E 13
C F 14
C F 15
And I would like
V1 V2 V3.1 V3.2
A D 10 11
B E 12 13
C F 14 15
I try reshape{stats} and reshape2

As I had mentioned, all that you need is a "time" variable and you should be fine.
Mark Miller shows the base R approach, and creates the time variable manually.
Here's a way to automatically create the time variable, and the equivalent command for dcast from the "reshape2" packge:
## Creating the "time" variable. This does not depend
## on the rows being in a particular order before
## assigning the variables
df <- within(df, {
A <- do.call(paste, df[1:2])
time <- ave(A, A, FUN = seq_along)
rm(A)
})
## This is the "reshaping" step
library(reshape2)
dcast(df, V1 + V2 ~ time, value.var = "V3")
# V1 V2 1 2
# 1 A D 10 11
# 2 B E 12 13
# 3 C F 14 15
Self-promotion alert
Since this type of question has cropped up several times, and since a lot of datasets don't always have a unique ID, I have implemented a variant of the above as a function called getanID in my "splitstackshape" package. In its present version, it hard-codes the name of the "time" variable as ".id". If you were using that, the steps would be:
library(splitstackshape)
library(reshape2)
df <- getanID(df, id.vars=c("V1", "V2"))
dcast(df, V1 + V2 ~ .id, value.var = "V3")

V1 <- c("A","A","B","B","C","C")
V2 <- c("D","D","E","E","F","F")
V3 <- c(10:15)
time <- rep(c(1,2), 3)
df <- data.frame(V1,V2,V3,time)
df
reshape(df, idvar = c('V1','V2'), timevar='time', direction = 'wide')
V1 V2 V3.1 V3.2
1 A D 10 11
3 B E 12 13
5 C F 14 15

Related

Searching a list across a dataframe in R?

Currently, I have a database built in R that looks like this:
df <- data.frame(c('ABC','DEF','HIJ'),
c(1,2,5),
c(2,5,9),
c(14,19,12))
And I have a function which searches for one value across the entire data frame and returns the entire row, the function for this is below:
df[which(df == 5,
arr.ind = TRUE)[,"row"],]
This function returns the following when executed:
HIJ 5 9 12
DEF 2 5 19
I would like to be able to enter a list of values as a vector and then filter through all the values in one shot using a loop to return values that have a match, however, I have been totally lost in creating a loop function with my search function above to find values from a vector in my dataset. Below is an example of what I am trying to achieve, by searching for values from vector v across data frame df to return all rows of df which have values in any column or row that are the same as values in v:
v <- c(1,2,13,19,16,120,2934,1087)
Searching this across the data frame I would like to return:
HIJ 5 9 12
DEF 2 5 19
I am wondering what would be the best way to perform a loop to do this search?
We can use :
df[rowSums(sapply(df, `%in%`, v)) > 0, ]
Or using dplyr :
library(dplyr)
df %>% filter_all(any_vars(. %in% v))
It may be easier to reshape your data first. I'll use data.table::melt:
library(data.table)
df = data.frame(
V1 = c("ABC", "DEF", "HIJ"),
V2 = c(1, 2, 5),
V3 = c(2, 5, 9),
V4 = c(14, 19, 12)
)
setDT(df)
# reshape long
melt_df = melt(df, id.vars = 'V1')
melt_df
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: HIJ V2 5
# 4: ABC V3 2
# 5: DEF V3 5
# 6: HIJ V3 9
# 7: ABC V4 14
# 8: DEF V4 19
# 9: HIJ V4 12
Now we can look it all up at once:
melt_df[value %in% v]
# V1 variable value
# 1: ABC V2 1
# 2: DEF V2 2
# 3: ABC V3 2
# 4: DEF V4 19
That's the gist of it. To get back your original desired output, we need to do some other steps:
df[.(V1 = melt_df[value %in% v, unique(V1)]), on = 'V1']
# V1 V2 V3 V4
# 1: ABC 1 2 14
# 2: DEF 2 5 19
this pulls the associated values of V1 from melt_df (unique removes duplicates) and joins them back to df (hence on='V1') to get the associated rows from df

Customise the aggregate function inside dcast based on the max value of a column in data.table?

I've got a data.table that i'd like to dcast based on three columns (V1, V2, V3). there are, however, some duplicates in V3 and I need an aggregate function that looks at a fourth column V4 and decides for the value of V3 based on maximum value of V4. I'd like to do this without having to aggregate DT separately prior to dcasting. can this aggregation be done in aggregate function of dcast or do I need to aggregate the table separately first?
Here is my data.table DT:
> DT <- data.table(V1 = c('a','a','a','b','b','c')
, V2 = c(1,2,1,1,2,1)
, V3 = c('st', 'cc', 'B', 'st','st','cc')
, V4 = c(0,0,1,0,1,1))
> DT
V1 V2 V3 V4
1: a 1 st 0
2: a 2 cc 0
3: a 1 B 1 ## --> i want this row to be picked in dcast when V1 = a and V2 = 1 because V4 is largest
4: b 1 st 0
5: b 2 st 1
6: c 1 cc 1
and the dcast function could look something like this:
> dcast(DT
, V1 ~ V2
, value.var = "V3"
#, fun.aggregate = V3[max.which(V4)] ## ?!?!?!??!
)
My desired output is:
> desired
V1 1 2
1: a B cc
2: b st st
3: c cc <NA>
Please note that aggregating DT before dcasting to get rid of the duplicates will solve the issue. I'm just wondering if dcasting can be done with the duplicates.
Here is one option where you take the relevent subset before dcasting:
DT[order(V4, decreasing = TRUE)
][, dcast(unique(.SD, by = c("V1", "V2")), V1 ~ V2, value.var = "V3")]
# V1 1 2
# 1: a B cc
# 2: b st st
# 3: c cc <NA>
Alternatively order and use a custom function in dcast():
dcast(
DT[order(V4, decreasing = TRUE)],
V1 ~ V2,
value.var = "V3",
fun.aggregate = function(x) x[1]
)
dplyr/tidyr option would be to group_by V1 and V2 select the maximum value in each group and then spread to wide format.
library(dplyr)
library(tidyr)
DT %>%
group_by(V1, V2) %>%
slice(which.max(V4)) %>%
select(-V4) %>%
spread(V2, V3)
# V1 `1` `2`
# <chr> <chr> <chr>
#1 a B cc
#2 b st st
#3 c cc NA

Pivoting data in R with duplicate rows

Trying to do a simple pivot in R, much like you would in SQL.
I understand this question has been asked however I am having trouble with duplicate rows.
Pivoting data in R
Currently the data is in this format (characters are just placeholders for ease of viewing. The actual data is numerical):
V1 V2 V3 V4
A B C Sales
D E F Sales
G H I Technical
J K L Technical
And it needs to be transformed into this format:
Variable Sales Technical
V1 A G
V1 D J
V2 B H
V2 E K
V3 C I
V3 F L
I've tried both reshape and tidyr packages and they either aggregate the data in the case of reshape or throw errors for duplicate row identifiers in the case of tidyr.
I don't care about duplicate row identifiers, infact it's necessary to identify them as factors for analysis.
Am I going about this the wrong way? Are these the correct packages to be using or can anyone suggest another method?
I hope this will work:
df %>% gather(Variable, Value, V1:V3) %>%
group_by(V4, Variable) %>%
mutate(g = row_number()) %>%
spread(V4, Value) %>% ungroup() %>%
select(-g)
# # A tibble: 6 x 3
# Variable Sales Technical
# * <chr> <chr> <chr>
# 1 V1 A G
# 2 V1 D J
# 3 V2 B H
# 4 V2 E K
# 5 V3 C I
# 6 V3 F L
Another option is melt/dcast from data.table
library(data.table)
dcast(melt(setDT(df1), id.var = 'V4'), variable + rowid(V4) ~
V4, value.var = 'value')[, V4 := NULL][]
# variable Sales Technical
#1: V1 A G
#2: V1 D J
#3: V2 B H
#4: V2 E K
#5: V3 C I
#6: V3 F L

for loop acting weird

I have two dataframes:
df_1 <- data.frame(c("a_b", "a_c", "a_d"))
df_2 <- data.frame(matrix(ncol = 2))
And I would like to loop over df_1 in order to fill df_2:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[i*j,] <-str_split_fixed(df_1[i,1], "_", 2)
}
}
I would like df_2 to look like:
col1 col2
a b
a b
a c
a c
a d
a d
But instead I get:
col1 col2
a b
a c
a d
a c
NA NA
a d
I must be doing something wrong, but cannot figure it out.
I also would like to use apply (or something like it, but am pretty new to R and not firm with the apply-family.
Thanks for your help!
Another way would be
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))
df_2 <- as.data.frame(do.call(rbind, strsplit(as.character(df_1$col1), split = "_", fixed = TRUE)))
df_2[rep(1:nrow(df_2), each = 2), ]
V1 V2
1 a b
1.1 a b
2 a c
2.1 a c
3 a d
3.1 a d
We can use cSplit with data.table approach
library(splitstackshape)
cSplit(df_1, 'col1', '_')[rep(seq_len(.N), each =2)]
# col1_1 col1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or another option is tidyverse
library(tidyverse)
separate(df_1, col1, into=c("col_1", "col_2")) %>%
map_df(~rep(., each = 2))
# A tibble: 6 × 2
# col_1 col_2
# <chr> <chr>
#1 a b
#2 a b
#3 a c
#4 a c
#5 a d
#6 a d
NOTE: Both the answers are one-liners.
data
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))
This would be a combination of two answers. With cSplit we split the column by _ and then repeat each row twice. Assuming your column name as V1.
library(splitstackshape)
df_2 <- cSplit(df_1, "V1", "_")
df_2[rep(seq_len(nrow(df_2)),each = 2), ]
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or as #Sotos mentioned in the comments we can use expandRows to accomodate everything into one line.
expandRows(cSplit(df_1, "V1", "_"), 2, count.is.col = FALSE)
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
data
df_1 <- data.frame(V1 = c("a_b", "a_c", "a_d"))
OK, I started learning R this week, but if you want presented result you can use your code with this fix:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[(i-1)*2+j,] <- str_split_fixed(df_1[i,1], "_", 2)
}
}
I changed index of df_2.
I guess that there is better way than two for loops, but that all I can do for the moment.
I was trying to post a solution I found right after posting but it was misunderstood and was deleted:
"sometimes posting a question helps:
I am was asking for the right position in df_1, but I was saving the result in the wrong cell.
the answer to my original question should be something like this:
n <- 1
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[n,] <-str_split_fixed(df_1[i,1], "_", 2)
n <- n+1
}
}"

How to collapse/join selected factor levels across two columns in R

Let's say I have the following data frame:
x <-c(rep (c ("s1", "s2", "s3"),each=5 ))
y <- c(rep(c("a", "b", "c", "d", "e"), 3) )
z<-c(1:15)
x_name <- "dimensions"
y_name <- "aspects"
z_name<-"value"
df <- data.frame(x,y,z)
names(df) <- c(x_name,y_name, z_name)
How can I collapse/join factor levels 'a', 'c', 'd' in one new factor 'x' across 'dimensions' and 'value', so that the value is added up for the new x factor level. The output should look like this:
I thought to use gsub to replace the names of a,c, d, with x and then sum their values using aggregate. But is there a simpler way to do this? Besides I am not sure my solution would be still good if I have other columns containing a, c, d.
I reviewed several related answers on the forum but neither addressed this situation. Thanks.
First rename a, c, and d to x and then sum by dimensions and aspects
Reading the data:
df <- data.frame(dimensions = x, aspects = y, value = z, stringsAsFactors = FALSE)
Base R solution:
# if you read the data my way the following line is unnecessary
# df$aspects <- as.character(df$aspects)
df[df$aspects %in% c("a","c","d"),]$aspects <- "x"
aggregate(value ~., df, sum)
Result:
dimensions aspects value
1 s1 b 2
2 s2 b 7
3 s3 b 12
4 s1 e 5
5 s2 e 10
6 s3 e 15
7 s1 x 8
8 s2 x 23
9 s3 x 38
data.table solution
require(data.table)
DT <- setDT(df)
DT[aspects %in% c("a","c","d"), aspects := "x"]
DT[,sum(value), by=.(dimensions, aspects)]
Results in
dimensions aspects V1
1: s1 x 8
2: s1 b 2
3: s1 e 5
4: s2 x 23
5: s2 b 7
6: s2 e 10
7: s3 x 38
8: s3 b 12
9: s3 e 15
Here's a solution using plyr::revalue (see also plyr::mapvalues) and dplyr:
# install.packages("plyr")
library(dplyr)
df %>%
mutate(aspects = plyr::revalue(aspects, c("a" = "x", "c" = "x", "d" = "x"))) %>%
group_by(dimensions, aspects) %>%
summarise(sum_value = sum(value))
# dimensions aspects sum_value
# (fctr) (fctr) (int)
# 1 s1 x 8
# 2 s1 b 2
# 3 s1 e 5
# 4 s2 x 23
# 5 s2 b 7
# 6 s2 e 10
# 7 s3 x 38
# 8 s3 b 12
# 9 s3 e 15

Resources