In R, I'm trying to collapse multiple columns of a data frame into two columns, with the column names from the first data frame being copied into their own column in the resulting data frame. For instance, I have the following data frame df :
A B C D
1 2 3 4
5 6 7 8
And I'm trying to get this output, which is helpful when performing ANOVA tests:
DV IV
A 1
A 5
B 2
B 6
C 3
C 7
D 4
D 8
I've been going about this manually by declaring new data frame like so:
df2 <- data.frame("DV" = c(rep("A", 2), rep("B", 2), rep("C", 2), rep("D", 2)),
"IV" = c(df$A, df$B, df$C, df$D))
I suspect aggregate() or melt() could do this more efficiently, but I'm lost in the syntax. Thanks in advance!
You can use melt from reshape2 package
library(reshape2)
melt(df, variable.name = "DV", value.name = "IV")
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8
A <- c(1,5)
B <- c(2,6)
C <- c(3,7)
D <- c(4,8)
df <- data.frame(A,B,C,D)
I like gather of the package tidyr, since the code is so short
library(tidyr)
df2 <- gather(df, DV, IV)
You could just use stack(df) or, prettying up the result a bit:
setNames(rev(stack(df)), c("DV", "IV"))
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8
Related
I have a pretty basic problem that I can't seem to solve:
Take 3 vectors:
V1
V2
V3
1
4
7
2
5
8
3
6
9
I want to merge the 3 columns into a single column and assign a group.
Desired result:
Group
Value
A
1
A
2
A
3
B
4
B
5
B
6
C
7
C
8
C
9
My code:
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
data <- data.frame(group = c("A", "B", "C"),
values = c(V1, V2, V3))
Actual result:
Group
Value
A
1
B
2
C
3
A
4
B
5
C
6
A
7
B
8
C
9
How can I reshape the data to get the desired result?
Thank you!
We can stack on a named list of vectors
stack(setNames(mget(paste0("V", 1:3)), LETTERS[1:3]))[2:1]
-output
ind values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Regarding the issue in the OP's data creation, if the length is less than the length of the second column, it will recycle. We may need rep
data <- data.frame(group = rep(c("A", "B", "C"), c(length(V1),
length(V2), length(V3))),
values = c(V1, V2, V3))
-output
> data
group values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Here is a another option using pivot_longer() and converting the 1,2,3 column names into factors labeled A, B, C
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
df <- data.frame(V1, V2, V3)
library(dplyr)
library(tidyr)
#basic answer:
answer<-pivot_longer(df, cols=starts_with("V"), names_to = "Group")
OR
Answer changing the column names to a different label
answer<-pivot_longer(df, cols=starts_with("V"), names_prefix = "V", names_to = "Group")
answer$Group <- factor(answer$Group, labels = c("A", "B", "C"))
answer %>% arrange(Group, value)
I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")
Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))
I'm looking for a way to calculate values in LONG format data frame without switching between long and wide formats. Data frame structure is basically like this:
index <- rep(seq(1:3),2)
category <- c("a","a","a","b","b","b")
value <- c(3,6,8,9,7,4)
df <- data.frame(index, category,value, stringsAsFactors = FALSE)
Say, I need to calculate a new category, c by adding up a and b. That is very easy to do by transforming the data frame to "wide" format with category as the key column, adding new c variable by the calculation and switching back to "long" format.
However, I have hundreds of new categories to be calculated from hundreds of source items and it would be a very time-consuming solution. I'm sure there must be a smarter way, but I haven't been able to find it. Any ideas? Thank you!
We can use data.table
library(data.table)
rbind(setDT(df), df[, .(category = 'c', value = sum(value)), index])
# index category value
#1: 1 a 3
#2: 2 a 6
#3: 3 a 8
#4: 1 b 9
#5: 2 b 7
#6: 3 b 4
#7: 1 c 12
#8: 2 c 13
#9: 3 c 12
With dplyr we can group_by index to match the values, sum values for each group and bind the rows to the original dataframe.
library(dplyr)
bind_rows(df, df %>%
group_by(index) %>%
summarise(category = 'c',
value = sum(value)))
# index category value
#1 1 a 3
#2 2 a 6
#3 3 a 8
#4 1 b 9
#5 2 b 7
#6 3 b 4
#7 1 c 12
#8 2 c 13
#9 3 c 12
The same with base R would be using aggregate and rbind
rbind(df, transform(aggregate(value~index, df, sum), category = 'c'))
I am trying to convert a matrix to a dataframe and use a column name and row name in the matrix with variables in the dataframe.
here is the sample
sample = matrix(c(1,NA,NA,2,NA,3,NA,NA,5,NA,NA,6,NA,NA,NA,NA,8,NA,3,1),ncol = 4)
colnames(sample) = letters[1:4]
row.names(sample) = letters[22:26]
My dataset has a lot of NA so I am trying to remove all the NA in the dataframe.
so here is my desiring output,
data.frame(col = c("v","v","w","w","y","y","y","z"),
row = c("a","b","c","c","a","b","d","d"),
value = c(1,3,6,8,2,5,3,1))
Use melt from reshape2 package for reshaping, then clear NA. Finally, do some formating stuff to get your desired output (ordering, setting colnames...).
> library(reshape2)
> df <- na.omit(melt(sample)) # reshaping
> df <- df[order(df$Var1), ] # ordering
> colnames(df) <- c("col", "row", "value") # setting colnames
> df # getting desired output
col row value
1 v a 1
6 v b 3
12 w c 6
17 w d 8
4 y a 2
9 y b 5
19 y d 3
20 z d 1
With dplyr and magrittr
> library(magrittr)
> library(dplyr)
> sample %>% melt %>%
na.omit %>%
arrange(., Var1) %>%
setNames(c('col', 'row', 'value'))
col row value
1 v a 1
2 v b 3
3 w c 6
4 w d 8
5 y a 2
6 y b 5
7 y d 3
8 z d 1
Here is a base R method by replicating the row names and column names
out <- na.omit(data.frame(col = rownames(sample)[row(sample)],
row = colnames(sample)[col(sample)], value = c(sample)))
out <- out[order(out$col),]
row.names(out) <- NULL
out
# col row value
#1 v a 1
#2 v b 3
#3 w c 6
#4 w d 8
#5 y a 2
#6 y b 5
#7 y d 3
#8 z d 1