I have a pretty basic problem that I can't seem to solve:
Take 3 vectors:
V1
V2
V3
1
4
7
2
5
8
3
6
9
I want to merge the 3 columns into a single column and assign a group.
Desired result:
Group
Value
A
1
A
2
A
3
B
4
B
5
B
6
C
7
C
8
C
9
My code:
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
data <- data.frame(group = c("A", "B", "C"),
values = c(V1, V2, V3))
Actual result:
Group
Value
A
1
B
2
C
3
A
4
B
5
C
6
A
7
B
8
C
9
How can I reshape the data to get the desired result?
Thank you!
We can stack on a named list of vectors
stack(setNames(mget(paste0("V", 1:3)), LETTERS[1:3]))[2:1]
-output
ind values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Regarding the issue in the OP's data creation, if the length is less than the length of the second column, it will recycle. We may need rep
data <- data.frame(group = rep(c("A", "B", "C"), c(length(V1),
length(V2), length(V3))),
values = c(V1, V2, V3))
-output
> data
group values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Here is a another option using pivot_longer() and converting the 1,2,3 column names into factors labeled A, B, C
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
df <- data.frame(V1, V2, V3)
library(dplyr)
library(tidyr)
#basic answer:
answer<-pivot_longer(df, cols=starts_with("V"), names_to = "Group")
OR
Answer changing the column names to a different label
answer<-pivot_longer(df, cols=starts_with("V"), names_prefix = "V", names_to = "Group")
answer$Group <- factor(answer$Group, labels = c("A", "B", "C"))
answer %>% arrange(Group, value)
Related
I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")
I have a large data frame and want to create a new variable which depends on two other variables.
Here is a short example:
v1 <- rep(c(1:5),each=3)
v2 <- c('X','A','Y','X','Y','B','X','Y','C','X','Y','C','X','Y','A')
dat <- data.frame(v1,v2)
#create a new var which contains either A,B, or C depending on what is found in v2
#desired output
v3 <- rep(c('A','B','C','C','A'),each=3)
data.frame(v1,v2,v3)
Any ideas on how to do this with a short code?
I tried this, but it's far from the solution. Too many missings. :(
dat$v3[dat$v2 %in% c('A','B','C')] <- dat$v2[dat$v2 %in% c('A','B','C')]
library(tidyverse)
dat %>% group_by(v1) %>% mutate(v3 = intersect(v2, c("A", "B", "C")))
# A tibble: 15 x 3
# Groups: v1 [5]
# v1 v2 v3
# <int> <fct> <chr>
# 1 1 X A
# 2 1 A A
# 3 1 Y A
# 4 2 X B
# 5 2 Y B
# 6 2 B B
# 7 3 X C
# 8 3 Y C
# 9 3 C C
# 10 4 X C
# 11 4 Y C
# 12 4 C C
# 13 5 X A
# 14 5 Y A
# 15 5 A A
This is assuming that only one of A, B, C can appear in a group given by v1.
Here is my issue:
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df2) <- LETTERS[3:7]
df2
x y z
C 1 2 3
D 2 3 4
E 3 4 5
F 4 5 6
G 5 6 7
what I wanted is:
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
where duplicated rows were added up by same variable.
A solution with base R:
# create a new variable from the rownames
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
# bind the two dataframes together by row and aggregate
res <- aggregate(cbind(x,y,z) ~ rn, rbind(df1,df2), sum)
# or (thx to #alistaire for reminding me):
res <- aggregate(. ~ rn, rbind(df1,df2), sum)
# assign the rownames again
rownames(res) <- res$rn
# get rid of the 'rn' column
res <- res[, -1]
which gives:
> res
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
With dplyr,
library(dplyr)
# add rownames as a column in each data.frame and bind rows
bind_rows(df1 %>% add_rownames(),
df2 %>% add_rownames()) %>%
# evaluate following calls for each value in the rowname column
group_by(rowname) %>%
# add all non-grouping variables
summarise_all(sum)
## # A tibble: 7 x 4
## rowname x y z
## <chr> <int> <int> <int>
## 1 A 1 2 3
## 2 B 2 3 4
## 3 C 4 6 8
## 4 D 6 8 10
## 5 E 8 10 12
## 6 F 4 5 6
## 7 G 5 6 7
could also vectorize the operation turning the dfs to matrices:
result_df <- as.data.frame(as.matrix(df1) + as.matrix(df2))
This might need some teaking to get the rownames logic working on a longer example:
dfr <-rbind(df1,df2)
do.call(rbind, lapply( split(dfr, sapply(rownames(dfr),substr,1,1)), colSums))
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
If the rownames could all be assumed to be alpha characters a gsub solution should be easy.
An alternative is to melt the data and cast it. At first we set the row names to the last column of both data frames thanks to #Jaap
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
Then we melt the data based on the name
melt(list(df1, df2), id.vars = "rn")
Then we use dcast with mget function which is used to retrieve multiple variables at once.
mydf<- dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "rn"),
rn ~ variable, value.var = "value", fun.aggregate = sum)
rownames(mydf) <- mydf$rn
# get rid of the 'rn' column
mydf <- mydf[, -1]
> mydf
# x y z
#A 1 2 3
#B 2 3 4
#C 4 6 8
#D 6 8 10
#E 8 10 12
#F 4 5 6
#G 5 6 7
In R, I'm trying to collapse multiple columns of a data frame into two columns, with the column names from the first data frame being copied into their own column in the resulting data frame. For instance, I have the following data frame df :
A B C D
1 2 3 4
5 6 7 8
And I'm trying to get this output, which is helpful when performing ANOVA tests:
DV IV
A 1
A 5
B 2
B 6
C 3
C 7
D 4
D 8
I've been going about this manually by declaring new data frame like so:
df2 <- data.frame("DV" = c(rep("A", 2), rep("B", 2), rep("C", 2), rep("D", 2)),
"IV" = c(df$A, df$B, df$C, df$D))
I suspect aggregate() or melt() could do this more efficiently, but I'm lost in the syntax. Thanks in advance!
You can use melt from reshape2 package
library(reshape2)
melt(df, variable.name = "DV", value.name = "IV")
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8
A <- c(1,5)
B <- c(2,6)
C <- c(3,7)
D <- c(4,8)
df <- data.frame(A,B,C,D)
I like gather of the package tidyr, since the code is so short
library(tidyr)
df2 <- gather(df, DV, IV)
You could just use stack(df) or, prettying up the result a bit:
setNames(rev(stack(df)), c("DV", "IV"))
DV IV
1 A 1
2 A 5
3 B 2
4 B 6
5 C 3
6 C 7
7 D 4
8 D 8
I am new to R and this site, but I searched and didn't find the answer I was looking for.
If I have the following data set "total":
names <- c("a", "b", "c", "d", "a", "b", "c", "d")
x <- cbind(x1 = 3, x2 = c(3:10))
total <- data.frame(names, x)
total
names x1 x2
1 a 3 3
2 b 3 4
3 c 3 5
4 d 3 6
5 a 3 7
6 b 3 8
7 c 3 9
8 d 3 10
How can I create a new data set that works like the SumIf Excel function with just unique rows?
The answer should be a new data set "summary" that is 4 x 3.
names <- unique(names)
summary <- data.frame(names)
summary$Sumx1 <- ?????
summary$Sumx2 <- ?????
summary
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
In base R:
aggregate(. ~ names, data=total, sum)
You can use ddply from the plyr package:
library(plyr)
ddply(total, .(names), summarise, Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
You can also use data.table:
library(data.table)
DT <- as.data.table(total)
DT[ , lapply(.SD, sum), by = "names"]
names x1 x2
1: a 6 10
2: b 6 12
3: c 6 14
4: d 6 16
With the new dplyr package, you can do:
library(dplyr)
total %>%
group_by(names) %>%
summarise(Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 d 6 16
2 c 6 14
3 b 6 12
4 a 6 10