Easiest way to reshape this dataframe in R? [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
Say I have the following wide/messy dataframe:
df1 <- data.frame(ID = c(1, 2), Gender = c("M","F"),
Q1 = c(1, 5), Q2 = c(2, 6),
Q3 = c(3, 7), Q4 = c(4, 8))
ID Gender Q1 Q2 Q3 Q4
1 M 1 2 3 4
2 F 5 6 7 8
how can I turn it into this dataframe:
df2 <- data.frame(ID = c(1, 1, 2, 2), Gender = c("M", "M", "F", "F"),
V1 = c(1, 3, 5, 7), V2 = c(2, 4, 6, 8))
ID Gender V1 V2
1 M 1 2
1 M 3 4
2 F 5 6
2 F 7 8
I know there are multiple packages and functions (e.g., tidyr, reshape2, reshape function) that can accomplish this. Which is the easiest way to do it and how? Really appreciate any help anyone can provide. Thanks!

You could try melt from the devel version of data.table i.e v1.9.5. It can take multiple variables in the measure.vars as a list. Instructions to install the devel version are here
library(data.table)#v1.9.5+
melt(setDT(df1), measure.vars=list(c(3,5), c(4,6)),
value.name=c('V1', 'V2'))[,variable:=NULL][order(ID)]
# ID Gender V1 V2
#1: 1 M 1 2
#2: 1 M 3 4
#3: 2 F 5 6
#4: 2 F 7 8
Or use reshape from base R
res <- subset(reshape(df1, idvar=c('ID', 'Gender'),
varying=list(c(3,5), c(4,6)), direction='long'), select=-time)
row.names(res) <- NULL
Update
If we need to transform back the 'df2' to 'df1', dcast from data.table could be used. It can take multiple value.var columns. We need to create a sequence column (N) by group ('ID', 'Gender') before proceeding with dcast
dcast(setDT(df2)[, N:=1:.N, list(ID, Gender)], ID+Gender~N,
value.var=c('V1', 'V2'))
# ID Gender 1_V1 2_V1 1_V2 2_V2
#1: 1 M 1 3 2 4
#2: 2 F 5 7 6 8
Or we create a sequence by group with ave and then use reshape from base R.
df2 <- transform(df2, N= ave(seq_along(ID), ID, Gender, FUN=seq_along))
reshape(df2, idvar=c('ID', 'Gender'), timevar='N', direction='wide')
# ID Gender V1.1 V2.1 V1.2 V2.2
#1 1 M 1 2 3 4
#3 2 F 5 6 7 8
data
df1 <- data.frame(ID = c(1, 2), Gender = c("M","F"), Q1 = c(1, 5),
Q2 = c(2, 6), Q3 = c(3, 7), Q4 = c(4, 8))
df2 <- data.frame(ID = c(1, 1, 2, 2), Gender = c("M", "M", "F", "F"),
V1 = c(1, 3, 5, 7), V2 = c(2, 4, 6, 8))

Related

Match value from one dataframe to values from a second dataframe of different length

I have two dataframes like so
df_1 <- data.frame(Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100))
df_2 <- data.frame(Value = c(5, 2, 33),
Symbol = c("B", "A", "D"))
I want to attach df_2$Symbol to df_1 based on whether or not df_2$Value falls between df_1$Min and df_1$Max. If there's no df_2$Value in the appropriate range I'd like NA instead:
df_target <- data.frame(
Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100),
Symbol = c("A", "B", NA, "D")
)
If df_1 and df_2 were of equal lengths this would be simple with findInterval or something with cut but alas...
A solution in either base or tidyverse would be appreciated.
We could use a non-equi join
library(data.table)
setDT(df_1)[df_2, Symbol := Symbol, on = .(Min < Value, Max > Value)]
df_1
# Min Max Symbol
#1: 1 3 A
#2: 4 7 B
#3: 9 14 <NA>
#4: 25 100 D
Or can use fuzzy_left_join
library(fuzzyjoin)
fuzzy_left_join(df_1, df_2, by = c('Min' = 'Value',
'Max' = 'Value'), list(`<`, `>`) ) %>%
dplyr::select(-Value)
# Min Max Symbol
#1 1 3 A
#2 4 7 B
#3 9 14 <NA>
#4 25 100 D

dplyr row sum on selected rows [duplicate]

This question already has an answer here:
Row sum using mutate and select [duplicate]
(1 answer)
Closed 4 years ago.
I have the following data:
library(dplyr)
library(purrr)
d <- data.frame(
Type= c("d", "e", "d", "e"),
"2000"= c(1, 5, 1, 5),
"2001" = c(2, 5 , 6, 4),
"2002" = c(8, 9, 6, 3))
I would like to use rowsum and mutate to generate a new row which is the sum of 'd' and another row which is the sum of 'e' so that the data looks like this:
d2 <- data.frame(
Type= c("d", "e", "d", "e", "sum_of_d", "Sum_of_e"),
"2000"= c(1, 5, 1, 5, 2, 10),
"2001" = c(2, 5 , 6, 4, 8, 9),
"2002" = c(8, 9, 6, 3, 14, 12))
I think the code should look something like this:
d %>%
dplyr::mutate(sum_of_d = rowSums(d[1,3], na.rm = TRUE)) %>%
dplyr::mutate(sum_of_e = rowSums(d[2,4], na.rm = TRUE)) -> d2
however this does not quite work. Any ideas?
Thanks
You're looking for the sum by Type across all other columns, so..
library(dplyr)
d %>%
group_by(Type) %>%
summarise_all(sum) %>%
mutate(Type = paste0("sum_of_", Type)) %>%
rbind(d, .)
Type X2000 X2001 X2002
1 d 1 2 8
2 e 5 5 9
3 d 1 6 6
4 e 5 4 3
5 sum_of_d 2 8 14
6 sum_of_e 10 9 12
d %>%
group_by(Type) %>%
summarize_all(sum) %>%
mutate(Type=paste0("sum_of_",Type)) %>%
bind_rows(d,.)
Type X2000 X2001 X2002
1 d 1 2 8
2 e 5 5 9
3 d 1 6 6
4 e 5 4 3
5 sum_of_d 2 8 14
6 sum_of_e 10 9 12

Subset of dataframe for which 2 variables match another dataframe in R

I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7

Subset R data.frame by index and name in one line

Sample data.frame:
structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
Output:
df
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
I'd like to get the first and third columns, but I want to subset by name and also by column index.
df[, "a"]
# [1] 1 2 3
df[, 3]
# [1] 7 8 9
df[, c("a", 3)]
# Error in `[.data.frame`(df, , c("a", 3)) : undefined columns selected
df[, c(match("a", names(df)), 3)]
# a c
# 1 1 7
# 2 2 8
# 3 3 9
Are there functions or packages that allow for clean/simple syntax, as in the third example, while also achieving the result of the fourth example?
Maybe use dplyr?
For interactive use - i.e., if you know ahead of time the name of the column you want to select
library(dplyr)
df %>% select(a, 3)
If you do not know the name of the column in advance, and want to pass it as a variable,
x <- names(df)[1]
x
[1] "a"
df %>% select_(x, 3)
Either way the output is
# a c
#1 1 7
#2 2 8
#3 3 9
In base R you can combine subset with select.
df <- structure(list(a = c(1, 2, 3),
b = c(4, 5, 6), c = c(7, 8, 9)),
.Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df <- subset(df, select = c(a, 3))
You can index names(df) without using dplyr:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df[,c("a",names(df)[3]) ]
Output:
a c
1 1 7
2 2 8
3 3 9

match rows across two columns

Given a data frame
df=data.frame(
E=c(1,1,2,1,3,2,2),
N=c(4,4,10,4,3,2,2)
)
I would like to create a third column: Every time a value equals another value in the same column and these rows are also equal in the other column it results in a match (new character for every match).
dfx=data.frame(
E=c(1,1,2,1,3,2,2,3, 2),
N=c(4,4,10,4,3,2,2,6, 10),
matched=c("A", "A", "B","A", NA, "C", "C", NA, "B")
)
Thanks!
Here, df is:
df <- structure(list(E = c(1, 1, 2, 1, 3, 2, 2, 3, 2), N = c(4, 4,
10, 4, 3, 2, 2, 6, 10)), .Names = c("E", "N"), row.names = c(NA,
-9L), class = "data.frame")
You can do:
dfx <- transform(df, matched = {
i <- as.character(interaction(df[c("E", "N")]))
tab <- table(i)[order(unique(i))]
LETTERS[match(i, names(tab)[tab > 1])]
})
# E N matched
# 1 1 4 A
# 2 1 4 A
# 3 2 10 B
# 4 1 4 A
# 5 3 3 <NA>
# 6 2 2 C
# 7 2 2 C
# 8 3 6 <NA>
# 9 2 10 B

Resources