Restructuring a dataframe [duplicate] - r

This question already has answers here:
How can I transpose my dataframe so that the rows and columns switch in r?
(2 answers)
Closed 3 years ago.
I have a data look like below:
cat1 <- c("A","A","B","B")
gender <- c("male","female","male","female")
mean <- c(1,2,3,4)
sd <-c(5,6,7,8)
data <- data.frame("cat1"=cat1,"gender"=gender, "mean"=mean, "sd"=sd)
> data
cat1 gender mean sd
1 A male 1 5
2 A female 2 6
3 B male 3 7
4 B female 4 8
I would like to change the format of the table to this below.
> data
cat1 score male female
1 A mean 1 2
2 A sd 5 6
3 B mean 3 4
4 B sd 7 8
Basically, I am alternating between score and cat2 variables.
Any suggestions?

One option using gather and spread
library(dplyr)
library(tidyr)
data %>%
gather(score, value, -cat1, -gender) %>%
spread(gender, value)
# cat1 score female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7

We can also use melt and dcast from data.table package:
library(data.table)
dcast(melt(data, id=c("cat1","gender"), variable.name = "score"), cat1 + score ~ gender)
#> cat1 score female male
#> 1 A mean 2 1
#> 2 A sd 6 5
#> 3 B mean 4 3
#> 4 B sd 8 7
Generally, any solution that converts the data to long format and then reshape it back to wide to swap variable and value columns works here.

It can be done with recast
library(reshape2)
recast(data, id.var = 1:2, cat1 + variable ~ gender)
# cat1 variable female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7

Related

Filling out missing information by grouping in R [duplicate]

This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 5 months ago.
I have a sample dataset below:
df <- data.frame(id = c(1,1,2,2,3,3),
gender = c("male",NA,"female","female",NA, "female"))
> df
id gender
1 1 male
2 1 <NA>
3 2 female
4 2 female
5 3 <NA>
6 3 female
By grouping the same ids, some rows are missing. What I would like to do is to fill those missing cells based on the existing information.
SO the desired output would be:
> df
id gender
1 1 male
2 1 male
3 2 female
4 2 female
5 3 female
6 3 female
Any thoughts?
Thanks!
You can use dplyr::group_by and tidyr::fill e.g.:
df |>
dplyr::group_by(id) |>
tidyr::fill(gender, .direction = "updown")

Get the sum of all duplicate rows for a each column without hard coding the column name? [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Hi suppose I have a table with many columns ( in the thousands) and some rows that are duplicates. What I like to do is sum any duplicates for each row and for every columns. I'm stuck because I don't want to have to hard code or loop through each column and remerge. Is there a better way to do this? Here is an example with only 3 columns for simplicity.
dat <- read.table(text='name etc4 etc1 etc2
A 9 0 3
A 10 10 4
A 11 9 4
B 2 7 5
C 40 6 0
C 50 6 1',header=TRUE)
# I could aggregate one column at a time
# but is there a way to do for each columns without prior hard coding?
aggregate( etc4 ~ name, data = dat, sum)
We can specify the . which signifies all the rest of the columns other than the 'name' column in aggregate
aggregate(. ~ name, data = dat, sum)
name etc4 etc1 etc2
1 A 30 19 11
2 B 2 7 5
3 C 90 12 1
Or if we need more fine control i.e if there are other columns as well and want to avoid, either subset the data with select or use cbind
aggregate(cbind(etc1, etc2, etc4) ~ name, data = dat, sum)
name etc1 etc2 etc4
1 A 19 11 30
2 B 7 5 2
3 C 12 1 90
If we need to store the names and reuse, subset the data and then convert to matrix
cname <- c("etc4", "etc1" )
aggregate(as.matrix(dat[cname]) ~ name, data = dat, sum)
name etc4 etc1
1 A 30 19
2 B 2 7
3 C 90 12
Or this may also be done in a faster way with fsum
library(collapse)
fsum(get_vars(dat, is.numeric), g = dat$name)
etc4 etc1 etc2
A 30 19 11
B 2 7 5
C 90 12 1
A tidyverse approach
dat %>%
group_by(name) %>%
summarise(across(.cols = starts_with("etc"),.fns = sum))
# A tibble: 3 x 4
name etc4 etc1 etc2
<chr> <int> <int> <int>
1 A 30 19 11
2 B 2 7 5
3 C 90 12 1

Grouped pivot_longer dplyr

This is an example dataframe. My real dataframe is larger. I highly prefer a tidyverse solution.
#my data
age <- c(18,18,19)
A1 <- c(3,5,3)
A2 <- c(4,4,3)
B1 <- c(1,5,2)
B2 <- c(2,2,5)
df <- data.frame(age, A1, A2, B1, B2)
I want my data to look like this:
#what i want
new_age <- c(18,18,18,18,19,19)
A <- c(3,5,4,4,3,3)
B <- c(1,5,2,2,2,5)
new_df <- data.frame(new_age, A, B)
I want to pivot longer and stack columns A1:A2 into column A, and B1:B2 into B. I also want to have the responses to match the correct age. For example, the 19 year old person in this example has only responded with 3's in columns A1:A2.
tidyr::pivot_longer(df, cols = -age, names_to = c(".value",'groupid'),
#1+ non digits followed by 1+ digits
names_pattern = "(\\D+)(\\d+)")
# A tibble: 6 x 4
age groupid A B
<dbl> <chr> <dbl> <dbl>
1 18 1 3 1
2 18 2 4 2
3 18 1 5 5
4 18 2 4 2
5 19 1 3 2
6 19 2 3 5
in Base R you will use reshape then select the columns you want. You can change the row names also
reshape(df,2:ncol(df),dir = "long",sep="")[,-c(2,5)] #
age A B
1.1 18 3 1
2.1 18 5 5
3.1 19 3 2
1.2 18 4 2
2.2 18 4 2
3.2 19 3 5
As you have a larger dataframe, maybe a solution with data.table will be faster. Here, you can use melt function from data.table package as follow:
library(data.table)
colA = grep("A",colnames(df),value = TRUE)
colB = grep("B",colnames(df),value = TRUE)
setDT(df)
df <- melt(df, measure = list(colA,colB), value.name = c("A","B"))
df[,variable := NULL]
dt <- dt[order(age)]
age A B
1: 18 3 1
2: 18 5 5
3: 18 4 2
4: 18 4 2
5: 19 3 2
6: 19 3 5
Does it answer your question ?
EDIT: Using patterns - suggestion from #Wimpel
As #Wimpel suggested it in comments, you can get the same result using patterns:
melt( setDT(df), measure.vars = patterns( A="^A[0-9]", B="^B[0-9]") )[, variable:=NULL][]
age A B
1: 18 3 1
2: 18 5 5
3: 19 3 2
4: 18 4 2
5: 18 4 2
6: 19 3 5

R: reshape dataframe with duplicated variable names labeled var.1, var.2 [duplicate]

This question already has answers here:
R: reshaping wide to long [duplicate]
(1 answer)
Using tidyr to combine multiple columns [duplicate]
(1 answer)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I'm hoping to reshape a dataframe in R so that a set of columns read in with duplicated names, and then renamed as var, var.1, var.2, anothervar, anothervar.1, anothervar.2 etc. can be treated as independent observations. I would like the number appended to the variable name to be used as the observation so that I can melt my data.
For example,
dat <- data.frame(ID=1:3, var=c("A", "A", "B"),
anothervar=c(5,6,7),var.1=c(C,D,E),
anothervar.1 = c(1,2,3))
> dat
ID var anothervar var.1 anothervar.1
1 1 A 5 C 1
2 2 A 6 D 2
3 3 B 7 E 3
How can I reshape the data so it looks like the following:
ID obs var anothervar
1 1 A 5
1 2 C 1
2 1 A 6
2 2 D 2
3 1 B 7
3 2 E 3
Thank you for your help!
We can use melt from data.table that takes multiple patterns in the measure
library(data.table)
melt(setDT(dat), measure = patterns("^var", "anothervar"),
variable.name = "obs", value.name = c("var", "anothervar"))[order(ID)]
# ID obs var anothervar
#1: 1 1 A 5
#2: 1 2 C 1
#3: 2 1 A 6
#4: 2 2 D 2
#5: 3 1 B 7
#6: 3 2 E 3
As for a tidyverse solution, we can use unite with gather
dat %>%
unite("1", var, anothervar) %>%
unite("2", var.1, anothervar.1) %>%
gather(obs, value, -ID) %>%
separate(value, into = c("var", "anothervar"))
# ID obs var anothervar
#1 1 1 A 5
#2 2 1 A 6
#3 3 1 B 7
#4 1 2 C 1
#5 2 2 D 2
#6 3 2 E 3

Merging datasets by columns that have different names

I want to merge datasets by columns that have different names
For example, for the data frames, df and df1
df <- data.frame(ID = c(1,2,3), Day = c(1,2,3), mean = c(2,3,4))
df1 <- data.frame(ID = c(1,2,3), Day = c(1,2,3), median = c(5,6,7))
I want to merge df and df1 so that I get
ID Day Measure Value
1 1 Mean 2
2 2 Mean 3
3 3 Mean 4
1 1 Median 5
2 2 Median 6
3 3 Median 7
Any ideas how? I tried using
merge(df,df1, by=c("ID","Day")) and
rbind.fill(df,df1) from the plyr package
but they each only do half of what I want.
library(tidyr)
m <- merge(df, df1, c("ID", "Day"))
gather(m, measure, value, mean:median)
# ID Day measure value
#1 1 1 mean 2
#2 2 2 mean 3
#3 3 3 mean 4
#4 1 1 median 5
#5 2 2 median 6
#6 3 3 median 7
And with reshape2:
melt(m, id=c("ID", "Day"))
Or with data.table:
setDT(df, df1)
setkey(df, ID, Day)
melt(df[df1], c("ID", "Day"))
# 1: 1 1 mean 2
# 2: 2 2 mean 3
# 3: 3 3 mean 4
# 4: 1 1 median 5
# 5: 2 2 median 6
# 6: 3 3 median 7
In base R:
vars <- c("ID","Day")
m <- merge(df, df1, by=vars)
cbind(m[vars], stack(m[setdiff(names(m),vars)]) )
# ID Day values ind
#1 1 1 2 mean
#2 2 2 3 mean
#3 3 3 4 mean
#4 1 1 5 median
#5 2 2 6 median
#6 3 3 7 median
You could add a new column to your two original data.frames called "Measure", then set the entire column to "Mean" in your first data.frame and "Median" in your second data.frame. Then set the colname of mean and median in both of your data.frames to "Value". Then combine using rbind.

Resources