Merging datasets by columns that have different names - r

I want to merge datasets by columns that have different names
For example, for the data frames, df and df1
df <- data.frame(ID = c(1,2,3), Day = c(1,2,3), mean = c(2,3,4))
df1 <- data.frame(ID = c(1,2,3), Day = c(1,2,3), median = c(5,6,7))
I want to merge df and df1 so that I get
ID Day Measure Value
1 1 Mean 2
2 2 Mean 3
3 3 Mean 4
1 1 Median 5
2 2 Median 6
3 3 Median 7
Any ideas how? I tried using
merge(df,df1, by=c("ID","Day")) and
rbind.fill(df,df1) from the plyr package
but they each only do half of what I want.

library(tidyr)
m <- merge(df, df1, c("ID", "Day"))
gather(m, measure, value, mean:median)
# ID Day measure value
#1 1 1 mean 2
#2 2 2 mean 3
#3 3 3 mean 4
#4 1 1 median 5
#5 2 2 median 6
#6 3 3 median 7
And with reshape2:
melt(m, id=c("ID", "Day"))
Or with data.table:
setDT(df, df1)
setkey(df, ID, Day)
melt(df[df1], c("ID", "Day"))
# 1: 1 1 mean 2
# 2: 2 2 mean 3
# 3: 3 3 mean 4
# 4: 1 1 median 5
# 5: 2 2 median 6
# 6: 3 3 median 7

In base R:
vars <- c("ID","Day")
m <- merge(df, df1, by=vars)
cbind(m[vars], stack(m[setdiff(names(m),vars)]) )
# ID Day values ind
#1 1 1 2 mean
#2 2 2 3 mean
#3 3 3 4 mean
#4 1 1 5 median
#5 2 2 6 median
#6 3 3 7 median

You could add a new column to your two original data.frames called "Measure", then set the entire column to "Mean" in your first data.frame and "Median" in your second data.frame. Then set the colname of mean and median in both of your data.frames to "Value". Then combine using rbind.

Related

Average values from multiple data frames by position [duplicate]

This question already has an answer here:
Average Cells of Two or More DataFrames
(1 answer)
Closed 1 year ago.
I have two dataframes:
dataA <- data.frame(A = replicate(5, 1), B = replicate(5, 2))
dataB <- data.frame(A = replicate(5, 3), B = replicate(5, 4))
I would like to create a third data frame dataC that is the average of the other two. For example, row 1 column 1 in the third data frame would be the average of the same position in the first two data frames.
Desired output:
dataC <- data.frame(A = replicate(5, 2), B = replicate(5, 3))
dataC
A B
2 3
2 3
2 3
2 3
2 3
We can use place the datasets in a list, do elementwise sum with + and divide by the lenght of the list
Reduce(`+`, list(dataA, dataB))/2
-output
A B
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Or another option is to bind the datasets while creating a grouping column based on sequence and then do the group by mean
library(dplyr)
library(data.table)
bind_rows(dataA, dataB, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), mean)) %>%
select(-grp)
-output
# A tibble: 5 x 2
A B
<dbl> <dbl>
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Here are some solutions:
# method 1:
dataC <- (dataA + dataB) / 2
# method 2:
dataC <- dataA
dataC[] <- Map(function(x,y) (x+y)/2, dataA, dataB)
# A B
# 1 2 3
# 2 2 3
# 3 2 3
# 4 2 3
# 5 2 3

Restructuring a dataframe [duplicate]

This question already has answers here:
How can I transpose my dataframe so that the rows and columns switch in r?
(2 answers)
Closed 3 years ago.
I have a data look like below:
cat1 <- c("A","A","B","B")
gender <- c("male","female","male","female")
mean <- c(1,2,3,4)
sd <-c(5,6,7,8)
data <- data.frame("cat1"=cat1,"gender"=gender, "mean"=mean, "sd"=sd)
> data
cat1 gender mean sd
1 A male 1 5
2 A female 2 6
3 B male 3 7
4 B female 4 8
I would like to change the format of the table to this below.
> data
cat1 score male female
1 A mean 1 2
2 A sd 5 6
3 B mean 3 4
4 B sd 7 8
Basically, I am alternating between score and cat2 variables.
Any suggestions?
One option using gather and spread
library(dplyr)
library(tidyr)
data %>%
gather(score, value, -cat1, -gender) %>%
spread(gender, value)
# cat1 score female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7
We can also use melt and dcast from data.table package:
library(data.table)
dcast(melt(data, id=c("cat1","gender"), variable.name = "score"), cat1 + score ~ gender)
#> cat1 score female male
#> 1 A mean 2 1
#> 2 A sd 6 5
#> 3 B mean 4 3
#> 4 B sd 8 7
Generally, any solution that converts the data to long format and then reshape it back to wide to swap variable and value columns works here.
It can be done with recast
library(reshape2)
recast(data, id.var = 1:2, cat1 + variable ~ gender)
# cat1 variable female male
#1 A mean 2 1
#2 A sd 6 5
#3 B mean 4 3
#4 B sd 8 7

Count unique elements in columns of dataframe

For the dataframe below, the there are 59 columns
circleid name birthday 56 more...
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
Result I want
circleid distinct_name distinct_birthday 56 more...
1 1 1
2 3 2
quiz <- read.csv("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
So far
ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(name)))
This works for 1 column how do i get the full dataframe
columns <- colnames(quiz)
for (i in c(1:58)
{
final <- ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(columns[i])))
}
with data.table you can run:
library(data.table)
quiz <- fread("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
unique_vals <- quiz[, lapply(.SD, uniqueN), by = circleid]
You can use dplyr:
result<-quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct)
microbenchmark for data.table and dplyr:
microbenchmark(x1=quiz[, lapply(.SD, function(x) length(unique(x))), by = circleid],
x2=quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct),times=100)
Unit: milliseconds
expr min lq mean median uq max neval cld
x1 150.06392 155.02227 158.75775 156.49328 158.38887 224.22590 100 b
x2 41.07139 41.90953 42.95186 42.54135 43.97387 49.91495 100 a
With package dplyr this is simple. The original answer had length(unique(.)) but #akrun pointed me to n_distinct(.) in a comment.
library(dplyr)
quiz %>%
group_by(circleid) %>%
summarise_all(n_distinct)
## A tibble: 2 x 3
#circleid name birthday
#<int> <int> <int>
# 1 1 1
# 2 2 3
Data.
quiz <- read.table(text = "
circleid name birthday
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
", header = TRUE)

Min and Median of Multiple Columns of a DF by Row in R

Given a dataframe that looks like:
V1 V2 V3
5 8 12
4 9 5
7 3 9
...
How to add columns to the dataframe for min and median of these 3 columns, calculated for each row?
The resulting DF should look like:
V1 V2 V3 Min Median
5 8 12 5 8
4 9 5 4 5
7 3 9 3 7
...
I tried using dplyr::mutate:
mutate(df, Min = min(V1,V2,V3))
but that takes the min of the entire dataframe and puts that value in every row. How can I get the min and median of just each row?
For Mean, I can use rowMeans in mutate, but there are no similar functions for min and median.
Also tried,
lapply(df[1:3], median)
but it just produces the median of each column
dd <- read.table(header = TRUE, text = 'V1 V2 V3
5 8 12
4 9 5
7 3 9')
With dplyr, using the function rowwise
library(dplyr)
mutate(rowwise(df), min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
# Using the pipe operator %>%
df %>%
rowwise() %>%
mutate(min= min(V1, V2, V3), median = median(c(V1, V2, V3)))
Output:
Source: local data frame [3 x 5]
Groups: <by row>
V1 V2 V3 min median
(int) (int) (int) (int) (int)
1 5 8 12 5 8
2 4 9 5 4 5
3 7 3 9 3 7
You can use apply like this (the 1 means calculate by row, 2 would calculate by column):
the_min <- apply(df, 1, min)
the_median <- apply(df, 1, median)
df$Min <- the_min
df$Median <- the_median
min<-apply(df,1,min)
median<-apply(df,1,median)
df$Min<-min
df$Median<-median
You can do it with dplyr, but you need to group by a unique ID variable so evaluate separately for each row. If, say, V1 is definitely unique, this is pretty easy:
dat %>% group_by(V1) %>% mutate(min = min(V1, V2, V3), median = median(c(V1, V2, V3)))
If you don't have a unique ID, you can make (and delete, if you like) one pretty easily:
dat %>% mutate(id = seq_len(n())) %>% group_by(id) %>%
mutate(min = min(V1, V2, V3), median = median(c(V1, V2, V3))) %>%
ungroup() %>% select(-id)
Either way, you get
Source: local data frame [3 x 5]
V1 V2 V3 min median
(int) (int) (int) (int) (int)
1 5 8 12 5 8
2 4 9 5 4 5
3 7 3 9 3 7
data<- data.frame(a=1:3,b=4:6,c=7:9)
data
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
data$Min <- apply(data,1,min)
data
# a b c Min
# 1 1 4 7 1
# 2 2 5 8 2
# 3 3 6 9 3
data$Median <-apply(data[,1:3],1,median)
data
# a b c min median
# 1 1 4 7 1 4
# 2 2 5 8 2 5
# 3 3 6 9 3 6
Hope this helped.

Frequency of rows by ID

The data set contains three variables: id, sex, and grade (factor).
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
For each ID, I need to see how many unique grades we have and then create a new column (call N) to record the grade frequency. For instance, for ID=1, we have five unique values for "grade", so N = 4; for ID=2, we have two unique values for "grade", so N = 2; for ID=4, we have two unique values for "grade" (ignore NA), so N = 2.
The final data set is
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)
New answer:
The uniqueN-function of data.table has a na.rm argument, which we can use as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade, na.rm = TRUE), by = id]
which gives:
> mydata
id sex grade n
1: 1 1 a 5
2: 1 1 b 5
3: 1 1 c 5
4: 1 1 d 5
5: 1 1 e 5
6: 2 0 x 2
7: 2 0 y 2
8: 2 0 y 2
9: 2 0 x 2
10: 3 0 q 1
11: 3 0 q 1
12: 3 0 q 1
13: 3 0 q 1
14: 4 1 a 2
15: 4 1 a 2
16: 4 1 a 2
17: 4 1 NA 2
18: 4 1 b 2
Old answer:
With data.table you could do this as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade[!is.na(grade)]), by = id]
or:
setDT(mydata)[, n := uniqueN(na.omit(grade)), by = id]
You could use the package data.table:
library(data.table)
setDT(mydata)
#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]
Very short, readable and fast. It can also be done in base-R:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times
mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))
Because there was discussion on what is faster, let's do some benchmarking.
Given dataset:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a
Larger dataset: (10000 observations, 1000 id's)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b
A dplyr option that makes use of dplyr::n_distinct and its na.rm-argument:
library(dplyr)
mydata %>% group_by(id) %>% mutate(N = n_distinct(grade, na.rm = TRUE))
#Source: local data frame [18 x 4]
#Groups: id [4]
#
# id sex grade N
# (dbl) (dbl) (fctr) (int)
#1 1 1 a 5
#2 1 1 b 5
#3 1 1 c 5
#4 1 1 d 5
#5 1 1 e 5
#6 2 0 x 2
#7 2 0 y 2
#8 2 0 y 2
#9 2 0 x 2
#10 3 0 q 1
#11 3 0 q 1
#12 3 0 q 1
#13 3 0 q 1
#14 4 1 a 2
#15 4 1 a 2
#16 4 1 a 2
#17 4 1 NA 2
#18 4 1 b 2
Looks like we have several votes for data.table, but you could also use the base R function ave():
mydata$N <- ave(as.character(mydata$grade),mydata$id,
FUN = function(x) length(unique(x[!is.na(x)])))
use tapply and lookup table
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4),
sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q",
"q","q","q", "a", "a", "a", NA, "b"))
uniqN <- tapply(mydata$grade, mydata$id, function(x) sum(!is.na(unique(x))))
mydata$N <- uniqN[mydata$id]
Here is a dplyr method. I kept the summary table separate for tidy reasons.
library(dplyr)
summary =
mydata %>%
distinct(id, grade) %>%
filter(grade %>% is.na %>% `!`) %>%
count(id)
mydata %>%
left_join(summary)

Resources