I want to combine about 20 dataframes, with different lengths of rows and columns, only by the mutual rownames. Any rows that are not shared for ALL dataframes are deleted. So for example on two dataframes:
Patient1 Patient64 Patient472
ABC 28 38 0
XYZ 92 11 998
WWE 1 10 282
ICQ 0 76 56
SQL 22 1002 778
combine with
Pat_9 Pat_1 Pat_111
ABC 65 44 874
CBA 3 311 998
WWE 2 1110 282
vVv 2 760 56
GHG 12 1200 778
The result would be
Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
ABC 28 38 0 65 44 874
WWE 1 10 282 2 1110 282
I know how to use rbind and cbind but not for the purpose of joining according to shared rownames.
Try this considering change list arguments to df1 , df2 , df3 , ... , df20 your data.frames
l <- lapply(list(df1 , df2 ) , \(x) {x[["id"]] <- rownames(x) ; x})
Reduce(\(x,y) merge(x,y , by = "id") , l)
you can try
merge(d1, d2, by = "row.names")
Row.names Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
1 ABC 28 38 0 65 44 874
2 WWE 1 10 282 2 1110 282
for more than two you can use a tidyverse
library(tidyverse)
lst(d1, d2, d2) %>%
map(rownames_to_column) %>%
reduce(inner_join, by="rowname")
You can first turn your rownames_to_column and use a inner_join and at last convert column_to_rownames back like this:
df1 <- read.table(text=" Patient1 Patient64 Patient472
ABC 28 38 0
XYZ 92 11 998
WWE 1 10 282
ICQ 0 76 56
SQL 22 1002 778", header = TRUE)
df2 <- read.table(text = " Pat_9 Pat_1 Pat_111
ABC 65 44 874
CBA 3 311 998
WWE 2 1110 282
vVv 2 760 56
GHG 12 1200 778", header = TRUE)
library(dplyr)
library(tibble)
df1 %>%
rownames_to_column() %>%
inner_join(df2 %>% rownames_to_column(), by = "rowname") %>%
column_to_rownames()
#> Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
#> ABC 28 38 0 65 44 874
#> WWE 1 10 282 2 1110 282
Created on 2022-07-20 by the reprex package (v2.0.1)
Option with list of dataframes:
dfs_list <- list(df1, df2)
transform(Reduce(merge, lapply(dfs_list, function(x) data.frame(x, rn = row.names(x)))), row.names=rn, rn=NULL)
#> Patient1 Patient64 Patient472 Pat_9 Pat_1 Pat_111
#> ABC 28 38 0 65 44 874
#> WWE 1 10 282 2 1110 282
Created on 2022-07-20 by the reprex package (v2.0.1)
Related
I have df like this
ID <- c("A01","B20","C3","D4")
Nb_data <- c(2,2,2,3)
Weight_t1 <- c(70,44,98,65)
Weight_t2 <- c(75,78,105,68)
Weight_t3 <- c(72,52,107,NA)
year1 <- c(20,28,32,50)
year2 <- c(28,32,35,60)
year3 <- c(29,35,38,NA)
LENGTHt1 <- c(175,155,198,165)
LENGTHt2 <- c(175,155,198,163)
LENGTHt3 <- c(176,154,198,NA)
df <- data.frame(ID,Nb_data,Weight_t1,Weight_t2,Weight_t3,year1,year2,year3,LENGTHt1,LENGTHt2,LENGTHt3)
weight/year and length : t1 to t28
I want to tidy my data like :
ID
Nb_data
Weigth
Year
Length
A01
3
70
20
175
A01
3
75
28
175
A01
3
72
29
176
B20
3
44
28
155
B20
3
78
32
155
B20
3
52
35
154
I try
df1 <- df %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c('Weight','Year','Length' ),
names_pattern = '(Weight_t[0-9]*|year[0-9]*|LENGTHt[0-9]*)' ,
values_drop_na = TRUE)
or names_pattern = '(.t[0-9])(.t[0-9])(.t[0-9])'
I have some difficulties to use regex or maybe pivot_longer are not suitable...
You need to extract the common timepoint information from the variable names. Make this information consistent first, with a clear separator (_ in this case), then it becomes much easier.
I would do something like this
library(tidyr)
library(dplyr)
df1 <- df
names(df1) <- gsub("year", "Year_t", names(df1))
names(df1) <- gsub("LENGTH", "Length_", names(df1))
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c("name", "timepoint"),
names_sep = "_",
values_drop_na = TRUE) %>%
pivot_wider(names_from = name, values_from = value)
EDIT: or shorter, using ".value" in the names_to argument (as #onyambu showed in his answer):
df1 %>%
pivot_longer(cols = -c('ID','Nb_data'),
names_to = c(".value", "timepoint"),
names_sep = "_",
values_drop_na = TRUE)
Output:
ID Nb_data timepoint Weight Year Length
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 t1 70 20 175
2 A01 2 t2 75 28 175
3 A01 2 t3 72 29 176
4 B20 2 t1 44 28 155
5 B20 2 t2 78 32 155
6 B20 2 t3 52 35 154
7 C3 2 t1 98 32 198
8 C3 2 t2 105 35 198
9 C3 2 t3 107 38 198
10 D4 3 t1 65 50 165
11 D4 3 t2 68 60 163
You could directly use pivot_longer though with abit of complex regex as follows
df %>%
pivot_longer(matches("\\d+$"), names_to = c(".value", "grp"),
names_pattern = "(.*?)[_t]{0,2}(\\d+$)",
values_drop_na = TRUE)
# A tibble: 11 × 6
ID Nb_data grp Weight year LENGTH
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 A01 2 1 70 20 175
2 A01 2 2 75 28 175
3 A01 2 3 72 29 176
4 B20 2 1 44 28 155
5 B20 2 2 78 32 155
6 B20 2 3 52 35 154
7 C3 2 1 98 32 198
8 C3 2 2 105 35 198
9 C3 2 3 107 38 198
10 D4 3 1 65 50 165
11 D4 3 2 68 60 163
I have a problem with the humans here; they're giving me Citizen Science data in spreadsheets formatted to be attractive and legible. I figured out the right sequence of pivots _longer and _wider to get it into an analyzable format but first I had to do a whole bunch of hand edits to make the column labels usable. I've just been given a corrected spreadsheet so now I have to do the same hand edits all over. Can I avoid this?
reprex <- read_csv("reprex.csv", col_names = FALSE)
gives:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 2014 NA NA 2015 NA NA 2016 NA
2 NA Total F M Total F M Total F M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB NA NA NA 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want column labels like "2014 Total", "2014 F", ... like so:
Location `2014 Total` `2014 F` `2014 M` `2015 Total` `2015 F` `2015 M` `2016 Total` `2016 F` `2016 M`
1 SiteA 180 92 88 134 40 94 34 20 14
2 SiteB NA NA NA 247 143 104 8 8 0
3 SiteC 237 194 43 220 95 125 62 45 17
...which would allow me to twist it up until I get to something like:
Location date Total F M
1 SiteA 2014 180 92 88
2 SiteB 2014 NA NA NA
3 SiteC 2014 237 194 43
4 SiteA 2015 134 40 94
5 SiteB 2015 247 143 104
6 SiteC 2015 220 95 125
7 SiteA 2016 34 20 14
8 SiteB 2016 8 8 0
9 SiteC 2016 62 45 17
The part from the second table to the third I've got; the problem is in how to get from the first table to the second. It would seem like you could pivot the first and then fill in the missing dates with fill(.direction="updown") except that the dates are the grouping value you need to be following.
For this example we could do like this:
library(tidyverse)
df_helper <- df %>%
slice(1:2) %>%
pivot_longer(cols= everything()) %>%
fill(value, .direction = "up") %>%
mutate(x = lead(value, 11)) %>%
drop_na() %>%
unite("name", c(value, x), sep = " ", remove = FALSE) %>%
pivot_wider(names_from = name)
df %>%
setNames(names(df_helper)) %>%
rename(Location = x) %>%
slice(-c(1:2))
Location 2014 Total 2014 F 2014 M 2015 Total 2015 F 2015 M 2016 Total 2016 F 2016 M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB <NA> <NA> <NA> 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I have a dataframe with columns that have 'x1' and 'x1_fit' with the numbers going up to 5 in some cases.
date <- seq(as.Date('2019-11-04'), by = "days", length.out = 7)
x1 <- c(100,120,111,152,110,112,111)
x1_fit <- c(150,142,146,148,123,120,145)
x2 <- c(110,130,151,152,150,142,161)
x2_fit <- c(170,172,176,178,173,170,175)
df <- data.frame(date,x1,x1_fit,x2,x2_fit)
How can I do x1_fit - x1 and so on. The number of x's will change every time.
You can select those columns with regular expressions (surppose the columns are in appropriate order):
> df[, grep('^x\\d+_fit$', colnames(df))] - df[, grep('^x\\d+$', colnames(df))]
x1_fit x2_fit
1 50 60
2 22 42
3 35 25
4 -4 26
5 13 23
6 8 28
7 34 14
If you want to assign the differences to the original df:
df[, paste0(grep('^x\\d+$', colnames(df), value = TRUE), '_diff')] <-
df[, grep('^x\\d+_fit$', colnames(df))] - df[, grep('^x\\d+$', colnames(df))]
# > df
# date x1 x1_fit x2 x2_fit x1_diff x2_diff
# 1 2019-11-04 100 150 110 170 50 60
# 2 2019-11-05 120 142 130 172 22 42
# 3 2019-11-06 111 146 151 176 35 25
# 4 2019-11-07 152 148 152 178 -4 26
# 5 2019-11-08 110 123 150 173 13 23
# 6 2019-11-09 112 120 142 170 8 28
# 7 2019-11-10 111 145 161 175 34 14
Solution from #mt1022 is straightforward, however since you have tagged this as dplyr, here is one approach following it where we convert the data to long format, subtract the corresponding values and get the data in wide format again.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -date) %>%
mutate(name = sub('_.*', '', name)) %>%
group_by(date, name) %>%
summarise(diff = diff(value)) %>%
pivot_wider(names_from = name, values_from = diff) %>%
rename_at(-1, ~paste0(., "_diff")) %>%
left_join(df, by = "date")
# date x1_diff x2_diff x1 x1_fit x2 x2_fit
# <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2019-11-04 50 60 100 150 110 170
#2 2019-11-05 22 42 120 142 130 172
#3 2019-11-06 35 25 111 146 151 176
#4 2019-11-07 -4 26 152 148 152 178
#5 2019-11-08 13 23 110 123 150 173
#6 2019-11-09 8 28 112 120 142 170
#7 2019-11-10 34 14 111 145 161 175
In base R, you could loop over the unique column names and diff on the the fitted column using
> lapply(setNames(nm = unique(gsub("_.*", "", names(df)))), function(nm) {
fit <- paste0(nm, "_fit")
diff <- df[, nm] - df[, fit]
})
# $x1
# [1] -50 -22 -35 4 -13 -8 -34
#
# $x2
# [1] -60 -42 -25 -26 -23 -28 -14
Here, I set the Date column as the row names and removed the column using
df <- data.frame(date,x1,x1_fit,x2,x2_fit)
row.names(df) <- df$date
df$date <- NULL
but you could just loop over the the column names without the Date column.
We can also do with a split in base R
out <- sapply(split.default(df[-1], sub("_.*", "", names(df)[-1])),
function(x) x[,2] - x[1])
df[sub("\\..*", "_diff", names(lst1))] <- out
df
# date x1 x1_fit x2 x2_fit x1_diff x2_diff
#1 2019-11-04 100 150 110 170 50 60
#2 2019-11-05 120 142 130 172 22 42
#3 2019-11-06 111 146 151 176 35 25
#4 2019-11-07 152 148 152 178 -4 26
#5 2019-11-08 110 123 150 173 13 23
#6 2019-11-09 112 120 142 170 8 28
#7 2019-11-10 111 145 161 175 34 14
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))
I'm currently on R trying to create for a DF multiple columns with the sum of previous one. Imagine I got a DF like this:
df=
sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
1 70 153 NA 28 19
2 57 68 73 118 16
3 29 NA 19 32 36
4 177 36 3 54 53
and I want to add at the end the sum of the rows previous of the month that I'm reporting so for October you end up with the sum of sep and oct, and for November you end up with the sum of sep, oct and november and end up with something like this:
df=
sep-2016 oct-2016 nov-2016 dec-2016 jan-2017 status-Oct2016 status-Nov 2016
1 70 153 NA 28 19 223 223
2 57 68 73 118 16 105 198
3 29 NA 19 32 36 29 48
4 177 36 3 54 53 213 93
I want to know a efficient way insted of writing a lots of lines of rowSums() and even if I can get the label on the iteration for each month would be amazing!
Thanks!
We can use lapply to loop through the columns to apply the rowSums.
dat2 <- as.data.frame(lapply(2:ncol(dat), function(i){
rowSums(dat[, 1:i], na.rm = TRUE)
}))
names(dat2) <- paste0("status-", names(dat[, -1]))
dat3 <- cbind(dat, dat2)
dat3
# sep-2016 oct-2016 nov-2016 dec-2016 jan-2017 status-oct-2016 status-nov-2016 status-dec-2016 status-jan-2017
# 1 70 153 NA 28 19 223 223 251 270
# 2 57 68 73 118 16 125 198 316 332
# 3 29 NA 19 32 36 29 48 80 116
# 4 177 36 3 54 53 213 216 270 323
DATA
dat <- read.table(text = " 'sep-2016' 'oct-2016' 'nov-2016' 'dec-2016' 'jan-2017'
1 70 153 NA 28 19
2 57 68 73 118 16
3 29 NA 19 32 36
4 177 36 3 54 53",
header = TRUE, stringsAsFactors = FALSE)
names(dat) <- c("sep-2016", "oct-2016", "nov-2016", "dec-2016", "jan-2017")
Honestly I have no idea why you would want your data in this format, but here is a tidyverse method of accomplishing it. It involves transforming the data to a tidy format before spreading it back out into your wide format. The key thing to note is that in a tidy format, where month is a variable in a single column instead of spread across multiple columns, you can simply use group_by(rowid) and cumsum to calculate all the values you want. The last few lines are constructing the status- column names and spreading the data back out into a wide format.
library(tidyverse)
df <- read_table2(
"sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
70 153 NA 28 19
57 68 73 118 16
29 NA 19 32 36
177 36 3 54 53"
)
df %>%
rowid_to_column() %>%
gather("month", "value", -rowid) %>%
arrange(rowid) %>%
group_by(rowid) %>%
mutate(
value = replace_na(value, 0),
status = cumsum(value)
) %>%
gather("vartype", "number", value, status) %>%
mutate(colname = ifelse(vartype == "value", month, str_c("status-", month))) %>%
select(rowid, number, colname) %>%
spread(colname, number)
#> # A tibble: 4 x 11
#> # Groups: rowid [4]
#> rowid `dec-2016` `jan-2017` `nov-2016` `oct-2016` `sep-2016`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 28.0 19.0 0 153 70.0
#> 2 2 118 16.0 73.0 68.0 57.0
#> 3 3 32.0 36.0 19.0 0 29.0
#> 4 4 54.0 53.0 3.00 36.0 177
#> # ... with 5 more variables: `status-dec-2016` <dbl>,
#> # `status-jan-2017` <dbl>, `status-nov-2016` <dbl>,
#> # `status-oct-2016` <dbl>, `status-sep-2016` <dbl>
Created on 2018-02-16 by the reprex package (v0.2.0).
A clean way to do it is by convert your data in a long format.
library(tibble)
library(tidyr)
library(dplyr)
your_data <- tribble(~"sep_2016", ~"oct_2016", ~"nov_2016", ~"dec_2016", ~"jan_2017",
70, 153, NA, 28, 19,
57, 68, 73, 118, 16,
29, NA, 19, 32, 36,
177, 36, 3, 54, 53)
You can change the format of your data.frame with gather from the tidyr package.
your_data_long <- your_data %>%
rowid_to_column() %>%
gather(key = month_year, value = the_value, -rowid)
head(your_data_long)
#> # A tibble: 6 x 3
#> rowid month_year the_value
#> <int> <chr> <dbl>
#> 1 1 sep_2016 70
#> 2 2 sep_2016 57
#> 3 3 sep_2016 29
#> 4 4 sep_2016 177
#> 5 1 oct_2016 153
#> 6 2 oct_2016 68
Once your data.frame is in a long format. You can compute cumulative sum with cumsumand dplyrfunctions mutate and group_by.
result <- your_data_long %>%
group_by(rowid) %>%
mutate(cumulative_value = cumsum(the_value))
result
#> # A tibble: 20 x 4
#> # Groups: rowid [4]
#> rowid month_year the_value cumulative_value
#> <int> <chr> <dbl> <dbl>
#> 1 1 sep_2016 70 70
#> 2 2 sep_2016 57 57
#> 3 3 sep_2016 29 29
#> 4 4 sep_2016 177 177
#> 5 1 oct_2016 153 223
#> 6 2 oct_2016 68 125
#> 7 3 oct_2016 NA NA
#> 8 4 oct_2016 36 213
#> 9 1 nov_2016 NA NA
#> 10 2 nov_2016 73 198
#> 11 3 nov_2016 19 NA
#> 12 4 nov_2016 3 216
#> 13 1 dec_2016 28 NA
#> 14 2 dec_2016 118 316
#> 15 3 dec_2016 32 NA
#> 16 4 dec_2016 54 270
#> 17 1 jan_2017 19 NA
#> 18 2 jan_2017 16 332
#> 19 3 jan_2017 36 NA
#> 20 4 jan_2017 53 323
If you want to retrieve the starting form, you can do it with spread.
My preferred solution would be:
# library(matrixStats)
DF <- as.matrix(df)
DF[is.na(DF)] <- 0
RES <- matrixStats::rowCumsums(DF)
colnames(RES) <- paste0("status-", colnames(DF))
cbind.data.frame(df, RES)
This is closest to what you are looking for with the rowSums.
One option could be using spread and gather function from tidyverse.
Note: The status column has been added even for the 1st month. And the status columns are not in order but values are correct.
The approach is:
# Data
df <- read.table(text = "sep-2016 oct-2016 nov-2016 dec-2016 jan-2017
70 153 NA 28 19
57 68 73 118 16
29 NA 19 32 36
177 36 3 54 53", header = T, stringsAsFactors = F)
library(tidyverse)
# Just add an row number as sl
df <- df %>% mutate(sl = row_number())
#Calculate the cumulative sum after gathering and arranging by date
mod_df <- df %>%
gather(key, value, -sl) %>%
mutate(key = as.Date(paste("01",key, sep="."), format="%d.%b.%Y")) %>%
arrange(sl, key) %>%
group_by(sl) %>%
mutate(status = cumsum(ifelse(is.na(value),0L,value) )) %>%
select(-value) %>%
mutate(key = paste("status",as.character(key, format="%b.%Y"))) %>%
spread(key, status)
# Finally join cumulative calculated sum columns with original df and then
# remove sl column
inner_join(df, mod_df, by = "sl") %>% select(-sl)
# sep.2016 oct.2016 nov.2016 dec.2016 jan.2017 status Dec.2016 status Jan.2017 status Nov.2016 status Oct.2016 status Sep.2016
#1 70 153 NA 28 19 251 270 223 223 70
#2 57 68 73 118 16 316 332 198 125 57
#3 29 NA 19 32 36 80 116 48 29 29
#4 177 36 3 54 53 270 323 216 213 177
Another base solution where we build a matrix accumulating the row sums :
status <- setNames(
as.data.frame(t(apply(dat,1,function(x) Reduce(sum,'[<-'(x,is.na(x),0),accumulate = TRUE)))),
paste0("status-",names(dat)))
status
# status-sep-2016 status-oct-2016 status-nov-2016 status-dec-2016 status-jan-2017
# 1 70 223 223 251 270
# 2 57 125 198 316 332
# 3 29 29 48 80 116
# 4 177 213 216 270 323
Then bind it to your original data if needed :
cbind(dat,status[-1])