How to combine rows in two dataframes into one dataframe in R? - r

This is what I want to do:
Combine
df1
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
and
df2
Col1 Col2 Col3
4 6 3
5 7 8
9 1 2
into this:
df3
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
4 6 3
5 7 8
9 1 2
And I'm sorry if someone has asked this already.
Thanks!

You can just do :
df3 <- rbind(df1, df2)

Related

Remove rows with all NA values after groupby r

I want to remove rows which has all NAs after using group_by. here is a sample dataset:
df=data.frame(Col1=c("B","B","C","D",
"P1","P2","P3")
,Col2=c(NA,8,NA,9,10,8,9)
,Col3=c(NA,7,6,8,NA,7,8)
,Col4=c(NA,NA,7,7,NA,7,7))
i want to groupby Col1 and remove rows if column values are all NA.
So the desired output is:
Col1
Col2
Col3
Col4
B
8
7
NA
C
NA
6
7
D
9
8
7
P1
10
NA
NA
P2
8
7
7
P3
9
8
7
any help would be really appreciated.
You don't need group_by, you can use if_any.
library(dplyr)
filter(df, if_any(-Col1, ~ !is.na(.)))
# Col1 Col2 Col3 Col4
# 1 B 8 7 NA
# 2 C NA 6 7
# 3 D 9 8 7
# 4 P1 10 NA NA
# 5 P2 8 7 7
# 6 P3 9 8 7
There is no need for group-by, keep only rows where there is at least 1 non-na column, excluding Col1:
df[ rowSums(!is.na(df[, -1])) > 0, ]
# Col1 Col2 Col3 Col4
# 2 B 8 7 NA
# 3 C NA 6 7
# 4 D 9 8 7
# 5 P1 10 NA NA
# 6 P2 8 7 7
# 7 P3 9 8 7
With the help of r2evans answer, I was able to remove NAs after grouping by using group_by. Thanks everyone who answered.
library(dplyr)
df %>% group_by(Col1) %>% filter(if_any(everything(), ~ !is.na(.)))
Col1 Col2 Col3 Col4
B 8 7 NA
C NA 6 7
D 9 8 7
P1 10 NA NA
P2 8 7 7
P3 9 8 7

R way to select unique columns from column name?

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!
Something like this should work:
df[, !duplicated(colnames(df))]
Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

R Standard deviation across columns and rows by id

I have several data frames that look similar to the following data frame (with much more columns):
id col1 col2 col3 col4 col5
1 4 3 5 4 A
1 3 5 4 9 Z
1 5 8 3 4 H
2 6 9 2 1 B
2 4 9 5 4 K
3 2 1 7 5 J
3 5 8 4 3 B
3 6 4 3 9 C
I want to calculate the standard deviation across specific columns (let's say col2 to col4) grouped by the id. I do not know the column index in every data frame. I only know the names for the columns I want to calculate the standard deviation for.
Is there a way I could do that easily? My original data frames contain around 20 columns and I only want the standard deviation for 10 columns with specific column names grouped by the id.
On top, it would be nice if I can directly add the calculated standard deviations to my data frame as a new column according to the id, looking like this:
id col1 col2 col3 col4 col5 SD
1 4 3 5 4 A SD1
1 3 5 4 9 Z SD1
1 5 8 3 4 H SD1
2 6 9 2 1 B SD2
2 4 9 5 4 K SD2
3 2 1 7 5 J SD3
3 5 8 4 3 B SD3
3 6 4 3 9 C SD3
You can try :
library(dplyr)
df %>%
group_by(id) %>%
mutate(SD = sd(unlist(select(cur_data(), col2:col4))))
# id col1 col2 col3 col4 col5 SD
# <int> <int> <int> <int> <int> <chr> <dbl>
#1 1 4 3 5 4 A 2.12
#2 1 3 5 4 9 Z 2.12
#3 1 5 8 3 4 H 2.12
#4 2 6 9 2 1 B 3.41
#5 2 4 9 5 4 K 3.41
#6 3 2 1 7 5 J 2.62
#7 3 5 8 4 3 B 2.62
#8 3 6 4 3 9 C 2.62
Using data.table
library(data.table)
setDT(df)[, SD := sd(unlist(.SD)), id, .SDcols = col2:col4]

I need Column sum as sum12 is sum of 1st columns and sum34 is of column 3 and 4

col1 col2 col3 col4
1 4 1 4
2 4 2 5
4 5 3 6
5 6 5 7
I need column sum like
col1 col2 col3 col4 sum12 sum34
1 4 1 4 5 5
2 4 2 5 6 7
4 5 3 6 9 9
5 6 5 7 11 12
We can use transform
transform(df, sum12 = col1 + col2, sum34 = col3 + col4)
Or another option is
df[c("sum12", "sum34")] <- df[c(1,3)] + df[c(2,4)]
df
# col1 col2 col3 col4 sum12 sum34
#1 1 4 1 4 5 5
#2 2 4 2 5 6 7
#3 4 5 3 6 9 9
#4 5 6 5 7 11 12

Use a column in a data.frame to select another

How do I use one variable in a data.frame to refer to another?
say I have:
col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8
and I want:
col col1 col2 answer
"col1" 1 5 1
"col2" 2 6 6
"col1" 3 7 3
"col2" 4 8 8
,
df$answer = df[,df$col]
isn't working, and a for loop is taking forever.
I know it's already answered, but I thought another approach might be useful:
read.table(text='col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8',h=T)->df
df$answer <- as.integer(df[ cbind(c(1:nrow(df)), match(df$col, names(df))) ])
df
# col col1 col2 answer
# 1 col1 1 5 1
# 2 col2 2 6 6
# 3 col1 3 7 3
# 4 col2 4 8 8
This doesn't look very hard, but the solution I found isn't very elegant, there are probably better ways. But you can use match and then subset according to the match:
dat <- read.table(text="col col1 col2
col1 1 5
col2 2 6
col1 3 7
col2 4 8", header = T, stringsAsFactors = FALSE)
cols <- unique(dat$col)
matches <- match(dat$col, cols)
dat$answer <- sapply(seq_along(matches), function (i) {
dat[i,cols[matches[i]]]
})
And the result:
> dat
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Edit
Actually, here's an already much better approach:
dat$answer <- sapply(1:nrow(dat), function(r) {
dat[r,dat$col[r]]
})
This is apparently what you have tried, but using sapply instead of unlist(lapply, so yeah, not sure if this helps.
In this case with only 2 columns ifelsemight be the fastest and most straightforward solution.
df$answer <- ifelse(df[,1] == "col1",df[,"col1"],df[,"col2”])
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Addition as N8TRO asked in his comment for a more general solution.
A simple switch might be all that is needed:
for(i in 1:nrow(df)) df$ans[i] <- switch(df[i,1],df[i,df[i,1]])
or without a "for" loop:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
example:
df <- data.frame(col=sample(paste0('col',1:5),10,replace=T),col1=1:10,col2=11:20,col3=21:30,col4=31:40,col5=41:50,stringsAsFactors = F)
select the elements:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
df
col col1 col2 col3 col4 col5 ans
1 col1 1 11 21 31 41 1
2 col1 2 12 22 32 42 2
3 col5 3 13 23 33 43 43
4 col2 4 14 24 34 44 14
5 col3 5 15 25 35 45 25
6 col4 6 16 26 36 46 36
7 col5 7 17 27 37 47 47
8 col3 8 18 28 38 48 28
9 col1 9 19 29 39 49 9
10 col5 10 20 30 40 50 50

Resources