rowSums adding new column to dataframe but not adding the values - r

I have a table that I'm trying to add multiple columns together in a new column. I've tried simple rowSums in different forms and it will create the new column but the values don't add.
col1
col2
col3
col4
10
25
15
25
99
42
20
35
I've tried
df$col5 <- rowSums(df[c("col1","col2","col3",col4")])
and
df <- df %>% mutate(col5 = rowSums(select(.,"col1","col2","col3",col4")
and
df <- df %>% mutate(col5 = sum(c_across(col1:col4)
But each time it returns
col1
col2
col3
col4
col5
10
25
15
25
NA
99
42
20
35
NA

Related

R join data efficiently if one of the columns in the first dataset matches any of the columns in the second dataset

Given 2 dataframes:
df1
col1 col2 col3
43 21 "a"
32 31 "b"
NA 12 "c"
44 NA "d"
df2
cl4 cl5 cl6
43 1 "text"
12 0 "text2"
32 44 "text3"
How can I merge them with a left_join, if one value of the columns in c("col1", "col2") matches a value in the columns c("cl4", "cl5") ?
Additional information: all variables can have missing values, except cl6 which is always completed.
Expected result:
col1 col2 col3 cl4 cl5 cl6
43 21 "a" 43 1 "text"
32 31 "b" 32 44 "text3"
NA 12 "c" 12 0 "text2"
44 NA "d" 32 44 "text3"
I have some code that works, but I think there must be a better solution if there are a lot of joins to be done (in my real dataframes I have 24 joins to do...).
Here is my code:
list_vars = c('cl4', "cl5", "cl6")
list_vars_rename = c("col4", "col5", "col6")
#MERGE 1
df1_merged <- left_join(df1, df2, by=c("col1" = "cl4"), na_matches = "never") #ignore NAs
df1_merged$cl4 <- df1_merged$col1 #because cl4 disappears during the join
df1_merged[is.na(df1_merged$cl6), "cl4"] <- NA #cl4 equals NA if no match = if cl6 NA
setnames(df1_merged, old = list_vars, new = list_vars_rename, skip_absent = T) #rename cols
#MERGE 2
df1_merged <- left_join(df1_merged, df2, by=c("col1" = "cl5"), na_matches = "never")
df1_merged <- as.data.frame(df1_merged) #because was a tibble
df1_merged$cl5 <- df1_merged$col1 #because cl4 disappears during the join
df1_merged[is.na(df1_merged$cl6), "cl5"] <- NA #cl5 equals NA if no match = if cl6 NA
for (i in seq_along(list_vars_rename)){
df1_merged[,list_vars_rename[i]] <- ifelse(is.na(df1_merged[,list_vars_rename[i]]), df1_merged[,list_vars[i]], df1_merged[,list_vars_rename[i]])
} #fill col4, col5 & col6 with the values of cl4, cl5 & cl6 we got in the join
df1_merged = df1_merged[, !(names(df1_merged) %in% list_vars)] #drop cl4 ,cl5 & cl6
#MERGE 3
df1_merged <- left_join(df1_merged, ventes, by=c("col2" = "cl4"), na_matches = "never")
df1_merged <- as.data.frame(df1_merged)
df1_merged$cl4 <- df1_merged$col2
df1_merged[is.na(df1_merged$cl6), "cl4"] <- NA
for (i in seq_along(list_vars_rename)){
df1_merged[,list_vars_rename[i]] <- ifelse(is.na(df1_merged[,list_vars_rename[i]]), df1_merged[,list_vars[i]], df1_merged[,list_vars_rename[i]])
}
df1_merged= df1_merged[, !(names(df1_merged) %in% list_vars)]
###etc. until the last merge.
I didn't quite get there, but maybe this code helps:
library(tidyverse)
df1 <- read_table("col1 col2 col3
43 21 a
32 31 b
NA 12 c
44 NA d")
df2 <- read_table("cl4 cl5 cl6
43 1 text
12 0 text2
32 44 text3")
cols_1 <- c("col1", "col2")
cols_2 <- c("cl4", "cl5")
df1 %>%
pivot_longer(cols = all_of(cols_1)) %>%
left_join(df2 %>% pivot_longer(cols = all_of(cols_2)), by = "value", suffix = c(".df1", ".df2")) %>%
filter(!is.na(name.df1) & !is.na(name.df2))
#> # A tibble: 4 x 5
#> col3 name.df1 value cl6 name.df2
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 a col1 43 text cl4
#> 2 b col1 32 text3 cl4
#> 3 c col2 12 text2 cl4
#> 4 d col1 44 text3 cl5
Created on 2021-07-27 by the reprex package (v2.0.0)
The output contains the columns containing the important info (col3 and cl6), and it tells you what columns matched (name.df1 and name.df2), and what the matching value is (value). But I couldn't figure out how to add back the other information to match your desired output. Also I didn't deal with NAs.

How to divide a dataframe column by 3 but skip division based on another column contents

I want to divide both numeric columns by 3 but not the third character column.
current dataframe:
col1 col2 col3
100 10 cat
200 20 dog
300 30 NA
desired:
col1 col2 col3
10 1 cat
20 2 dog
300 30 NA
my current code that isn't based on col3:
DB <- BD %>% mutate(Col1=Col1/3) %>% mutate(Col2s=Col2/3)
Please help with a solution. Thank you
Here is an idea via dplyr,
library(dplyr)
df %>%
mutate_at(vars(-3), list(~ifelse(!is.na(col3), ./10, .)))
# col1 col2 col3
#1 10 1 cat
#2 20 2 dog
#3 300 30 <NA>
Using base R.
no <- !is.na(dat$col3)
num <- sapply(dat, is.numeric)
dat[na, num] <- dat[na, num]/10
dat
# col1 col2 col3
# 1 10 1 cat
# 2 20 2 dog
# 3 300 30 <NA>
Data:
dat <- read.table(header=T, text="col1 col2 col3
100 10 cat
200 20 dog
300 30 NA")
try it this way
library(tidyverse)
df %>%
pivot_longer(-col3) %>%
mutate(value = ifelse(!is.na(col3), value / 3, value)) %>%
pivot_wider(col3, names_from = name, values_from = value)
In base R, we can directly do the assignment if we have a logical index
dat1[1:2][!is.na(dat1$col3),] <- dat1[1:2, !is.na(dat1$col3)]/10
Or using data.table
library(data.table)
setDT(dat1)[is.na(col3), (1:2) := .SD/10, .SDcols = 1:2]

Union dataframes in some way that updates rows with same row.name

I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.
I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])
We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]

resample each two columns together in a data frame in R

I have a very large data frame that contains 100 rows and 400000 columns.
To sample each column, I can simply do:
df <- apply(df, 2, sample)
But I want every two column to be sampled together. For example, if originally col1 is c(1,2,3,4,5) and col2 is also c(6,7,8,9,10), and after resampling, col1 becomes c(1,3,2,4,5), I want col2 to be c(6,8,7,9,10) that follows the resampling pattern of col1. Same thing for col3 & col4, col5 & col6, etc.
I wrote a for loop to do this, which takes forever. Is there a better way? Thanks!
You might try this; split the data frame every two columns with split.default, for each sub data frame, sample the rows and then bind them together:
df <- data.frame(col1 = 1:5, col2 = 6:10, col3 = 11:15)
index <- seq_len(nrow(df))
cbind.data.frame(
setNames(lapply(
split.default(df, (seq_along(df) - 1) %/% 2),
function(sdf) sdf[sample(index),,drop=F]),
NULL)
)
# col1 col2 col3
#5 5 10 12
#4 4 9 11
#1 1 6 15
#2 2 7 14
#3 3 8 13

R - Combining duplicate rows within dataframe in R :

I have a dataframe as below: please note that COL1 is having duplicate entries
COL1 COL2 COL3
10 hai 2
10 hai 3
10 pal 1
I want the output to be like this as shown below: i.e COL1 should have the unique entry alone(10), COL2 should contain the merged entries under it without duplicates(hai pal), and COL3 should contain the sum of entries(2+3+1=6)
OUTPUT:
COL1 COL2 COL3
10 hai pal 6
Perhaps we need to aggregate by group. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'COL1', paste the unique elements in 'COL2' together as well as get the sum of 'COL3'.
library(data.table)
setDT(df1)[,.(COL2 = paste(unique(COL2), collapse=" "), COL3= sum(COL3)) , by = COL1]
# COL1 COL2 COL3
#1: 10 hai pal 6

Resources