How to combine values within a row with comma - r

I would like to know, How to combine columns in a dataframe/list in R with a comma separator. Below is the sample dataset.
Name Red Blue Green
Jack 4 5 3
John 5 6 4
Gen 3 7 1
Pra 4 6 2
Expected would be:
Name Colors
Jack 4,5,3
John 5,6,4
Gen 3,7,1
Pra 4,6,2
Immediate help would be appreciated.
Thanks in advance

We can use paste with do.call. Note that even if you have 100 columns to paste, the below code does it automatically without having to painfully mention paste(df1$Red, df1$blue, df1$Green, df1$Orange, etc..., sep=",") etc.
newdf1 <- cbind(df1[1], Colors=do.call(paste, c(df1[-1], sep=",")))
newdf1
# Name Colors
#1 Jack 4,5,3
#2 John 5,6,4
#3 Gen 3,7,1
#4 Pra 4,6,2
Or a similar option with sprintf
cbind(df1[1], Colors=do.call(sprintf, c(df1[-1], list(fmt="%d,%d,%d"))))
Or with unite from tidyr
library(dplyr)
library(tidyr)
df1 %>%
unite(Colors, Red:Green, sep=",")
# Name Colors
#1 Jack 4,5,3
#2 John 5,6,4
#3 Gen 3,7,1
#4 Pra 4,6,2

I'd suggest the paste function with a "," separator.
df$Colors<-paste(df$Red, df$Blue, df$Green, sep =",")

You can achieve this by using the unite function from the tidyr package:
tidyr::unite(df_test, Color, -Name, sep = ', ')
data:
structure(list(Name = c("Jack", "John", "Gen", "Pra"),
Red = c(4L, 5L, 3L, 4L),
Blue = c(5L, 6L, 7L, 6L),
Green = c(3L, 4L, 1L, 2L)),
class = "data.frame",
row.names = c(NA, -4L)) -> df_test

I would copy the data into excel, add a column with the value , and add a formulae =A1&D1&B1&D1&C1
A,B,C is the columns, D is the comma

Related

r transfer values from one dataset to another by ID

I have two datasets , the first dataset is like this
ID Weight State
1 12.34 NA
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 NA
5 14.12 NA
The second dataset is a lookup table for state values by ID
ID State
1 WY
2 IA
3 MA
4 OR
4 CA
5 FL
As you can see there are two different state values for ID 4, which is normal.
What I want to do is replace the NAs in dataset1 State column with State values from dataset 2. Expected dataset
ID Weight State
1 12.34 WY
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 OR,CA
5 14.12 FL
Since ID 4 has two state values in dataset2 , these two values are collapsed and separated by , and used to replace the NA in dataset1. Any suggestion on accomplishing this is much appreciated. Thanks in advance.
Collapse df2 value and join it with df1 by 'ID'. Use coalesce to use non-NA value from the two state columns.
library(dplyr)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)), by = 'ID') %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-State.x, -State.y)
# ID Weight State
#1 1 12.3 WY
#2 2 11.2 IA
#3 2 13.1 IN
#4 3 12.7 MA
#5 4 10.9 OR, CA
#6 5 14.1 FL
In base R with merge and transform.
merge(df1, aggregate(State~ID, df2, toString), by = 'ID') |>
transform(State = ifelse(is.na(State.x), State.y, State.x))
Tidyverse way:
library(tidyverse)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)) %>%
ungroup(), by = 'ID') %>%
transmute(ID, Weight, State = coalesce(State.x, State.y))
Base R alternative:
na_idx <- which(is.na(df1$State))
df1$State[na_idx] <- with(
aggregate(State ~ ID, df2, toString),
State[match(df1$ID, ID)]
)[na_idx]
Data:
df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 5L), Weight = c(12.34,
11.23, 13.12, 12.67, 10.89, 14.12), State = c("WY", "IA", "IN",
"MA", "OR, CA", "FL")), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L), State = c("WY",
"IA", "MA", "OR", "CA", "FL")), class = "data.frame", row.names = c(NA,
-6L))

How to count the number of customer per month in R?

So I have a table of customers with the respective date as below:
ID
Date
1
2019-04-17
4
2019-05-12
1
2019-04-25
2
2019-05-19
I just want to count how many Customer is there for each month-year like below:
Month-Year
Count of Customer
Apr-19
2
May-19
2
EDIT:
Sorry but I think my Question should be clearer.
The same customer can appear more than once in a month and would be counted as 2 customer for the same month. I would basically like to find the number of transaction per month based on customer id.
My assumed approach would be to first change the date into a month-year format? And then I count each customer and grouped it for each month? but I am not sure how to do this in R. Thank you!
You can use count -
library(dplyr)
df %>% count(Month_Year = format(as.Date(Date), '%b-%y'))
# Month_Year n
#1 Apr-19 2
#2 May-19 2
Or table in base R -
table(format(as.Date(df$Date), '%b-%y'))
#Apr-19 May-19
# 2 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))
We can use zoo::as.yearmon
library(dplyr)
df %>%
count(Date = zoo::as.yearmon(Date))
Date n
1 Apr 2019 2
2 May 2019 2
data
df <- structure(list(ID = c(1L, 4L, 1L, 2L), Date = c("2019-04-17",
"2019-05-12", "2019-04-25", "2019-05-19")),
class = "data.frame", row.names = c(NA, -4L))

Finding difference based on interaction between two vectors

I have two data frames with two columns each I'd like to compare, and generate output that appears in the first dataframe only that is the difference of the interaction of the two columns when compared between the dataframes.
I've tried using merge, %in%, Interaction, match, and I can't seem to get the correct output. I've also searched extensively on SO and am not finding a similar problem.
The closest response I've found is:
newdat <- match(interaction(dfA$colA, dfA$colB), interaction(dfB$colA, dfB$colB))
But obviously, this code isn't correct as this would (if working) would give me something that is common between the dataframes, and I want the difference between them (erroring - it generates a numeric vector, when both colA and B are string).
Example data:
#Dataframe A
colA colB
Aspirin Smith, John
Aspirin Doe, Jane
Atorva Smith, John
Simva Doe, Jane
#Dataframe B
colA colB
Aspirin Smith, John
Aspirin Doe, Jane
Atorva Doe, Jane
## GOAL:
#Dataframe
colA colB
Atorva Smith, John
Simva Doe, Jane
Thanks!
We can use setdiff from the dplyr package.
library(dplyr)
setdiff(datA, datB)
# colA colB
# 1 Atorva Smith, John
# 2 Simva Doe, Jane
DATA
datA <- read.table(text = " colA colB
Aspirin 'Smith, John'
Aspirin 'Doe, Jane'
Atorva 'Smith, John'
Simva 'Doe, Jane'",
header = TRUE, stringsAsFactors = FALSE)
datB <- read.table(text = " colA colB
Aspirin 'Smith, John'
Aspirin 'Doe, Jane'
Atorva 'Doe, Jane'",
header = TRUE, stringsAsFactors = FALSE)
If you want a base R solution, it's easy to write a setdiffDF function.
setdiffDF <- function(x, y){
ix <- !duplicated(rbind(y, x))[nrow(y) + 1:nrow(x)]
x[ix, ]
}
setdiffDF(dfA, dfB)
# colA colB
#3 Atorva Smith, John
#4 Simva Doe, Jane
Data in dput format.
dfA <-
structure(list(colA = structure(c(1L, 1L, 2L, 3L),
.Label = c("Aspirin", "Atorva", "Simva"), class = "factor"),
colB = structure(c(2L, 1L, 2L, 1L), .Label = c("Doe, Jane",
"Smith, John"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L))
dfB <-
structure(list(colA = structure(c(1L, 1L, 2L),
.Label = c("Aspirin", "Atorva"), class = "factor"),
colB = structure(c(2L, 1L, 1L), .Label = c("Doe, Jane",
"Smith, John"), class = "factor")), class = "data.frame",
row.names = c(NA, -3L))

How to look up values from a table and insert name of the lookup-list?

I have a (sample)table like this:
df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL Values
TP53 2 3.55
XBP1 5 4.06
TP27 1 2.53
REDD1 4 3.99
ERO1L 6 5.02
STK11 9 3.64
HIF2A 8 2.96")
I want to look up the symbols from two different genelists, given here as genelist1 and genelist2:
genelist1 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
P4H 10
PLK 7
TP27 1
KTD 11
ERO1L 6")
genelist2 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
TP53 2
XBP1 5
BHLHB 12
STK11 9
TP27 1
UPK 18")
What I want to is to get a new column where I can see in which genelist(s) I can find each of the genes in my dataframe, but when I run the following code it is just the symbols that are repeated in the new columns.
df_geneinfo <- df %>%
join(genelist1,by="SYMBOL") %>%
join(genelist2, by="SYMBOL")
Any suggestions of how to solve this, either to make one new column with the name of the genelists, or to make one column for each of the genelists?
Thanks in advance! :)
For the sake of completeness (and performance with large tables, perhaps), here is a data.table approach:
library(data.table)
rbindlist(list(genelist1, genelist2), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 2
2: XBP1 5 4.06 2
3: TP27 1 2.53 1, 2
4: REDD1 4 3.99 1
5: ERO1L 6 5.02 NA
6: STK11 9 3.64 2
7: HIF2A 8 2.96 NA
rbindlist() creates a data.table from all genelists and adds a column glid to identify the origin of each row. The Gene column is ignored as the subsequent join is only on SYMBOL. Before joining, df is coerced to class data.table using setDT(). The joined result is then aggregated by SYMBOL to exhibit cases where a symbol appears in both genelists which is the case for SYMBOL == 1.
Edit
In case there are many genelists or the full name of the genelist is required instead of just a number, we can try this:
rbindlist(mget(ls(pattern = "^genelist")), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 genelist2
2: XBP1 5 4.06 genelist2
3: TP27 1 2.53 genelist1, genelist2
4: REDD1 4 3.99 NA
5: ERO1L 6 5.02 genelist1
6: STK11 9 3.64 genelist2
7: HIF2A 8 2.96 NA
ls()is looking for objects in the environment the name of which is starting with genelist.... mget() returns a named list of those objects which is passed to rbindlist().
Data
As provided by the OP
df <- structure(list(Gene = c("TP53", "XBP1", "TP27", "REDD1", "ERO1L",
"STK11", "HIF2A"), SYMBOL = c(2L, 5L, 1L, 4L, 6L, 9L, 8L), Values = c(3.55,
4.06, 2.53, 3.99, 5.02, 3.64, 2.96)), .Names = c("Gene", "SYMBOL",
"Values"), class = "data.frame", row.names = c(NA, -7L))
genelist1 <- structure(list(Gene = c("P4H", "PLK", "TP27", "KTD", "ERO1L"),
SYMBOL = c(10L, 7L, 1L, 11L, 4L)), .Names = c("Gene", "SYMBOL"
), class = "data.frame", row.names = c(NA, -5L))
genelist2 <- structure(list(Gene = c("TP53", "XBP1", "BHLHB", "STK11", "TP27",
"UPK"), SYMBOL = c(2L, 5L, 12L, 9L, 1L, 18L)), .Names = c("Gene",
"SYMBOL"), class = "data.frame", row.names = c(NA, -6L))
I just wrote my own function, which replaces the column values:
replace_by_lookuptable <- function(df, col, lookup) {
assertthat::assert_that(all(col %in% names(df))) # all cols exist in df
assertthat::assert_that(all(c("new", "old") %in% colnames(lookup)))
cond_na_exists <- is.na(unlist(lapply(df[, col], function(x) my_match(x, lookup$old))))
assertthat::assert_that(!any(cond_na_exists))
df[, col] <- unlist(lapply(df[, col], function(x) lookup$new[my_match(x, lookup$old)]))
return(df)
}
df is the data.frame, col is a vector of column names which should be replaced using lookup, a data.frame with column "old" and "new".
If you add a listid column to your genelists
genelist1$listid = 1
genelist2$listid = 2
you can then merge your df with the genelists:
merge(df,rbind(genelist1,genelist2),all.x=T, by = "SYMBOL")
Note that ERO1L is SYMBOL 6 in your df and 4 in genelist1, and HIF2A and REDD1 are missing from genelists but REDD1 is symbol 4 in your df (which is ERO1L in genlist1... so I'm a not sure of what output you're expecting in that case.
You could also merge only on Gene names:
merge(df,rbind(genelist1,genelist2),all.x=T, by.x = "Gene", by.y= "Gene")
You could put all of your genlists in a list:
gen_list <- list(genelist1 = genelist1,genelist2 = genelist2)
and compare them to your target data.frame:
cbind(df,do.call(cbind,lapply(seq_along(gen_list),function(x) ifelse( df$Gene %in% gen_list[[x]]$Gene,names(gen_list[x]),NA))))

How to divide the column values in to ranges

I would like to know how to divide the column values in to three different ranges based on scores.
Here's the following data I have
Name V1.1 V1.2 V2.1 V2.2 V3.1 V3.2
John French 86 Math 78 English 56
Sam Math 97 French 86 English 79
Viru English 93 Math 44 French 34
If I consider three ranges. First rangewith 0-60, Second rangewith 61-90 and third range with 91-100.
I would like to the subject name across all the skills.
Expected result would be
Name Level1 Level2 Level3
John English Math,French Null
Sam Null French,Eng Math
Viru Math,Fren Null English
First you need to convert the data to long form, one row per observation (where an observation is a single score. You need to do a melt, but it is complicated because your wide form consists of not only observations but observation classes. One way to do it is to use melt.data.table twice, but you may be more comfortable with dplyr, which has more accessible syntax.
# first you need to convert to long form
library("data.table")
setDT(df)
lhs <- melt.data.table(df, id = "Name", measure = patterns("\\.2"),
variable.name = "obs", value.name = "score")
lhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
lhs
rhs <- melt.data.table(df, id = "Name", measure = patterns("V\\d\\.1"),
variable.name = "obs", value.name = "subject")
rhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
rhs
df2 <- merge(lhs, rhs, by = c("Name","obs"))
# Name obs score1 subject1
# 1: John V1 86 French
# 2: John V2 78 Math
# 3: John V3 56 English
# 4: Sam V1 97 Math
# 5: Sam V2 86 French
# 6: Sam V3 79 English
# 7: Viru V1 93 English
# 8: Viru V2 44 Math
# 9: Viru V3 34 French
Then you need to use cut or some other function to create the three levels based on score1.
Then you should group by these levels and apply concatenation to the subjects, such as paste(..., collapse = ",").
Then you need to use cast or spread to return it to wide form.
Do give it some effort, and edit your question with what you've tried, and try to come up with a more specific question, not just "please do this for me".
Another option using splitstackshape and nested ifelse
library(splitstackshape)
library(tidyr)
# prepare data to convert in long format
data$subjects = do.call(paste, c(data[,grep("\\.1", colnames(data))], sep = ','))
data$marks = do.call(paste, c(data[,grep("\\.2", colnames(data))], sep = ','))
data = data[,-grep("V", colnames(data))]
# use cSplit to convert wide to long
out = cSplit(setDT(data), sep = ",", c("subjects", "marks"), "long")
# nested ifelse to assign level based on the score range
out[, level := ifelse(marks <= 60, "level1",
ifelse(marks <= 90, "level2", "level3"))]
req = out[, toString(subjects), by= c("Name","level")]
this will give
#> req
# Name level V1
#1: John level2 French, Math
#2: John level1 English
#3: Sam level3 Math
#4: Sam level2 French, English
#5: Viru level3 English
#6: Viru level1 Math, French
you can reshape either using dcast or spread from tidyr
spread(req, level, V1)
# Name level1 level2 level3
#1: John English French, Math NA
#2: Sam NA French, English Math
#3: Viru Math, French NA English
data
data = structure(list(Name = structure(1:3, .Label = c("John", "Sam",
"Viru"), class = "factor"), V1.1 = structure(c(2L, 3L, 1L), .Label = c("English",
"French", "Math"), class = "factor"), V1.2 = c(86L, 97L, 93L),
V2.1 = structure(c(2L, 1L, 2L), .Label = c("French", "Math"
), class = "factor"), V2.2 = c(78L, 86L, 44L), V3.1 = structure(c(1L,
1L, 2L), .Label = c("English", "French"), class = "factor"),
V3.2 = c(56L, 79L, 34L)), .Names = c("Name", "V1.1", "V1.2",
"V2.1", "V2.2", "V3.1", "V3.2"), class = "data.frame", row.names = c(NA,
-3L))
Not very intuitive, but leads to the requested output. Requires the package sjmisc!
library(sjmisc)
mydat <- data.frame(Name = c("John", "Sam", "Viru"),
V1.1 = c("French", "Math", "English"),
V1.2 = c(86, 97, 93),
V2.1 = c("Math", "French", "Math"),
V2.2 = c(78, 86, 44),
V3.1 = c("English", "English", "French"),
V3.2 = c(56, 79, 34))
# recode into groups
rec(mydat[, c(3,5,7)]) <- "min:60=1; 61:90=2; 91:max=3"
# convert to long format
newdf <- to_long(mydat, "no_use",
c("subject", "score"),
c("V1.1", "V2.1", "V3.1"),
c("V1.2", "V2.2", "V3.2")) %>%
select(-no_use) %>%
arrange(Name, score)
# at this point we are at a similar stage as described in the
# other answers, so we have our data in a long format
newdf
fdf <- list()
# iterate all unique names
for (i in unique(newdf$Name)) {
dummy <- c()
# iterare all three scores
for (s in 1:3) {
# find subjects related to the score
dat <- newdf$subject[newdf$Name == i & newdf$score == s]
if (length(dat) == 0) dat <- ""
dat <- paste0(dat, collapse = ",")
dummy <- c(dummy, dat)
}
# add character vector with sorted subjects to list
fdf[[length(fdf) + 1]] <- dummy
}
# list to data frame
finaldf <- as.data.frame(t(as.data.frame(fdf)))
finaldf <- cbind(unique(newdf$Name), finaldf)
# proper row/col names
colnames(finaldf) <- c("Names", "Level1", "Level2", "Level3")
rownames(finaldf) <- 1:nrow(finaldf)
finaldf

Resources