How to divide the column values in to ranges - r

I would like to know how to divide the column values in to three different ranges based on scores.
Here's the following data I have
Name V1.1 V1.2 V2.1 V2.2 V3.1 V3.2
John French 86 Math 78 English 56
Sam Math 97 French 86 English 79
Viru English 93 Math 44 French 34
If I consider three ranges. First rangewith 0-60, Second rangewith 61-90 and third range with 91-100.
I would like to the subject name across all the skills.
Expected result would be
Name Level1 Level2 Level3
John English Math,French Null
Sam Null French,Eng Math
Viru Math,Fren Null English

First you need to convert the data to long form, one row per observation (where an observation is a single score. You need to do a melt, but it is complicated because your wide form consists of not only observations but observation classes. One way to do it is to use melt.data.table twice, but you may be more comfortable with dplyr, which has more accessible syntax.
# first you need to convert to long form
library("data.table")
setDT(df)
lhs <- melt.data.table(df, id = "Name", measure = patterns("\\.2"),
variable.name = "obs", value.name = "score")
lhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
lhs
rhs <- melt.data.table(df, id = "Name", measure = patterns("V\\d\\.1"),
variable.name = "obs", value.name = "subject")
rhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
rhs
df2 <- merge(lhs, rhs, by = c("Name","obs"))
# Name obs score1 subject1
# 1: John V1 86 French
# 2: John V2 78 Math
# 3: John V3 56 English
# 4: Sam V1 97 Math
# 5: Sam V2 86 French
# 6: Sam V3 79 English
# 7: Viru V1 93 English
# 8: Viru V2 44 Math
# 9: Viru V3 34 French
Then you need to use cut or some other function to create the three levels based on score1.
Then you should group by these levels and apply concatenation to the subjects, such as paste(..., collapse = ",").
Then you need to use cast or spread to return it to wide form.
Do give it some effort, and edit your question with what you've tried, and try to come up with a more specific question, not just "please do this for me".

Another option using splitstackshape and nested ifelse
library(splitstackshape)
library(tidyr)
# prepare data to convert in long format
data$subjects = do.call(paste, c(data[,grep("\\.1", colnames(data))], sep = ','))
data$marks = do.call(paste, c(data[,grep("\\.2", colnames(data))], sep = ','))
data = data[,-grep("V", colnames(data))]
# use cSplit to convert wide to long
out = cSplit(setDT(data), sep = ",", c("subjects", "marks"), "long")
# nested ifelse to assign level based on the score range
out[, level := ifelse(marks <= 60, "level1",
ifelse(marks <= 90, "level2", "level3"))]
req = out[, toString(subjects), by= c("Name","level")]
this will give
#> req
# Name level V1
#1: John level2 French, Math
#2: John level1 English
#3: Sam level3 Math
#4: Sam level2 French, English
#5: Viru level3 English
#6: Viru level1 Math, French
you can reshape either using dcast or spread from tidyr
spread(req, level, V1)
# Name level1 level2 level3
#1: John English French, Math NA
#2: Sam NA French, English Math
#3: Viru Math, French NA English
data
data = structure(list(Name = structure(1:3, .Label = c("John", "Sam",
"Viru"), class = "factor"), V1.1 = structure(c(2L, 3L, 1L), .Label = c("English",
"French", "Math"), class = "factor"), V1.2 = c(86L, 97L, 93L),
V2.1 = structure(c(2L, 1L, 2L), .Label = c("French", "Math"
), class = "factor"), V2.2 = c(78L, 86L, 44L), V3.1 = structure(c(1L,
1L, 2L), .Label = c("English", "French"), class = "factor"),
V3.2 = c(56L, 79L, 34L)), .Names = c("Name", "V1.1", "V1.2",
"V2.1", "V2.2", "V3.1", "V3.2"), class = "data.frame", row.names = c(NA,
-3L))

Not very intuitive, but leads to the requested output. Requires the package sjmisc!
library(sjmisc)
mydat <- data.frame(Name = c("John", "Sam", "Viru"),
V1.1 = c("French", "Math", "English"),
V1.2 = c(86, 97, 93),
V2.1 = c("Math", "French", "Math"),
V2.2 = c(78, 86, 44),
V3.1 = c("English", "English", "French"),
V3.2 = c(56, 79, 34))
# recode into groups
rec(mydat[, c(3,5,7)]) <- "min:60=1; 61:90=2; 91:max=3"
# convert to long format
newdf <- to_long(mydat, "no_use",
c("subject", "score"),
c("V1.1", "V2.1", "V3.1"),
c("V1.2", "V2.2", "V3.2")) %>%
select(-no_use) %>%
arrange(Name, score)
# at this point we are at a similar stage as described in the
# other answers, so we have our data in a long format
newdf
fdf <- list()
# iterate all unique names
for (i in unique(newdf$Name)) {
dummy <- c()
# iterare all three scores
for (s in 1:3) {
# find subjects related to the score
dat <- newdf$subject[newdf$Name == i & newdf$score == s]
if (length(dat) == 0) dat <- ""
dat <- paste0(dat, collapse = ",")
dummy <- c(dummy, dat)
}
# add character vector with sorted subjects to list
fdf[[length(fdf) + 1]] <- dummy
}
# list to data frame
finaldf <- as.data.frame(t(as.data.frame(fdf)))
finaldf <- cbind(unique(newdf$Name), finaldf)
# proper row/col names
colnames(finaldf) <- c("Names", "Level1", "Level2", "Level3")
rownames(finaldf) <- 1:nrow(finaldf)
finaldf

Related

How to combine specific data across multiple rows in a dataframe in R

I am looking to alter (concatenate, reshape I am not sure which word is right for this scenario) the data in my data frame by combining rows data cells across 1 column where the other columns in that row are identical.
Basically, I have something like this:
>df
>Person_id System_id Category Type Tag
>1A 134 1 Chr Question
>1A 134 1 Chr Answer
>1A 134 1 Chr Evaluation
>1A 134 1 Chr Overall
>1A 134 1 Chr Analysis
>Z4 002 1 Chr Question
>Z4 002 1 Chr Answer
And get it to look something like this:
>Person_id System_id Category Type Tag
>1A 134 1 Chr Question, Answer, Evaluation, Overall, Analysis
>Z4 002 1 Chr Question, Answer
The Tags don't have to be separated by a comma , a space is fine.
Any ideas where to look for a solution like this would be helpful.
Thank you.
We can group by the first four columns and paste the 'Tag' elements together
library(dplyr)
df %>%
group_by_at(1:4) %>%
summarise(Tag = toString(Tag))
# A tibble: 2 x 5
# Groups: Person_id, System_id, Category [2]
# Person_id System_id Category Type Tag
# <chr> <int> <int> <chr> <chr>
#1 1A 134 1 Chr Question, Answer, Evaluation, Overall, Analysis
#2 Z4 2 1 Chr Question, Answer
Or using base R
aggregate(Tag ~ ., df, toString)
NOTE: toString is a convenient wrapper for paste(., collapse=", ")
data
df <- structure(list(Person_id = c("1A", "1A", "1A", "1A", "1A", "Z4",
"Z4"), System_id = c(134L, 134L, 134L, 134L, 134L, 2L, 2L), Category = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), Type = c("Chr", "Chr", "Chr", "Chr",
"Chr", "Chr", "Chr"), Tag = c("Question", "Answer", "Evaluation",
"Overall", "Analysis", "Question", "Answer")),
class = "data.frame", row.names = c(NA,
-7L))
You can use paste0 with collapse = ", " to achieve this:
library(dplyr)
df %>%
group_by(Person_id, System_id, Category, Type) %>%
summarise(Tag = paste0(Tag, collapse = ", "))
#Person_id System_id Category Type Tag
# <chr> <int> <int> <chr> <chr>
#1 1A 134 1 Chr Question, Answer, Evaluation, Overall, Analysis
#2 Z4 2 1 Chr Question, Answer

How to append 2 data sets one below the other having slightly different column names?

Data set1:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
Data set2:
Terr ID Name Comments
LA 5 Rick yes
MH 11 Oly no
I want final data set to have columns of 1st data set only and identify Territory is same as Terr and does not bring forward Comments column.
Final data should look like:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
5 Rick LA NA
11 Oly MH NA
Thanks in advance
A possible solution:
# create a named vector with names from 'set2'
# with the positions of the matching columns in 'set1'
nms2 <- sort(unlist(sapply(names(set2), agrep, x = names(set1))))
# only keep the columns in 'set2' for which a match is found
# and give them the same names as in 'set1'
set2 <- setNames(set2[names(nms2)], names(set1[nms2]))
# bind the two dataset together
# option 1:
library(dplyr)
bind_rows(set1, set2)
# option 2:
library(data.table)
rbindlist(list(set1, set2), fill = TRUE)
which gives (dplyr-output shown):
ID Name Territory Sales
1 1 Richard NY 59
2 8 Sam California 44
3 5 Rick LA NA
4 11 Oly MH NA
Used data:
set1 <- structure(list(ID = c(1L, 8L),
Name = c("Richard", "Sam"),
Territory = c("NY", "California"),
Sales = c(59L, 44L)),
.Names = c("ID", "Name", "Territory", "Sales"), class = "data.frame", row.names = c(NA, -2L))
set2 <- structure(list(Terr = c("LA", "MH"),
ID = c(5L, 11L),
Name = c("Rick", "Oly"),
Comments = c("yes", "no")),
.Names = c("Terr", "ID", "Name", "Comments"), class = "data.frame", row.names = c(NA, -2L))

How to look up values from a table and insert name of the lookup-list?

I have a (sample)table like this:
df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL Values
TP53 2 3.55
XBP1 5 4.06
TP27 1 2.53
REDD1 4 3.99
ERO1L 6 5.02
STK11 9 3.64
HIF2A 8 2.96")
I want to look up the symbols from two different genelists, given here as genelist1 and genelist2:
genelist1 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
P4H 10
PLK 7
TP27 1
KTD 11
ERO1L 6")
genelist2 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
TP53 2
XBP1 5
BHLHB 12
STK11 9
TP27 1
UPK 18")
What I want to is to get a new column where I can see in which genelist(s) I can find each of the genes in my dataframe, but when I run the following code it is just the symbols that are repeated in the new columns.
df_geneinfo <- df %>%
join(genelist1,by="SYMBOL") %>%
join(genelist2, by="SYMBOL")
Any suggestions of how to solve this, either to make one new column with the name of the genelists, or to make one column for each of the genelists?
Thanks in advance! :)
For the sake of completeness (and performance with large tables, perhaps), here is a data.table approach:
library(data.table)
rbindlist(list(genelist1, genelist2), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 2
2: XBP1 5 4.06 2
3: TP27 1 2.53 1, 2
4: REDD1 4 3.99 1
5: ERO1L 6 5.02 NA
6: STK11 9 3.64 2
7: HIF2A 8 2.96 NA
rbindlist() creates a data.table from all genelists and adds a column glid to identify the origin of each row. The Gene column is ignored as the subsequent join is only on SYMBOL. Before joining, df is coerced to class data.table using setDT(). The joined result is then aggregated by SYMBOL to exhibit cases where a symbol appears in both genelists which is the case for SYMBOL == 1.
Edit
In case there are many genelists or the full name of the genelist is required instead of just a number, we can try this:
rbindlist(mget(ls(pattern = "^genelist")), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 genelist2
2: XBP1 5 4.06 genelist2
3: TP27 1 2.53 genelist1, genelist2
4: REDD1 4 3.99 NA
5: ERO1L 6 5.02 genelist1
6: STK11 9 3.64 genelist2
7: HIF2A 8 2.96 NA
ls()is looking for objects in the environment the name of which is starting with genelist.... mget() returns a named list of those objects which is passed to rbindlist().
Data
As provided by the OP
df <- structure(list(Gene = c("TP53", "XBP1", "TP27", "REDD1", "ERO1L",
"STK11", "HIF2A"), SYMBOL = c(2L, 5L, 1L, 4L, 6L, 9L, 8L), Values = c(3.55,
4.06, 2.53, 3.99, 5.02, 3.64, 2.96)), .Names = c("Gene", "SYMBOL",
"Values"), class = "data.frame", row.names = c(NA, -7L))
genelist1 <- structure(list(Gene = c("P4H", "PLK", "TP27", "KTD", "ERO1L"),
SYMBOL = c(10L, 7L, 1L, 11L, 4L)), .Names = c("Gene", "SYMBOL"
), class = "data.frame", row.names = c(NA, -5L))
genelist2 <- structure(list(Gene = c("TP53", "XBP1", "BHLHB", "STK11", "TP27",
"UPK"), SYMBOL = c(2L, 5L, 12L, 9L, 1L, 18L)), .Names = c("Gene",
"SYMBOL"), class = "data.frame", row.names = c(NA, -6L))
I just wrote my own function, which replaces the column values:
replace_by_lookuptable <- function(df, col, lookup) {
assertthat::assert_that(all(col %in% names(df))) # all cols exist in df
assertthat::assert_that(all(c("new", "old") %in% colnames(lookup)))
cond_na_exists <- is.na(unlist(lapply(df[, col], function(x) my_match(x, lookup$old))))
assertthat::assert_that(!any(cond_na_exists))
df[, col] <- unlist(lapply(df[, col], function(x) lookup$new[my_match(x, lookup$old)]))
return(df)
}
df is the data.frame, col is a vector of column names which should be replaced using lookup, a data.frame with column "old" and "new".
If you add a listid column to your genelists
genelist1$listid = 1
genelist2$listid = 2
you can then merge your df with the genelists:
merge(df,rbind(genelist1,genelist2),all.x=T, by = "SYMBOL")
Note that ERO1L is SYMBOL 6 in your df and 4 in genelist1, and HIF2A and REDD1 are missing from genelists but REDD1 is symbol 4 in your df (which is ERO1L in genlist1... so I'm a not sure of what output you're expecting in that case.
You could also merge only on Gene names:
merge(df,rbind(genelist1,genelist2),all.x=T, by.x = "Gene", by.y= "Gene")
You could put all of your genlists in a list:
gen_list <- list(genelist1 = genelist1,genelist2 = genelist2)
and compare them to your target data.frame:
cbind(df,do.call(cbind,lapply(seq_along(gen_list),function(x) ifelse( df$Gene %in% gen_list[[x]]$Gene,names(gen_list[x]),NA))))

Can you merge your data without creating separate dataframe in R?

My data frame is something like the follows:
sex year country value
F 2010 AU 350
F 2011 GE 258
M 2010 AU 250
F 2012 GE 928
In order to create another data frame that is merged by year and country, with sex and value being what you want to compare, you must first create separate data frames, like:
f <- subset(df, sex=="F")
m <- subset(df, sex=="M")
df_new <- merge(f, m, by=c("country", "year"), suffixes=c("_f", "_m"))
In this way, you can obtain a new data frame with year, and country being matched and just the value being different.
However, I don't like to bother to create separate data frames in order to merge. Is it possible to just write a code in one-line to achieve the data frame?
Considering dput(dft) as :
structure(list(sex = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"),
year = c(2010, 2011, 2010, 2012),
country = structure(c(1L, 2L, 1L, 2L), .Label = c("AU", "GE"), class = "factor"),
value = c(350, 258, 250, 928)), .Names = c("sex", "year", "country", "value"),row.names = c(NA, -4L), class = "data.frame")
you can use tidyverse and do:
dft %>% spread(sex,value)
which gives:
# year country F M
#1 2010 AU 350 250
#2 2011 GE 258 NA
#3 2012 GE 928 NA
We can do a split and then with Reduce/merge can get the expected output
Reduce(function(...) merge(..., by = c("country", "year"),
suffixes = c("_f", "_m")), split(df, df$sex))
# country year sex_f value_f sex_m value_m
#1 AU 2010 F 350 M 250
NOTE: This should also work when there are 'n' number of unique elements in the split by column (without the suffixes or its modification)
A reshaping option with data.table is
library(data.table)
na.omit(dcast(setDT(df), country + year ~ rowid(country, year),
value.var = c("sex", "value")))
# country year sex_1 sex_2 value_1 value_2
#1: AU 2010 F M 350 250

How to tag the Term with the highest percentage of the class in R

y app baby blackberry dear
Neg 20 33.33 100 100
Neutral 80 66.66 0 0
Pos 0 0 0 0
In the above data frame,"app" is having more percentage for class Neutral. So I have to combine the "app" term with Neutral sentiment. Likewise,term "blackberry" has more percentage for class "Negative".So have to combine the term "blackberry" with class 'Neg'.
Can anybody please help me on this.
It is not clear about the expected output. If we need to find the corresponding 'y' term for the maximum value in other columns, use summarise_each with which.max to get the numeric index and based on that we can get the 'y' value for those columns form 'app' to 'dear'
library(dplyr)
res <- df1 %>%
summarise_each(funs(y[which.max(.)]), app:dear)
res
# app baby blackberry dear
# 1 Neutral Neutral Neg Neg
This can be converted to 'long' format with gather from tidyr
library(tidyr)
res %>%
gather()
# key value
#1 app Neutral
#2 baby Neutral
#3 blackberry Neg
#4 dear Neg
Or we can melt it to 'long' format (from data.table) and use which.max
library(data.table)
melt(setDT(df1), id.var = "y")[, .(value = y[which.max(value)]),.(key= variable)]
# key value
#1: app Neutral
#2: baby Neutral
#3: blackberry Neg
#4: dear Neg
Or using base R
df1$y[sapply(df1[-1], which.max)]
#[1] "Neutral" "Neutral" "Neg" "Neg"
data
df1 <- structure(list(y = c("Neg", "Neutral", "Pos"), app = c(20L, 80L,
0L), baby = c(33.33, 66.66, 0), blackberry = c(100L, 0L, 0L),
dear = c(100L, 0L, 0L)), .Names = c("y", "app", "baby", "blackberry",
"dear"), class = "data.frame", row.names = c(NA, -3L))

Resources