Extracting values with if and put them in a new column - r

maybe this is a very simple question, but I cannot figure out what is wrong with my short code.
This is my (very simple) data frame:
structure(list(sample = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 3L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), value = c(0.1446689595,
0.9151456018, 0.880888083, 0.005522657, 0.7079621046, 0.4770259836,
0.6960717649, 0.5892328324, 0.1134234308), new = c("red", "red",
"red", "red", "red", "red", "red", "red", "red")), .Names = c("sample",
"value", "new"), row.names = c(NA, -9L), class = "data.frame")
what I would like to do is add a new column where the new values depend on the values of the first column. In other and simpler words:
if (df1$sample != "a") {
df1$new <- "green"
} else {
df1$new <- "red"
}
but R returns an error:
In if (df1$sample != "a") { :
the condition has length > 1 and only the first element will be used
I also tried with an elseif statement:
ifelse(df1$sample != "a", df1$new <- "green", df1$new <- "red")
but it this case the new column contains only "red" and no "green".
Am I missing something?
Thanks!

You could try
df1$new <- c('green', 'red')[(df1$sample=='a')+1L]
df1
# sample value new
#1 a 0.144668959 red
#2 b 0.915145602 green
#3 a 0.880888083 red
#4 a 0.005522657 red
#5 a 0.707962105 red
#6 b 0.477025984 green
#7 c 0.696071765 green
#8 c 0.589232832 green
#9 c 0.113423431 green

ifelse should work fine - you just need to assign it
df1$new1 <- ifelse(df1$sample != "a", df1$new1 <- "green", df1$new1 <- "red")
sample value new new1
1 a 0.144668959 red red
2 b 0.915145602 red green
3 a 0.880888083 red red
4 a 0.005522657 red red
5 a 0.707962105 red red
6 b 0.477025984 red green
7 c 0.696071765 red green
8 c 0.589232832 red green
9 c 0.113423431 red green
I would avoid using new as a variable name - it is the name of a function and this may cause issues.

Related

Is there a way to "CountIF" in R based on two conditions

I know how to do this in excel, but am trying to translate into R and create a new column. In R I have a data frame called CleanData. I want to see how many times the value in each row of column A shows up in all of column B. In excel it would read like this:
=COUNTIF(B:B,A2)>0,C="Purple")
The second portion would be a next if / and statement. It would look like this in excel:
=IF(AND(COUNTIF(B:B,A2)>0,C="Purple"),"Yes", "No")
Anyone know where to start?
I have tried mutating and also this:
sum(CleanData$colA == CleanData$colB)
and am getting no values
You don't need any extra packages, here is a solution with the base R function ifelse, which is a frequently very useful function you should learn. An example:
set.seed(7*11*13)
DF <- data.frame(cond=rnorm(100), X= sample(c("Yes","No"), 100, replace=TRUE))
with(DF, sum(ifelse( (cond>0)&(X=="Yes"), 1, 0)))
I think this will capture your if/countif scenario:
library(dplyr)
CleanData %>%
mutate(YesOrNo = case_when(Color != "Purple" ~ "No", is.na(LABEL1) | !nzchar(LABEL1) ~ "No", !LABEL1 %in% LABEL2 ~ "No", TRUE ~ "Yes"))
# LABEL1 LABEL2 Color YesOrNo
# 1 HELLO <NA> Purple Yes
# 2 <NA> HELLO!!! Blue No
# 3 HELLO$$ <NA> Purple Yes
# 4 <NA> HELLO Blue No
# 5 HELLOOO <NA> Purple Yes
# 6 <NA> <NA> Purple No
# 7 <NA> HELLOOO Blue No
# 8 <NA> HELLO$$ Blue No
# 9 <NA> HELLO Yellow No
Data
CleanData <- structure(list(LABEL1 = c("HELLO", NA, "HELLO$$", NA, "HELLOOO", NA, NA, NA, NA), LABEL2 = c(NA, "HELLO!!!", NA, "HELLO", NA, NA, "HELLOOO", "HELLO$$", "HELLO"), Color = c("Purple", "Blue", "Purple", "Blue", "Purple", "Purple", "Blue", "Blue", "Yellow")), class = "data.frame", row.names = c(NA, -9L))
or programmatically,
CleanData <- data.frame(LABEL1=c("HELLO",NA,"HELLO$$",NA,"HELLOOO",NA,NA,NA,NA), LABEL2=c(NA,"HELLO!!!",NA,"HELLO",NA,NA,"HELLOOO","HELLO$$","HELLO"),Color=c("Purple","Blue","Purple","Blue","Purple","Purple","Blue","Blue","Yellow"))

Average only duplicated rows and replacing value in a defined column

I have a dataframe D:
surname name salary
Red A 1000
Green B 900
Green A 1100
Blue C 1000
Blue B 1000
Blue F 800
Violet F 1200
Some row has no replicate in surname, some other yes.
I need to aggregate the rows only where surname is duplicated, to average values of salary and change the name to "X".
I tryed something using duplicated() but it leave one duplicate as original and change the others.
D$name<-replace(D$name,duplicated(D$surname),"X")
And also I was unable to average the values of salary.
Thank you!
We can use
D$name <- replace(D$name,duplicated(D$surname)|duplicated(D$surname,
fromLast = TRUE),"X")
If we need to create an average column
library(dplyr)
D %>%
group_by(surname) %>%
mutate(average = mean(salary))
data
D <- structure(list(surname = c("Red", "Green", "Green", "Blue", "Blue",
"Blue", "Violet"), name = c("A", "B", "A", "C", "B", "F", "F"
), salary = c(1000L, 900L, 1100L, 1000L, 1000L, 800L, 1200L)), class = "data.frame", row.names = c(NA,
-7L))

how to adjust a heatmap color key and values everything

I have a data as follows:
data<- structure(list(names = structure(c(5L, 1L, 10L, 2L, 6L, 4L, 9L,
7L, 11L, 3L, 8L), .Label = c("Bin", "Dari", "Down", "How", "India",
"Karachi", "Left", "middle", "Right", "Trash", "Up"), class = "factor"),
X1Huor = c(1.555555556, 5.2555556, 2.256544, 2.3654225, 1.2665545,
0, 1.889822365, 2.37232101, -1, -1.885618083, 1.128576187
), X2Hour = c(1.36558854, 2.254887, 2.3333333, 0.22255444,
2.256588, 5.66666, -0.377964473, 0.107211253, -1, 0, 0),
X3Hour = c(0, 1.222222222, 5.336666, 1.179323788, 0.832050294,
-0.397359707, 0.185695338, 1.393746295, -1, -2.121320344,
1.523019248), X4Hour = c(3.988620176, 3.544745039, -2.365555,
2.366666, 1.000000225, -0.662266179, -0.557086015, 0.862662186,
0, -1.305459824, 1.929157714), X5Hour = c(2.366666, 2.333365,
4.22222, 0.823333333, 0.980196059, -2.516611478, 2.267786838,
0.32163376, 0, -2.592724864, 0.816496581)), .Names = c("names",
"X1Huor", "X2Hour", "X3Hour", "X4Hour", "X5Hour"), class = "data.frame", row.names = c(NA,
-11L))
I tried to plot it like below
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
Then after I make it as matrix I use the heatmap
myPalette <- colorRampPalette(c("green", "yellow","red" ))(n = 299)
col_breaks = c(seq(-2,0,length=100), # for red
seq(0.01,0.8,length=100), # for yellow
seq(0.81,2,length=100)) # for green
# creates a 5 x 5 inch image
heatmap.2(mat_data,
cellnote = mat_data, # same data set for cell labels
main = "The heatmap title to be appeared ", # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(15,20), # widens margins around plot
col=myPalette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram="row", # only draw a row dendrogram
Colv="NA") # turn off column clustering
But what I want is to
1- make the color key smaller
2- not to show any value or show with 2 values
3- I don't mind if one give a solution with ggplot
This is how it looks now
and I want to have the xlabel straight without Xbefore them , is this possible?

Color data points based on sample classification

A pairwise scatterplot showing relationship between genes (columns in data frame) across multiple samples (rows in data frame) is created. The samples belong to two distinct groups: group "A" and "B". Since one dot in plot represent one sample, I need to color the data points (dots) according to groups with two different colors, say group A with "green" and group B with "red". Is it possible to do that?
Any kind of help will be appreciated.
plot(DF[1:6], pch = 21) #command used for plotting, DF is data frame
Sample Data Frame Example:
CBX3 PSPH ATP2C1 SNX10 MMD ATP13A3
B 10.589844 6.842970 8.084550 8.475023 9.202490 10.403811
A 10.174385 5.517944 7.736994 9.094834 9.253766 10.133408
B 10.202084 5.669137 7.392141 7.522270 7.830969 9.123178
B 10.893231 6.630709 7.601690 7.894177 8.979142 9.791841
B 10.071038 5.091222 7.032585 8.305581 7.903737 8.994821
A 10.005002 4.708631 7.927246 7.292527 8.257853 10.054630
B 10.028055 5.080944 6.421961 7.616856 8.287496 9.642294
A 10.144115 6.626483 7.686203 7.970934 7.919615 9.475175
A 10.675386 6.874047 7.900560 7.605519 8.585158 8.858613
A 9.855063 5.164399 6.847923 8.072608 8.221344 9.077744
A 10.994228 6.545318 8.606128 8.426329 8.787876 9.857079
A 10.501266 6.677360 7.787168 8.444976 8.928174 9.542558
GGally has a good function for this as well.
library(GGally)
ggpairs(dd, color = 'CLASS',columns = 2:ncol(dd) )
It might not be that easy to do with base graphics. You could easily do this with lattice. With this sample data.frame
dd<-structure(list(CLASS = structure(c(2L, 1L, 2L, 2L, 2L, 1L, 2L,
1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
CBX3 = c(10.589844, 10.174385, 10.202084, 10.893231, 10.071038,
10.005002, 10.028055, 10.144115, 10.675386, 9.855063, 10.994228,
10.501266), PSPH = c(6.84297, 5.517944, 5.669137, 6.630709,
5.091222, 4.708631, 5.080944, 6.626483, 6.874047, 5.164399,
6.545318, 6.67736), ATP2C1 = c(8.08455, 7.736994, 7.392141,
7.60169, 7.032585, 7.927246, 6.421961, 7.686203, 7.90056,
6.847923, 8.606128, 7.787168), SNX10 = c(8.475023, 9.094834,
7.52227, 7.894177, 8.305581, 7.292527, 7.616856, 7.970934,
7.605519, 8.072608, 8.426329, 8.444976), MMD = c(9.20249,
9.253766, 7.830969, 8.979142, 7.903737, 8.257853, 8.287496,
7.919615, 8.585158, 8.221344, 8.787876, 8.928174), ATP13A3 = c(10.403811,
10.133408, 9.123178, 9.791841, 8.994821, 10.05463, 9.642294,
9.475175, 8.858613, 9.077744, 9.857079, 9.542558)), .Names = c("CLASS",
"CBX3", "PSPH", "ATP2C1", "SNX10", "MMD", "ATP13A3"), class = "data.frame", row.names = c(NA, -12L))
you can do
library(lattice)
splom(~dd[,-1], groups=dd$CLASS)
to get
You can add color to the points by specifying the argument col
to plot
DF <- read.delim(textConnection(
"category CBX3 PSPH ATP2C1 SNX10 MMD ATP13A3
B 10.589844 6.842970 8.084550 8.475023 9.202490 10.403811
A 10.174385 5.517944 7.736994 9.094834 9.253766 10.133408
B 10.202084 5.669137 7.392141 7.522270 7.830969 9.123178
B 10.893231 6.630709 7.601690 7.894177 8.979142 9.791841
B 10.071038 5.091222 7.032585 8.305581 7.903737 8.994821
A 10.005002 4.708631 7.927246 7.292527 8.257853 10.054630
B 10.028055 5.080944 6.421961 7.616856 8.287496 9.642294
A 10.144115 6.626483 7.686203 7.970934 7.919615 9.475175
A 10.675386 6.874047 7.900560 7.605519 8.585158 8.858613
A 9.855063 5.164399 6.847923 8.072608 8.221344 9.077744
A 10.994228 6.545318 8.606128 8.426329 8.787876 9.857079
A 10.501266 6.677360 7.787168 8.444976 8.928174 9.542558"))
plot(DF[2:7],col = ifelse(DF$category == 'A','red','green'))
A list of valid color values can be obtained by calling colors(). Vectors with a gradient of colors can be created via rainbow(), and just for fun, I use this little function for choosing pretty colors when making a figure.
(Edited per suggestions from #MrFlick)
#! #param n The number of colors to be selected
colorchoose <- function (n = 1, alpha, term = F)
{
cols <- colors()
mod <- ceiling(sqrt(length(cols)))
plot(xlab = "", ylab = "", main = "click for color name",
c(0, mod), c(0, mod), type = "n", axes = F)
s<-seq_along(cols)
dev.hold()
points(s%%mod, s%/%mod, col = cols, pch = 15, cex = 2.4)
dev.flush()
p <- locator(n)
return(cols[round(p$y) * mod + round(p$x)])
}

Using apply with a user-defined function in R

I have defined the following function in r:
#A function that compares color and dates to determine if there is a match
getTagColor <- function(color, date){
for (i in (1:nrow(TwistTieFix))){
if ((color == TwistTieFix$color_match[i]) &
(date > TwistTieFix$color_match[i]) &
(date <= TwistTieFix$julian_cut_off_date[i])) {
Data$color_code <- TwistTieFix$color_code[i]
print(Data$color_code)
}
}
}
I then used apply() in an attempt to apply the function to each row.
#Apply the above function to the data set
testData <- apply(Data, 1, getTagColor(Data$tag_color,Data$julian_date))`
The goal of the code is to use two variables in Data and find another value to put into a new column in Data (color_code) that will be based on the information in TwistTieFix. When I run the code, I get a list of warnings saying
In if ((color == TwistTieFix$color_match[i]) & (date > ... :
the condition has length > 1 and only the first element will be used
I cannot determine why the function does not use the date and color from each row and use it in the function (at least that is what I think is going wrong here). Thanks!
Here are examples of the data frames being used:
TwistTieFix
color_name date color_code cut_off_date color_match julian_start julian_cut_off_date
yellow 2013-08-12 y1 2001-07-02 yellow 75 389
blue 2000-09-28 b1 2001-08-12 blue 112 430
Data
coll_date julian_date tag_color
2013-08-13 76 yellow
2013-08-14 76 yellow
2000-09-29 112 blue
Data has a lot more columns of different variables, but I am not allowed to include all of the columns. However, I have included the columns in Data that I am referencing in function. The data sets are loaded into r using read.csv and are from Excel csv files.
To me, it seems like you want to join Data and TwistTieFix where tag_color=color_match and julian_start <= julian_date <= julian_cut_off_date. Here are your sample data.sets in dput form
TwistTieFix <- structure(list(color_name = structure(c(2L, 1L), .Label = c("blue",
"yellow"), class = "factor"), date = structure(c(2L, 1L), .Label = c("2000-09-28",
"2013-08-12"), class = "factor"), color_code = structure(c(2L,
1L), .Label = c("b1", "y1"), class = "factor"), cut_off_date = structure(1:2, .Label = c("2001-07-02",
"2001-08-12"), class = "factor"), color_match = structure(c(2L,
1L), .Label = c("blue", "yellow"), class = "factor"), julian_start = c(75L,
112L), julian_cut_off_date = c(389L, 430L)), .Names = c("color_name",
"date", "color_code", "cut_off_date", "color_match", "julian_start",
"julian_cut_off_date"), class = "data.frame", row.names = c(NA,
-2L))
Data <- structure(list(coll_date = structure(c(2L, 3L, 1L), .Label = c("2000-09-29",
"2013-08-13", "2013-08-14"), class = "factor"), julian_date = c(76L,
76L, 112L), tag_color = structure(c(2L, 2L, 1L), .Label = c("blue",
"yellow"), class = "factor")), .Names = c("coll_date", "julian_date",
"tag_color"), class = "data.frame", row.names = c(NA, -3L))
An easy way to perform this merge would be using the data.table library. You can do
#convert to data.table and set keys
ttf<-setDT(TwistTieFix)
setkey(ttf, color_match, julian_start)
dt<-setDT(Data)
setkey(dt, tag_color, julian_date)
#merge and extract columns
ttf[dt, roll=T][julian_start<julian_cut_off_date,list(coll_date,
julian_date=julian_start, tag_color=color_match, color_code)]
to get
coll_date julian_date tag_color color_code
1: 2000-09-29 112 blue b1
2: 2013-08-13 76 yellow y1
3: 2013-08-14 76 yellow y1

Resources