Average only duplicated rows and replacing value in a defined column - r

I have a dataframe D:
surname name salary
Red A 1000
Green B 900
Green A 1100
Blue C 1000
Blue B 1000
Blue F 800
Violet F 1200
Some row has no replicate in surname, some other yes.
I need to aggregate the rows only where surname is duplicated, to average values of salary and change the name to "X".
I tryed something using duplicated() but it leave one duplicate as original and change the others.
D$name<-replace(D$name,duplicated(D$surname),"X")
And also I was unable to average the values of salary.
Thank you!

We can use
D$name <- replace(D$name,duplicated(D$surname)|duplicated(D$surname,
fromLast = TRUE),"X")
If we need to create an average column
library(dplyr)
D %>%
group_by(surname) %>%
mutate(average = mean(salary))
data
D <- structure(list(surname = c("Red", "Green", "Green", "Blue", "Blue",
"Blue", "Violet"), name = c("A", "B", "A", "C", "B", "F", "F"
), salary = c(1000L, 900L, 1100L, 1000L, 1000L, 800L, 1200L)), class = "data.frame", row.names = c(NA,
-7L))

Related

R dataframe with values in the wrong columns

I have a dataframe like this one:
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear British Yellow
Strawberries French Red
Blackberry Blue Austrian
As you can see the Characteristic can be in different Columns depending in the row. I would like to obtain a dataframe where each column contains only the values of a specific Characteristic.
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear Yellow British
Strawberries Red French
Blackberry Blue Austrian
My idea is to use the case_when function but I would like to know if there are Faster ways to achieve the same result.
Example data:
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
I suspect there is an easier way of solving the issue, but here is one potential solution:
# Load the libraries
library(tidyverse)
# Load the data
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
# R has 657 built in colour names. You can see them using the `colours()` function.
# Chances are your colours are contained in this list.
# The `str_to_title()` function capitalizes every colour in the list
list_of_colours <- str_to_title(colours())
# If your colours are not contained in the list, add them using e.g.
# `list_of_colours <- c(list_of_colours, "Octarine")`
# Create a new dataframe ("df2") by taking the original dataframe ("df")
df2 <- df %>%
# Create two new columns called "Colour" and "Origin" using `mutate()` with
# `ifelse` used to identify whether each word is in the list of colours.
# If the word is in the list of colours, add it to the "Colours" column, if
# it isn't, add it to the "Origin" column.
mutate(Colour = ifelse(!is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2),
Origin = ifelse(is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2)) %>%
# Then select the columns you want
select(Name, Colour, Origin)
df2
# A tibble: 4 x 3
# Name Colour Origin
# <chr> <chr> <chr>
#1 Apple Yellow Italian
#2 Pear Yellow British
#3 Strawberries Red French
#4 Blackberry Blue Austrian
I think there is also a better way of achieving this but for now this is the one solution that came to my mind:
library(dplyr)
library(stringr)
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
df %>%
mutate(char_1 = if_else(str_to_lower(Characteristic_1) %in% colours(distinct = TRUE),
Characteristic_1, Characteristic_2),
char_2 = if_else(Characteristic_1 == char_1, Characteristic_2, Characteristic_1)) %>%
select(-c(Characteristic_1, Characteristic_2))
# A tibble: 4 x 3
Name char_1 char_2
<chr> <chr> <chr>
1 Apple Yellow Italian
2 Pear Yellow British
3 Strawberries Red French
4 Blackberry Blue Austrian

How to render igraph in R

There is dataset with code below. And I need get a graph like in the picture, without changing frame. I tried use rbind to add more hierarchy to data frame in favor to get diagram like in picture. col0 and col1 data is changing debending on data while col2 remains always the same.
df <- data.frame(col0 = c("Cat Dog Wolf", "Cat Dog Wolf", "Cat Dog Wolf"),
col1 = c( "Cat", "Dog", "Wolf"),
col2 = c( "Feline", "Canis", "Canis2"))
df <-rbind(df, data.frame(col0="Cat Dog Wolf", col1 = "Canis2", col2 = "Canis"))
df <-df[c('col1', 'col2')]
names(df) <-c('from', 'to')
abc <-union(df$to, df$from)
g <-graph.data.frame(df, directed = TRUE, vertices = abc)
plot(g, vertex.size = 20, vertex.label.dist = 0.5, vertex.color = "blue",
edge.arrow.size = 0.5, layout = layout.reingold.tilford(g))
You need three edges taken from only two columns ("From" and "To"). But you have three columns in df so you have to choose from them. I created a new column with the names from col1 and col2 pasted together. Then, I chose the first two vertex from the top and rbind the third one.
df <- data.frame(col0 = "Cat Dog Wolf",
col1 = c( "Cat", "Dog", "Wolf"),
col2 = c( "Feline", "Canis", "Canis2"))
df$col1_2 <- paste(df$col2,df$col1)
df <- rbind(df[1:2,c(1,4)],data.frame(col0=df[2,4],col1_2=df[3,4]))
names(df) <-c('from', 'to')
abc <-union(df$to, df$from)
g <-graph.data.frame(df, directed = TRUE, vertices = abc)
plot(g, vertex.size = 20, vertex.label.dist = 0.5, vertex.color = c("lightblue","red","green","white"),
edge.arrow.size = 0.5, layout = layout.reingold.tilford(g))

Keeping the label order on the y-axis when using seqpcplot in TraMineR

I'm using the R package TraMineR. I would like to plot frequent event sequences by using the command seqpcplot. I previously coded the states in the alphabet as to keep them in alphabetical order so that when I compute the sequences by using the seqdef command without specifying the labels and states options I obtain the following output:
[>] state coding:
[alphabet] [label] [long label]
1 a.sin a.sin a.sin
2 b.co0 b.co0 b.co0
3 c.co1 c.co1 c.co1
4 d.co2+ d.co2+ d.co2+
5 e.ma0 e.ma0 e.ma0
6 f.ma1 f.ma1 f.ma1
7 g.ma2+ g.ma2+ g.ma2+
8 h.sin0 h.sin0 h.sin0
9 i.lp1 i.lp1 i.lp1
10 l.lp2+ l.lp2+ l.lp2+
11 m.lp1_18 m.lp1_18 m.lp1_18
12 n.lp2_18 n.lp2_18 n.lp2_18
I then convert the state-sequence objet in an event-sequece objet by using seqecreate. When plotting the event sequences by seqpcplot I obtain a very nice graph where the states are ordered alphabetically on the y-axis according to the alphabet.
However, I would like to use longer labels in the graphs, so that I specified the labels and states options in the seqdef command as
lab<-c("single", "cohabNOchildren","cohab1child","cohab2+children","marrNOchildren","marr1child","marr2+children","singleNOchildren","loneMother1child","loneMother2+children","loneMother1child_over18","loneMother2+children_over18")
obtaining:
[>] state coding:
[alphabet] [label] [long label]
1 a.sin single single
2 b.co0 cohabNOchildren cohabNOchildren
3 c.co1 cohab1child cohab1child
4 d.co2+ cohab2+children cohab2+children
5 e.ma0 marrNOchildren marrNOchildren
6 f.ma1 marr1child marr1child
7 g.ma2+ marr2+children marr2+children
8 h.sin0 singleNOchildren singleNOchildren
9 i.lp1 loneMother1child loneMother1child
10 l.lp2+ loneMother2+children loneMother2+children
11 m.lp1_18 loneMother1child_over18 loneMother1child_over18
12 n.lp2_18 loneMother2+children_over18 loneMother2+children_over18
As before, I then computed the event sequences and plot them by using seqpcplot:
seqpcplot(example.seqe,
filter = list(type = "function",
value = "cumfreq",
level = 0.8),
order.align = "last",
ltype = "non-embeddable",
cex = 1.5, lwd = .9,
lcourse = "downwards")
This time the states on the y-axis were the states are ordered alphabetically but following the order given by the labels and states labels rather than the alphabet, as I wished.
Is there a way to keep the alphabetical order given in the alphabet when plotting with seqpcplot when the labels and states options are specified and may follow a different alphabetical order from the alphabet?
Thanks.
I agree with the solution above. As a supplement, here a number of possible solutions:
Using seqecreate and the alphabet argument in seqpcplot:
dat <- data.frame(id = factor(1, 1, 1),
timestamp = c(0, 20, 22),
event = factor(c("A", "B", "C")))
dat.seqe <- seqecreate(dat)
seqpcplot(dat.seqe, alphabet = c("C", "A", "B"))
Using seqecreate only
dat <- data.frame(id = factor(1, 1, 1),
timestamp = c(0, 20, 22),
event = factor(c("A", "B", "C"),levels = c("C", "A", "B")))
dat.seqe <- seqecreate(dat)
seqpcplot(dat.seqe)
Using seqdef (here the original categories are different than the labels to be shown in the y-axis)
dat <- data.frame(id = factor(1),
ev.0 = factor("AA", levels = c("CC", "AA", "BB")),
ev.20 = factor("BB", levels = c("CC", "AA", "BB")),
ev.22 = factor("CC", levels = c("CC", "AA", "BB")))
dat.seq <- seqdef(dat, var = 2:4, alphabet = c("CC", "AA", "BB"),
states = c("C", "A", "B"))
seqpcplot(dat.seq)
The last solution may be the one you're looking for. Hope it helps.
The alphabet argument of the seqpcplot function is there to control that order. Something like
seqpcplot(example.seqe,
alphabet = lab,
filter = list(type = "function",
value = "cumfreq",
level = 0.8),
order.align = "last",
ltype = "non-embeddable",
cex = 1.5, lwd = .9,
lcourse = "downwards")
should give you the expected plot.

Extracting values with if and put them in a new column

maybe this is a very simple question, but I cannot figure out what is wrong with my short code.
This is my (very simple) data frame:
structure(list(sample = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 3L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), value = c(0.1446689595,
0.9151456018, 0.880888083, 0.005522657, 0.7079621046, 0.4770259836,
0.6960717649, 0.5892328324, 0.1134234308), new = c("red", "red",
"red", "red", "red", "red", "red", "red", "red")), .Names = c("sample",
"value", "new"), row.names = c(NA, -9L), class = "data.frame")
what I would like to do is add a new column where the new values depend on the values of the first column. In other and simpler words:
if (df1$sample != "a") {
df1$new <- "green"
} else {
df1$new <- "red"
}
but R returns an error:
In if (df1$sample != "a") { :
the condition has length > 1 and only the first element will be used
I also tried with an elseif statement:
ifelse(df1$sample != "a", df1$new <- "green", df1$new <- "red")
but it this case the new column contains only "red" and no "green".
Am I missing something?
Thanks!
You could try
df1$new <- c('green', 'red')[(df1$sample=='a')+1L]
df1
# sample value new
#1 a 0.144668959 red
#2 b 0.915145602 green
#3 a 0.880888083 red
#4 a 0.005522657 red
#5 a 0.707962105 red
#6 b 0.477025984 green
#7 c 0.696071765 green
#8 c 0.589232832 green
#9 c 0.113423431 green
ifelse should work fine - you just need to assign it
df1$new1 <- ifelse(df1$sample != "a", df1$new1 <- "green", df1$new1 <- "red")
sample value new new1
1 a 0.144668959 red red
2 b 0.915145602 red green
3 a 0.880888083 red red
4 a 0.005522657 red red
5 a 0.707962105 red red
6 b 0.477025984 red green
7 c 0.696071765 red green
8 c 0.589232832 red green
9 c 0.113423431 red green
I would avoid using new as a variable name - it is the name of a function and this may cause issues.

colored categories in r wordclouds

Using the wordcloud package in R I would like to color different words according to a categorical variable in the dataset. Say my data is as follows:
name weight group
1 Aba 10 x
2 Bcd 20 y
3 Cde 30 z
4 Def 5 x
And here as a dput:
dat <- structure(list(name = c("Aba", "Bcd", "Cde", "Def"), weight = c(10,
20, 30, 5), group= c("x", "y", "z", "x")), .Names = c("name",
"weight", "group"), row.names = c(NA, -4L), class = "data.frame")
Is there a way in wordcloud() to color the names by their group (x, y, z) or should I use different software/packages?
It will automatically choose from a color list based on frequency or by word order if ordered.colors is specified.
name = c("Aba","Bcd","Cde","Def")
weight = c(10,20,30,5)
colorlist = c("red","blue","green","red")
wordcloud(name, weight, colors=colorlist, ordered.colors=TRUE)
The example above works for independent variables. In a data frame, your color specification will be stored as a factor, and it will have to be converted to text by wrapping it in as.character like this:
wordcloud(df$name, df$weight, colors=as.character(df$color), ordered.colors=TRUE)
If you just have factors and not a list of colors, you can generate a parallel colorlist with a couple of lines.
#general solution for any number of categories
basecolors = rainbow(length(unique(group)))
# solution for known categories
basecolors = c("red","green","blue")
group = c("x","y","z","x")
# find position of group in list of groups, and select that matching color...
colorlist = basecolors[ match(group,unique(group)) ]

Resources