Order a variable and separate by another variable R - r

I want to order() a variable by the number of characters it has and then separate that variable based on gender, f and m. How can I do that using R?

Calling your data d,
d[order(d$sex, nchar(as.character(d$name))), ]
# name sex
# 2 Alex Berniugn female
# 5 Aaron Poss male
# 1 Alex Josh Adams male
# 4 Potter Allbinore male
# 3 Abbigol Alexander male

Related

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

Ordering alphabetically after ordering once numerically [duplicate]

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Sort data frame column by factor

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources