I have a dataframe with a few columns, one of those columns is ranks, an integer between 1 and 20. I want to create another column that contains a bin value like "1-4", "5-10", "11-15", "16-20".
What is the most effective way to do this?
the data frame that I have looks like this(.csv format):
rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink
and I want to add another column to the dataframe, so it would be like this:
rank,name,info,binValue
1,steve,red,"1-4"
3,joe,blue,"1-4"
6,john,green, "5-10"
3,liz,yellow,"1-4"
15,jon,pink,"11-15"
The way I am doing it now is not working, as I would like to keep the data.frame intact, and just add another column if the value of df$ranked is within a given range. thank you.
See ?cut and specify breaks (and maybe labels).
x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15
dat <- "rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink"
x <- read.table(textConnection(dat), header=TRUE, sep=",", stringsAsFactors=FALSE)
x$bins <- cut(x$rank, breaks=seq(0, 20, 5), labels=c("1-5", "6-10", "11-15", "16-20"))
x
rank name info bins
1 1 steve red 1-5
2 3 joe blue 1-5
3 6 john green 6-10
4 3 liz yellow 1-5
5 15 jon pink 11-15
We can use smart_cut from package cutr :
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Andrie's sample data:
x$bins <- smart_cut(x$rank,
c(1,5,11,16),
labels = ~paste0(.y[1],'-',.y[2]-1),
simplify = FALSE)
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 11-15
more on cutr and smart_cut
Related
I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().
We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1
I know it is fundamental but I can't find the trick ...
Here is an exemple :
Species <- c("dark frog",rep(c("elephant","tiger","boa"),3),"black mamba")
Year <- c(rep(2011,4),rep(2012,3),rep(2013,4))
Abundance <- c(2,4,5,6,9,2,1,5,6,8,4)
df <- data.frame(Species, Year, Abundance)
I would like to obtain another dataframe (3 rows *5 columns) with the abundance values in function of the species as the column names (each species appearing thus only one time) and the years as the row names (appearing one time also).
May someone help me please ?
You mean something like this?
> xtabs(Abundance~Year+Species, data=df)
Species
Year black mamba boa dark frog elephant tiger
2011 0 6 2 4 5
2012 0 1 0 9 2
2013 4 8 0 5 6
The class for the above is a table, so if you prefer a data.frame instead, you can try:
library(tidyr)
new.df<- spread(df, key = Species, value = Abundance)
Year black mamba boa dark frog elephant tiger
1 2011 NA 6 2 4 5
2 2012 NA 1 NA 9 2
3 2013 4 8 NA 5 6
If you want 0s instead of NA add the following line:
new.df[is.na(new.df)]<- 0
I have a data frame created by this chunk of code:
df <- dplyr::data_frame(
id = c(1, 2, 3),
name = c('Jack', 'Peter', 'Sam'),
role = list(
c('Manager', 'Analyst'),
c('Analyst', 'Advisor'),
c('Analyst')
),
fav_color = list(
c('White', 'Blue'),
c('Black', 'Red'),
c('White', 'Red')
)
)
Each row in role and fav_color columns contain a vector or characters instead of a single string. I want to spread the values into separate rows like this:
id name role fav_color
------------------------------
1 Jack Manager White
1 Jack Manager Blue
1 Jack Analyst White
1 Jack Analyst Blue
2 Peter Analyst Black
2 Peter Analyst Red
2 Peter Advisor Black
2 Peter Advisor Red
3 Sam Analyst White
3 Sam Analyst Red
I tried purrr and tidyjson but still didn't get very far.
Anyone gives me some advice? much appreciated.
You can do this with dplyr fairly easily
df %>% rowwise %>% do(expand.grid(., stringsAsFactors=FALSE))
# id name role fav_color
# * <dbl> <chr> <chr> <chr>
# 1 1 Jack Manager White
# 2 1 Jack Analyst White
# 3 1 Jack Manager Blue
# 4 1 Jack Analyst Blue
# 5 2 Peter Analyst Black
# 6 2 Peter Advisor Black
# 7 2 Peter Analyst Red
# 8 2 Peter Advisor Red
# 9 3 Sam Analyst White
# 10 3 Sam Analyst Red
Here we use the base function expand.grid to find all combinations of the list values,
I have a dataframe with a few columns, one of those columns is ranks, an integer between 1 and 20. I want to create another column that contains a bin value like "1-4", "5-10", "11-15", "16-20".
What is the most effective way to do this?
the data frame that I have looks like this(.csv format):
rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink
and I want to add another column to the dataframe, so it would be like this:
rank,name,info,binValue
1,steve,red,"1-4"
3,joe,blue,"1-4"
6,john,green, "5-10"
3,liz,yellow,"1-4"
15,jon,pink,"11-15"
The way I am doing it now is not working, as I would like to keep the data.frame intact, and just add another column if the value of df$ranked is within a given range. thank you.
See ?cut and specify breaks (and maybe labels).
x$bins <- cut(x$rank, breaks=c(0,4,10,15), labels=c("1-4","5-10","10-15"))
x
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 10-15
dat <- "rank,name,info
1,steve,red
3,joe,blue
6,john,green
3,liz,yellow
15,jon,pink"
x <- read.table(textConnection(dat), header=TRUE, sep=",", stringsAsFactors=FALSE)
x$bins <- cut(x$rank, breaks=seq(0, 20, 5), labels=c("1-5", "6-10", "11-15", "16-20"))
x
rank name info bins
1 1 steve red 1-5
2 3 joe blue 1-5
3 6 john green 6-10
4 3 liz yellow 1-5
5 15 jon pink 11-15
We can use smart_cut from package cutr :
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Andrie's sample data:
x$bins <- smart_cut(x$rank,
c(1,5,11,16),
labels = ~paste0(.y[1],'-',.y[2]-1),
simplify = FALSE)
# rank name info bins
# 1 1 steve red 1-4
# 2 3 joe blue 1-4
# 3 6 john green 5-10
# 4 3 liz yellow 1-4
# 5 15 jon pink 11-15
more on cutr and smart_cut