I have a dataset called test
I need to write code for the following
q1. what does jim do the most and the ANS will be 10 runs
q2 what are the three least things mike does and the Answill be walks 6 runs 4 drives 4
q3 who travels furthest and the answer will be Jim 40
This will give you an idea of how to put it into tidy format, and start to answer the questions.
library(tidyverse)
df <- data.frame(stringsAsFactors=FALSE,
name = c("paul", "john", "mike", "jim"),
walks = c(10L, 9L, 6L, 7L),
runs = c(6L, 5L, 4L, 10L),
cycles = c(2L, 5L, 8L, 9L),
drives = c(2L, 3L, 4L, 5L),
flys = c(2L, 6L, 8L, 9L)
)
df
df <- df %>% gather(key = transport, value = "freq", walks:flys)
df
df %>% filter(name == "jim") %>%
group_by(transport) %>%
arrange(desc(freq))
which gives you an output table like:
# A tibble: 5 x 3
# Groups: transport [5]
name transport freq
<chr> <chr> <int>
1 jim runs 10
2 jim cycles 9
3 jim flys 9
4 jim walks 7
5 jim drives 5
which lets you answer Q1.
Notice that gather() is used to make the data in the tidy format, like:
name transport freq
1 paul walks 10
2 john walks 9
3 mike walks 6
4 jim walks 7
5 paul runs 6
This looks like your homework, so it's probably better for you to figure out the rest for yourself, but this will get you on the right track.
Look into the functions in dplyr that you need.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have a table (based on a .csv file named "bikes"), which contains around 10,000 rows and looks like this:
rentals season weatherCondition
12 1 4
6 4 1
21 1 4
4 3 1
5 3 2
19 1 4
13 1 3
10 2 4
8 2 3
.. .. ..
My task is to visualise the relationship between "RENTALS" and "SEASON" based on "weatherCondition".
So far I managed to do the following:
library(tidyverse)
ggplot(data=bikes, mapping = aes(x = season, y = rentals)) +
geom_point(aes(color = weatherCond)) +
theme_bw()
The result is:
I assume that I'm heading in the right direction, but it's far from perfect.
It doesn't visualise the data perfectly.
I assume that providing more information regarding the correlation, the connection or the relationship between the variables would be beneficial.
How would you solve this problem?
I would summarize the data by weather condition and season, given the weather conditions are not thousand different ones. Ideally both can be named, like "rain", "summer" etc. Either case, the following works.
Data
dat <- structure(list(rentals = c(12L, 6L, 21L, 4L, 5L, 19L, 13L, 10L,
8L), season = c(1L, 4L, 1L, 3L, 3L, 1L, 1L, 2L, 2L), weatherCondition = c(4L,
1L, 4L, 1L, 2L, 4L, 3L, 4L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
res <- data.frame( aggregate( rentals ~ weatherCondition + season, dat, sum ))
# for numerical data
weatherCondition season rentals
1 3 1 13
2 4 1 52
3 3 2 8
4 4 2 10
5 1 3 4
6 2 3 5
7 1 4 6
# for named conditions and seasons
weatherCondition season rentals
1 rainy fall 6
2 foggy spring 8
3 snowy spring 10
4 rainy summer 4
5 sunny summer 5
6 foggy winter 13
7 snowy winter 52
The plotting
barplot( res[,"rentals"], names=apply( res[,1:2], 1, function(x) paste0(x[1],"_",x[2]) ), las=3 )
Obviously, the labels don't make much sense here in my toy example.
Seems like you probably want grouped box plots. You can add colours or fills to this if necessary
library(tidyverse)
bikes %>%
ggplot(aes(x = factor(season), y = rentals, group = factor(weatherCond))) +
geom_boxplot() +
theme_bw()
I wonder if someone can help out. I have the following dataset where an ID is a company that has hired different number people over time with ID duplicates. And we have the address of IDs but it is not collected for each row:
ID Address Number of hiring
1 5
2 Montreal 2
3 3
4 Helsinki 4
1 London 1
2 3
3 Dubai 5
and I'd like to group by ID and add a column that shows the total number of hiring cities that an ID has hired to as well as a column showing the address ID. When I do it, because there are missing values in address, R automatically selects the first row for each ID that may have missing value. So, the following should be the result:
ID Address Total Number of hiring
1 London 6
2 Montreal 5
3 Dubai 8
4 Helsinki 4
I am trying to use the dplyr in R
You can select the first non-empty Address for each ID :
library(dplyr)
df %>%
group_by(ID) %>%
summarise(Address = Address[Address != ''][1],
total_hiring = sum(Number_of_hiring, na.rm =TRUE))
# ID Address total_hiring
# <int> <chr> <int>
#1 1 London 6
#2 2 Montreal 5
#3 3 Dubai 8
#4 4 Helsinki 4
data
df <- structure(list(ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L), Address = c("",
"Montreal", "", "Helsinki", "London", "", "Dubai"), Number_of_hiring = c(5L,
2L, 3L, 4L, 1L, 3L, 5L)), class = "data.frame", row.names = c(NA, -7L))
Has anyone selected unique values from a dataframe based on a second value's highest value?
Example:
name value
cheese 15
pepperoni 12
cheese 9
tomato 4
cheese 3
tomato 2
The best I've come up with - which I am SURE there's a better way - is to sort df by value descending, extract df$name, run unique() on that, then do a left join back with dplyr.
The ideal outcome is this:
name value
cheese 15
pepperoni 12
tomato 4
Thanks in advance!
Seeing your expected result, for each name, you are looking for the row that has the largest number. One way to achieve this task is the following.
library(dplyr)
group_by(mydf, name) %>%
slice(which.max(value))
# A tibble: 3 x 2
# Groups: name [3]
# name value
# <fct> <int>
#1 cheese 15
#2 pepperoni 12
#3 tomato 4
DATA
mydf <- structure(list(name = structure(c(1L, 2L, 1L, 3L, 1L, 3L), .Label = c("cheese",
"pepperoni", "tomato"), class = "factor"), value = c(15L, 12L,
9L, 4L, 3L, 2L)), class = "data.frame", row.names = c(NA, -6L
))
Suppose I have a matrix (or dataframe):
1 5 8
3 4 9
3 9 6
6 9 3
3 1 2
4 7 2
3 8 6
3 2 7
I would like to select only the first three rows that have "3" as their first entry, as follows:
3 4 9
3 9 6
3 1 2
It is clear to me how to pull out all rows that begin with "3" and it is clear how to pull out just the first row that begins with "3."
But in general, how can I extract the first n rows that begin with "3"?
Furthermore, how can I select just the 3rd and 4th appearances, as follows:
3 1 2
3 8 6
Without the need for an extra package:
mydf[mydf$V1==3,][1:3,]
results in:
V1 V2 V3
2 3 4 9
3 3 9 6
5 3 1 2
When you need the third and fourth row:
mydf[mydf$V1==3,][3:4,]
# or:
mydf[mydf$V1==3,][c(3,4),]
Used data:
mydf <- structure(list(V1 = c(1L, 3L, 3L, 6L, 3L, 4L, 3L, 3L),
V2 = c(5L, 4L, 9L, 9L, 1L, 7L, 8L, 2L),
V3 = c(8L, 9L, 6L, 3L, 2L, 2L, 6L, 7L)),
.Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
Bonus material: besides dplyr, you can do this also very efficiently with data.table (see this answer for speed comparisons on large datasets for the different data.table methods):
setDT(mydf)[V1==3, head(.SD,3)]
# or:
setDT(mydf)[V1==3, .SD[1:3]]
You can do something like this with dplyr to extract first three rows of each unique value of that column:
library(dplyr)
df %>% arrange(columnName) %>% group_by(columnName) %>% slice(1:3)
If you want to extract only three rows when the value of that column, you can try:
df %>% filter(columnName == 3) %>% slice(1:3)
If you want specific rows, you can supply to slice as c(3, 4), for example.
We could also use subset
head(subset(mydf, V1==3),3)
Update
If we need to extract also one row below the rows where V1==3,
i1 <- with(mydf, V1==3)
mydf[sort(unique(c(which(i1),pmin(which(i1)+1L, nrow(mydf))))),]
I have a rather large data frame. Here is a simplified example:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
2 CBA 14 Good
2 FDA 14 Good
3 JHA 16 Good
3 AHF 16 Good
3 AKF 17 Good
Here it is as a dput:
dat <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Element = structure(c(1L,
2L, 5L, 6L, 7L, 8L, 3L, 4L), .Label = c("AAA", "ABA", "AHF",
"AKF", "AVA", "CBA", "FDA", "JHA"), class = "factor"), Value = c(11L,
12L, 13L, 14L, 14L, 16L, 16L, 17L), Note = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "Good", class = "factor")), .Names = c("Group",
"Element", "Value", "Note"), class = "data.frame", row.names = c(NA,
-8L))
I'm trying to separate it based on the group. so let's say
Group 1 will be a data frame:
Group Element Value Note
1 AAA 11 Good
1 ABA 12 Good
1 AVA 13 Good
Group 2:
2 CBA 14 Good
2 FDA 14 Good
and so on.
You can use split for this.
> dat
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
## 6 3 JHA 16 Good
## 7 3 AHF 16 Good
## 8 3 AKF 17 Good
> x <- split(dat, dat$Group)
Then you can access each individual data frame by group number with x[[1]], x[[2]], etc.
For example, here is group 2:
> x[[2]] ## or x[2]
## Group Element Value Note
## 4 2 CBA 14 Good
## 5 2 FDA 14 Good
ADD: Since you asked about it in the comments, you can write each individual data frame to file with write.csv and lapply. The invisible wrapper is simply to suppress the output of lapply
> invisible(lapply(seq(x), function(i){
write.csv(x[[i]], file = paste0(i, ".csv"), row.names = FALSE)
}))
We can see that the files were created by looking at list.files
> list.files(pattern = "^[0-9].csv")
## [1] "1.csv" "2.csv" "3.csv"
And we can see the data frame of the third group with read.csv
> read.csv("3.csv")
## Group Element Value Note
## 1 3 JHA 16 Good
## 2 3 AHF 16 Good
## 3 3 AKF 17 Good
Obligatory plyr version (pretty much equiv to Richard's, but I'll bet it's slower, too:
library(plyr)
groups <- dlply(dat, .(Group), function(x) { return(x) })
length(groups)
## [1] 3
groups$`1` # can also do groups[[1]]
## Group Element Value Note
## 1 1 AAA 11 Good
## 2 1 ABA 12 Good
## 3 1 AVA 13 Good
groups[[2]]
## Group Element Value Note
## 1 2 CBA 14 Good
## 2 2 FDA 14 Good