Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am new to R programming and learnt lots of R functions but not able to comprehend the mutate the data frame. Since I am pursuing a course Introduction to Probability and Data at Coursera where I am not able to solve a question, Recently I came across one of the exercises where it was asked to the mutate the data frame, which is as follows
Suppose you define a flight to be "on time" if it gets to the destination on time or earlier than expected, regardless of any departure delays. Mutate the data frame to create a new variable called arr_type with levels "on time" and "delayed" based on this definition. Then, determine the on-time arrival percentage based on whether the flight departed on time or not. What proportion of flights that
were "delayed" departing arrive "on time"?
Please guide me and explain how to comprehend this clause?
Here's how it works:
(df <- data.frame(group=gl(2,2), value=1:4))
# group value
# 1 1 1
# 2 1 2
# 3 2 3
# 4 2 4
library(dplyr)
df %>% group_by(group) %>% mutate(avg=mean(value))
# Source: local data frame [4 x 3]
# Groups: group [2]
#
# group value avg
# (fctr) (int) (dbl)
# 1 1 1 1.5
# 2 1 2 1.5
# 3 2 3 3.5
# 4 2 4 3.5
You can also group by several variables, like group_by(plane, flight). So you should be able to get where you want easily.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
this is probably a very basic question but I'm just starting out using R and hope someone can help.
I've imported some data into R and created an object containing just the data I'm working on first:
Each of the values is from a scale of 1 to 10.
What I want to produce is a chart showing the mean of each column, something like this (which I did in Excel):
I'm sure this is possible, but I'm going round in circles figuring it out! Ignoring the vertical line (at maximum value) and standard deviations for now, though ultimately I'd like to have them included. Thank you!
set.seed(42)
dat <- setNames(data.frame(replicate(4, sample(10, 50, replace=TRUE))), c("2000", "2400", "2800", "3200"))
head(dat)
# 2000 2400 2800 3200
# 1 1 6 5 1
# 2 5 6 9 1
# 3 1 2 10 5
# 4 9 4 8 3
# 5 10 3 7 10
# 6 4 6 6 1
library(dplyr)
library(tidyr) # pivot_longer
library(ggplot2)
dat %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarize(value = mean(value), .groups = "drop") %>%
mutate(name = as.integer(name)) %>%
ggplot(aes(name, value)) + geom_line()
It seems that you have encoded a numerical value in the column name, which is not a good idea, because it is a violation of the first normal form. I would thus suggest to transpose the data and encode the first value in the first column.
With your peculiar data structure, you must first extract the number from the colmn names with
x <- as.numeric(names(dat))
Then you can compute all column means with
y <- colMeans(dat)
And then you can plot it
plot(x, y, type="l")
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am currently trying to decrease the values in a column randomly according to a given sum.
For example, if the main data is like this;
ID Value
1 4
2 10
3 16
after running the code the sum of Value should be 10 and this need to be done randomly(the decrease for each member should be chosen randomly)
ID Value
1 1
2 8
3 1
Tried several command and library but could not manage it. Still a novice and
Any help would be appreciated!
Thanks
Edit: Sorry I was not clear enough. I would like to assign a new value for each observation smaller than original (randomly). And at the end new sum of value will be equal to 10
Using the sample data
dd <- read.table(text="ID Value
1 4
2 10
3 16", header=TRUE)
and the dplyr + tidyr library, you can do
library(dplyr)
library(tidyr)
dd %>%
mutate(ID=factor(ID)) %>%
uncount(Value) %>%
sample_n(10) %>%
count(ID, name = "Value", .drop=FALSE)
Here we repeat the row once for each Value, then we randomly sample 10 rows, then we count them back up. We turn ID to a factor to make sure IDs with 0 observations are preserved.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am trying to divide all integers in a column with another integer. I have a database with a column that has integers that go above 1*10^20. Because of this my plots are way to big. I need to normalize the data to have a better understanding what is going on. For example, the data that I have:
[x][Day] [Amount]
[1] 1 1 23440100
[2] 2 2 41231020
[3] 3 3 32012010
I am using a data.frame for my own data, so here you have the data frame for the data above
x <- c(1,2,3)
day <- c(1,2,3)
Amount <- c(23440100, 41231020, 32012010)
my.data <- data.frame(x, day, Amount)
I tried using another answer, provided here, but that doesn't seem to work.
The code that I tried:
test <- my.data[, 3]/1000
Hope someone can help me out! Cheers, Chester
I guess you are looking for this?
my.data$Amount <- my.data$Amount/1000
such that
> my.data
x day Amount
1 1 1 23440.10
2 2 2 41231.02
3 3 3 32012.01
Use mutate from dplyr
Since you're using a data.frame, you can use this simple code:
library(dplyr)
mutated.data <- my.data %>%
mutate(Amount = as.integer(Amount / 1000))
> mutated.data
x day Amount
1 1 1 23440.10
2 2 2 41231.02
3 3 3 32012.01
Hope this helps.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a data frame like this
Name Value
A. -5
B. 100
F. 0
G. -5
I want to sort the data in an ascending order and add a rank column. So I want something like this:
Name. Value. Rank
A. -5. 1
G. -5. 1
F. 0. 2
B. 100. 3
A base R solution could be:
v1 <- order(df$Value)
data.frame(df[v1, ], rank = as.numeric(factor(df$Value[v1])))
# Name Value rank
#1 A. -5 1
#4 G. -5 1
#3 F. 0 2
#2 B. 100 3
Sorting the dataframe with order and converting the sorted Value to factors and then numeric so that the Value with same value would get same rank.
This can be achieved easily with the dplyr package.
#Recreate the data
df <- read.table(text = "Name Value
A. -5
B. 100
F. 0
G. -5", header = TRUE)
library(dplyr)
df %>% arrange(Value) %>% mutate(Rank = dense_rank(Value))
The dplyr function reads take the data frame df, then arrange it by Value, then add a new column Rank which equals the dense ranking of Value.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a dataset that is a similar structure to this:
account_no <- c(1:5, 2, 2 , 3)
interaction_date <- c("1/1/2016","2/5/2016", "3/2/2016", "27/4/2016","11/10/2015", "11/10/2015","11/10/2015","2/5/2016")
interaction_date<- as.Date(b, format = "%d/%m/%Y")
action <- c("a","c","b","c","c","a","a","b")
df <- data.frame(account_no ,interaction_date, action)
df
There are a couple of other attributes associated with each row, but this is the typical structure.
Essentially it is log data, describing interactions of a user (account_no), the time they interacted and the action they took.
I've been told to find underlying trends in the data.
Is there a way I can aggregate the data based on account_no that would give me an insight into the average length in days between interaction dates?
Or some sort of count to see what is the most common action taken on a specific day?
There are about 80,000 rows in the dataset, and there may be a number of actions on the same account on the same day. Is there a way in which I can break this down into something meaningful?
Here's how you can get a sense of the gap between interaction dates:
df$interaction_date <- as.Date(df$interaction_date,'%d/%m/%Y'); ## coerce to Date
df <- df[order(df$interaction_date),]; ## ensure ordered by interaction_date
aggregate(cbind(gap=interaction_date)~account_no,df,function(x) mean(diff(unique(x))));
## account_no gap
## 1 1 NaN
## 2 2 204
## 3 3 89
## 4 4 NaN
## 5 5 NaN
Only accounts 2 and 3 had 2 or more interactions, so the remainder get an invalid result. The gap unit is days between interaction dates.
I added the unique() call to exclude multiple interactions on the same date, since I assumed you wouldn't want those to lower the averages.
Or using data.table
library(data.table)
setDT(df)[, interaction_date := as.IDate(interaction_date, "%d/%m/%Y")]
df[order(account_no,interaction_date), .(Gap = mean(diff(interaction_date))) ,account_no]
# account_no Gap
#1: 1 NaN days
#2: 2 102 days
#3: 3 89 days
#4: 4 NaN days
#5: 5 NaN days