Order list depending on vector - r

To create some plots, I've already summarized my data using the following approach, which includes all the needed information.
# Load Data
RawDataSet <- read.csv("http://pastebin.com/raw/VP6cF31A", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
# summarising the data
new.df <- RawDataSet %>%
group_by(UserEmail,location,context) %>%
tally() %>%
mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
group_by(UserEmail,location) %>%
mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))
With some other analysis I've identified distinct user groups. Since I would like to plot my data, it would be great to have a plot visualizing my data in the right order.
The order is based on the UserEmail and is defined by the following:
order <- c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")
Asking for the type of my new.df with typeof(new.df) it says that this is a list. I've already tried some approaches like order_by or with_order, but I until now I have not managed it to order my new.df depending on my order-vector. Of course, the order process could also be done in the summarising part.
Is there any way to do so?

I couldn't bring myself to create a vector named order out of respect for the R function by that name. Using match to construct an index to use as the basis ordering (as a function):
sorted.df <- new.df[ order(match(new.df$UserEmail, as.integer(c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")) )), ]
head(sorted.df)
#---------------
Source: local data frame [6 x 6]
Groups: UserEmail, location [4]
UserEmail location context n n2 p
(int) (fctr) (fctr) (int) (dbl) (dbl)
1 28 NOT_WITHIN Clicked A 16 -16 -0.8421053
2 28 NOT_WITHIN Clicked B 3 -3 -0.1578947
3 28 WITHIN Clicked A 2 2 1.0000000
4 27 NOT_WITHIN Clicked A 4 -4 -0.8000000
5 27 NOT_WITHIN Clicked B 1 -1 -0.2000000
6 27 WITHIN Clicked A 1 1 1.0000000
(I didn't load plyr or reshape2 since at least one of those packages has a nasty habit of interaction poorly with the dplyr functions.)

Related

Problems with a function to sum various elements in a nested data structure in R

I try to create a simple function how to sum some variables in a nested data set.
Here is a much simpler example
df <- data.frame(ID=c(1,1,1,1,2,3,3,4,4,4,5,6,7,7,7,7,7,7,7,7),
var=c("A","B","C","D","B","A","D","A","C","D","D","D","A","D","A","A","A","B","B","B"),
N=c(50,50,50,50,298,156,156,85,85,85,278,301,98,98,98,98,98,98,98,98))
Think of this as a dataframe containing results of 7 different studies. Each study has investigated one or more Variables (A, B, C, D). The variables mean
ID = The ID of a respective study.
var = The respective variable measured in each study. Some studies have measured only one variable (e.g., ID=2, which only contained b), some several
N = The sample size of each study. That is, each ID has a sample size
I would like to create a function that summarizes three things:
k = how many studies measured each variable (e.g., "A")
m = how often each variable was measured (regardless whether some studies measured a variable more than once)--a simple frequency.
N = the sample size per variable--but only once per study. That is, no duplications per study ID are allowed.
My current version (I am a real noob, so please forgive the form), results in exactly what I want:
model km N
1 A 4 (7) 389
2 B 3 (5) 446
3 C 2 (2) 135
4 D 6 (6) 968
For instance, variable A was measured 7times, but only by 4 studies (i.e., study #7 measured it several times. The (non-redundant) sample size was N=389 (not counting the several measures of study #7 more than one time).
(Note: The parentheses in the table are helpful as I intend to copy the results into a document)
Here is the current version of the code. The problems begin with the part containing the pipes
kmn <- function(data, x, ID, N) {
m <-table(data[[x]])
k <-apply(table(data[[x]],data[[ID]]), 1, function(x) length(x[x>0]) )
model <- levels(data[[x]])
km <- cbind(k,m)
colnames(km)<-c("k","m")
km <- paste0(k," (",m,")")
smpsize <- data %>%
group_by(data[[x]]) %>%
summarise(N = sum(N[!duplicated(ID)])) %>%
select(N)
cbind(model,km,smpsize)
}
kmn(data=df, x="var", ID = "ID", N="N")
The above code works but only if the df-dataframe really contains the N-variable (but not with a different variable name). I guess the "data %>%" prompts R to look into the dataframe and not to use the "sum(N..." part as reference to the call.
I can guess that this looks horrible for someone with some idea :)
Thank you for any ideas
Holger
First, remove duplicates by using the unique function and sum by var.
Secondly take df and group by var, n() gives the count and n_distinct(ID) the number of unique IDs, then you join the dataframe stats_N
library(dplyr)
stats_N <- df %>%
select(ID,var,N) %>%
unique() %>%
group_by(var) %>%
summarise(N=sum(N))
df %>%
group_by(var) %>%
summarise(n=n(),km=n_distinct(ID)) %>%
left_join(stats_N)
# A tibble: 4 x 4
# var n km N
# <fct> <int> <int> <dbl>
#1 A 7 4 389
#2 B 5 3 446
#3 C 2 2 135
#4 D 6 6 968
in addition to the #fmarm's answer, it can be also done without a join, where do the group by 'var', get the number of distinct elements in 'D' (n_distinct), number of rows (n()) and the sum of non-duplicated 'N's
library(dplyr)
df %>%
group_by(model = var) %>%
summarise(km = sprintf("%d (%d)", n_distinct(ID), n()),
N = sum(N[!duplicated(N)]))
# A tibble: 4 x 3
# model km N
# <fct> <chr> <dbl>
#1 A 4 (7) 389
#2 B 3 (5) 446
#3 C 2 (2) 135
#4 D 6 (6) 968

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

Removing duplicated observations in r with restriction

I have a dataset with which contains duplicates of the ident variable.
I need to select only 1 observation of each ident and it needs to be the newest value, i.e. the resulting data should contain the observation for the ident where the 'year' is the highest in the initial data set.
I believe a general case would look like this:
1. ident value year
2. A 1 19X1
3. A 2 19X2
4. B 4 19X2
5. B 2 19X1
6. B 1 19X3
7. C 1 19X4
8. C 2 19X1
(I could not order it in a proper table here, so please disregard the numbered list on the left)
Only, I have several hundred thousands obs.
Order of the resulting data set is not important to me.
Using library dplyr you can do something like this:
library(dplyr)
df %>% group_by(ident) %>% arrange(desc(year)) %>% slice(1)
Output will be as follows:
Source: local data frame [3 x 4]
Groups: ident [3]
X1. ident value year
(dbl) (chr) (int) (chr)
1 3 A 2 19X2
2 6 B 1 19X3
3 7 C 1 19X4
This assumes year is in a format where sorting in descending order makes it go from latest to oldest.
NOTE: that x1. column is a result of your input data above. I just read it as is.
Try
df <- do.call(rbind, lapply(split(df, df$ident),
function(x) x[which.max(x$year), ]))

Plot many categories

I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:
DF <- read.table(text = " Comp Category
Comp1 1
Comp2 1
Comp3 4,2
Comp4 1,3
Comp1 1,2
Comp3 3 ", header = TRUE)
barplot(table(DF$Comp))
So this worked perfectly for me.
After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.
My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.
DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
it will give me for example: for Comp1
1 2 3 4
Comp1 2 1 0 0
But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).
How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?
Thanks a lot !
Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the Category column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:
DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))
I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp.
The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do() later).
cat_perc <- function(cats, vec) {
# percentages
nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
perc <- nums/sum(nums)
final <- perc * length(vec)
df <- as.data.frame(as.list(final))
names(df) <- cats
return(df)
}
Running the function on the complete data frame gives:
cat_perc(cats, DF$Category)
## 1 4 2 3
## 1 2.666667 0.6666667 1.333333 1.333333
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of Comp, which can be done using the dplyr package:
library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
##
## Comp 1 4 2 3
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2 Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3 Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4 Comp4 0.500000 0.0000000 0.0000000 0.5000000
This first groups the data by Comp and then applies the function cat_perc to only the subset of the data frame with a given Comp.
I will plot the data with the ggplot2 package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr package as follows:
library(tidyr)
plot_data <- gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
##
## Comp Category value
## (fctr) (chr) (dbl)
## 1 Comp1 1 1.333333
## 2 Comp2 1 1.000000
## 3 Comp3 1 0.000000
## 4 Comp4 1 0.500000
## 5 Comp1 4 0.000000
## 6 Comp2 4 0.000000
As you can see, there is now a single data point per row, characterised by Comp, Category and the corresponding value.
Plotting the data
Now that everything is read, we can plot the data using ggplot:
library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
geom_bar(stat = "identity")

Aggregate data in dataframe

I have the following data frame in R
DeptNumber EmployeeTypeId
1 10
1 11
1 11
2 23
2 23
2 30
2 40
3 45
3 46
I need to generate another dataframe with a new column MaxEmployeeType, which should contain the EmployeeTypeId which is repeated the most for a given DeptNumber. The output should be as follows
DeptNumber MaxEmployeeType
1 11
2 23
3 45
In case of departmentNumber=3, there is a tie, but it is ok to present either of the option. I am not sure what is the optimal way to do this? Any help is appreciated.
A similar question is posted already
How to aggregate data in R with mode (most common) value for each row?
but it had a limitation to use only plyr & lubridate. If possible I want a best solution and not limit to these two packages. The question is even down voted possibly due to that it could be homework.
You could try:
library(dplyr)
df %>%
count(DeptNumber, EmployeeTypeId) %>%
top_n(1) %>%
slice(1)
Or as per suggested by #jazzuro:
count(df, DeptNumber, EmployeeTypeId) %>% slice(which(n == max(n))[1])
Which gives:
#Source: local data frame [3 x 3]
#Groups: DeptNumber [3]
#
# DeptNumber EmployeeTypeId n
# (int) (int) (int)
#1 1 11 2
#2 2 23 2
#3 3 45 1
Try this.
# Mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# new data-frame
new_df <- data.frame("DeptNumber" = numeric(0), "MaxEmployeeType" = numeric(0))
# distinct departments
depts <- unique(df$DeptNumber)
# calculate mode for every department
for(dept in depts){
dept_set <- subset(df, DeptNumber == dept)
new_df <- rbind(new_df, c(dept, Mode(dept_set$EmployeeTypeId)))
}
R doesn't have any standard function for calculating Mode. Mode function in the code above is taken from Ken Williams' post here.
Here is another dplyr solution
library(dplyr)
data %>%
count(DeptNumber, EmployeeTypeId) %>%
slice(which.max(n))

Resources