use dataframe column to change levels of factor - r

how can I use a column of a data frame to change the levels of a factor ?
the example bellow is a simplification of the original dataset but shows what I am trying to accomplish.
The dataset structure
df <- data.frame(X=c("DFV","TUG","WQD","PRF","NJK"),Y=c(2000,5000,3000,1000,4000))
Arrange the order of X based on Y
ndf <- df %>% arrange(desc(Y))
Use the ndf$X to change the order of df$X (the original dataset).
df <- df %>% mutate(X = factor(X,levels=ndf$X))
desired result
df
X Y
1 TUG 5000
2 NJK 4000
3 WQD 3000
4 DFV 2000
5 PRF 1000
Please note that the problem is not to arrange the df in decreasing order, but how to use the ndf X column on the levels parameter. This is problem I am trying to figure out. Tks

Your approach is correct but since you want to order the rows use it in arrange :
library(dplyr)
df %>% arrange(factor(X, levels = ndf$X))
# X Y
#1 TUG 5000
#2 NJK 4000
#3 WQD 3000
#4 DFV 2000
#5 PRF 1000
You can also use match :
df %>% arrange(match(X, ndf$X))

There are two independent tasks: i) set levels of X based on Y; ii) reorder rows of df. Your question says about changing levels but your "desired output" seems to be about reordering the dataset. Please clarify if you need first, second, or both.
Changing levels, using data.table:
require(data.table)
setDT(df)
df[,X:=factor(X, levels=X[order(-Y)])]
(Note that it won't work if you have any duplicated values of X.)

In the end the solution needed was:
df %>% arrange(factor(X, levels = unique(ndf$X)))

Related

Delete duplicate rows and sum corresponding values of last column in a dataframe

If we want to remove the duplicates from a dataframe df, we need just to write df[!duplicated(df),] and duplicates will be removed from it. I have the following dataframe:
df <- data.frame(from = c("z","y","z","w","y"), to=c("x","w","x","z","w"), weight=c(2,1,3,5,6))
I would like to obtain something different. In df[,1:2], the first and the third rows are equals between them and I would like to: 1) delete one of them; 2) sum the corresponding values of weight. E.g. for this example, the expected result is:
from to weight
z x 5
y w 7
w z 5
Anyway, if I use:
df2=df[,1:2]
which(duplicated(df2) | duplicated(df2[nrow(df2):1, ])[nrow(df2):1])
I obtain
[1] 1 2 3 5
which does not allow me to obtain the desidered result (e.g. 1 and 3 are equals between them, 2 and 5 are equals between them, but this information is not contained in the latter result).
We can do a group by sumoperation instead of duplicated
aggregate(weight~ ., df, sum)
In dplyr, this can be done using
library(dplyr)
df %>%
group_by(from, to) %>%
summarise(weight = sum(weight))

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

Dplyr/Lubridate: How to summarise overlapping intervals after grouping

I would like to group agreements and then compare how much their periods overlap (or are apart).
My dataframe may look like:
library(tidyverse)
library(lubridate)
tribble(
~ShipTo, ~Code, ~Start, ~End,
"xxxx", "AAA11", 2018-01-01, 2018-03-01,
"yyyy", "BBB23", 2018-02-01, 2018-05-11,
"yyyy", "BBB23", 2018-03-01, 2018-06-11,
"cccc", "AAA11", 2018-01-06, 2018-03-12,
"yyyy", "CCC04", 2018-01-16, 2018-03-31,
"xxxx", "DDD", 2018-01-21, 2018-03-25
)
I would like to mutate a column to create lubridate periods and evaluate them after grouping by ShipTo and Code. What I tried was:
dft3<-dft %>% filter(concat1 %in% to_filter2) %>%
arrange(ShipTo,Code)%>%
group_by(ShipTo,Code)%>%
mutate(period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days"))) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv),
intervmin=min(interv))
If I remove the line group_by(ShipTo,Code)%>% the intervals are created correctly and also the lead intervals are correctly calculated from the next line. But when I naively use group_by, the intervals are not calculated correctly.
I suspect that perhaps my database should be split into many tables by groups and then, after the operation of creating and comparing intervals it should be glued back together.
Is there a succinct way to do it? Or perhaps there is a simpler way I have not yet learned? Thank you in advance for the hint in the right direction.
EDIT: The desired output should be a column with value of overlaps of intervals in days (or distances between intervals if no overlap). Grouping destroys the calculation. I would like to have these values calculated within groups (not accross them).
EDIT2: I trying to solve the problem by splitting dataframe into a list of dataframes and then combining it, but I am not sure of a syntax. It does not quite work, produces tables with one column, a help I was given on other portal (perhaps it can ilustrate the issue). The idea is to split a database, create new columns and combine the tables to a single table.
fnOverlaps <- function(x) {
mutate(x,okres=interval(Start,End),
nastokres=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(okres, nastokres), "days")))
}
dft3<-dft3 %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(fnOverlaps) %>%
flatten_dfr()
The result (for one group) that I expect would look like this.
tribble(
~ShipTo, ~Code, ~interv,
"yyyy", "BBB23", 70 #say there is a 70 days overlap
"yyyy", "BBB23", NA #there is no next row to compare
)
It looks like the issue is being caused by trying to combine vectors with the class "Interval." Specifically, they appear to be getting converted to numeric and losing their inherent information.
I think the only viable solution is to split the data.frame, run the analysis on each component separately with lapply, then bring them back together with bind_rows. The number of groups with only one entry present an issue as max and min return -Inf and Inf when the argument is empty after removing NAs. But, that is easy enough to correct for.
This code should work. Note that I am using group_by to ensure the ShipTo/Code columns are kept, though you could do that in other ways.
dft %>%
split(paste(.$ShipTo, "XXX", .$Code)) %>%
lapply(function(x){
x %>%
arrange(ShipTo,Code) %>%
mutate(period=interval(Start,End)
, nextperiod=interval(lead(Start),lead(End))
, interv=day(as.period(intersect(period, nextperiod), "days"))
) %>%
group_by(ShipTo,Code)%>%
summarise(count=n(),
intervmax=max(interv, na.rm = TRUE),
intervmin=min(interv, na.rm = TRUE)) %>%
ungroup()
}) %>%
bind_rows() %>%
mutate(intervmax = ifelse(is.infinite(intervmax)
, NA, intervmax)
, intervmin = ifelse(is.infinite(intervmin)
, NA, intervmin))
Returns
# A tibble: 5 x 5
ShipTo Code count intervmax intervmin
<chr> <chr> <int> <dbl> <dbl>
1 cccc AAA11 1 NA NA
2 xxxx AAA11 1 NA NA
3 xxxx DDD 1 NA NA
4 yyyy BBB23 2 71.0 71.0
5 yyyy CCC04 1 NA NA
I am putting it just for the record. I received an answer from Jake Knaupp on slack r4ds group with the modern map_df() syntax, it calculates overlap of periods but it converts periods to numeric. And there is a bunch of warnings it will do that.
myFun <- function(x) {
mutate(x,period=interval(Start,End),
nextperiod=interval(lead(Start),lead(End)),
interv=day(as.period(intersect(period, nextperiod), "days")))
}
df %>%
split(list(.$ShipTo, .$Code), drop = TRUE) %>%
map_df(myFun)

Pivoted data with a column describing some function of other columns

I am trying to pivot some data such that I retrieve (1) the total of some measurement for two+ groups, and then (2) that measurement divided by the # of observations in that group. I have achieved (1) but not (2). Below is an example output I desire:
grouping measurement_total group_size mean
1 1 301 60 5.0
2 2 215 40 5.4
Let some data be:
> grouping <- c(1,2,1,1,2)
> measurement <- sample(rnorm(1,10),100, replace=TRUE)
> dataframe <- cbind(grouping, measurement)
To create the pivot, I used aggregate. I then used a cbind to get the # of observations per group:
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
grouping measurement V2
1 1 301 60
2 2 215 40
I now need to create "V3" which would be { measurement / V2 } such that I achieve the result. NB I can get the mean only by using FUN=mean, but this means I cannot also get the group size.
> aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=mean)
grouping V2(# obs.) mean
1 1 1 5.0
2 2 1 5.4
What are some options for achieving this simply, ideally with a single line? I.e. I could obtain the two tables separately and merge the two, but it's a little long-winded.
Thanks
John
You can use dplyr to do this fairly easily
library(dplyr)
dataframe <- data.frame(dataframe) # Convert to dataframe
dataframe %>%
group_by(grouping) %>%
mutate(measurement_total = sum(measurement)) %>%
mutate(group_size = length(measurement)) %>%
mutate(mean = mean(measurement)) %>%
filter(row_number()==1) %>%
select(-measurement)
Of course, the easy way to do it in base R would be:
df <- aggregate(cbind(measurement,1) ~ grouping, data=dataframe, FUN=sum)
df$mean <- df$measurement/df$V2
But if you're going to be doing dataframe manipulation, it might be a good idea to get into dplyr

R: identify the factor associated with the highest sum of values for multiple groups

Consider this:
plot=c("A","A","A","A","B","B","B","B")
mean=c(3,5,40,0,3,5,3,0)
sp=c("ch","ch","ag",NA,"ch","ag","ch",NA)
df=data.frame(plot,mean,sp)
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 <NA>
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 <NA>
I'd like to figure out some code that will return the "sp" from each "plot" with the highest cumulative "mean" value. For the example above, I'd like to return this:
plot=c("A","B")
sp=c("ag","ch")
df=data.frame(plot,sp)
plot sp
1 A ag
2 B ch
In case that wasn't clear, for plot A, the sp "ag" is returned becasue it has the highest cumulative mean value (40) for the plot. For plot B, "ch" is returned because it has the highest cumulative value (6). The values are not important to me; I want only the most dominant sp by cumulative mean value for each plot.
I've played around with aggregate and suspect that would be useful here, but am unsure about how to proceed.
Many thanks (this site is a huge resource for those of us new to R!)
Not sure how #jebyrnes would have done it with summarise and filter (edit: I figured it out and it's pretty simple too), but here's how I'd go about it with dplyr:
library(dplyr)
group_by(df, plot,sp) %>% summarise(sum=sum(mean)) %>% summarise(sp=sp[sum==max(sum)])
# plot sp
#1 A ag
#2 B ch
Here's an approach that uses the "data.table" package
library(data.table)
setDT(df)[, cumsum(mean), by=.(plot, sp)][, .(sp = sp[V1 == max(V1)]), by=plot]
# plot sp
# 1: A ag
# 2: B ch
After setting df to a data table with setDT(df), we are doing two things
[, cumsum(mean), by=.(plot, sp)] calculates the cumulative sum of the mean column, grouped by plot and sp
[, .(sp = sp[V1 == max(V1)]), by=plot] takes the sp value for which V1 (calculated in step 1) is equal to the maximum of V1 and renames that column sp, grouped by plot
You should be able to do this in two steps.
Step 1, aggregate the data frame by plot at sp and calculate the cummulative mean. You can use a package such as plyr with ddply or the dplyr package for this.
Step 2, once you've done this, for each plot output the sp with the highest cumulative mean. There are a lot of ways to to this. I'd again go with dplyr, but that's because I'm a bit besotted with it at the moment.
Actually...you can do this whole thing with 4 lines in dplyr with one line per operation piping your way through with magritr. 5 if you want to get rid of the cumulative means column. You just need a group_by, summarise, and filter statement. I'll post the code if you want it, but it will be far more useful for you to go read, say, http://seananderson.ca/2014/09/13/dplyr-intro.html and try it yourself.
Or....
df %>%
group_by(plot, sp) %>%
summarise(cumMean = sum(mean, na.rm=T)) %>%
filter(cumMean == max(cumMean)) %>%
select(plot, sp)
Aggregate twice: once to calculate the sums for each plot and sp, and a second time to get the maxima for each plot. The second aggregation is only going to give you the mean, though, so merge it back in with the first aggregate.
df2 = aggregate(mean ~ plot + sp, FUN = sum, data = df)
df3a = aggregate(mean ~ plot, data = df2, FUN = max)
merge(df3a, df2)
I haven't tested what happens if you have equal sums coming up here, though. Also, this drops any NAs in the data frame. If you want to keep those, I'd make sure you bring the data frame in with strings rather than factors and then changing the NAs to placeholders ("None" or even "NA") before you begin. The above code works fine with strings!
df = data.frame(plot,mean,sp, stringsAsFactors = FALSE)
df[is.na(df$sp), "sp"] = "None"
> df
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 None
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 None

Resources