my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0 repeating two times, 1 two times and so one. What I would like to do is get get all the corresponging values for one number, say 0, in the second columns (in this example 7 and 2). Preferably in data.frame.
I know the attempt with df$V2[which(df$V1==0)], however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0's in the first columns, or get min/max values.
Regards
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate() with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As #teucer points out, dplyr (and it's ancestor, plyr) are designed for exactly this, as is data.tables. In vanilla R, you would tend to use by or aggregate (or split and sapply for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0] or df[df$V1 == 0,] as well.
Thanks all for your replies. I decided to go for the dplyr solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3 to group_by).
Since some are asking what's wrong with df$V2[which(df$V1==0)]: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n distinct values in the first column, I would have to use the command n times for all distinct values and store the n resulting vectors.
Related
I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2
I've got a question regarding the filter() function of dplyr, and/or base subset() function within R. Basically, when I use filter() or subset() I can extract observations based on two conditions, which is what I need.
As an example, this is what I've been using so far:
df %>% filter(Axis_1_1 == "Diagnostic of function on axis1 postponed") %>% filter(is.na(diagnostic_code9))
This gives me the right amount of observations that satisfy these two conditions at the same time, i.e. 92 out of the 23992 in total.
However, when I use the negation sign to not include these observations in my current dataframe, R is deleting roughly 8000 extra observations. Thus, the end result is 15992 observations left after filtering with the negation "!" sign used. Example:
df %>% filter(Axis_1_1 != "Diagnostic of function on axis1 postponed") %>% filter(!is.na(diagnostic_code9))
Using simple subsetting from base R gives me the same wrong end result, while it manages to find the correct 92 observations that satisfy the condition, as stated in the first example.
subset(df, df$Axis1_1 == "Diagnostic of function on axis1 postponed" & is.na(diagnostic_code9))
My dataframe consists of 112 variables and 23900+ observations in the current setting.
Thus, my questions are:
Could there be something curious going on with my dataframe I'm using (Unfortunately I cannot give you a subset out of it)
Second, is there something wrong here with my coding?
Lastly, what is R exactly doing in the background? Since it is able to filter out these observations based on the exact conditioning where they match the string and is.na() function, while doing completely something else when using the negation sign.
Your logic doesn't quote work in this case. Doing two subsequent filter statments is kind of like doing an AND operation. Consider the following example
df <- data.frame(a=c(1,1,1,1,2,2,2, 2),
b=c(NA,NA,5,5,5,5,5,NA))
df %>% filter(a==1) %>% filter(is.na(b))
# a b
# 1 1 NA
# 2 1 NA
df %>% filter(a!=1) %>% filter(!is.na(b))
# a b
# 1 2 5
# 2 2 5
# 3 2 5
Note the rows with a=1, b=5 are not returned even though they are not in the first output because your first filter (filter(!=1)) eliminates them.
So if you consider your two filters as A and B, in the first case you are doing A and B. It would be the same as
df %>% filter(a==1 & is.na(b))
# a b
# 1 1 NA
# 2 1 NA
But in the second you are doing NOT A and NOT B. These are not equivalent. According to DeMorgan's Law, you need NOT A OR NOT B. So try
df %>% filter(a!=1 | !is.na(b))
# a b
# 1 1 5
# 2 1 5
# 3 2 5
# 4 2 5
# 5 2 5
# 6 2 NA
or equivalently (note the parenthsis applying the NOT (!) to the whole expression)
df %>% filter(!(a==1 & is.na(b)))
I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0
Whilst reviewing a colleague's Stata code I came across the command expand.
I would really love to be able to do the same thing simply in my own R code.
Essentially expand duplicates a dataset n times but has the option to create a new variable which is 0 if the observation originally appeared
in the dataset and 1 if the observation is a duplicate.
Does anyone know of a quick way of implementing this in R? Or is it a case of writing my own function?
rep_r<-function(x,n){if(n<=1){rep(x,times=1)}else{rep(x,times=n)}}
expand_r<-function(x,n){
Reduce(function(x,y)
{c(x,y)},mapply(rep_r,x,n))
}
expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
EDIT: Thanks to the suggestion from #nicola the above functionality can be simply achieved by the following one-liner.
expand_r<-function(x,n) rep(x,replace(n,n<1,1))
#>expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
This function expands the rows of a data.frame like the Stata expand command does. I got the idea from the R mefa package.
expand_r <- function(df, ...) {
as.data.frame(lapply(df, rep, ...))
}
df <- data.frame(x = 1:2, y = c("a", "b"))
expand_r(df, times = 3)
Consider this:
plot=c("A","A","A","A","B","B","B","B")
mean=c(3,5,40,0,3,5,3,0)
sp=c("ch","ch","ag",NA,"ch","ag","ch",NA)
df=data.frame(plot,mean,sp)
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 <NA>
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 <NA>
I'd like to figure out some code that will return the "sp" from each "plot" with the highest cumulative "mean" value. For the example above, I'd like to return this:
plot=c("A","B")
sp=c("ag","ch")
df=data.frame(plot,sp)
plot sp
1 A ag
2 B ch
In case that wasn't clear, for plot A, the sp "ag" is returned becasue it has the highest cumulative mean value (40) for the plot. For plot B, "ch" is returned because it has the highest cumulative value (6). The values are not important to me; I want only the most dominant sp by cumulative mean value for each plot.
I've played around with aggregate and suspect that would be useful here, but am unsure about how to proceed.
Many thanks (this site is a huge resource for those of us new to R!)
Not sure how #jebyrnes would have done it with summarise and filter (edit: I figured it out and it's pretty simple too), but here's how I'd go about it with dplyr:
library(dplyr)
group_by(df, plot,sp) %>% summarise(sum=sum(mean)) %>% summarise(sp=sp[sum==max(sum)])
# plot sp
#1 A ag
#2 B ch
Here's an approach that uses the "data.table" package
library(data.table)
setDT(df)[, cumsum(mean), by=.(plot, sp)][, .(sp = sp[V1 == max(V1)]), by=plot]
# plot sp
# 1: A ag
# 2: B ch
After setting df to a data table with setDT(df), we are doing two things
[, cumsum(mean), by=.(plot, sp)] calculates the cumulative sum of the mean column, grouped by plot and sp
[, .(sp = sp[V1 == max(V1)]), by=plot] takes the sp value for which V1 (calculated in step 1) is equal to the maximum of V1 and renames that column sp, grouped by plot
You should be able to do this in two steps.
Step 1, aggregate the data frame by plot at sp and calculate the cummulative mean. You can use a package such as plyr with ddply or the dplyr package for this.
Step 2, once you've done this, for each plot output the sp with the highest cumulative mean. There are a lot of ways to to this. I'd again go with dplyr, but that's because I'm a bit besotted with it at the moment.
Actually...you can do this whole thing with 4 lines in dplyr with one line per operation piping your way through with magritr. 5 if you want to get rid of the cumulative means column. You just need a group_by, summarise, and filter statement. I'll post the code if you want it, but it will be far more useful for you to go read, say, http://seananderson.ca/2014/09/13/dplyr-intro.html and try it yourself.
Or....
df %>%
group_by(plot, sp) %>%
summarise(cumMean = sum(mean, na.rm=T)) %>%
filter(cumMean == max(cumMean)) %>%
select(plot, sp)
Aggregate twice: once to calculate the sums for each plot and sp, and a second time to get the maxima for each plot. The second aggregation is only going to give you the mean, though, so merge it back in with the first aggregate.
df2 = aggregate(mean ~ plot + sp, FUN = sum, data = df)
df3a = aggregate(mean ~ plot, data = df2, FUN = max)
merge(df3a, df2)
I haven't tested what happens if you have equal sums coming up here, though. Also, this drops any NAs in the data frame. If you want to keep those, I'd make sure you bring the data frame in with strings rather than factors and then changing the NAs to placeholders ("None" or even "NA") before you begin. The above code works fine with strings!
df = data.frame(plot,mean,sp, stringsAsFactors = FALSE)
df[is.na(df$sp), "sp"] = "None"
> df
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 None
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 None