Removing rows after a certain value in R - r

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?

This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.

How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Related

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

R: Aggregate dataframe if column has less than 3 zeros, else return zero

I have ratings of images by several raters:
data <- as.data.frame(matrix(c(rep(1,6),rep(2,6),rep(1:6,2),
0,2,1,0,1,0,0,0,3,0,0,0),12,3))
colnames(data) <- c("image", "rater", "rating")
print(data)
# image rater rating
# 1 1 1 0
# 2 1 2 2
# 3 1 3 1
# 4 1 4 0
# 5 1 5 1
# 6 1 6 0
# 7 2 1 0
# 8 2 2 0
# 9 2 3 3
# 10 2 4 0
# 11 2 5 0
# 12 2 6 0
I want to aggregate (mean) ratings by images, but only if there less than 3 zero ratings for a given image. Otherwise (=if there are 3 zeros or more), the aggregated rating should be zero. And the counting of zeros should only be for raters 1-5.
So for the above data:
# image rating
# 1 1 0.8
# 2 2 0.0
For image 1 ratings are aggregated because the third zero belongs to rater 6. For image 2, the aggregated rating is zero because there are more than 2 zeros.
On top of that, I want the aggregation to take into account a) only the first 5 ratings for each image, and b) only positive ratings.
I can manage the last 2 conditions using aggregate:
aggregate(rating ~ image, data = data[data$rater <= 5 & data$rating != 0,], mean)
# Result:
# image rating
# 1 1 1.333333
# 2 2 3.000000
But I can't figure out the first condition.
Correct results should be:
# image rating
# 1 1 1.333333
# 2 2 0.000000
Can anyone please help? Thanks.
Here is a nice method using base R:
data$this <- ave(data$rating, data$image,
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)
I use i[1:5] to subset each image, so if you have fewer than 5 raters for an image, you will get an error. This returns the mean for each group, if that is of interest. Of course, you can use the same function to produce the aggregation table you mentioned:
aggregate(data$rating, data["image"],
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)

R Split data.frame using a column that represents and on/off switch

I have data that looks like the following:
a <- data.frame(cbind(x=seq(50),
y=rnorm(50),
z=c(rep(0,5),
rep(1,8),
rep(0,3),
rep(1,2),
rep(0,12),
rep(1,12),
rep(0,8))))
I would like to split the data.frame a on the column z but have each group as a separate data.frame as a member of a list i.e. in my example the first 5 rows would be the first item in the list the next 8 rows would be the next item in the list, the next 3 rows would be item after that etc. etc.
Simple factors combine all the 1s together and all the 0s together...
I'm sure that there is a simple way to do this, but it has eluded for at the moment.
Thanks
Try the rleid function in data.table v > 1.9.5
library(data.table)
split(a, rleid(a$z))
# $`1`
# x y z
# 1 1 -0.03737561 0
# 2 2 -0.48663043 0
# 3 3 -0.98518106 0
# 4 4 0.09014355 0
# 5 5 -0.07703517 0
#
# $`2`
# x y z
# 6 6 0.3884339 1
# 7 7 1.5962833 1
# 8 8 -1.3750668 1
# 9 9 0.7987056 1
# 10 10 0.3483114 1
# 11 11 -0.1777759 1
# 12 12 1.1239553 1
# 13 13 0.4841117 1
....
Or, also with cumsum:
split(a, c(0, cumsum(diff(a$z) != 0)))
Here are some base R options.
Using rle. A variant of rleid function in the comments by #Spacedman
split(a,inverse.rle(within.list(rle(a$z), values <- seq_along(values))))
By using cumsum after creating a logical index based on whether the adjacent elements are equal or not
split(a, cumsum(c(TRUE, a$z[-1]!=a$z[-nrow(a)])))

How can I reshape my dataframe?

I have a huge data frame, that in a simple version it looks like this:
trials=c("1","2","3","4","5","6","7","8","9","10")
co =c(rep ("1",10))
stim=c("8","9","11","2","4","7","8","1","12","16")
ansbin=c("1","0","1","0","0","1","0","1","1","0")
stim.1=c("11","2","11","7","4","3","9","1","4","16")
ansbin.1=c("0","0","1","0","0","1","0","1","1","1")
trials.1=c("1","2","3","4","5","6","7","8","9","10")
co.1 =c(rep ("2",10))
stim1.1=c("11","2","11","2","5","7","8","15","17","10")
ansbin1.1=c("1","1","1","0","0","1","1","1","0","1")
stim2.1=c("11","2","14","1","4","8","9","10","4","12")
ansbin2.1=c("0","1","1","0","0","1","0","0","1","0")
ID<- data.frame(trials,co,stim,ansbin,stim.1,ansbin.1,trials.1,co.1,stim1.1,ansbin1.1,stim2.1,ansbin2.1)
View(ID)
Now I would like to form my new data.frame in the way that "stim", "stim.1","stim1.1" and "stim2.1" are under the same column called "stimulus", and the same thing for the answers: I would like all "ansbin", "ansbin.1", "ansbin1.1" and "ansbin2.1" under the same column called "answers".
Trials and Trials.1 at the same time should be under the same column, but the difference will the "co" column.
I tryied to use "reshape" like this:
df<-reshape(ID, direction="long",
idvar=c("trials", "co"),
varying= c("stim","stim.1", "stim1.1","stim2.1","ansbin","ansbin.1","ansbin1.1","ansbin2.1"
v.names=c("stimulus","answer"),
timevar="num",
)
but I have some problems and warning at the everytimes. I think it should be a problem linked to columns's name.
Can you help me?
Thank you in advance! :)
Here's the approach I would take:
library(data.table)
melt(
rbindlist(split.default(ID, cumsum(grepl("^trials", names(ID))))),
measure.vars = patterns("^stim", "^ansbin"), value.name = c("stim", "ansbin"))
# trials co variable stim ansbin
# 1: 1 1 1 8 1
# 2: 2 1 1 9 0
# 3: 3 1 1 11 1
# 4: 4 1 1 2 0
# 5: 5 1 1 4 0
# ---
# 36: 6 2 2 8 1
# 37: 7 2 2 9 0
# 38: 8 2 2 10 0
# 39: 9 2 2 4 1
# 40: 10 2 2 12 0
Basically, it sounds like you're looking at two rounds of "reshaping".
Stacking the columns from "trials" to the second set of "ansbin" on top of each other. I've done that with the rbindlist(split.default(...)) part of my answer.
Stacking each resulting pair of "stim" and "ansbin" columns on top of each other. I've done that with the melt(...) part of my answer.
Consider building a list of reshaped dataframes for each set: co, trials, stimulus, and answers, then merge them together. However, because co and trials only carry two columns while latter two carries four columns consider repeating columns prior to reshaping:
ID$co2 <- ID$co
ID$co3 <- ID$co.1
ID$trials.2 <- ID$trials
ID$trials.3 <- ID$trials.1
df_list <- lapply(c("co", "trials", "stim", "ans"), function(s)
reshape(ID, direction="long",
varying= grep(s, names(ID)),
v.names=c(s),
drop = grep(paste0("^", s), names(ID), invert=TRUE),
timevar="num",
new.row.names = 1:1000)
)
# CHAIN MERGE
finaldf <- Reduce(function(x, y) merge(x, y, by=c('id', 'num')), df_list)
finaldf <- with(finaldf, finaldf[order(num, id),]) # SORT DATAFRAME
rownames(finaldf) <- NULL # RESET ROWNAMES
head(finaldf)
# id num co trials stim ans
# 1 1 1 1 1 8 1
# 2 2 1 1 2 9 0
# 3 3 1 1 3 11 1
# 4 4 1 1 4 2 0
# 5 5 1 1 5 4 0
# 6 6 1 1 6 7 1

Count occurrences of value in a set of variables in R (per row)

Let's say I have a data frame with 10 numeric variables V1-V10 (columns) and multiple rows (cases).
What I would like R to do is: For each case, give me the number of occurrences of a certain value in a set of variables.
For example the number of occurrences of the numeric value 99 in that single row for V2, V3, V6, which obviously has a minimum of 0 (none of the three have the value 99) and a maximum of 3 (all of the three have the value 99).
I am really looking for an equivalent to the SPSS function COUNT: "COUNT creates a numeric variable that, for each case, counts the occurrences of the same value (or list of values) across a list of variables."
I thought about table() and library plyr's count(), but I cannot really figure it out. Vectorized computation preferred. Thanks a lot!
If you need to count any particular word/letter in the row.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,L),V2=c(1,L,2,2,L),
V3=c(1,2,2,1,L), V4=c(L, L, 1,2, L))
For counting number of L in each row just use
#This is how to compute a new variable counting occurences of "L" in V1-V4.
df$count.L <- apply(df, 1, function(x) length(which(x=="L")))
The result will appear like this
> df
V1 V2 V3 V4 count.L
1 1 1 1 L 1
2 1 L 2 L 2
3 2 2 2 1 0
4 1 2 1 2 0
I think that there ought to be a simpler way to do this, but the best way that I can think of to get a table of counts is to loop (implicitly using sapply) over the unique values in the dataframe.
#Some example data
df <- data.frame(a=c(1,1,2,2,3,9),b=c(1,2,3,2,3,1))
df
# a b
#1 1 1
#2 1 2
#3 2 3
#4 2 2
#5 3 3
#6 9 1
levels=unique(do.call(c,df)) #all unique values in df
out <- sapply(levels,function(x)rowSums(df==x)) #count occurrences of x in each row
colnames(out) <- levels
out
# 1 2 3 9
#[1,] 2 0 0 0
#[2,] 1 1 0 0
#[3,] 0 1 1 0
#[4,] 0 2 0 0
#[5,] 0 0 2 0
#[6,] 1 0 0 1
Try
apply(df,MARGIN=1,table)
Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Here is another straightforward solution that comes closest to what the COUNT command in SPSS does — creating a new variable that, for each case (i.e., row) counts the occurrences of a given value or list of values across a list of variables.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
#This is how to compute a new variable counting occurences of value "1" in V1-V4.
df$count.1 <- apply(df, 1, function(x) length(which(x==1)))
The updated data frame contains the new variable count.1 exactly as the SPSS COUNT command would do.
> df
V1 V2 V3 V4 count.1
1 1 1 1 NA 3
2 1 NA 2 NA 1
3 2 2 2 1 1
4 1 2 1 2 2
5 NA NA NA NA 0
You can do the same to count how many time the value "2" occurs per row in V1-V4. Note that you need to select the columns (variables) in df to which the function is applied.
df$count.2 <- apply(df[1:4], 1, function(x) length(which(x==2)))
You can also apply a similar logic to count the number of missing values in V1-V4.
df$count.na <- apply(df[1:4], 1, function(x) sum(is.na(x)))
The final result should be exactly what you wanted:
> df
V1 V2 V3 V4 count.1 count.2 count.na
1 1 1 1 NA 3 0 1
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 0
4 1 2 1 2 2 2 0
5 NA NA NA NA 0 0 4
This solution can easily be generalized to a range of values.
Suppose we want to count how many times a value of 1 or 2 occurs in V1-V4 per row:
df$count.1or2 <- apply(df[1:4], 1, function(x) sum(x %in% c(1,2)))
A solution with functions from the dplyr package would be the following:
Using the example data set from LechAttacks answer:
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
Count the appearances of "1" and "2" each and both combined:
df %>%
rowwise() %>%
mutate(count_1 = sum(c_across(V1:V4) == 1, na.rm = TRUE),
count_2 = sum(c_across(V1:V4) == 2, na.rm = TRUE),
count_12 = sum(c_across(V1:V4) %in% 1:2, na.rm = TRUE)) %>%
ungroup()
which gives the table:
V1 V2 V3 V4 count_1 count_2 count_12
1 1 1 1 NA 3 0 3
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 4
4 1 2 1 2 2 2 4
5 NA NA NA NA 0 0 0
In my effort to find something similar to Count from SPSS in R is as follows:
`df <- data.frame(a=c(1,1,NA,2,3,9),b=c(1,2,3,2,NA,1))` #Dummy data with NAs
`df %>%
dplyr::mutate(count = rowSums( #this allows calculate sum across rows
dplyr::select(., #Slicing on .
dplyr::one_of( #within select use one_of by clarifying which columns your want
c('a','b'))), na.rm = T)) #once the columns are specified, that's all you need, na.rm is cherry on top
That's how the output looks like
>df
a b count
1 1 1 2
2 1 2 3
3 NA 3 3
4 2 2 4
5 3 NA 3
6 9 1 10
Hope it helps :-)

Resources