Problems with using subset in r - r

I need to subset my data frame, but I do not know what condition to use.
df2<-subset(df, condition )
A part of the dataframe, `df`:
state value
a 1
b 2
c 3
a 1
b 4
c 5
I count the sum of the value column for each state using : table(df$state)
I need to create a date frame where I show just the rows where the sum of the value column is bigger then a given value x.
If x is 3, I need to have in the new data frame just the rows that have the "state" column equal to b or c.
What should I replace "condition" with? How can I use : table(df$state) in the condition?

It is not clear what are you trying to do.
table(df$state) count the occurence of each state in your data, not the sum of variable "value" for each "state".You should instead use something like this:
vv <- tapply(dat$value,dat$state,sum)
vv
a b c
2 6 8
Now you can use the result within subset, to get the sum of the value column is bigger then a given value x. For example x == 3:
subset(dat,state %in% names(vv)[vv>3])
or without using `subset ( more efficient)
dat[dat$state %in% names(vv)[vv>3],]

Related

R: Deleting rows from a data frame based on values of other vector

So I have a data frame with baskets of products of purchases of individuals. A row stands for a basket of products of one individual. I want to remove all the rows (baskets) that contain a product (expressed as a integer) that are listed in a vector named products.to.delete . Here is a small image of how the data set looks like.
Next to that I have a vector containing a large number of numbers that must be deleted. I would like to delete all the rows that contain a value from this vector.
here is some code to make it reproducable:
dataframe <- as.data.frame( matrix(data = sample(10000,1000,replace = TRUE),20,50))
products.to.delete <- sample(10000,200,replace = FALSE)
Thank you in advance for helping me out!
If your data is data, and your vector of target values is vals, you could do this:
data[apply(data,1,\(r) !any(r %in% vals)),]
That is, within each row of data (i.e. apply(data,1...)), you can check if any of the values are in vals. Reverse the boolean using !, to create an global logical vector for selecting the remaining rows
For your next questions, please create reproducible examples such as the one below.
What you're after is called filtering and can be done in base R by the following.
First, create an object called for example myfilter which is a boolean vector with the same length as the number of rows in your data.frame.
mydat <- data.frame("col1"=1:5, "col2"=letters[1:5])
col1 col2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
myfilter <- mydat$col2 %in% c("a", "c")
[1] TRUE FALSE TRUE FALSE FALSE
mydat[myfilter,]
col1 col2
1 1 a
3 3 c
Then simply include this object into brackets []. R will keep rows where values are TRUE

Identifying, grouping unique entries in data frame (R)

I have a dataframe with two columns. One is an ID column (string), the second consists of strings several hundred characters long (DNA sequences). I want to identify the unique DNA sequences and group the unique groups together.
Using:
data$duplicates<-duplicated(data$seq, fromLast = TRUE)
I have successfully identified whether a specific row is a duplicate or not. This is not sufficient - I want to know whether I have 2, 3, etc. duplicates, and to which ID's do they correspond to (it is important that the ID always stays with its corresponding sequence).
Maybe something like:
for data$duplicates = TRUE... "add number in data$grouping
corresponding to the set of duplicates."
I don't know how to write the code for the last part.
I appreciate any and all help, thank you.
Edit: As an example:
df <- data.frame(ID = c("seq1","seq2","seq3","seq4","seq5"),seq= c("AAGTCA",AGTCA","AGCCTCA","AGTCA","AGTCAGG"))
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
("1","2","3","2","4")
I would like the output to be a new column (e.g.: df$grouping) where a numeric value is given to each unique group, so in this case:
Since df$seq is already a factor, we can just use the level number. This is given when a factor is coerced to an integer.
df$grouping = as.integer(df$seq)
df
# ID seq grouping
# 1 seq1 AAGTCA 1
# 2 seq2 AGTCA 3
# 3 seq3 AGCCTCA 2
# 4 seq4 AGTCA 3
# 5 seq5 AGTCAGG 4
If, in your real data, the seq column is not of class factor, you can still use df$grouping = as.integer(factor(df$seq)). By default the order of the groups will be alphabetical---you can modify this by giving the levels argument to factor in the order you want. For example, df$grouping = as.integer(factor(df$seq, levels = unique(df$seq))) will put the levels (and thus the grouping integers) in the order in which they first occur.
If you want to see the number of rows in each group, use table, e.g.
table(df$seq)
# AAGTCA AGCCTCA AGTCA AGTCAGG
# 1 1 2 1
table(df$grouping)
# 1 2 3 4
# 1 1 2 1
sort(table(df$seq), decreasing = T)
# AGTCA AAGTCA AGCCTCA AGTCAGG
# 2 1 1 1

create lists that contain the rownumbers for which column i contains the maximum value of that row

In a dataframe of 4 columns, I'm looking for an elegant way to get 3 lists that contain the names from column 1 if the maximum of that row in which that name is, is respectively in column 2, 3 or 4.
the first column contains parameter names,
column 2 a shapiro test outcome on the raw data of parameter x
column 3, shapiro test outcome of log10 transformed data for parameter x
column 4, shapiro test outcome of a custom transformation given by the user for parameter x
if this is the data:
Parameter xval xlog10val xcustomval
1 FWS.Range 0.62233371 0.9741614 0.9619065
2 FL.Red.Range 0.48195980 0.9855781 0.9643206
3 FL.Orange.Range 0.43338087 0.9727243 0.8239867
4 FL.Yellow.Range 0.53554943 0.9022795 0.9223407
5 FL.Red.Gradient 0.35194524 0.9905047 0.5718224
6 SWS.Range 0.46932823 0.9487955 0.9825318
7 SWS.Length 0.02927791 0.4565962 0.7309313
8 FWS.Fill.factor 0.93764311 0.8039806 0.0000000
9 FL.Red.Total 0.22437754 0.9655873 0.9923307
QUESTION: how to get a list that tells me all parameter names where xlog10val is the highest of the three columns (xval, xlog10val, xcuxtomval)
detailed explanation, ignore perhaps. ....
list 1, the rows where xval is the highest value, should be looking like this: 'FWS.Fill.factor' since that is the only row where xval has the highest score
list 2 is the list of all rows where xlog10val is the maximum value, and thus should contain the names of parameters where xlog10val is the maximum of that row:
'FWS.Range', 'FL.Red.Range', 'FL.Orange.Range',
'FL.Red.Gradient', 'FWS.Fill.factor'
and list 3 the rest of the names
I tried something like
df$Parameter[which(df$xval == max(df[ ,2:4]))]
but this gives integer(0) results.
EDIT
to clarify:
Lets start with looking at column 2 (xval).
PER row I need to test whether xval is the maximum of the 3 columns; xval, xlog10val, xcustomval
if this is the case, add the parameter in THAT row to the list of xval_is_the_max_of_3_columns list
Then we do the same PER row for xlog10val. IF xlog10val in row i is max of columns 2:4, add the name of that ROW to xlog10val_is_the_max_of_3_columns list.
To make the DF:
df <- data.frame(Parameter = c('FWS.Range', 'FL.Red.Range', 'FL.Orange.Range', 'FL.Yellow.Range', 'FL.Red.Gradient','SWS.Range','SWS.Length','FWS.Fill.factor','FL.Red.Total'),
xval = c(0.622333705577588,0.481959800402278,0.433380866119736,0.535549430820635,0.351945244290616,0.469328232931424,0.0292779051823701,0.93764311477813,0.224377540663707),
xlog10val = c( 0.974161367853916,0.985578135386898,0.97272429360688,0.902279501804112,0.990504657326703,0.94879549470406,0.45659620937997,0.803980592920426,0.965587334461157),
xcustomval = c(0.961906534164457,0.964320569400919,0.823986745004031,0.922340716468745,0.571822393107348,0.982531798077881,0.73093132928955,0,0.992330722386105))
We can use max.col to get the index of the maximum value per each row and with that we subset the 'Parameter'
i1 <- max.col(df[-1], 'first')
split(df$Parameter, i1)
EDIT: Based on the discussion with #Mark
I'm not sure exactly how you're selecting the parameters for list two and three, however, you can try something like this as well
df$Parameter <- as.character(df$Parameter)
par.xval.max <- df[which.max(df$xval), "Parameter"]
par.col3.gt.max <- df[df$xlog10val > max(df$xval), "Parameter"]
par.rem <- df$Parameter[! df$Parameter %in% c(par.xval.max, par.col3.gt.max)]
In this case, the values from column three are greater than the max(df$xval), and the remaining parameters are taken by negative selection using %in%

R: replacing values in a df according to a different df

I am new to R and I am having some trouble.
I am trying to replace some specific values of a data frame according to different values from another df.
For example:
I have this two dataframes:
a <- data.frame(c('a','b','c','d'), c('g','e','p','d'))
1 a g
2 b e
3 c p
4 d d
b <- data.frame(c('a','c'))
1 a
2 c
I want to find out which items that are on df a are also on df b and assign the value of the next column, in this case 'g' and 'p'. I tried with the match function but it has a problem if there are many items with the same name that need to be changed. I would really want an option to do this without checking 1 by 1 with a loop.

group data by tolerance via index list

I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)

Resources