Thresholding a data frame without removing values

Thresholding a data frame without removing values - r

I have a data frame that consisting of a non-unique identifier (ID) and measures of some property of the objects within that ID, something like this:
ID Sph
A 1.0
A 1.2
A 1.1
B 0.5
B 1.8
C 2.2
C 1.1
D 2.1
D 3.0
First, I get the number of instances of each ID as X using table(df$ID), i.e. A=3, B=2 ,C=2 and D=2. Next, I would like to apply a threshold in the "Sph" category after getting the number of instances, limiting to rows where the Sph value exceeds the threshold. With threshold 2.0, for instance, I would use thold=df[df$Sph>2.0,]. Finally I would like to replace the ID column with the X value that I computed using table above. For instance, with a threshold of 1.1 in the "Sph" columns I would like the following output:
ID Sph
3 1.0
2 1.8
2 2.2
2 2.1
2 3.0
In other words, after using table() to get an x value corresponding to the number of times an ID has occurred, say 3, I would like to then assign that number to every value in that ID, Y, that is over some threshold.

There are some inconsistencies in your question and you didn't give a reproducible example, however here's my attempt.
I like to use the dplyr library, in this case I had to break out an sapply, maybe someone can improve on my answer.
Here's the short version:
library(dplyr)
#your data
x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
#lookup table
y <- summarise(group_by(x,ID), IDn=n())
#fill in original table
x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
#filter for rows where Sph greater or equal to 1.1
x <- x %>% filter(Sph>=1.1)
#done
x
And here's the longer version with explanatory output:
> library(dplyr)
> #your data
> x <- data.frame(ID=c(rep("A",3),rep("B",2),rep("C",2),rep("D",2)),Sph=c(1.0,1.2,1.1,0.5,1.8,2.2,1.1,2.1,3.0),stringsAsFactors = FALSE)
> x
ID Sph
1 A 1.0
2 A 1.2
3 A 1.1
4 B 0.5
5 B 1.8
6 C 2.2
7 C 1.1
8 D 2.1
9 D 3.0
>
> #lookup table
> y <- summarise(group_by(x,ID), IDn=n())
> y
Source: local data frame [4 x 2]
ID IDn
1 A 3
2 B 2
3 C 2
4 D 2
>
> #fill in original table
> x$IDn <- sapply(x$ID,function(z) as.integer(y[y$ID==z,"IDn"]))
> x
ID Sph IDn
1 A 1.0 3
2 A 1.2 3
3 A 1.1 3
4 B 0.5 2
5 B 1.8 2
6 C 2.2 2
7 C 1.1 2
8 D 2.1 2
9 D 3.0 2
>
> #filter for rows where Sph greater or equal to 1.1
> x <- x %>% filter(Sph>=1.1)
>
> #done
> x
ID Sph IDn
1 A 1.2 3
2 A 1.1 3
3 B 1.8 2
4 C 2.2 2
5 C 1.1 2
6 D 2.1 2
7 D 3.0 2

You can actually do this in one step after computing X and thold as you did in your question:
X <- table(df$ID)
thold <- df[df$Sph > 1.1,]
thold$ID <- X[as.character(thold$ID)]
thold
# ID Sph
# 2 3 1.2
# 5 2 1.8
# 6 2 2.2
# 8 2 2.1
# 9 2 3.0
Basically you look up the frequency of each ID value in the table X that you built.

Related

Remove groups based on multiple conditions in dplyr R

I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated

We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4

Changing duplicated coordinate values by adding a decimal place R

I have UTM coordinate values from GPS collared leopards, and my analysis gets messed up if there are any points that are identical. What I want to do is add a 1 to the end of the decimal string to make each value unique.
What I have:
> View(coords)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.7 4980096
5 618522.7 4980096
6 622674.1 4976161
I want something like this, or something that will make each number unique (doesn't have to be a +1)
> coords
X Y
1 623190.9 4980021
2 618876.6 4980729
3 618522.7 4980896
4 618522.71 4980096.1
5 618522.72 4977148.2
6 622674.1 4976161
Ive looked at existing questions and got this to work for a simulated data set, but not for values with more than 1 duplicated value.
DF <- data.frame(A=c(5,5,6,6,7,7), B=c(1, 1, 2, 2, 2, 3))
>View(DF)
A B
1 5 1
2 5 1
3 6 2
4 6 2
5 7 2
6 7 3
DF <- do.call(rbind, lapply(split(DF, list(DF$A, DF$B)),
function(x) {
x$A <- x$A + seq(0, by=0.1, length.out=nrow(x))
x$B <- x$B + seq(0, by=0.1, length.out=nrow(x))
x
}))
>View(DF
A B
5.1.1 5.0 1.0
5.1.2 5.1 1.1
6.2.3 6.0 2.0
6.2.4 6.1 2.1
7.2 7.0 2.0
7.3 7.0 3.0
The'2s' in column B don't continue to add a decimal place when there are more than 2. I also had a problem accomplishing this when the number was more than 4 digits (i.e. XXXXX vs XX) There's probably a better way to do this, but I would love help on adding these decimals and possibly altering them in the original data frame which has 12 columns of various data.

It is easier to use make.unique
DF[] <- lapply(DF, function(x) as.numeric(make.unique(as.character(x))))
DF
# A B
#1 5.0 1.0
#2 5.1 1.1
#3 6.0 2.0
#4 6.1 2.1
#5 7.0 2.2
#6 7.1 3.0

How to find the measurements of same type in R

I am new in R.I have one question regarding my data set.
S.NO Type Measurements
1 1 2.1
2 2 3.3
3 2 3.1
4 3 2.7
5 3 2.6
6 3 4.5
7 2 1.1
8 3 2.2
suppose we have measurements in column 3 but their types are given in column 2.Each measurement is either type 1,type 2 or type 3.Now if we are interested to find only
measurements corressponding to type 2(suppose),how we can do it in R?
I am looking forward to response.

This is a basic subsetting question covered in most introductory R guides:
with(mydf, mydf[Type == 2, ])
# S.NO Type Measurements
# 2 2 2 3.3
# 3 3 2 3.1
# 7 7 2 1.1
with(mydf, mydf[Type == 2, "Measurements"])
# [1] 3.3 3.1 1.1
You can also look at the subset function:
subset(mydf, subset = Type == 2, select = "Measurements")
# Measurements
# 2 3.3
# 3 3.1
# 7 1.1

# make some data
testData$measurement=1:10
testData$Type=sample(1:3,10,replace=T)
testData=data.frame(testData)
# fetch only type 2
testData[testData$Type==2,]
# now only the measurements
testData[testData$Type==2,"measurement"]

Return a Data Frame with Only the Max Values

Say I have this data frame, lesson, with 3 columns (User, Course, Score), which looks something like:
User Course Score
A 1.1 9
A 1.1 8
B 1.2 7
Only it has a lot more data. If I want to get a data frame that only has the highest scores for each course by each user, how would I go about doing that?
I tried:
lesson<-lesson[order(lesson$User,lesson$Course,-lesson$User),]
and then
lesson[!duplicated(lesson$User && lesson$Course),]
but I got an error back.

DF <- read.table(text="User Course Score
A 1.1 9
A 1.1 8
B 1.1 1
B 1.2 7",header=TRUE)
aggregate(Score~Course*User,data=DF,FUN=max)
# Course User Score
#1 1.1 A 9
#2 1.1 B 1
#3 1.2 B 7

or you might want to try plyr package
library(plyr)
ddply(DF,.(User,Course),transform,maxScore=max(Score,na.rm=TRUE))
User Course Score maxScore
A 1.1 9 9
A 1.1 8 9
B 1.1 1 1
B 1.2 7 7
or if you want to see the max score only
ddply(DF,.(User,Course),summarise,maxScore=max(Score,na.rm=TRUE))
User Course maxScore
A 1.1 9
B 1.1 1
B 1.2 7

for loop through data frame and looping with unique values

I'm trying to work on code to build a function for three stage cluster sampling, however, I am just working with dummy data right now so I can understand what is going into my function.
I am working on for loops and have a data frame with grouped values. I'm have a data frame that has data:
Cluster group value value.K.bar value.M.bar N.bar
1 1 A 1 1.5 2.5 4
2 1 A 2 1.5 2.5 4
3 1 B 3 4.0 2.5 4
4 1 B 4 4.0 2.5 4
5 2 B 5 4.0 6.0 4
6 2 C 6 6.5 6.0 4
7 2 C 7 6.5 6.0 4
and I am trying to run the for loop
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {e = data.y$value.M.bar[i] - data$N.bar[i]
total = total + e^2}
My question is: Is there a way to run the same loop but for the unique values in the group? Say by:
Group 'A', 'B', 'C'
Any help would be greatly appreciated!
Edit: for correct language

You can use by for example, to apply your data per group. First I wrap your code in a function that take data as input.
get.total <- function(data){
n <- dim(data)[1]
e <- 0
total <- 0
for(i in 1:n) {
e <- data$value.M.bar[i] - data$N.bar[i] ## I correct this line
total <- total + e^2
}
total
}
Then to compute total just for group B and C you do this :
by(data,data$group,FUN=get.total)
data$group: A
[1] 4.5
----------------------------------------------------------------------------------------------------
data$group: B
[1] 8.5
----------------------------------------------------------------------------------------------------
data$group: C
[1] 8
But better , Here a vectorized version of your function
by(data,data$group,
function(dat)with(dat, sum((value.M.bar - N.bar)^2)))

Categories

HOME

symfony

http

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Thresholding a data frame without removing values - r

Related

Remove groups based on multiple conditions in dplyr R

Changing duplicated coordinate values by adding a decimal place R

How to find the measurements of same type in R

Return a Data Frame with Only the Max Values

for loop through data frame and looping with unique values

Categories

Resources