Subsetting dataframe rows based on decimals in R? - r

I am quite new to R and have quite a challenging Question. I have a large dataframe consisting of 110,000 rows representing high-Resolution data from a Sediment core. I would like to select multiple rows based on Depth (which is recorded in mm to 3 decimal points). Of Course, I have not the time to go through the entire dataframe and pick the rows that I Need. I would like to be able to select the rows I would like based on the decimal Point part of the number and not the first Digit. I.e. I would like to be able to subset to a dataframe where all the .035 values would be returned. I have so far tried using the which() function but had no luck
newdata <- Linescan_EN18218[which(Linescan_EN18218$Position.mm.== .035),]
Can anyone offer any hints/suggestions how I can solve this Problem. Link to the first part of the dataframe csv

Welcome to stack overflow
Can you please further describe what you mean with had no luck. Did you get an error message or an empty data.frame?
In principle, your method should work. I have replicated it with simulated data.
n = 100
test <- data.frame(
a = 1:n,
b = rnorm(n = n),
c = sample(c(0.1,0.035, 0.0001), size = n, replace =T)
)
newdata <- test[which(test$c == 0.035),]

Related

Creating multiple dimensional list to replace subseting - Is it worth?

Basic idea:
As said before, is a good idea to substitute subsisting a data frame, for a multidimensional list?
I have a function that need to generate a subset from a quite big data frame close to 30 thousand times. Thus, creating a 4 dimensional list, will give me instant access to the subset, without loosing time generating it.
However, I don't know how R treats this objects, so I would like you opinion on it.
More concrete example if needed:
What I was trying to do is to use the inputation method of KNN. Basically, the algorithm says that the value found as outliers has to be replaced with K(K in a number, it could be 1,2,3...) closest neighbor. The neighbor in this example are the rows with the same attributes in the first 4 columns. And, the closed neighbors are the one with the smallest difference between the fifth column. If it is not clear what I said, please still consider reading the code, because, I found it hard to describe in words.
This are the objects
#create a vector with random values
values <- floor(runif(5e7, 0, 50)
possible.outliers <- floor(runif(5e7, 0, 10000)
#use this values, in a mix way, create a data frame
df <- data.frame( sample(values), sample(values), sample(values),
sample(values), sample(values), sample(possible.outliers)
#all the values greater then 800 will be marked as outliers
df$isOutlier = df[,6] > 800
This is the function which will be used to replace the outliers
#with the generated data frame, do this function
#Parameter:
# *df: The entire data frame from the above
# *vector.row: The row that was marked that contains an outlier. The outlier will be replaced with the return of this function
# *numberK: The number of neighbors to take into count.
# !Very Important: Consider that, the last column, the higher the
# difference between their values, less attractive
# they are for an inputation.
foo <- function(df, vector.row, numberK){
#find the neighbors
subset = df[ vector.row[1] == df[,1] & vector.row[2] == df[,2] &
vector.row[3] == df[,3] & vector.row[4] == df[,4] , ]
#taking the "distance" from the rows, so It can find which are the
# closest neighbors
subset$distance = subset[,5] - vector.row[5]
#not need to implement
"function that find the closest neighbors from the distance on subset"
return (mean(ClosestNeighbors))
}
So, the function runtime is quite big. For this reason, I am searching for alternatives and I thought that, maybe, if I replace the subsisting for something like this:
list[[" Levels COl1 "]][[" Levels COl2 "]]
[[" Levels COl3 "]][[" Levels COl4 "]]
What this should do is an instant access to the subset, instead of generating it all the time inside the function.
Is it a reasonable idea? I`am a noob in R.
If you did not understood what is written, or would like something to be explained in more detain or in other words, please tell me, because I know it is not the most direct question.

Random generation of numbers using R

I have some data that involves zebu (beef animals) that are labeled 1-40. I need to divide them into 4 groups of 10 each. I need to choose them randomly to remove any bias and I need to use R and Excel. Thank you please help.
There are ways of doing this that only require less code, but here's a verbose example that let's me explain what's happening.
Here's the dataset I'll be using since I don't know exactly how your data look.
beef <-
data.frame(number = 1:40, weight = round(rnorm(40, mean = 2000, sd = 500)))
Because your animals are numbered from 1 to 40, you can create a new dataframe that contains those numbers with a random group number (1 to 4) as the second column.
num_group <- (data.frame(
number = 1:40,
group =
sample(
x = 1:4,
size = 40,
replace = TRUE
)
))
Join the two dataframes together and you have your answer.
merge(beef, num_group)
To shuffle the data in excel follow this tip
Create new column in your data then apply RAND()
It will generate random number over that column and sort random numbers column you will get your data shuffled.
Later load data in to R and select 10 rows each time and assign class to them.

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

Using ddply() to Get Frequency of Certain IDs, by Appearance in Multiple Rows (in R)

Goal
If the following description is hard follow, please see the example "before" and "after" to see a straightforward example.
I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction.
I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade).
Further, I'd like to do this with the plyr ddply function.
(If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.)
In terms of pseudo code I'm working off of:
merge each row of Side1 and Side2
by row, get unique() appearances of each item id
apply table() function
transpose and relabel output from table
Example of the structure of my data, and the output I desire.
Data Example (before):
df <- data.frame(TradeID = c("01","02","03","04"))
df$Side1 = list(c("001","001","002"),
c("002","002","003"),
c("001","004"),
c("001","002","003","004"))
df$Side2 = list(c("001"),c("007"),c("009"),c())
Desired Output (after):
df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"),
RelFreq_byTrade = c(3,3,2,2,1,1))
One method to do this without ddply
I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me.
temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique)))
df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp),
RelFreq_byTrade = temp[])
Thanks for any help you can offer!
Curtis
I believe this will do what you're asking for. It uses ddply. Twice!
res <- ddply(df, .(TradeID), function(df) data.frame(ItemID = c(df$Side1[[1]],df$Side2[[1]]), TradeID = df$TradeID))
ddply(res, .(ItemID), summarise, RelFreq_byTrade = length(unique(TradeID)))
Note that the ItemsIDs are slightly out of order.

Mean of triplicate

I've just cleaned up a data frame that I scraped from an excel spreadsheet by amongst other things, removing percentage signs from some of the numbers see, Removing Percentages from a Data Frame.
The data has twenty four rows representing the parameters and results from eight experiments done in triplicate. Eg, what one would get from,
DF1 <- data.frame(X = 1:24, Y = 2 * (1:24), Z = 3 * (1:24))
I want to find the mean of each of the triplicates (which, fortunately are in sequential order) and create a new data frame with eight rows and the same amount of columns.
I tried to do this using,
DF2 <- data.frame(replicate(3,sapply(DF1, mean)))
which gave me the mean of each column as rows three times. I wanted to get a dataframe that would give me,
data.frame(X = c(2,5,8,11,14,17,20,23), Y = c(4,10,16,22,28,34,40,23), Z = c(6,15,24,33,42,51,60,69))
which I worked out by hand; it's supposed to be the reduced result.
Thanks, ...
Any help would be gratefully recieved.
Nice task for codegolf!
aggregate(DF1, list(rep(1:8, each=3)), mean)[,-1]
to be more general, you should replace 8 with nrow(DF1).
... or, my favorite, using matrix multiplication:
t(t(DF1) %*% diag(8)[rep(1:8,each=3),]/3)
This works:
foo <- matrix(unlist(by(data=DF1,INDICES=rep(1:8,each=3),FUN=colMeans)),
nrow=8,byrow=TRUE)
colnames(foo) <- colnames(DF1)
Look at ?by.

Resources