Partitioning Data in 'R' based on data size - r

I'm currently working on a program that analyzes leaf area and compares that to the position of the leaf within the cluster (i.e. is it the first leaf, 3rd, last. etc.) and am analyzing the relationship between the position, area, mass, and more. I have a database of approximately 5,000 leaves, and 1,000 clusters and that's where the problem arises.
Clusters come in different numbers, most have 5 leaves, but some have 2, 8, or anywhere in-between. I need a way to separate the clusters by number in the cluster so that the program isn't treating clusters with 3 leaves the same as clusters with 7. My .csv has each leaf individually entered so simply manually input different sets isn't possible.
I'm rather new at 'R' so I might be missing an obvious skill here but any help would be greatly appreciated. I also understand this is rather confusing so please feel free to reply with clarifying questions.
Thanks in advance.

If I understand the question correctly, it sounds like you want to calculate things based on some defined group (in your case clusterPosition?). One way to do this with dplyr is to use group_by with summarize or mutate. The later keeps all the rows in your original data set and simply adds a new column to it, the former aggregates like rows and returns a summary statistic for each unique grouped variable.
As an example, if your data looks something like this:
df <- data.frame(leafArea = c(2.0, 3.0, 4.0, 5.0, 6.0),
cluster = c(1, 2, 1, 2, 3), clusterPosition = c(1, 1, 2, 2, 1))
To get the mean and standard deviation for each unique clusterPosition you would do something like the below, this returns one row for each unique clusterPosition.
library(dplyr)
df %>% group_by(clusterPosition) %>% summarize(meanArea = mean(leafArea), sdArea = sd(leafArea))
If you want to compare each unique leaf to some characteristic of it's clusterPosition, ie you want to preserve all the individual rows in your original dataset, you can use mutate instead of summarize.
library(dplyr)
df %>% group_by(clusterPosition) %>% mutate(meanPositionArea = mean(leafArea), diffMean = leafArea - meanPositionArea)

Related

Fuzzy matching in R - about 1 million rows

I have a list of about one million individuals - each of which is identified by his/her name and surname. Individuals might be present in the list more than once. I would like to group observations by individual and count how many times they appear - this is usually OK and do it with dplyr::group_by.
However, there are spelling mistakes. In order to solve the issue, I thought of computing a measure of string distance within this list. I would then go ahead and assume that if the string distance is below a certain threshold, record identify the same individual.
All the methods I tried so far are either too time-consuming or plain infeasible RAM-wise.
These is my attempt using dplyr and RecordLinkage:
list_matrix <- expand.grid(x = individual_list, pattern = individual_list, stringsAsFactors = F)
# The same is achieved using stringdistmatrix (stringdist package)
result <- list_matrix %>%
group_by(x) %>%
mutate(similarity = levenshteinSim(x, pattern)) %>%
summarise(match = similarity[which.max(similarity)],
matched_to = pattern[which.max(match)])
This method works well with small data sets. Intuitively, I always confront all elements with each other. Nevertheless, the resulting matrix is of dimension numberofrows x numerofrows, which in my case is a million times a million - way too heavy to be handled.
I also gave a shot to other functions: adist, pmatch, agrep(l). Same logic applies. I think that the problem is conceptual here. Any ideas?

How to filter a column based on a condition from another column in R

I have a huge data table with millions of rows and dozens columns, so performance is a crucial issue for me. The data describes visits to a content site. I want to compute the ContentId of the earliest (i.e. minimum hit time) hit of each visit. What I did is:
dt[,.(FirstContentOfVisit=ContentID[ContentID != ""][which.min(HitTime)]), by=VisitId,.SDcols=c("ContentID","HitTime")]
the problem is that I don't know if which.min first computes the min on all the HitTime vector (which I don't want!) or does it only on the filtered HitTime vector (the one which is corresponding to the non-empty ContentID).
In addition, after I compute it - how can I get the minimal HitTime of the ContentIDs that are different from the first (i.e. the earliest hit time of the non-first content id).
When I tried to have both actions with user-defined functions (first - sort the sub data table and then extract the desired value) it took ages (and actually never stopped), although I have a very strong machine (virtual) with 180 GB RAM. So I'm looking for an inline solution.
dplyr makes this much easier. You didn't share a sample of your data, but I assume the variables of interest look something like this.
web <- tibble(
HitTime = sample(seq(as.Date('2010/01/01'), as.Date('2017/02/23'), by="day"), 1000),
ContentID = 1:1000,
SessionID = sample(1:100, 1000, replace = TRUE)
)
Then you can just use group_by and summarise to find the earliest value of HitTime for each SessionID.
web %>%
group_by(SessionID) %>%
summarise(HitTime = min(HitTime))

R dataframe - Collapse multiple columns into a single numeric vector, row-by-row

I'm sorry if this is elementary or a repeat question. But I have been looking for hours, to little avail.
I want to take multiple numeric columns in a dataframe (say 100), and combine them into a single numeric vector that I can store in a single column. I plan to use the dplyr::transmute() function to store the result and drop the original 100 columns. However, that is not the problem. My problem is getting the operation to iterate over each of the rows in the dataframe. To put it simply, imagine I was working with the mtcars dataframe:
as.numeric(mtcars[x,2:8])
would give me a single numeric vector of row x, columns 2 (cyl) through 8 (vs), which I could then store in a new column. But how would I apply this to all 32 rows of the data frame without typing it out 32 ways, once for each row?
(yes, I have tried the "apply" functions, but I am still learning R, so may have been unable to nail down the correct syntax)
(also, I suppose I could do it in a for loop, but have gathered that these are generally frowned upon in R)
Any help would be greatly appreciated, and please try to be gentle!
I don't think davo1979 is looking to transform his data to a long format, therefore, I don't think tidyr::gather is the tool he wants.
The answer, which I think, most closestly address his question is
lapply(1:nrow(mtcars), function(x) as.numeric(mtcars[x,2:8]))
This will store each row as a numeric vector, contained within a list.
However, davo1979 it is important to know what is your ultimate goal with this data. I ask because your desired outcome of a bunch of numeric vectors is a clumsy way to handle and store data in R.
Have you tried tidyr's gather() function? Here's an example from the documentation:
library(tidyr)
stocks <- data_frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
gather(stocks, stock, price, -time)
stocks %>% gather(stock, price, -time)

How do I generate row-specific means in a data frame?

I'm looking to generate means of ratings as a new variable/column in a data frame. Currently every method I've tried either generates columns that show the mean of the entire dataset (for the chosen items) or don't generate means at all. Using the rowMeans function doesn't work as I'm not looking for a mean of every value in a row, just a mean that reflects the chosen values in a given row. So for example, I'm looking for the mean of 10 ratings:
fun <- mean(T1.1,T2.1,T3.1,T4.1,T5.1,T6.1,T7.1,T8.1,T9.1,T10.1, trim = 0, na.rm = TRUE)
I want a different mean printed for every row because each row represents a different set of observations (a different subject, in my case). The issues I'm looking to correct with this are twofold: 1) it generates only one mean, the mean of all values for each of the 10 variables, and 2) this vector is not a part of the dataframe. I tried to generate a new column in the dataframe by using "exp$fun" but that just creates a column whose every value (for every row) is the grand mean. Could anyone advise as to how to program this sort of row-based mean? I'm sure it's simple enough but I haven't been able to figure it out through Googling or trawling StackOverflow.
Thanks!
It's hard to figure out an answer without a reproducible example but have you tried subsetting your dataset to only include the 10 columns from which you'd like to derive your means and then using an apply statement? Something along the lines of apply(df, 1, mean) where the first argument refers to your dataframe, the second argument specifies whether to perform a function by rows (1) or columns (2), and the third argument specifies the function you wish to apply?

computing a subset using a loop

I have a data frame with different variables and I want to build different subsets out of this data frame using some conditions and I want to use a loop because there will be a lot of subsets and this would be saving a lot of time.
This are the conditions:
Variable A has an ID for an area, variable B has different species (1,2,3, etc.) and I want to compute different subsets with these columns. The name of every subset should be the the ID of a point and the content should be all individuals of a certain specie in this point.
For a better understanding:
This would be the code for the one subset and I want to use a loop
A_2_NGF_Abies_alba <- subset(A_2_NGF, subset = Baumart %in% c("Abies alba"))
Is this possible doing in R
Thanks
Does this help you?
Baumdaten <- data.frame(pointID=sample(c("A_2_SEF","A_2_LEF","A_3_LEF"), 10, T), Baumart=sample(c("Abies alba", "Betula pendula", "Fagus sylvatica"), 10, T))
split(Baumdaten, Baumdaten[, 1:2])

Resources