Extract elements 10x greater than the last values for multiple columns - r

I am a new R user.
I have a dataframe consisting of 50 columns and 300 rows. The first column indicates the ID while the 2nd until the last column are standard deviation (sd) of traits. The pooled sd for each column are indicated at the last row. For each column, I want to remove all those values ten times greater than the pooled sd. I want to do this in one run. So far, the script below is what I have came up for knowing whether a value is greater than the pooled sd. However, even the ID (character) are being processed (resulting to all FALSE). If I put raw_sd_summary[-1], I have no way of knowing which ID on which trait has the criteria I'm looking for.
logic_sd <- lapply(raw_sd_summary, function(x) x>tail(x,1) )
logic_sd_df <- as.data.frame(logic_sd)
What shall I do? And how can I extract all those values labeled as TRUE (greater than pooled sd) that are ten times greater than the pooled SD (along with their corresponding ID's)?

I think your code won't work since lapply will run on a data.frame's columns, not its rows as you want. Change it to
logic_sd <- apply(raw_sd_summary, 2, function(x) x>10*tail(x,1) )
This will give you a logical array of being more than 10 times the last row. You could recover the IDs by replacing the first column
logic_sd[,1] <- raw_sd_summary[,1]
You could remove/replace the unwanted values in the original table directly by
raw_sd_summary[-300,-1][logic_sd[-300,-1]]<-NA # or new value

Related

R - Create a loop to calculate the length and Sum of a column where Identifier equals value in list

I am attempting to create a loop to calculate the length and average value of a column where the identifier equals the value in a list. I basically have a dataframe with a Identifier, number of occurrences, and additional data. I also have a list that contains the unique of the identifiers (50 string values). I want to summarize the number of rows and average value for each of those 50 values.
So far I've tried creating two functions to calculate those values, and then integrating it into the loop but have been unsuccessful.
infoAvg = function(x){
average = mean(x)
return(average)}
infoLen = function(x){
length = length(x)
return(length)
}
Here x is the DF and y is the column I want to calculate on.
Does it make sense to take this approach, and if so how do I integrate it into a loop?
Thanks.

Find a Series of Rows That Maximizes a Condition

I want to find a series of consecutive rows in a dataset where a condition is met the most often.
I have two columns that I can use for this; Either one with ones and zeros that alternate based on the presence or absence of a condition or a column which increments for the duration across which the desirable condition is present. I envision that I will need to use subset(),filter(), and/or rle() in order to make this happen but am at a loss as to how to get it to work.
In the example, I want to find 6 sequential rows that maximize the instances in which happens occurs.
Given the input:
library(data.frame)
df<-data.frame(time=c(1:10),happens=c(1,1,0,0,1,1,1,0,1,1),count=c(1,2,0,0,1,2,3,0,1,2))
I would like to see as the output the rows 5 through 10, inclusive, as the data subset output, using either the happens or count columns since this sequence of rows would yield the highest output of happens occurrences on 6 consecutive rows.
library(zoo)
which.max( rollapply( df$happens, 6, sum) )
#[1] 5
The fifth window of 6 rows apparently holds the maximum sum of df$happens
So the answer is row 5:10

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Identify categorical variables when importing dataset in R

I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))

Setting a maximum limit for values in a data frame in R

In a data frame (in R), I have two columns - the first is a list of species names (species), the second is the number of occurrence records I have for that species (number). There is a large variation in the number column with most values being <100 but a few being very high values (>100,000), and there are many rows (~4000). Here is a simplified example:
x<-data.frame(species=c("a","b","c","d","e","f","g","h","i","j"),number=c(53,17,67,989,135,67,13,786,100400,28))
Basically what I want to do is reduce the maximum number of records (the value in the number column) until the mean of all the values in this column stabilises.
To do this, I need to set a maximum limit for values in the number column so that any value > this limit is reduced to this maximum limit, and record the mean. I want to repeat this multiple times, each time reducing the maximum limit by 100.
I've not been able to find any similar questions online and am not really sure where to start with this! Any help, even just a point in the right direction, would be much appreciated! Cheers
you should use the pmin value :
pmin(x$number, 1e3)
# to test multiple limits :
mns <- sapply(c(1e6, 1e4, 1e2), function(u) mean(pmin(x$number, u)))

Resources