calculating length using na.omit in R - r

here is my code:
data <-setNames(lapply(paste0("80-20 ", file.number,".csv"),read.csv,stringsAsFactors=FALSE),paste(file.number,"participant"))
# imports csv data and turns it into a R-data file
df <- data.frame(RT=1:100,rep.sw=sample(c("sw","rep",100,replace=TRUE)))
(error.sw.c <- lapply(data[control.data],function(df) with(df, na.omit(rep.sw == "sw" & accuracy == "wrong"))))
This code scans a bunch of excel file and attributes a value of 'TRUE' every time the accuracy is "wrong" for values labeled "sw." then what I want to do is count the number of true values, and put them in a data frame. This is what I tried:
(dataframe.c <- data.frame(switch.rt = sapply(sw.c,mean), repetition.rt = sapply(rep.c,mean), switch.error = sapply(error.sw.c,length), group = rep("control",each=length(control.data))))
However, when I do this, it gives me the length of all the values (TRUE & FALSE), not just the TRUE values.
If I do this:
length(error.sw.c)
I get the total of all the error values, not all the error values separately.
So my question is: Is there a way to get the length of each individual excel file so I can put it in a dataframe? Thank you StackOverflow community, you folks haven't let me down yet. Any help will be greatly appreciated. Let me know if any clarification is needed. :)

sum() can be used to count the number of TRUEs in a logical vector. Let's see why:
set.seed(555)
logicalVec <- rnorm(5) > 0 # create logical vector
logicalVec
[1] FALSE TRUE TRUE TRUE FALSE
Arithmetic functions coerce logical values to numeric values such that FALSE becomes 0 and TRUE becomes 1:
logicalVec*1
[1] 0 1 1 1 0
You can think of sum(logicalVec) as equivalent to sum(c(0,1,1,1,0)):
sum(c(0,1,1,1,0))
[1] 3
sum(logicalVec)
[1] 3

Related

How to get total number of entries with specific condition in data frame in R

I have a data frame data here data. I want to know the genes that have a value satisfying a specific condition, such as greater than 0.15. I want to subset the data by finding all genes that have a value that satisfies this > 0.15 condition. I have tried this
moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15 = data
moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15 = as.data.frame(moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered[1,])
moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15 = moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15[,colSums(moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15[1,] <= 0.15) == 0]
moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered_threshold_0.15
It will result this and this and for some reason it includes value that less than 0.15 so this is wrong output. How can I get it fixed? Or how can I get the result as expected?
Your values are character. You cannot make numeric-comparisons and assume it will match with strings. For example,
0.15 < 5.3e-3
# [1] FALSE
"0.15" < "5.3e-3"
# [1] TRUE
Convert your data to numeric and then rerun your logic. Perhaps
moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered[] <-
lapply(moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered,
type.convert, as.is = TRUE)
(The use of brackets in ...[] <- ... is intentional, without the [] you'll get a list instead of retaining the class data.frame.)
I did some renaming of variables but this should work.
library(data.table)
md_015 = as.data.frame(fread("moran_deviation_data_multiple_correction_1january_BH_conclusion_spatially_clustered.csv"))
md_015_index = which(md_015 > .15)
refined_df <- md_015[,md_015_index]
refined_df
> refined_df
V1 METTL7B DBI SLN COX5B ITM2C PTGDS GLUL
1 Moran.index 0.1682925 0.1586386 0.1913133 0.1541711 0.1533256 0.1910979 0.157243

A problem during real data analysis with purrr

I analyzed an real data set,
Data set: https://github.com/ThinkR-open/datasets/blob/master/README.md
tweets <- readRDS("#RStudioConf.RDS")
rstudioconf <- as.list(NULL)
for (i in 1:nrow(tweets)) {
rstudioconf[[i]] <- tweets[i,]
}
I want to answer question from data set: how many tweets contain a link to a GitHub related URL?
below is my code:
# Extract the "urls_url" elements, and flatten() the result
urls_clean <- map(rstudioconf, "urls_url") %>%
flatten()
# Remove NA from list
compact_urls <- urls_clean %>%
map(discard,is.na) %>%
compact()
# Create a mapper that detects the patten "github"
has_github <- as_mapper(~ str_detect(.x, "github"))
# Look for the "github" pattern, and sum the result
**map_lgl(compact_urls, has_github) %>% sum()
The last line of code
map_lgl(compact_urls, has_github) %>% sum()
gives me an error:
Error: Result 10 must be a single logical, not a logical vector of length 2
I am really confused, the code map_lgl(compact_urls, has_github) should give a logical vector with TRUE and FALSE, next this vector was piped into sum() and TRUE values were summed up and finally return a number. I never wonder it will give me an error. Could anyone help? Thank you!
map_lgl returns the error because some of the list elements have different length. It is indicated in ?map
map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).
out <- map(compact_urls, has_github)
table(lengths(out))
# 1 2 3 6
#1117 22 4 1
We can unlist the output from map and get the sum
sum(unlist(out))
It can be reproduced using a simple example
map_lgl(list(FALSE, TRUE), I) #each list element of length 1
#[1] FALSE TRUE
map_lgl(list(FALSE, c(TRUE, TRUE)), I) # one element of length 2
Error: Result 2 must be a single logical, not a vector of class AsIs
and of length 2
In case, if the objective is to return only a single TRUE/FALSE, then wrap the function with any
has_github <- as_mapper(~ any(str_detect(.x, "github")))
Now, try with map_lgl
map_lgl(compact_urls, has_github) %>%
sum()
#[1] 347

When subsetting in R is it necessary to include `which` or can I just put a logical test?

Say I have a data frame df and want to subset it based on the value of column a.
df <- data.frame(a = 1:4, b = 5:8)
df
Is it necessary to include a which function in the brackets or can I just include the logical test?
df[df$a == "2",]
# a b
#2 2 6
df[which(df$a == "2"),]
# a b
#2 2 6
It seems to work the same either way... I was getting some strange results in a large data frame (i.e., getting empty rows returned as well as the correct ones) but once I cleaned the environment and reran my script it worked fine.
df$a == "2" returns a logical vector, while which(df$a=="2") returns indices. If there are missing values in the vector, the first approach will include them in the returned value, but which will exclude them.
For example:
x=c(1,NA,2,10)
x[x==2]
[1] NA 2
x[which(x==2)]
[1] 2
x==2
[1] FALSE NA TRUE FALSE
which(x==2)
[1] 3

rle(): Return average of lengths only if values == TRUE

I have the following rle object:
Run Length Encoding
lengths: int [1:189] 4 5 3 15 6 4 9 1 9 5 ...
values : logi [1:189] FALSE TRUE FALSE TRUE FALSE TRUE ...
I would like to find the average (mean) of the lengths if the corresponding item in the values == TRUE (I'm not interested in the lengths when values == FALSE)
df <- data.frame(values = NoOfTradesAndLength$values, lengths = NoOfTradesAndLength$lengths)
AveLength <- aggregate(lengths ~ values, data = df, FUN = function(x) mean(x))
Which returns this:
values lengths
1 FALSE 7.694737
2 TRUE 5.287234
I can now obtain the length where values == TRUE but is there a nicer way of doing this? Or perhaps, could I achieve a similar result without using rle at all? It feels a bit fiddly converting from lists to dataframe and I'm sure there is a one line clever way of doing this. I've seen that derivatives of this question have cycled through before but I wasn't able to come up with anything better from those so your help is much appreciated.
The rle returns a list of 'lengths' and 'values'. We can subset the 'lengths' using the 'values' as logical index and get the mean
with(NoOfTradesAndLength, mean(lengths[values]))
Using a reproducible example
set.seed(24)
NoOfTradesAndLength <- rle(sample(c(TRUE, FALSE), 25, replace=TRUE))
with(NoOfTradesAndLength, mean(lengths[values]))
#[1] 1.5
Using the OP's code
AveLength[2,]
# values lengths
#2 TRUE 1.5

removing sequences of positive values between sequences of "0"

I would like to create a small function in a data frame, for detecting (and setting to 0) sequences of positive values which are located between sequences of values equal to 0, but only if these sequences of positive values are not more than 5 values long.
Here's just a small example for showing you how my data looks (initial_data column), and what I would like to obtain at the end (final_data column):
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
This sentence can also resume the trick:
"If there's a sequence of positive values, not longer than 5 values, and located between at least two or three 0-values (before and after this sequence of positive values), then set also this sequence to 0"
Any advice for doing this easily?
Thanks a lot!!!
Here's a possible approach using rle function :
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),
final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
# using rle create an object with the sequences of consecutive elements
# having the same sign (-1 means negative, 0 means zero, 1 means positive)
enc <- rle(sign(DF$initial_data))
# find the positive sequences having maximum 5 elements
posSequences <- which(enc$values == 1 & enc$lengths <= 5)
# remove index=1 or index=length(enc$values) if present because
# they can't be surrounded by 0
posSequences <- posSequences[posSequences != 1 &
posSequences != length(enc$values)]
# check if they're preceeded and followed by at least 2 zeros
# (if not remove the index)
toForceToZero <- sapply(posSequences,FUN=function(idx){
enc$values[idx-1]==0 &&
enc$lengths[idx-1] >= 2 &&
enc$values[idx+1] == 0 &&
enc$lengths[idx+1] >= 2})
posSequences <- posSequences[toForceToZero]
# reverse the run-length encoding, setting NA where we want to force to zero
v <- enc$values
v[posSequences] <- NA
# create the final data vector by forcing NAs to 0
final_data <- DF$initial_data
final_data[is.na(rep.int(v, enc$lengths))] <- 0
# check if is equal to your desired output
all(DF$final_data == final_data)
# > [1] TRUE
My best friend rle to the rescue:
notzero<-rle(as.logical(unlist(DF)))
Run Length Encoding
lengths: int [1:7] 4 3 6 8 20 8 7
values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now just find all locations where values is TRUE and lengths < 5, and replace the values at those locations with FALSE . Then invoke inverse.rle to get the desired output.

Resources