R: Sum values that are in different parts of vector - r

I have a number of vectors consisting of 1s and 0s, such as:
[1] x <- c(0, 1, 1, 0, 0, 1)
I would like to count the number of consecutive 1s at different parts of these sequences, and in this instance end up with:
[1] 2 1
I have considered using something like strsplit to split the sequence where there are zeros, though it is a numeric vector so strsplit won't work and ideally I don't want to change back and forth between numeric and character format.
Is there another, simpler, solution to this? Would appreciate any help.

You can split up the value into a vector and use rle like this:
With your original value
x <- 11001
temp <- rle(unlist(strsplit(as.character(x), split="")))
temp$lengths[temp$values == 1]
[1] 2 1
It's a bit simpler when starting with the vector as you don't have to use strsplit and unlist.
x <- c(0, 1, 1, 0, 0, 1)
temp <- rle(x)
temp$lengths[temp$values == 1]
[1] 2 1

Related

Is there a way to determine how many rows in a dataset have the same categorical variable for multiple conditions (columns)?

For example, i have the dataset below where 1 = yes and 0 = no, and I need to figure out how many calls were made by landline that lasted under 10 minutes.
Image of example dataset
You can also specifically define the values you're looking for in each column when you're finding the sum. (This will help if you need count rows with values other than 1 in a column.)
sum(df$landline == 1 & df$`under 10 minutes` == 1)
We can use sum
sum(df1[, "under 10 minutes"])
If two columns are needed
colSums(df1[, c("landline", "under 10 minutes")])
If we are checking both columns, use rowSums
sum(rowSums(df1[, c("landline", "under 10 minutes")], na.rm = TRUE) == 2)
The grep function finds the rows where landline=1. We then only call those rows and sum the under 10 min column.
sum( df[ grep(1,df[,1]) ,4] )
R will conveniently treat 1 and 0 as if they mean TRUE and FALSE, so we can apply logical Boolean operations like AND (&) and OR (|) on them.
df <- data.frame(x = c(1, 0, 1, 0),
y = c(0, 0, 1, 1))
> sum(df$x & df$y)
[1] 1
> sum(df$x | df$y)
[1] 3
For future questions, you should look up how to use functions like dput or other ways to give an example data set instead of using an image.

R - Add columns with almost same name and save it using the correct column name

I have multiple large data tables in R. Some column names appear twice having a nearly duplicate name: they are the same except for the last character.
For example:
[1] "Genre_Romance" (correct)
[2] "Genre_Sciencefiction" (correct)
[3] "Genre_Sciencefictio" (wrong)
[4] "Genre_Fables" (correct)
[5] "Genre_Fable" (wrong)
Genre_Romance <- c(1, 0, 1, 0, 1)
Genre_Sciencefiction <- c(0, 1, 0, 0, 0)
Genre_Sciencefictio <- c(1, 0, 1, 1, 0)
Genre_Fables <- c(0, 0, 1, 0, 0)
Genre_Fable <- c(0, 0, 0, 0, 1)
dt <- data.table(Genre_Romance, Genre_Sciencefiction, Genre_Sciencefictio, Genre_Fables, Genre_Fable)
Now I want to add the column values with nearly the same column name. I want to save this sum under the correct column name while removing the incorrect column. The solution here would be:
dt[,"Genre_Sciencefiction"] <- dt[,2] + dt[, 3]
dt[,"Genre_Fables"] <- dt[,4] + dt[, 5]
dt[,"Genre_Sciencefictio"] <- NULL
dt[,"Genre_Fable"] <- NULL
dt
Genre_Romance Genre_Sciencefiction Genre_Fables
1 1 0
0 1 0
1 1 1
0 1 0
1 0 1
As you can see, not every column name has a nearly duplicate one (such as "Genre_Romance"). So we just keep the first column like that.
I tried to solve this problem with a for loop to compare column names one by one and use substr() function to compare the longest column name with the shorter column name and take sum if they are the same. But it does not work correctly and is not very R-friendly.
The post below also helped me a bit further, but I cannot use 'duplicated' since the column names are not exactly the same.
how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R
Thanks in advance.
Here is a more-or-less base R solution that relies on agrep to find similar names. agrep allows for close string matches, based on the "generalized Levenshtein edit distance."
# find groups of similar names
groups <- unique(lapply(names(dt), function(i) agrep(i, names(dt), fixed=TRUE, value=TRUE)))
# choose the final names as those that are longest
finalNames <- sapply(groups, function(i) i[which.max(nchar(i))])
I chose to keep the longest variable names in each groups that matched the example, you could easily switch to the shortest with which.min or you could maybe do some hard-coding depending on what you want.
Next, Reduce is given "+" operator and is fed matching groups with lapply. To calculate the maximum instead, use max in place of "+". The variables are selected using .SDcols from data.table with a data.frame, you could directly feed it the group vectors.
# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))),
finalNames)
#Frank's comment notes that this can be simplified in newer (1.10+, I believe) versions of data.table to avoid .SD, .SDcols with
# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, ..x]))), finalNames)
To make this a data.table, just replace data.frame with as.data.table or wrap the output in setDT.
To turn the final line into a data.table solution, you could use
dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))],
finalNames)
or, following #Frank's comment
dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, ..x]))], finalNames)
which both return
dtFinal
Genre_Romance Genre_Sciencefiction Genre_Fables
1: 1 1 0
2: 0 1 0
3: 1 1 1
4: 0 1 0
5: 1 0 1

condition does not recognize integers as number, it is considering it a character

I have a dataframe with 2 columns
The second column has one of the following values, recognized in the data frame as numbers: 0, 1, 2, or 3
I want to create a third column that has a character string values based on the values of the second column.
I tried:
df2 = data.frame(r = tssd2, cgmval = colcgm6)
df2$clrl[colcgm6 = 0] ="black"
df2$clrl[colcgm6 = 1] ="lightskyblue"
df2$clrl[colcgm6 = 2] ="blue"
df2$clrl[colcgm6 = 3] ="purple"
The error that I get is:
Error in `$<-.data.frame`(`*tmp*`, "clrl", value = character(0)) :
replacement has 0 rows, data has 4139
from the description of the error, my understanding is that the code is trying to compare the vaues of colcgm6 which are number (0 to 3) to characters 0, 1, 2, and 3. So the result is that the conditions are never true and no values are ever inputed into a new thir column.
Please help,
Edit:
For a reproducible example please use tssd as a vector of numeric values (1, 1, 1) and coldcgm6 as a vector of numeric values (0, 1, 2).
We can do this easily by using numeric indexing instead of comparing (==) 4 times
clrl <- c("black", "lightskyblue", "blue", "purple")
df2$clrl <- clrl[colcgm6+1]
head(df2)
# r cgmval clrl
#1 -0.7622144 1 lightskyblue
#2 -1.4290903 0 black
#3 0.3322444 2 blue
#4 -0.4690607 2 blue
#5 -0.3349868 2 blue
#6 1.5362522 3 purple
In the OP's code, instead of using the logical operator (==), the assignment operator (=) is used. By changing it, it would be fix the problem.
data
set.seed(24)
colcgm6 <- sample(0:3, 24, replace=TRUE)
tssd2 <- rnorm(24)
df2 <- data.frame(r = tssd2, cgmval = colcgm6)

Function/instruction to count number of times a value has already been seen

I'm trying to identify if MATLAB or R has a function that resembles the following.
Say I have an input vector v.
v = [1, 3, 1, 2, 4, 2, 1, 3]
I want to generate a vector, w of equivalent length to v. Each element w[i] should tell me the following: for the corresponding value v[i], how many times has this value been encountered so far in v, i.e. in all elements of v up to, but not including, position i. In this example
w = [0, 0, 1, 0, 0, 1, 2, 1]
I'm really looking to see if any statistical or domain-specific languages have a function/instruction like this and what it might be called.
In R, you can try this:
v <- c(1,3,1,2,4,2,1,3)
ave(v, v, FUN=seq_along)-1
#[1] 0 0 1 0 0 1 2 1
Explanation
ave(seq_along(v), v, FUN=seq_along) #It may be better to use `seq_along(v)` considering different classes i.e. `factor` also.
#[1] 1 1 2 1 1 2 3 2
Here, we are grouping the sequence of elements by v. For elements that match the same group, the seq_along function will create 1,2,3 etc. In the case of v, the elements of same group 1 are in positions 1,3,7, so those corresponding positions will be 1,2,3. By subtracting with 1, we will be able to start from 0.
To understand it better,
lst1 <- split(v,v)
lst2 <- lapply(lst1, seq_along)
unsplit(lst2, v)
#[1] 1 1 2 1 1 2 3 2
Using data.table
library(data.table)
DT <- data.table(v, ind=seq_along(v))
DT[, n:=(1:.N)-1, by=v][,n[ind]]
#[1] 0 0 1 0 0 1 2 1
In Matlab there is not a function for that (as far as I know), but you can achieve it this way:
w = sum(triu(bsxfun(#eq, v, v.'), 1));
Explanation: bsxfun(...) compares each element with each other. Then triu(..., 1) keeps only matches of an element with previous elements (i.e. values above the diagonal). Finally sum(...) adds all coincidences with previous elements.
A more explicit, but slower alternative (not recommended) is:
w = arrayfun(#(n) sum(v(1:n-1)==v(n)), 1:numel(v));
Explanation: for each index n (where n varies as 1:numel(v)), compare all previous elements v(1:n-1) to the current element v(n), and get the number of matches (sum(...)).
R has a function called make.unique that can be used to obtain the required result. First use it to make all elements unique:
(v.u <- make.unique(as.character(v))) # it only works on character vectors so you must convert first
[1] "1" "3" "1.1" "2" "4" "2.1" "1.2" "3.1"
You can then take this vector, remove the original data, convert the blanks to 0, and convert back to integer to get the counts:
as.integer(sub("^$","0",sub("[0-9]+\\.?","",v.u)))
[1] 0 0 1 0 0 1 2 1
If you want to use a for-loop in matlab you can get the result with:
res=v;
res(:)=0;
for c=1:length(v)
helper=find(v==v(c));
res(c)=find(helper==c);
end
not sure about runtime compared to Luis Mendo's solution. Gonna check that now.
Edit
Running the code 10.000 times results in:
My Solution: Elapsed time is 0.303828 seconds
Luis Mendo's Solution (bsxfun): Elapsed time is 0.180215 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 3.868467 seconds.
So the bsxfun solution is fastest, then the for-loop followed by the arrayfun solution. Gonna generate longer v-arrays now and see if sth changes.
Edit 2
Changing v to
v = ceil(rand(100,1)*8);
resulted in more obvious runtime ranking:
My Solution: Elapsed time is 4.020916 seconds.
Luis Mendo's Solution (bsxfun):Elapsed time is 0.808152 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 22.126661 seconds.

R subscript based on a vector

df <- data.frame(name=c('aa', 'bb', 'cc','dd'),
code=seq(1:4), value= seq(100, 400, by=100))
df
v <- c(1, 2, 2)
v
A <- df[df$code %in% v,]$value
A
str(A)
I tried to obtain the corresponding value based on the code. I was expecting A to be of length 3; but it actually returns a vector of 2. What can I do if I want A to be a vector of 3, that is c(100,200,200)?
%in% returns a logical vector, the same length as vector 1, that indicates whether each element of vector 1 occurs in vector 2.
In contrast, the match function returns, for each element of vector 1, the position in vector 2 where the element first appears (or NA if it doesn't exist in vector 2). Try the following:
df[match(v, df$code), 'value']
You could just use v as an argument if those were the lines whose "value"s you wanted:
> df[v,]$value
[1] 100 200 200
df[v,2] # minimum characters :)

Resources