Create an unknown number of subsets with specific conditions using R

Create an unknown number of subsets with specific conditions using R - r

I am still an R beginner, so please be kind :). There are gaps that occur in my data at unknown times and for unknown intervals. I would like to pull these gaps out of my data by subsetting them. I don't want them removed from the data frame, I just want as many different subsets as there are data gaps so that I can make changes to them and eventually merge the changed subsets back into the original data frame. Also, eventually I will be running the greater part of this script on multiple .csv files so it cannot be hardcoded. A sample of my data is below with just the relevant column:
fixType (column 9)
fix
fix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
firstfix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
0
0
0
0
0
firstfix
The code I have now is not functional and not completely correct R, but I'm hoping that it expresses what I need to do. Essentially every time lastvalidfix and firstfix are found in the rows of column 9 I would like to create a subset which would include those two rows and however many rows are between them. If using my sample data above then I would be creating 2 subsets, the first with 7 rows and the second with 12 rows. The number of data gaps in each file varies, so the number of subset and the length will likely be different each time. I realize that each subset will need a unique name which is why I've done the subset + 1.
subset <- 0 # This is my attempt at creating unique names for the subsets
for (i in 2:nrow(dataMatrix)) { # Creating new subsets of data for each time the signal is lost
if ((dataMatrix[i, 9] == "lastvalidfix") &
(dataMatrix[i, 9] == "firstfix")){
subCreat <- subset(dataMatrix, dataMatrix["lastvalidfix":"firstfix", 9], subset + 1)
}
}
Any help would be most appreciated.

Try this:
start.idx <- which(df$fixType == "lastvalidfix")
end.idx <- which(df$fixType == "firstfix")
mapply(function(i, j) df[i:j, , drop = FALSE],
start.idx, end.idx, SIMPLIFY = FALSE)
It will return a list of sub-data.frames or sub-matrices.
(Note: my df$fixType is what you refer to as dataMatrix[, 9]. If it has a column name, I would highly recommend you use that.)

Related

How to define default input value when merging two datasets on one column of different lengths?

Assuming I have an original version dataset containing a complete set of "texsts" (a string variable), and a second dataset that only contains those "texts" for which the new variable "value" takes a certain value (0, 1, or NA).
Now I would like to merge them back together so that the resulting dataset contains the full range of "texts" from the first dataset but also includes "value" which should be 0 if coded 0 and/or only present in the original dataset.
dat1<-data.frame(text=c("a","b","c","d","e","f","g","h")) # original dataset
dat2<-data.frame(text=c("e","f","g","h"), value=c(0,NA,1,1)) # second version
The final dataset should look like this:
> dat3
text value
1 a 0
2 b 0
3 c 0
4 d 0
5 e 0
6 f NA
7 g 1
8 h 1
However, what Base-R's merge() does is to introduce NAs where I want 0s instead:
dat3<-merge(dat1, dat2, by=c("text"), all=T)
Is there a way to define a default input for when the variable by which datasets are merged is only present in one but not the other dataset? In other words, how can I define 0 as standard input value instead of NA?
I am aware of the fact that I could temporarily change the coded NAs in the second dataset to something else to distinguish later on between "real" NAs and NAs that just get introduced, but I would really like to refrain from doing so, if there's another, cleaner way. Ideally, I would like to use merge() or plyr::join() for that purpose but couldn't find anything in the manual(s).

I know that this is not ideal too, but something to consider:
library(dplyr)
dat3 <- dplyr::left_join(dat1,dat2,all.x =T)
dat3[which(dat2$text != dat3$text),2] = 0
Or wrapping in a function to call a one-liner:
merge_NA <- function(dat1,dat2){
dat3 <- dplyr::left_join(dat1,dat2,all.x = T)
dat3[which(dat2$text != dat3$text),2] = 0
return(dat3)
}
Now, you only call:
merge_NA(dat1,dat2)

Creating a loop over variables' names

I'm new to R (started a few days ago) and coming from STATA. I am trying to create a loop to create dummy variables when a variable has value -9. I want to use a loop as I have got plenty of variables like this.
In the following, reflex_working is my dataframe and "A7LECTUR" etc are my variables. I am trying to create a dummy called "miss_varname" for each variable using the ifelse function.
varlist<-c("A7LECTUR", "A7GROASG", "A7RESPRJ", "A7WORPLC", "A7PRACTI",
"A7THEORI", "A7TEACHR", "A7PROBAL", "A7WRIASG", "A7ORALPR")
for (i in varlist){
reflex_working$miss_[i]<-ifelse(reflex_working$i==-9,1,0)
}
I get the following warnings for each iteration:
1: Unknown or uninitialised column: 'miss_'.
2: Unknown or uninitialised column: 'i'.
And no variable is created. I assume this must be something very trivial for everyone, but I have been trying for the last hour to create this kind of loop and have zero results to show.
Edit:
I have something like:
A7LECTUR
1
2
1
4
-9
And would like, after the loop, to have a new column like:
reflex_working$miss_A7LECTUR
0
0
0
0
1
Hope this helps clarifying what I'm trying to achieve!
Any help would be seriously appreciated.
Gabriele

Let's break this down into why it doesn't work. For starters, in R
i
A7LECTUR
# and
"A7LECTUR"
are different. The first two are variablenames, the latter is a value. I am emphasising this difference, because it is an important distinction.
Working with lists (and data frames, as data frames are basically lists with some restrictions to make them rectangular), in the syntax reflex_working$i reflex_working refers to the variable and i is refers to the element named "i" within the list. In reflex_working$i, the i is literal and R doesn't care if you have an variable named i.
With programming, we want to be a bit more dynamic. So you correctly assumed using a variable would do the trick. If you want to do that, you have to use the [ or [[ subset method ([ always returns a list, while [[ will return the element without the encapsulating list[1]).
To summarise:
reflex_working$i # gets the element named i, no matter what.
reflex_working[[i]] # gets the element whose name (or position) is stored in the variable i
reflex_working$i == reflex_working[["i"]]
That should explain the right-hand-side of your line in the loop. The correct statement should read
ifelse(reflex_working[[i]]==-9,1,0)
For the left-hand-side, reflex_working$miss_[i], things are completely off. What you want can be decomposed into several steps:
Compose a value by concatenating "miss_" and the value of i.
Use that value as the element/column name.
We can combine these two into (as a commentor stated)
reflex_working[[paste0('miss_', i)]] <- ...
Good job on you, for realising that R is inherently vectorized - since you are not writing a loop for each row in the column. Good one!
[1] but [[ can return a list, if the element itself is a list. R can be... weird full of surprises.

Assuming you want this for the entire data frame.
tt <- read.table(text="
A7LECTUR A7GROASG
1 2
2 3
1 -9
4 -9
-9 0", header=TRUE)
tt.d <- (tt == -9)*1
colnames(tt.d) <- paste0("miss_", colnames(tt))
tt.d
# miss_A7LECTUR miss_A7GROASG
# [1,] 0 0
# [2,] 0 0
# [3,] 0 1
# [4,] 0 1
# [5,] 1 0

Give second column 0 value where first column shows NA for the same row

I have a dataset with 16 columns, in pairs. One with a species name and the second with a percentage cover (so 8 columns with species names and 8 columns with integers for the percentage cover). I want the rows with NA for species name to have a 0 in the adjacent cover percentage column for the same row. How can I write a loop that does this?
note that I can't stack the vectors they must remain in this structure for later analyses
I have tried copying numerous loops from online sources and also countless examples on SE but I can't seem to get it right. I have tried this code for starters
for(i in 1:nrow(x)){
if (x$species_1[i]==NA) {x$cover_1[i] <- 0}
else {NULL}
}
but it throws this
Error in if (x$species_1[i] == NA) { : missing value where TRUE/FALSE needed
which I have read relates to the fact that there are NAs in the vector...so you see my conundrum...
In the end, I want all NAs for species to have a corresponding 0 in the adjacent column for percentage cover

Try this instead:
x[is.na(x$species_1), "cover_1"] <- 0

Change
if (x$species_1[i]==NA)
to
if ( is.na(x$species_1[i]) )

how to do sumproduct in R by dplyr package

I just want to achieve a thing on R. Here is the explanation,
I have data sets which contains same value, please find the below data sets,
A B
1122513454 0
1122513460 0
1600041729 0
2100002632 147905
2840007103 0
2840064133 138142
3190300079 138040
3190301011 138120
3680024411 0
4000000263 4000000263
4100002263 4100002268
4880004352 138159
4880015611 138159
4900007044 0
7084781116 142967
7124925306 0
7225002523 7225001325
23012600000 0
80880593057 0
98880000045 0
I have two columns (A & B). In the b column, I have the same value (138159,138159). It appears two times.
I just want to make a calculation, where it will get the same value it will count as 1. That means, I am getting two 138159, but that will be treated as 1. and finally it will count the whole b column value except 0. That means, 0 is here 10 times and the other value is also 10 times, but 138519 appears 2 times, so it will be counted as 1, so other values are 9 times and finally it will give me only other value's count i.e 9.
So my expected output will be 9
I have already done this in excel. But, want to achieve the same in R. Is there any way to do it in R by dplyr package?
I have written following formula in excel,
=+SUMPRODUCT((I2:I14<>0)/COUNTIFS(I2:I14,I2:I14))
how can I count only other value's record without 0?
Can you guys help me with that?
any suggestion is really appreciable.
Edit 1: I have done this by following way,
abc <- hardy[hardy$couponid !=0,]
undertaker <- abc %>%
group_by(TYC) %>%
summarise(count_couponid= n_distinct(couponid))
any smart way to do that?
Thanks

How to speedup for and if loop in R

In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.

The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create an unknown number of subsets with specific conditions using R - r

Related

How to define default input value when merging two datasets on one column of different lengths?

Creating a loop over variables' names

Give second column 0 value where first column shows NA for the same row

how to do sumproduct in R by dplyr package

How to speedup for and if loop in R

Categories

Resources