I just want to achieve a thing on R. Here is the explanation,
I have data sets which contains same value, please find the below data sets,
A B
1122513454 0
1122513460 0
1600041729 0
2100002632 147905
2840007103 0
2840064133 138142
3190300079 138040
3190301011 138120
3680024411 0
4000000263 4000000263
4100002263 4100002268
4880004352 138159
4880015611 138159
4900007044 0
7084781116 142967
7124925306 0
7225002523 7225001325
23012600000 0
80880593057 0
98880000045 0
I have two columns (A & B). In the b column, I have the same value (138159,138159). It appears two times.
I just want to make a calculation, where it will get the same value it will count as 1. That means, I am getting two 138159, but that will be treated as 1. and finally it will count the whole b column value except 0. That means, 0 is here 10 times and the other value is also 10 times, but 138519 appears 2 times, so it will be counted as 1, so other values are 9 times and finally it will give me only other value's count i.e 9.
So my expected output will be 9
I have already done this in excel. But, want to achieve the same in R. Is there any way to do it in R by dplyr package?
I have written following formula in excel,
=+SUMPRODUCT((I2:I14<>0)/COUNTIFS(I2:I14,I2:I14))
how can I count only other value's record without 0?
Can you guys help me with that?
any suggestion is really appreciable.
Edit 1: I have done this by following way,
abc <- hardy[hardy$couponid !=0,]
undertaker <- abc %>%
group_by(TYC) %>%
summarise(count_couponid= n_distinct(couponid))
any smart way to do that?
Thanks
Related
I´m currently struggling with shuffling a dataframe in R Studio. Let's say my dataframe looks as follows:
x y
0 a
0 a
1 a
1 a
0 b
0 b
1 b
1 b
Would it be possible to shuffle the rows but to define, that the four different sequences of variable y (i.e. aa, ab, bb, ba) occur equally often? In total, I have 24 rows in my original dataframe.I hope I could make my problem clear. Thanks a lot in advance for your help!
Ema
It is possible, however it is not a built-in solution so you will have to code this yourself.
From what I can see from your data frame the 0 a and 1 a have 1:1 ratio and same goes for b.
In this case I would recommend grouping the letters in the pairs: aa, ab, ba, bb and repeating these pairs three times.
Now shuffle them - this will ensure that every pair occurs with the same frequency. (This only works if I assume that the pairs you wish to check are 1 and 2, 3 and 4, etc... If not and you wish to check 1 and 2, 2 and 3, etc. then I misunderstood. and you can stop reading.)
Now take only lines with (a)s, assign 6 ones and 6 zeroes in your case. Shuffle the a's only.
Repeat for (b)s.
You have your shuffle.
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data
In my current project, I have around 8.2 million rows. I want to scan for all rows and apply a certain function if the value of a specific column is not zero.
counter=1
for(i in 1:nrow(data)){
if(data[i,8]!=0){
totalclicks=sum(data$Clicks[counter:(i-1)])
test$Clicks[i]=totalclicks
counter=i
}
}
In the above code, I am searching for the specific column over 8.2 million rows and if values are not zero then I will calculate sum over values. The problem is that for and if loops are too slow. It takes 1 hour for 50K rows. I heard that apply family is alternative for this. The following code also takes too long:
sapply(1:nrow(data), function(x)
if(data[x,8]!=0){
totalclicks=sum(data$Clicks[counter:(x-1)])
test$Clicks[x]=totalclicks
counter=x
})
[Updated]
Kindly consider the following as sample dataset:
clicks revenue new_column (sum of previous clicks)
1 0
2 0
3 5 3
1 0
4 0
2 7 8
I want above kind of solution, in which I will go through all the rows. If any non-zero revenue value is encountered then it will add all previous values of clicks.
Am I missing something? Please correct me.
The aggregate() function can be used for splitting your long dataframe into chunks and performing operations on each chunk, so you could apply it in your example as:
data <- data.frame(Clicks=c(1,2,3,1,4,2),
Revenue=c(0,0,5,0,0,7),
new_column=NA)
sub_totals <- aggregate(data$Clicks, list(cumsum(data$Revenue)), sum)
data$new_column[data$Revenue != 0] <- head(sub_totals$x, -1)
I am trying to convert a data.frame to table without packages. Basically I take cookbook as reference for this and tried from data frame, both named or unnamed vectors. The data set is stackoverflow survey from kaggle.
moreThan1000 is a data.frame stores countries those have more than 1000 stackoverflow user and sorted by number column as shown below:
moreThan1000 <- subset(users, users$Number >1000)
moreThan1000 <- moreThan1000[order(moreThan1000$Number),]
when I try to convert it to a table like
tbl <- table(moreThan1000)
tbl <- table(moreThan1000$Country, moreThan1000$Number)
tbl <- table(moreThan1000$Country, moreThan1000$Number, dnn = c("Country","Number"))
after each attempt my conversion look like this:
Why moreThan1000 data.frame do not send just related countries but all countries to table? It seems to me conversion looks like a matrix.
I believe that this is because countries do not relate to each other. To each country corresponds a number, to another country will correspond an unrelated number. So the best way to reflect this is the original data.frame, not a table that will have just one 1 per row (unless two countries have the very same number of stackoverflow users). I haven't downloaded the dataset you're using, but look to what happens with a fake dataset, order by number just like your moreThan1000.
dat <- data.frame(A = letters[1:5], X = 21:25)
table(dat$A, dat$X)
21 22 23 24 25
a 1 0 0 0 0
b 0 1 0 0 0
c 0 0 1 0 0
d 0 0 0 1 0
e 0 0 0 0 1
Why would you expect anything different from your dataset?
The function "table" is used to tabulate your data.
So it will count how often every value occurs (in the "number"column!). In your case, every number only occurs once, so don't use this function here. It's working correctly, but it's not what you need.
Your data is already a tabulation, no need to count frequencies again.
You can check if there is an object conversion function, I guess you are looking for a function as.table rather than table.
I am still an R beginner, so please be kind :). There are gaps that occur in my data at unknown times and for unknown intervals. I would like to pull these gaps out of my data by subsetting them. I don't want them removed from the data frame, I just want as many different subsets as there are data gaps so that I can make changes to them and eventually merge the changed subsets back into the original data frame. Also, eventually I will be running the greater part of this script on multiple .csv files so it cannot be hardcoded. A sample of my data is below with just the relevant column:
fixType (column 9)
fix
fix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
firstfix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
0
0
0
0
0
firstfix
The code I have now is not functional and not completely correct R, but I'm hoping that it expresses what I need to do. Essentially every time lastvalidfix and firstfix are found in the rows of column 9 I would like to create a subset which would include those two rows and however many rows are between them. If using my sample data above then I would be creating 2 subsets, the first with 7 rows and the second with 12 rows. The number of data gaps in each file varies, so the number of subset and the length will likely be different each time. I realize that each subset will need a unique name which is why I've done the subset + 1.
subset <- 0 # This is my attempt at creating unique names for the subsets
for (i in 2:nrow(dataMatrix)) { # Creating new subsets of data for each time the signal is lost
if ((dataMatrix[i, 9] == "lastvalidfix") &
(dataMatrix[i, 9] == "firstfix")){
subCreat <- subset(dataMatrix, dataMatrix["lastvalidfix":"firstfix", 9], subset + 1)
}
}
Any help would be most appreciated.
Try this:
start.idx <- which(df$fixType == "lastvalidfix")
end.idx <- which(df$fixType == "firstfix")
mapply(function(i, j) df[i:j, , drop = FALSE],
start.idx, end.idx, SIMPLIFY = FALSE)
It will return a list of sub-data.frames or sub-matrices.
(Note: my df$fixType is what you refer to as dataMatrix[, 9]. If it has a column name, I would highly recommend you use that.)