table() function does not convert data frame correctly - r

I am trying to convert a data.frame to table without packages. Basically I take cookbook as reference for this and tried from data frame, both named or unnamed vectors. The data set is stackoverflow survey from kaggle.
moreThan1000 is a data.frame stores countries those have more than 1000 stackoverflow user and sorted by number column as shown below:
moreThan1000 <- subset(users, users$Number >1000)
moreThan1000 <- moreThan1000[order(moreThan1000$Number),]
when I try to convert it to a table like
tbl <- table(moreThan1000)
tbl <- table(moreThan1000$Country, moreThan1000$Number)
tbl <- table(moreThan1000$Country, moreThan1000$Number, dnn = c("Country","Number"))
after each attempt my conversion look like this:
Why moreThan1000 data.frame do not send just related countries but all countries to table? It seems to me conversion looks like a matrix.

I believe that this is because countries do not relate to each other. To each country corresponds a number, to another country will correspond an unrelated number. So the best way to reflect this is the original data.frame, not a table that will have just one 1 per row (unless two countries have the very same number of stackoverflow users). I haven't downloaded the dataset you're using, but look to what happens with a fake dataset, order by number just like your moreThan1000.
dat <- data.frame(A = letters[1:5], X = 21:25)
table(dat$A, dat$X)
21 22 23 24 25
a 1 0 0 0 0
b 0 1 0 0 0
c 0 0 1 0 0
d 0 0 0 1 0
e 0 0 0 0 1
Why would you expect anything different from your dataset?

The function "table" is used to tabulate your data.
So it will count how often every value occurs (in the "number"column!). In your case, every number only occurs once, so don't use this function here. It's working correctly, but it's not what you need.
Your data is already a tabulation, no need to count frequencies again.
You can check if there is an object conversion function, I guess you are looking for a function as.table rather than table.

Related

How to apply a function in different ranges of a vectror in R?

I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

How to define default input value when merging two datasets on one column of different lengths?

Assuming I have an original version dataset containing a complete set of "texsts" (a string variable), and a second dataset that only contains those "texts" for which the new variable "value" takes a certain value (0, 1, or NA).
Now I would like to merge them back together so that the resulting dataset contains the full range of "texts" from the first dataset but also includes "value" which should be 0 if coded 0 and/or only present in the original dataset.
dat1<-data.frame(text=c("a","b","c","d","e","f","g","h")) # original dataset
dat2<-data.frame(text=c("e","f","g","h"), value=c(0,NA,1,1)) # second version
The final dataset should look like this:
> dat3
text value
1 a 0
2 b 0
3 c 0
4 d 0
5 e 0
6 f NA
7 g 1
8 h 1
However, what Base-R's merge() does is to introduce NAs where I want 0s instead:
dat3<-merge(dat1, dat2, by=c("text"), all=T)
Is there a way to define a default input for when the variable by which datasets are merged is only present in one but not the other dataset? In other words, how can I define 0 as standard input value instead of NA?
I am aware of the fact that I could temporarily change the coded NAs in the second dataset to something else to distinguish later on between "real" NAs and NAs that just get introduced, but I would really like to refrain from doing so, if there's another, cleaner way. Ideally, I would like to use merge() or plyr::join() for that purpose but couldn't find anything in the manual(s).
I know that this is not ideal too, but something to consider:
library(dplyr)
dat3 <- dplyr::left_join(dat1,dat2,all.x =T)
dat3[which(dat2$text != dat3$text),2] = 0
Or wrapping in a function to call a one-liner:
merge_NA <- function(dat1,dat2){
dat3 <- dplyr::left_join(dat1,dat2,all.x = T)
dat3[which(dat2$text != dat3$text),2] = 0
return(dat3)
}
Now, you only call:
merge_NA(dat1,dat2)

how to do sumproduct in R by dplyr package

I just want to achieve a thing on R. Here is the explanation,
I have data sets which contains same value, please find the below data sets,
A B
1122513454 0
1122513460 0
1600041729 0
2100002632 147905
2840007103 0
2840064133 138142
3190300079 138040
3190301011 138120
3680024411 0
4000000263 4000000263
4100002263 4100002268
4880004352 138159
4880015611 138159
4900007044 0
7084781116 142967
7124925306 0
7225002523 7225001325
23012600000 0
80880593057 0
98880000045 0
I have two columns (A & B). In the b column, I have the same value (138159,138159). It appears two times.
I just want to make a calculation, where it will get the same value it will count as 1. That means, I am getting two 138159, but that will be treated as 1. and finally it will count the whole b column value except 0. That means, 0 is here 10 times and the other value is also 10 times, but 138519 appears 2 times, so it will be counted as 1, so other values are 9 times and finally it will give me only other value's count i.e 9.
So my expected output will be 9
I have already done this in excel. But, want to achieve the same in R. Is there any way to do it in R by dplyr package?
I have written following formula in excel,
=+SUMPRODUCT((I2:I14<>0)/COUNTIFS(I2:I14,I2:I14))
how can I count only other value's record without 0?
Can you guys help me with that?
any suggestion is really appreciable.
Edit 1: I have done this by following way,
abc <- hardy[hardy$couponid !=0,]
undertaker <- abc %>%
group_by(TYC) %>%
summarise(count_couponid= n_distinct(couponid))
any smart way to do that?
Thanks

R: How to turn a table of transaction-item pairs into a matrix of transactions x items?

I have a data frame in R like this:
user1,A
user1,B
user2,A
user2,C
user2,C
user3,A
user4,C
How can I transform it into a table like this?
user1,1,1,0
user2,1,0,2
user3,1,0,0
user4,0,0,1
Actually my data has a time interval as the second column, which is the time passed between the user's first and current purchase. I want to do a matrix plot in which each line in the matrix is a line in the image (And each element a pixel).
(I can (kinda) do it with the arules package, exporting the original dataframe and importing it as transactions, but I think there must be a direct way to do this, without needing such a hack.)
Thanks!
You can just use table()
> table(df1)
# V2
#V1 A B C
# user1 1 1 0
# user2 1 0 2
# user3 1 0 0
# user4 0 0 1
If you want to store this output as a new dataframe df2, this is one possibility:
df2 <- as.data.frame.matrix(table(df1))
data
df1 <- read.table(text="user1,A
user1,B
user2,A
user2,C
user2,C
user3,A
user4,C", header=F, sep=",")

Resources