Not sure why dcast() this data set results in dropping variables - r

I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?

Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.

Related

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Like dcast but without sum of data

I have data organized for the R survival package, but want to export it to work in Graphpad Prism, which uses a different structure.
#Example data
Treatment<-c("A","A","A","A","A","B","B","B","B","B")
Time<-c(3,4,5,5,5,1,2,2,3,5)
Status<-c(1,1,0,0,0,1,1,1,1,1)
df<-data.frame(Treatment,Time,Status)
The R survival package data structure looks like this
Treatment Time Status
A 3 1
A 4 1
A 5 0
A 5 0
A 5 0
B 1 1
B 2 1
B 2 1
B 3 1
B 5 1
The output I need organizes each treatment as one column, and then sorts by time. Each individual is then recorded as a 1 or 0 according to its Status. The output should look like this:
Time A B
1 1
2 1
2 1
3 1 1
4 1
5 0 1
5 0
5 0
dcast() does something similar to what I want, but it sums up the Status values and merges them into one cell for all individuals with matching Time values.
Thanks for any help!
I ran into a weird issue when trying to implement Sotos' code to my actual data. I got the error:
Error in Math.factor(var) : ‘abs’ not meaningful for factors
Which is weird, because Sotos' code works for the example. When I checked the example data frame using sapply() it gave me the result:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "numeric"
My issue as far as I could tell, was that my Status variable was read as numeric in my example, but an integer in my real data:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "integer"
I loaded my data from a .csv, so maybe that's what caused the change in variable calling. I ended up changing my Status variable using as.numeric(), and then re-generating the dataframe.
Status<-as.numeric(df$Status)
df<-data.frame(Treatment, Time, Status)
And was able to apply Sotos' code to the new dataframe.

Summing up prior rows contingent on ID #'s in R, for loop vs apply

I have a dataframe of xyz coordinates of units in 5 different boxes, all 4x4x8 so 128 total possible locations. The units are all different lengths. So even though I know the coordinates of the unit (3 units in, 2 left, and 1 up) I don't know the exact location of the unit in the box (12' in, 14' left, 30' up?). The z dimension corresponds to length and is the dimension I am interested in.
My instinct is to run a for loop summing values, but that is generally not the most efficient in R. The key elements of the for loop would be something along the lines of:
master$unitstartpoint<-if(master$unitz)==1 0
master$unitstartpoint<-if(master$unitz)>1 master$unitstartpoint[i-1] + master$length[i-1]
i.e. the unit start point is 0 if it is the first in the z dimension, otherwise it is the start point of the prior unit + the length of the prior unit. Here's the data:
# generate dataframe
master<-c(rep(1,128),rep(2,128),rep(3,128),rep(4,128),rep(5,128))
master<-as.data.frame(master)
# input basic data--what load number the unit was in, where it was located
# relative other units
master$boxNumber<-master$master
master$unitx<-rep(c(rep(1,32),rep(2,32),rep(3,32),rep(4,32)),5)
master$unity<-c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))
master$unitz<-rep(1:8,80)
# create unique unit ID # based on load number and xyz coords.
transform(master,ID=paste0(boxNumber,unitx,unity,unitz))
# generate how long the unit is. this length will be used to identify unit
# location in the box
master$length<-round(rnorm(640,13,2))
I'm guessing there is a relatively easy way to do this with apply or by but I am unfamiliar with those functions.
Extra info: the unit ID's are unique and the master dataframe is sorted by boxNumber, unitx, unity, and then unitz, respectively.
This is what I am shooting for:
length unitx unity unitz unitstartpoint
15 1 1 1 0
14 1 1 2 15
11 1 1 3 29
13 1 1 4 40
Any guidance would be appreciated. Thanks!
It sounds like you just want a cumulative sum along the z dimesion for each box/x/y combination. I used cumulative sum because otherwise if you reset at 0 when z=1 your definition would be leaving off the length at z=8. We can do this easily with ave
clength <- with(master, ave(length, boxNumber, unitx, unity, FUN=cumsum))
I'm exactly sure which values you want returned, but this column roughly transates to how you were redefining length above. If i combine with the original data and look at the total lenth for the first box for x=1, y=1:4
# head(subset(cbind(master, ml), unitz==8),4)
master boxNumber unitx unity unitz length ID ml
8 1 1 1 1 8 17 1118 111
16 1 1 1 2 8 14 1128 104
24 1 1 1 3 8 10 1138 98
32 1 1 1 4 8 10 1148 99
we see the total lengths for those positions. Since we are using cumsum we are summing that the z are sorted as you have indicated they are. If you just want one total overall length per box/x/y combo, you can replace cumsum with sum.

ChoiceModelR - Hierarchical Bayes Multinomial Logit Model

I hope that some of you are a bit experienced with the R package ChoiceModelR by Sermas and Colias, to estimate a Hierarchical Bayes Multinomial Logit Model. Actually, I am quite a newbie on both R and Hierarchical Bayes. However, I tried to get some estimates by using the script provided by Sermas and Colias in the help file. I have a data set in the same structure as they use (ID, choice set, alternative, independent variables, and choice variable). I have four independent variables all of them binary coded as categorical variables, none of them restricted. I have eight choice sets with three alternatives within each set as well as one no-choice-option as fourth alternative. I tried the following script:
library (ChoiceModelR)
data <- read.delim("Z:/KLU/CSR/CBC/mp3_vio.txt")
xcoding=c(0,0,0,0)
mcmc = list(R = 10, use = 10)
options = list(none=FALSE, save=TRUE, keep=1)
attlevels=c(2,2,2,2)
c1=matrix(c(0,0,0,0),2,2)
c2=matrix(c(0,0,0,0),2,2)
c3=matrix(c(0,0,0,0),2,2)
c4=matrix(c(0,0,0,0),2,2)
constraints = list(c1, c2, c3, c4)
out = choicemodelr(data, xcoding, mcmc = mcmc, options = options, constraints = constraints)
and have got the following error message:
Error in 1:nalts[i] : result would be too long a vector
In addition: There were 50 or more warnings (use warnings() to see the first 50). The mentioned warnings are of the following:
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
Actually, I have no idea what went wrong so far as I used the same data structure even I have more independent variables, more choice sets, and more alternatives within a choice set. I would be fantastic if anybody can shed some light into the darkness
I know that this may not be helpful since you posted so long ago, but if it comes up again in the future, this could prove useful.
One of the most common reasons for this error (in my experience) has been that either the scenario variable or the alternative variable is not in ascending order within your data.
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 3 1 4 2
1 3 2 5 0
2 1 4 3 1
2 1 5 1 0
2 2 1 4 2
2 2 2 3 0
This dataset will give you errors since the scenario and alternative variables must be ascending, and they must not skip any values. Just to fully reiterate what I mean, the scenario and alt variables must be reordered as follows in order to work:
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 2 1 4 2
1 2 2 5 0
2 1 1 3 1
2 1 2 1 0
2 2 1 4 2
2 2 2 3 0
I work with ChoiceModelR quite frequently, and this is what has caused these errors for me in the past. If you have a github account, you can also post your data (or modified data) there if you end up wanting to have other users take a look.

Unit of Analysis Conversion

We are working on a social capital project so our data set has a list of an individual's organizational memberships. So each person gets a numeric ID and then a sub ID for each group they are in. The unit of analysis, therefore, is the group they are in. One of our variables is a three point scale for the type of group it is. Sounds simple enough?
We want to bring the unit of analysis to the individual level and condense the type of group it is into a variable signifying how many different types of groups they are in.
For instance, person one is in eight groups. Of those groups, three are (1s), three are (2s), and two are (3s). What the individual level variable would look like, ideally, is 3, because she is in all three types of groups.
Is this possible in the least?
##simulate data
##individuals
n <- 10
## groups
g <- 5
## group types
gt <- 3
## individuals*group membership
N <- 20
## inidividuals data frame
di <- data.frame(individual=sample(1:n,N,replace=TRUE),
group=sample(1:g,N, replace=TRUE))
## groups data frame
dg <- data.frame(group=1:g, type=sample(1:gt,g,replace=TRUE))
## merge
dm <- merge(di,dg)
## order - not necessary, but nice
dm <- dm[order(dm$individual),]
## group type per individual
library(plyr)
dr <- ddply(dm, "individual", function(x) length(unique(x$type)))
> head(dm)
group individual type
2 2 1 2
8 2 1 2
20 5 1 1
9 3 3 2
12 3 3 2
17 4 3 2
> head(dr)
individual V1
1 1 2
2 3 1
3 4 2
4 5 1
5 6 1
6 7 1
I think what you're asking is whether it is possible to count the number of unique types of group to which an individual belongs.
If so, then that is certainly possible.
I wouldn't be able to tell you how to do it in R since I don't know a lot of R, and I don't know what your data looks like. But there's no reason why it wouldn't be possible.
Is this data coming from a database? If so, then it might be easier to write a SQL query to compute the value you want, rather than to do it in R. If you describe your schema, there should be lots of people here who could give you the query you need.

Resources