Error in levels for seqdef in R - r

I've seen this error everytime I try to run seqdef on my data that has already been converted to STS format using seqformat. A sample of my dataframe looks like
head(df.new, 10)
user_id orderdate cart to
1 8 1 produce 30
2 8 31 produce 60
3 8 61 produce 70
4 8 71 produce 92
5 10 1 produce 30
6 10 31 produce 42
7 10 43 meat seafood 56
8 10 57 deli 77
9 17 1 beverages 3
10 17 4 beverages 8
It has a total of 14000 rows of orders and there are some orders which occur on the same day for each user (i.e. orderdate == to). Below are the codes that I have used to create my STS data which is used as input to seqdef.
df.form <- seqformat(df.new, id='user_id', begin='orderdate', end='to', status='cart', from='SPELL', to='STS', process=FALSE)
df.seq <- seqdef(df.form, left='DEL', right = 'unknown', xtstep=10, void = 'unknown')
The error message I get when running the seqdef is
[>] found missing values ('NA') in sequence data
[>] preparing 35000 sequences
[>] coding void elements with 'unknown' and missing values with '*'
[>] 21 distinct states appear in the data:
1 = alcohol
2 = babies
3 = bakery
4 = beverages
5 = breakfast
6 = bulk
7 = canned goods
8 = dairy eggs
9 = deli
10 = dry goods pasta
11 = frozen
12 = household
...
[>] adding special state(s) to the alphabet: unknown
Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
factor level [24] is duplicated
I tried removing those orders where orderdate == to and the same error still occurs. I would appreciate any help I can get to fix this problem. Thanks.

The error occurs because you are using the same code ('unknown') for right missings and voids.
When the sequences contain 'missings', these missings will be considered as a separate state when you set with.missing = TRUE in functions such as seqdist or seqdplot, while voids are used to adjust the row lengths and are simply ignored when plotting the sequences (seqplot) or computing dissimilarities (seqdist).

Related

Interpolation in R when there are 3 columns

I need to find the interpolated value for consumption from the speed and weather.
I have tried approx function but it is only for 2 variables, wont accept three or more.
Speed weather fuel
10 2 30
12 3 35
14 8 38
15 9 65
need to find fuel for speed_new = 13 and weather = 7.
approx(x=Speed,y=Fuel,z=Weather,xout= speed_new,rule = 2)$y #need to also mention the weather

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

Resources