Treat variables as data_frame and other things - r

I guess I have a problem in R. I have this data frame (see at the bottom); I imported it as a Import Dataset "Weevils" from Text; then I converted the data via
as.data.frame(Weevils) and is.data.frame(Weevils) [1] TRUE proved me it's a data frame yet I cannot use the $ operator because the variables were all "atomic vectors"; I tried this instead:
pairs(x[Age_yrs]~x[Larvae_per_m²], col= x[Farmer] pch = 16)
but then this occured:
Error in plot.xy(xy, type, ...) :
numerische Farbe muss >= 0 sein, gefunden -2
which basically means that a negative value (for ther Farmer?) was found therefore so it cannot assign the colors to the outcome; All is supposed to look like this https://stackoverflow.com/a/40829168/5987736 (Thanks to Carles Mitjans!)
yet what came out in my case when putting in pairs(x[Age_yrs]~x[Larvae_per_m²], pch = 16) was this plot: Plot with negative values ; it has negative values, thus the colors cannto be assigned;
So my questions are: Why cannot the variables in the Weevils dataframe be treated as non-atomic vectors or why can't I use the $ and why are the values negative, what can I do so the values get positive? Thanks for helping me!
Farmer Age_yrs Larvae_per_m²
1 Band 2 1315
2 Band 4 725
3 Band 6 90
4 Fechney 1 520
5 Fechney 3 285
6 Fechney 9 30
7 Mulholland 2 725
8 Mulholland 6 20
9 Adams 2 150
10 Adams 3 225
11 Forrester 1 455
12 Forrester 3 75
13 Bilborough 2 850
14 Bilborough 3 650

Related

Hierarchical clustering with specific number of data in each cluster

I'm clustering a set of words using "Hierarchical Clustering". I want each cluster to contain a certain number of words, for example 2 words, or 3 words.
I'm trying to modify existing code for this clustering.
I just put the value of max(d) to Inf as well
Lm[min(d),] <- sl
Lm[,min(d)] <- sl
if (length(cluster)>2){#if it's already clustered with more than 2 points
#then dont't cluster them again by setting values to Inf
Lm[min(d), min(d)] <- Inf
Lm[max(d), max(d)] <- Inf
Lm[max(d),] <- Inf
Lm[,max(d)] <- Inf
Lm[min(d),] <- Inf
Lm[,min(d)] <- Inf
}
However, it doesn't give me the expected results, I was wondering if it's correct approach? How can I do this type of clustering with constraint in r ?
example of results that I got
row V1 V2
166 -194 -38
167 166 -1
……..
240 239 239
241 240 240
242 241 241
243 242 242
244 243 243
This will be tough to optimize, or it can produce arbitrarily bad results. Because your size constraint goes against the principles of clustering.
Consider the one-dimensional data set -100, -1, 1, 100. Assuming you want to limit the cluster size to 2 elements. Hierarchical clustering will first merge -1 and +1 because they are closest. Now they have reached maximum size, so the only option is now to cluster -100 and +100, the worst possible result - this cluster is as big as the entire data set.
Just to give you an example of what I meant with partitional clustering:
library(cluster)
data("ruspini")
desired_cluster_size <- 3L
corresponding_num_clusters <- round(nrow(ruspini) / desired_cluster_size)
km <- kmeans(ruspini, corresponding_num_clusters)
table(km$cluster)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
3 3 2 4 2 2 2 1 3 3 2 3 2 3 3 2 6 3 2 1 3 6 2 8 4
This definitely can't guarantee how many observations you'll have in each group,
and it's not deterministic,
but it at least gives you an approximation.
In the tabulated results you can see that many clusters (1 through 25) ended up with 2 or 3 elements.

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

ggplot2 plotting single data frame with 10 different levels

I am using R studio
I have a single data frame with three columns titled
colnames(result)
[1] "v" "v2" "Lambda"
I wish to use ggplot2 to create an overlay plot assigning 10 different colors to each of the 10 different values in the Lambda column
summary(result$Lambda)
1 2 3 4 5 6 7 8 9 10
101 100 100 100 100 100 100 100 100 100
Now I created a factor for the Lambda values as follows:
result$Lambda<-factor(result$Lambda, levels=c(1,2,3,4,5,6,7,8,9,10), labels=c("1","2","3","4","5","6","7","8","9","10"))
qplot(x=v, y=v2, data=result, geom="line", fill=Lambda main="Newton Revolved Plot", xlab="x=((L/2)*((1/t)+2*t+t^(3)))", ylab="y=(L/2)*(log(1/t)+t^(2)+(3/4)*t^(4))-(7*L/8)")
but this command does not work.
I am just inquiring about how to plot 10 different plots, one plot for each Lambda value. Or perhaps resources for the novice..
Thank you in advance
2.000000e+00 0.000000e+00 1
3 6.250000e+00 6.778426e+00 1
4 1.666667e+01 3.345069e+01 1
5 3.612500e+01 1.024319e+02 1
6 6.760000e+01 2.451953e+02 1
7 1.140833e+02 5.022291e+02 1
8 1.785714e+02 9.230270e+02 1
9 2.640625e+02 1.566085e+03 1
10 3.735556e+02 2.498901e+03 1
105 7.225000e+01 2.048637e+02 2
106 1.352000e+02 4.903906e+02 2
107 2.281667e+02 1.004458e+03 2
108 3.571429e+02 1.846054e+03 2
109 5.281250e+02 3.132171e+03 2
110 7.471111e+02 4.997803e+03 2
111 1.020100e+03 7.595947e+03 2
112 1.353091e+03 1.109760e+04 2
250 1.766205e+05 6.488994e+06 3
251 1.876500e+05 7.034992e+06 3
252 1.991295e+05 7.614744e+06 3
253 2.110680e+05 8.229615e+06 3
254 2.234745e+05 8.880996e+06 3
255 2.363580e+05 9.570303e+06 3
Here is some sample date, the first column is the "v", the second column is "v2" and the third column is the "Lambda" column. (not include the 0th column as the index)
Just to rephrase my R problems, I have a single data frame with 10 levels each with 100 entries (roughly.. see above for exact count). and wish to use ggplot2 to plot each level a different color.
there are two ways I can think of
1) Find the correct ggplot2 option and distinguish each level
2) split this single data frame titled "result" into 10 subsets.
thank you very much in advance.
for a line use
color=
instead of
fill=
But without any data to reproduce it is hard to know if that is what you want.

R igraph: how to find the largest community?

I use fastgreedy.community to generate a community object, which contains 15 communities. But how can I extract the largest community among these 15 communities?
Community sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1862 1708 763 974 2321 1164 649 1046 2 2 2 2 2 2
15
2
In this example, I want to extract the community 5 for further use.
Thanks!
Assuming that your community object is named community.object, which(membership(community.object) == x) extracts the indices of the vertices in community x. If you want the largest, community, you can set x to which.max(sizes(community.object)). Finally, you can use induced.subgraph to extract that particular community into a separate graph:
> x <- which.max(sizes(community.object))
> subg <- induced.subgraph(graph, which(membership(community.object) == x))

Resources