Generating 'weight' column for network analysis - r

I'm very new to R, so forgive any omissions or errors.
I have a dataframe that contains a series of events (called 'incidents') represented by a column named 'INCIDENT_NUM'. These are strings (ex. 2016111111), and there are multiple cells per incident if there are multiple employees involved in the incident. Employees are represented in their own string column ('EMPL_NO'), and they can be in the column multiple times if they're involved in multiple incidents.
So, the data I have looks like:
Incident Number
EMPL_NO
201611111
EID0012
201611111
EID0013
201611112
EID0012
201611112
EID0013
201611112
EID0011
What I am aiming to do is see which employees are connected to one another by how many incidents they're co-involved with. Looking at tutorials for network analysis, folks have data that looks like this, which is what I ultimately want:
From
To
Weight
EID0011
EID0012
2
EID0012
EID0013
1
Is there any easy process for this? My data has thousands of rows, so doing this by hand doesn't feel feasible.
Thanks in advance!!!

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Dataset modification (large number of data)

I am doing a research on the influence of extreme weather events on migration.
I am now building my dataset which looks like this:
dataset:
I need to create a new dataset which would give me the number of occurrences, total deaths and total affected for a period of 5 years (which would be the sum of the events that occurred during this period).
In other words, I would like to create a table that would look like this:
desired dataset:
I could do that by hand with excel, but the number of data is too important (the number of countries/ regions included is more than 200). Therefore, I need to use R.
Do you have an idea of how I could proceed?

ISSP data: calculating percentage of respondent answers on a particular item

Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.

Multiple random selection in R

I'm aware that similar questions have been asked before, but I haven't found an answer to exactly what I need. It seems like a simple solution I'm missing.
I have a sample of approximately 20,000 participants and would like to randomly select 2500 from this sample to receive gift cards, and another unique 2500 (who aren't in the first group) to receive cash allowance. Participants shouldn't be repeated/duplicated in any means. Participants are identified by unique IDs.
I create indices for each row that represents participants (this step could be avoided, I believe).
Npool=1:dim(pool_20K)[[1]]
giftcards=sample(Npool,2500)
-- how do I create the cash allowance group so they are unique participants and do not include the ones selected for giftcards?
After, I would combine indices with the data
giftcards_ids=pool_20K[giftcards, ]
Any insight? I feel like I'm complicating a fairly simple problem.
Thanks in advanced!!
Shuffle the entire thing and then select subsets:
shuffled.indices = sample(nrow(pool_20K))
giftcards = shuffled.indices[1:2500]
cash = shuffled.indices[2501:5000]

Creating New Variables in R that relate to

I have 7 different variable in an excel spreadsheet that I have imported into R. They each are columns with a size of 3331. They are:
'Tribe' - there are 8 of them
'Month' - when the sampling was carried out
'Year' - the year when the sampling was carried out
'ID" - an identifier for each snail
'Weight' - weight of a snail in grams
'Length' - length of a snail shell in millimetres
'Width' - width of a snail shell in millimetres
This is a case where 8 different tribes have been asked to record data on a suspected endangered species of snail to see if they are getting rarer, or changing in size or weight.
This happened at different frequencies between 1993 and 1998.
I would like to know how to be able to create a new variables to the data so that if I entered names(Snails) # then it would list the 7 given variables plus any added variable that I have.
The dataset is limited to the point where I would like to add new variables. Such as, knowing the counts per month of snails in any given month.
This would rely on me using - Tribe,Month,Year and ID. Where if an ID (snail identifier) was were listed according to the rates in any given month then I would be able to sum them to see if there are any changes in counts. I have tried:
count=c(Tribe,Year,Month,ID)
count
But, after doing things like that, R just has a large list of that is 4X the size of the dataset. I would like to be able to create a given new variable that is of column size n=3331.
Or maybe I would like to create a simpler variable so I can see if a tribe collected at any given month. I don't know how I can do this.
I have looked at other forums and searched but, there is nothing that I can see that helps me in my case. I appreciate any help. Thanks
I'm guessing you need to organise your variables in a single structure, such as a data.frame.
See ?data.frame for the help file.
To get you started, you could do something like:
snails <- data.frame(Tribe,Year,Month,ID)
snails
# or for just the first few rows
head(snails)
Then this would have your data looking similar to your Excel file like:
Tribe Year Month ID
1 1 1 1 a
2 2 2 2 b
3 3 3 3 c
<<etc>>
Then if you do names(snails) it will list out your column names.
You could possibly avoid some of this mucking about by just importing your Excel file either directly from Excel, or saving as a csv (comma separated values) file first and then using read.csv("name_of_your_file.csv")
See http://www.statmethods.net/input/importingdata.html for some more specifics on this.
To tabulate your data, you can do things like...
table(snails$Tribe)
...to see the number of snail records collected by each tribe. Or...
table(snails$Tribe,snails$Year)
...to see the trends in each tribe by each year. The $ character will let you access the named variable (column) inside a data.frame in the same way you are currently using the free floating variables. This might seem like more work initially, but it will pay off greatly when you need to do some more involved analysis.
Take for example if you want to only analyse the weights from tribe "1", you could do:
snails$Weight[snails$Tribe==1]
# mean of these weights
mean(snails$Weight[snails$Tribe==1])
There are a lot more things I could explain but you would probably be better served by reading an excellent website like Quick-R here: http://www.statmethods.net/management/index.html to get you doing some more advanced analysis and plotting.

Resources