R: Function: generate and save multiple matrices based on multiple conditions - r

I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon

So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)

Related

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

Create new numeric columns from 1 string column

I'm a beginner. I have a dataset taken from here which consists of people profiles with different attributes, while profession is of them. There are 12 professions: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
I'd like to apply K-NN to that dataset, so I'd like to distribute the profession column into 12 new columns, and attribute 1 to the corresponding profession, and 0 to all the other 11 professions that don't belong to that person.
I tried foreach package and for loops, unsuccessfully. I'm not being able to work with foreach, and I don't know what to do next, from the following code:
jobs <- data[,2]
jobs
for (job in jobs) {
print(job)
#No idea how to create the new columns here, based on if conditionals
}
How would be the best way to do this?
Thanks.
You can certainly solve the problem using a for loop, but may I suggest a solution that is more efficient in the long run: reshape2 package (https://cran.r-project.org/web/packages/reshape2/).
I have the data from bank-full.csv read into R in object bank. Next reshape2 package needs to be downloaded, installed, and loaded:
install.packages("reshape2")
library(reshape2)
The data can then be shaped into a format where observations are on rows and jobs on columns. An accessory id column is first added to the data:
bank$id<-1:nrow(bank)
Then, taking the columns 2 and 18 (job and id) from the data frame bank and casting them into the aforementioned form can be done as:
tmp<-dcast(bank[,c(2, 18)], id~job, length)
That should give a new data frame tmp, where each job has it's own column. Since every id is present in the data only once, the length function used in the dcast function to aggregate the data puts just zeros and ones in every column.
Last, these new columns can be added to the original data set:
bank<-cbind(bank[,-18], tmp[,-1])
Negative subscripts inside the square brackets delete the columns from the dataset, so this simultaneously let's you get rid off the id column.
Another, even more efficient way to do this is to use the function model.matrix:
bank2<-cbind(bank, model.matrix( ~ 0 + job, bank))
This should give you a data frame with each job as a new column. Note however that it changes the column names a bit (adds job to the beginning of the job columns).

Vector of vectors in R?

I have a large set of data and I'm trying to group different rows together. I will know how to group the rows by using an ID. In the dataset, these IDs are sequential.
For example,
So what I want to do is iterate through this set of data and then place the data contained in these rows into a vector of vectors for processing later. The data contained in these rows of identical ID are going to be compared with one another to categorize the groupings.
I would like my data structure to look like something like this.
1 -> 1 -> 1
|
V
2 -> 2
So row 1 would contain only data from 1 type of ID, then the next row in the vector would be a vector of another type of ID. How would I go about doing this in R? In C++ it would just be a vector of vectors but I haven't been able to figure out how to do the same in R.
Is this even the right way to be approaching this problem? Is there a better way to do what I'm trying to do?
You would want to work with Data Frames rather than simple matrices. Have a look as the Documentation R-tutor Data.Frames.
It is doable. Best!

How to create a data.frame with 3 factors?

I hope you won't find my question too silly, i did a lot of research but it seems that i can't figure how to solve this really annoying issue.
Well, i have datas for 6 participants (P) in an experiment, with 50 trials (T) per participants and 10 condition (C). So i'd like to create a dataframe in r allowing me to put these datas.
This data.frame should have 3 factors (P, T and C) and so a number of total row of (P*T*C). The difficulty for me is to create this one, since i have the datas for the 6 participant in 6 data.frame of 100 obs(T) by 10 varibles(C).
I'd like first to create the empty dataset with these factors, and then copy the values of the 6 data.set according to the factors P, T and C.
Any help would be greatly appreciated, i'm novice in r.
Thank you.
OK; First we create one big dataframe for all participants:
result<-rbind(dfrforparticipant1, dfrforparticipant2,...dfrforparticipant6) #you'll have to fill out the proper names of the original data.frames
Next, we add a column for the participant ID:
numTrials<-50 #although 100 is also mentioned in your question
result$P<-as.factor(rep(1:6, each=numTrials))
Finally, we need to go from 'wide' format to 'long' format (I'm assuming your column names holding the results for each condition are called C1, C2 etc. ; I'm also assuming your original data.frames already held a column named T to denote the trial), like this (untested, since you did not provide example data):
orgcolnames<-paste("C", 1:10, sep="")
result2<-reshape(result, varying=list(orgcolnames), v.names="val", idvar=c("T","P"), timevar="C", times=seq_along(orgcolnames), direction="long")
What you want is now in result2.

Resources