I'm a beginner. I have a dataset taken from here which consists of people profiles with different attributes, while profession is of them. There are 12 professions: admin., blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
I'd like to apply K-NN to that dataset, so I'd like to distribute the profession column into 12 new columns, and attribute 1 to the corresponding profession, and 0 to all the other 11 professions that don't belong to that person.
I tried foreach package and for loops, unsuccessfully. I'm not being able to work with foreach, and I don't know what to do next, from the following code:
jobs <- data[,2]
jobs
for (job in jobs) {
print(job)
#No idea how to create the new columns here, based on if conditionals
}
How would be the best way to do this?
Thanks.
You can certainly solve the problem using a for loop, but may I suggest a solution that is more efficient in the long run: reshape2 package (https://cran.r-project.org/web/packages/reshape2/).
I have the data from bank-full.csv read into R in object bank. Next reshape2 package needs to be downloaded, installed, and loaded:
install.packages("reshape2")
library(reshape2)
The data can then be shaped into a format where observations are on rows and jobs on columns. An accessory id column is first added to the data:
bank$id<-1:nrow(bank)
Then, taking the columns 2 and 18 (job and id) from the data frame bank and casting them into the aforementioned form can be done as:
tmp<-dcast(bank[,c(2, 18)], id~job, length)
That should give a new data frame tmp, where each job has it's own column. Since every id is present in the data only once, the length function used in the dcast function to aggregate the data puts just zeros and ones in every column.
Last, these new columns can be added to the original data set:
bank<-cbind(bank[,-18], tmp[,-1])
Negative subscripts inside the square brackets delete the columns from the dataset, so this simultaneously let's you get rid off the id column.
Another, even more efficient way to do this is to use the function model.matrix:
bank2<-cbind(bank, model.matrix( ~ 0 + job, bank))
This should give you a data frame with each job as a new column. Note however that it changes the column names a bit (adds job to the beginning of the job columns).
Related
I'm just 2 days into R so I hope I can give enough Info on my problem.
I have an Excel Table on Endothelial Cell Angiogenesis with Technical Repeats on 4 different dates. (But those Dates are not in order and in different weeks)
My Data looks like this (of course its not only the 2nd of March):
I want to average the data on those 4 different days, so I can compare i.e the "Nb Nodes" from day 1 to day 4.
So to finally have a jitterplot containing the group, the investigated Data Point and the date.
I'm a medical student so I dont really have yet any knowledge about this kind of stuff but Im trying to learn it. Hopefully I provided enough Info!
Found the solution:
#Group by
library(dplyr)
DateGroup <- group_by(Exclude0, Exp.Date, Group)
#Summarizing the mean in every Group and Date
summarise(DateGroup, mymean = mean(Date$`Nb meshes`))
I think the below code will work.
group_by the dimension you want to summarize by
2a. across() is helper verb so that you don't need to manually type each column specifically, it allows us to use tidy select language so that we can quickly reference columns that contains "Nb" (a pattern that I noticed from your screenshot)
2b. With across(), second argument, you then use formula that you want to apply to each column from the first argument of across()
2c. Optional argument in across so that the new columns names have a name convention)
Good luck on your R learning! It's a really great language and you made the right choice.
#df is your data frame
df %>% group_by(Exp.Date) %>%
summarize(across(contains("Nb"),mean,.names = {.fn}_{.col}))
#if you just want a single column then do this
df %>% group_by(Exp.Date) %>%
summarize(mean_nb_nodes=mean(`Nb nodes`))
I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected
I am struggling to understand how to combine in R two tables when the common variables are not exactly similar.
To give the context, I have downloaded two sources of information about politicians, from Twitter and from the administration and created two different data frames. In the first data frame (dataset 1), I have the name of the politicians present on Twitter. However, I don’t know if these politicians are now in function or not. To discover that, I could use the second date frame.
The second data frame (dataset 2) contains the name and other information about the politicians now in function.
The first and last names are the only variables contained in both tables. The two tables do not have the same number of rows.
Problem:
The names in the first dataset were indicated as one variable (first name + last name) whereas in the second dataset the names were separated in two variables (last name and first name). I used separate to separate the name column in the first tables. parliament_twitter_tempdata <- separate(parliament_twitter_tempdata,col=name, into=c("firstname","lastname"),extra ="merge”).
However I have problems with it as both datasets have:
composed first names and composed last names
first name and last name in the incorrect order
I have included a picture of a part (from lastname "J" to "M") of both datasets to illustrate the difference between the similar values or the inversion of lastname, firstname.
How could I improve my code?
The names in both tables are not completely similar. Some people did not write the official name in Instagram. Is there any function which could compare the two tables, find the set of variables that correspond to around 80% and remplace the name in the data frame 1 (from Twitter) with the official name of data frame 2 ? Ex. Dataset 1 : Marie Gabour ; Dataset 2 : Marie Gabour Jolliet —> Replace the Marie Gabour from dataset 1 into Marie Gabour
Could someone help me there? Many thanks !
[Part of the dataset 1 after having separate (lastname from "J" to "M" )1 [Part of the name in dataset 2 (lastname from "J" to "M") 2
Fuzzy matching might be a way to move forward:
https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf
Also, cleaning functions may help (e.g., using toppper or removing whitespace on the key).
I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)
I'm having some difficulty executing a conditional operation on two dataframes. For problem illustration, I have three variables: Price, State, and Item, which are stored in a data frame (data1) with those column names. I use ddply to generate a dataframe (data2) that includes columns State and Item, and the average price(or some other function) for that State/Item combination.
What I then want to do is fill in a column in the originating data frame(i.e. a simple prediction vector), where the column's value is the mean value for a given observations combination of State and Item in data1. (e.g., if an observation in data1 has state="Arizona" and item="pen", I then want to retrieve the average price stored in data2 that corresponds to that state/item combination, and insert it into the column.)
Thank you for any help.
The plyr package comes with a great little function called join. You can use this to complete your task.
join(dat1,dat2, by=c('State','Item'))
Review ?join to see the different types of joins possible. I'm pretty sure you want a left join.