R101 with mutate() [duplicate] - r

This question already has answers here:
how to change gender factor into an numerical coding in r
(2 answers)
Closed 1 year ago.
Essentially I have a table with different columns, of interest in this case is gender
Gender
Male
Female
I'd like to create a new column called gender_num that sets all male to 0 and all female to 1. I tried to use if df['Gender] == 'Male' , 0 , else, 1 type deal but r doesn't like that with strings that have more than 1 value. I know that you can use dyplr and the mutate function but I'm very confused. How could you get the df to look something like this via generating a new column.
Gender
Gender_num
Male
0
Female
1

From one new user to another, Try adding your code to generate a sample table next time so people can work off work you've already done. Also, you might get down-voted for lack of research as this is a common sub-chapter in many intro texts. You can see an example chapter here.
Aside from that, lets say you have 10 observations of male female.
df <- tibble::as.tibble(x= rep(1:10))
gendf <-df %>% mutate(gender=sample(x=c("male","female"), 10, replace=TRUE))
You can then run a mutate to add a your categorical numeric variable.
gendf <- gendf %>% mutate(gender_dummy=if_else(gender=="female",1,0))
Note: since the original character variable has two values, me using if_else() is the simplest way.
But you can use recode() too as you can add as many values as you like.
gendf <- gendf %>% mutate(gender_dummy2= recode(gender,"female" = 1, "male"=0))
You should get this resulting table
From there I would add value labels and call it a day.

Related

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

R count number of variables with value ="mq" per row [duplicate]

This question already has answers here:
How to count the frequency of a string for each row in R
(4 answers)
Closed 4 years ago.
I have a data frame with 70variables, I want to create a new variable which counts the number of occurrences where the 70 variables take the value "mq" on a per row basis.
I am looking for something like this:
[ID] [Var1] [Var2] [Count_mq]
1. mq mq 2
2. 1 mq 1
3. 1 7 0
I have found this solution:
count_row_if("mq",DT)
But it gives me a vector with those values for the whole data frame and it is quite slow to compute.
I would like to find a solution using the function apply() but I don't know how to achieve this.
Best.
You can use the 'apply' function to count a particular value in your existing dataframe 'df',
df$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))
Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply
I assume the name of dataset is DT. I'm a bit confused what you really want to get but this is how I understand. Data frame consists of 70 columns and a number of rows that some of them have observations 'mq'.
If I get it right, please see the code below.
apply(DT, function(x) length(filter(DT,value=='mq')), MARGIN=1)

Include third variable in table [duplicate]

This question already has answers here:
Contingency table based on third variable (numeric)
(2 answers)
Closed 4 years ago.
I have made an edit after realising my code was insufficient in order to explain to problem - appologies.
I have a data frame including four columns
purchaseId <- c("abc","xyz","def","ghi")
product <- c("a","b","c","a")
quantity <- c(1,2,2,1)
revenue <- c(500,1000,300,500)
t <- data.frame(purchaseId,product, quantity, revenue)
table(t$product,t$quantity)
Running this query
table(t$product,t$quantity)
returns a table indicating how many times each combination occurs
1 2
a 2 0
b 0 1
c 0 1
What I would like to do is plot both product and quantity as rows and columns (as shown above) but with the revenue as an actual value.
The result should look like this:
1 2
a 1000 0
b 0 1000
c 300 0
This would allow me to create a table that I could export as a csv.
Could anyone help me any further?
edit - the code suggested below throws the following error on the actual data set of 140K rows:
Error: dims [product 21525] do not match the length of object [147805]
Other ideas?
Of course the example code above is a simplified version of the actual data I'm using, but the idea is the same.
Thank you advance,
Kind regards.
table(t$product,t$quantity)*t$revenue
Using library(reshape2) or library(data.table)
dcast(t,product ~ quantity, value.var = "revenue", fun = sum)
it is fairly simple syntax:
Set the data frame you are recasting
Set the "formula" of the resulting data frame. LHS of ~ is the row-wise pivot, RHS is the column-wise.
value.var tells you what column we want to place in the cells, and using fun we want to aggregate with the sum function
As you mentioned in your comments familiarity with Excel Pivot tables, its worth noting that dcast is a fairly comprehensive replacement, with additional flexibility.

recording time a taxa first appears: nested loops and conditional statements in R

Here is my example. Here is some hypothetical data resembling my own. Environmental data describes the metadata of the community data, which is made up of taxa abundances over years in different treatments.
#Elements of Environmental (meta) data
nTrt<-2
Trt<-c("High","High","High","Low","Low","Low")
Year<-c(1,2,3,1,2,3)
EnvData<-cbind(Trt,Year)
#Elements of community data
nTaxa<-2
Taxa1<-c(0,0,2,50,3,4)
Taxa2<-c(0,34,0,0,0,23)
CommData<-cbind(Taxa1,Taxa2)
#Elements of ideal data produced
Ideal_YearIntroduced<-array(0,dim=c(nTrt,nTaxa))
Taxa1_i<-c(2,1)
Taxa2_i<-c(2,3)
IdealData<-cbind(Taxa1_i,Taxa2_i)
rownames(IdealData)<-c("High","Low")
I want to know what the Year is (in EnvData) when a given taxa first appears in a particular treatment. ie The "introduction year". That is, if the taxa is there at year 1, I want it to record "1" in an array of Treatment x Taxa, but if that taxa in that treatment does not arrive until year 3 (which means it meets the condition that it is absent in year 2), I want it to record Year 3.
So I want these conditional statements to only loop within a treatment. In other words, I do not want it to record a taxa as being "introduced" if it is 0 in year 3 of one treatment and prsent in year 1 of the next.
I've approached this by doing several for loops, but the loops are getting out of hand, with the conditional statements, and there is now an error that I can't figure out- I may be not thinking of the i and j's correctly.'
The data itself is more complicated than this...has 6 years, 1102 taxa, many treatments.
#Get the index number where each treatment starts
Index<-which(EnvData[,2]==1)
TaxaIntro<-array(0,dim=dim(Comm_0)) #Array to hold results
for (i in 1:length(Index)) { #Loop through treatment (start at year 1 each time)
for (j in 1:3) { #Loop through years within a treatment
for (k in 1:ncol(CommData)) { #Loop through Taxa
if (CommData[Index[i],1]>0 ) { #If Taxa is present in Year 1...want to save that it was introduced at Year 1
TaxaIntro[i,k]<-EnvData[Index[i],2]
}
if (CommData[Index[i+j]]>0 && CommData[Index[((i+j)-j)]] ==0) { #Or if taxa is present in a year AND absent in the previous year
TaxaIntro[i,k]<-EnvData[Index[i+j],2]
}
}
}
}
With this example, I get an error related to my second conditional statement...I may be going about this the wrong way.
Any help would be greatly appreciated. I am open to other (non-loop) approaches, but please explain thoroughly as I'm not so well-versed.
Current error:
Error in if (CommData[Index[i + j]] > 0 & CommData[Index[((i + j) - j)]] == :
missing value where TRUE/FALSE needed
Based on your example, I think you could combine your environmental and community data into a single data.frame. Then you might approach your problem using functions from the package dplyr.
# Make combined dataset
dat = data.frame(EnvData, CommData)
Since you want to do the work separately for each Trt, you'll want group_by that variable to do everything separately by group.
Then the problem is to find the first time each one of your Taxa columns contains a value greater than 0 and record which year that is. Because you want to do the same thing for many columns, you can use summarise_each. To get the desired summary, I used the function first to choose the first instance of Year where whatever Taxa column you are working with is greater than 0. The . refers to the Taxa columns. The last thing I did in summarise_each is to choose which columns I wanted to do this work on. In this case, you want to do this for all your Taxa columns, so I chose all columns that starts_with the word Taxa.
With chaining, this looks like:
library(dplyr)
dat %>%
group_by(Trt) %>%
summarise_each(funs(first(Year[. > 0])), contains("Taxa"))
The result is slightly different than yours, but I think this is correct based on the data provided (Taxa1 in High first seen in year 3 not year 2).
Source: local data frame [2 x 3]
Trt Taxa1 Taxa2
1 High 3 2
2 Low 1 3
The above code assumes that your dataset is already in order by Year. If it isn't, you can use arrange to set the order before summarising.
If you aren't used to chaining, the following code is the equivalent to above.
groupdat = group_by(dat, Trt)
summarise_each(groupdat, funs(first(Year[. > 0])), starts_with("Taxa"))

Resources