This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to expand a sample survey and simulate a population. For example, if I have the following data sample(very small for explain my question) like
control weight sex age race
1 2 F 23 W
2 3.1 M 21 B
3 5.3 F 19 W
In this case, control represents the interviewed people. For example, I would like get a dataframe where the control 1 (some person, sex female , 23 yeard old and white) repeats 2 times(2 rows). The dificult arises when I try to repeats 3.1 times the control number 2 and 5.3 the contol number 3, preserving the sex, age and race.
There is the "survey" package, but I don't know if there is some function for this situation.
How can I find a solution for this problem?
If you need the expand the rows of the dataset, based on the value in the 'weight' column, one option would be expandRows from splitstackshape. This will be similar to df1[rep(1:nrow(df1), weight),].
library(splitstackshape)
expandRows(df1, 'weight')
Related
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 4 years ago.
Lets say I have the following data frame in r:
> patientData
patientID age diabetes status
1 1 25 Type 1 Poor
2 2 34 Type 2 Improved
3 3 28 Type 1 Excellent
4 4 52 Type 1 Poor
How can I reference a specific row or group of rows by using the specific value/level of a particular column rather than the row index? For instance, if I wanted to set a variable x to equal all of the rows which contain a patient with Type 1 diabetes or all of the rows that contain a patient in "Improved" status, how would I do that?
Try this one:
library(dplyr)
patientData %>%
filter(diabetes == "Type 1")
Next time, please provide a Minimum Reproducible Example.
This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 4 years ago.
I have a tidy dataframe with study data. "init_cont" and "family" represent the different conditions in this study. There are three possible options for init_cont (A, B, or C) and two possible options for family (D or E), yielding a 3x2 experimental design. In this example, there are two different questions that each participant must answer (specified in column "qnumber"). The "value" column indicates their response to the question asked.
id init_cont family qnumber value
1 A D 1 5
1 A D 2 3
2 B D 1 4
2 B D 2 2
3 C E 1 4
3 C E 2 3
4 A E 1 5
4 A E 2 2
I am trying to determine the best way (preferably within the tidyverse) to determine the average of the values for each question, separated by condition. There are 6 conditions, which come from the 6 combinations of the 3 options in init_cont combined with the 2 options in family. In this dataframe, there are only 2 questions, but the actual dataset has 14.
I know I could probably do this by making distinct dataframes for each of the 6 conditions and then breaking these down further to make distinct dataframes for each question, then finding the average values for each dataframe. There must be a better way to do this in fewer steps.
Using tidyverse, to determine the average of the values for each question, separated by condition of say, family:
data %>%
group_by(family) %>%
summarize(avg_value = mean(value))
If you prefer, you can even find the average of the values for each question by condition of say family and a second (or more) variable, say, religion:
data %>%
group_by(family, religion) %>%
summarize(avg_value = mean(value))
EDIT 1: Based on feedback, here's the code to get the average value grouped by init_cont, family, and qnumber:
data %>%
group_by(init_cont, family, qnumber) %>%
summarize(avg_value = mean(value))
See a sample:
We can use aggregate from base R
aggregate(value ~ family, data, mean)
This question already has answers here:
How to filter a data frame
(2 answers)
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
I have a data.frame listing locations that have been sampled several years, and calculated the number of species at those locations. Hence, I have per location a species number for each year.
However, not every location has been sampled each year.
It would look something like this:
Location Year Species
1 2007 3
1 2008 10
2 2008 4
2 2009 5
2 2010 6
3 2007 3
3 2008 10
3 2009 5
3 2010 6
I want to select only those stations that have been sampled each year, and get a data.frame showing only these locations, the relevant years and their species numbers.
In the above example that would obviously be only location 3.
I searched various sites thoroughly but could not find the answer. I guess the answer is quite simple using either aggregate or subset, and I tried various solutions, but to no avail.
Edit: the answers referred to as being duplicate do not answer my question: I want to select the various years, but only return the stations that contain all these years. Answers referred to only supply the lines with the various years, but not conditional on the stations.
Edit
The answer to my question appeared to be adding a column to my data.frame counting the frequency of the unique values in the column Location, using the following code:
transform(df, freq.loc = ave(seq(nrow(df)), location, FUN=length))
This question already has answers here:
How to do vlookup and fill down (like in Excel) in R?
(9 answers)
Closed 7 years ago.
I have a table of pending bills in the Scottish Parliament. One of the columns (BillTypeID) is populated with numbers that indicate what type of bill each one is (there are seven different types of bills).
I have another table that describes which number corresponds to which bill types ( 1 = "Executive", 2 = "Member's", etc.)
I want to replace the number in my main table with the corresponding string that describes the type for each bill.
Data:
bills <- jsonlite::fromJSON(url("https://data.parliament.scot/api/bills"))
bill_stages <- jsonlite::fromJSON(url("https://data.parliament.scot/api/billstages"))
This is probably a duplicate but I can't find the corresponding answer ...
The easiest way to do this is with merge().
d1 <- data.frame(billtype=c(1,1,3,3),
bill=c("first","second","third","fourth"))
d2 <- data.frame(billtype=c(1,2,3),
billtypename=c("foo","bar","bletch"))
d3 <- merge(d1,d2)
##
## billtype bill billtypename
## 1 1 first foo
## 2 1 second foo
## 3 3 third bletch
## 4 3 fourth bletch
... then drop the billtype column if you don't want it any more. You can probably do it slightly more efficiently with match() (see my answer to the linked question).
I asked a very general version of this question a while ago. I thought I would have enough programming background to make the jump from the answer to create my function, but turns out I was wrong. This is my first time using R, and I'm having some trouble.
Given the following dataset:
Amount_Bought CustomerID
12 28
18 28
2 6
9 6
10 6
I want to create a column called "average spending" which tabulates the average spending of each customer based on their ID. There is about 1000 entries to the data with varying number of purchases.
For example, for customerID 28, I would want average spending to be (12 + 18)/2 = 15
So, something like this:
Amount_Bought CustomerID Average_Spending
12 28
18 28 15
2 6
9 6
10 6 7
How would I go about doing this?
Thank you
How about:
library(plyr)
sumdat <- ddply(my_data,"Customer_ID",summarise,
avg_spending = mean(Amount_Bought))
merge(my_data,sumdat)
(There are a variety of ways to aggregate data in this way in R: ave, aggregate in base R, dplyr package, data.table package ... there are lots of questions on SO comparing efficiency etc. of these various approaches, e.g. Joining aggregated values back to the original data frame )