Related
I have the follow set of data:
Using R and tidyverse if possible I would like to sum column S based upon a condition on other columns. If my variable
condition_columns = c('A', 'B')
The output I am after is a data frame containing
Where the 490 is obtained by summing column S only when A=1 and the 250 comes from summing column S when B=1.
Could anyone suggest a tidyverse way of doing it?
Thank you,
Phil,
You can do this using summarize(across())
summarize(df, across(all_of(condition_columns), ~sum(S[.x==1])))
Output:
A B
1 490 250
Input:
structure(list(ID = 1:10, A = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
B = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), S = c(10, 20, 30, 40,
50, 60, 70, 80, 90, 100)), class = "data.frame", row.names = c(NA,
-10L))
You may use the following (easy to understand) code :
df %>%
summarise(A = sum(A*S),
B = sum(B*S))
Output:
A B
1 490 250
I'm trying to use ggplot, and am hoping to create a boxplot that has four categories on the x axis for suspension data (low, lowish, highish, high) and farms on the y-axis.
I have I think broken the suspension column into four groups. But ggplot is upset with me. Here is the error:
```
Error in if (is.double(data$x) && !has_groups(data) && any(data$x != data$x[1L])) { : missing value where TRUE/FALSE needed
```
Here is my code:
```{r}
# To break suspension_rate_total_pct data into groups for clearer visualization, I found the min, and max
merged_data$suspension_rate_total_pct <-
as.numeric(merged_data$suspension_rate_total_pct)
max(merged_data$suspension_rate_total_pct, na.rm=TRUE)
min(merged_data$suspension_rate_total_pct, na.rm=TRUE)
low_suspension <- merged_data$suspension_rate_total_pct > 0 & merged_data$suspension_rate_total_pct < 0.5
low_ish_suspension <- merged_data$suspension_rate_total_pct > 0.5 & merged_data$suspension_rate_total_pct < 1
high_ish_suspension <- merged_data$suspension_rate_total_pct > 1 & merged_data$suspension_rate_total_pct < 1.5
high_suspension <- merged_data$suspension_rate_total_pct > 1.5 & merged_data$suspension_rate_total_pct < 2
ggplot(merged_data, aes(x = suspension_rate_total_pct , y = farms_pct)) +
geom_boxplot()
```
Here is the Data:
merged_data <- structure(list(schid = c("1030642", "1030766", "1030774", "1030840",
"1130103", "1230150"), enrollment = c(159, 333, 352, 430, 102,
193), farms = c(132, 116, 348, 406, 68, 130), foster = c(2, 0,
1, 8, 1, 4), homeless = c(14, 0, 8, 4, 1, 4), migrant = c(0,
0, 0, 0, 0, 0), ell = c(18, 12, 114, 45, 7, 4), suspension_rate_total = c(NA,
20, 0, 0, 95, 5), suspension_violent = c(NA, 9, 0, 0, 20, 2),
suspension_violent_no_injury = c(NA, 6, 0, 0, 47, 1), suspension_weapon = c(NA,
0, 0, 0, 8, 0), suspension_drug = c(NA, 0, 0, 0, 9, 1), suspension_defiance = c(NA,
1, 0, 0, 9, 1), suspension_other = c(NA, 4, 0, 0, 2, 0),
farms_pct = c(0.830188679245283, 0.348348348348348, 0.988636363636364,
0.944186046511628, 0.666666666666667, 0.673575129533679),
foster_pct = c(0.0125786163522013, 0, 0.00284090909090909,
0.0186046511627907, 0.00980392156862745, 0.0207253886010363
), migrant_pct = c(0, 0, 0, 0, 0, 0), ell_pct = c(0.113207547169811,
0.036036036036036, 0.323863636363636, 0.104651162790698,
0.0686274509803922, 0.0207253886010363), homeless_pct = c(0.0880503144654088,
0, 0.0227272727272727, 0.00930232558139535, 0.00980392156862745,
0.0207253886010363), suspension_rate_total_pct = c(NA, 2,
1, 1, 2, 2)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
If you can, please help me appease ggplot so that it will give me with beautiful visualization. Currently, this feels like a one-sided, emotional rollercoaster of a relationship.
Just a short answer, i am sure you can figure out the rest by yourself, (otherwise post a followup question.)
Since the data you provided has some NA's in the first row in several columns, i can only demonstrate you the principle on how to get your desired result by using the merged_data$homless value as group-input for our boxplots , the data (y-value) will be still Farms .
# first we create our groups of low, middle & high amount of homeless
merged_data2<- merged_data %>% mutate(homelessgroup= ifelse(homeless < 4, "low",
ifelse(homeless <= 8, "middle",
ifelse(homeless > 8, "high",NA ))))
## then we plot the data using ggplot
ggplot(merged_data2,aes(y=farms,fill=homelessgroup))+geom_boxplot()
I think you can just use cut() with your data to partition into 4 groups. Then you can use that variable with the plot
merged_data <- transform(merged_data,
group = cut(
suspension_rate_total_pct,
c(0, .5, 1, 1.5, 2),
include.lowest = TRUE,
labels = c("low", "lowish", "highish", "high")))
ggplot(merged_data, aes(x = group , y = farms_pct)) +
geom_boxplot()
I would like to convert wide data to long data in R, and my data set is for cross-classified models, exploring participants’ response to each target item that has different characteristics.
condition is one of the two conditions where participants were
assigned to.
The participants were tested twice: t1 and t2.
As for item-level predictor variables, x1 and x2, are coded.
As for response, whether participants’ response to the item was right or wrong was coded.
two test formats were administered, test1 and test2.
Although there are so many tutorials for a wide to long conversion, I could not find a one specifically explaining conversion for cross-classified models.
I would like to use tidyverse if possible for the sake of consistency.
My sample data is the following:
structure(list(item_name = c("x1", "x2", "participant_id", "1",
"2", "3", "4", "5", "6", "7"), participant_variable_1 = c(NA,
NA, NA, 20, 23, 21, 20, 19, 22, 30), condition = c(NA, NA, NA,
"A", "B", "A", "B", "A", "B", "A"), t1.item1.test1 = c(1, 3,
NA, 0, 1, 0, 1, 0, 0, 1), t1.item2.test1 = c(2, 2, NA, 0, 0,
0, 1, 1, 0, 1), t1.item3.test1 = c(1, 3, NA, 0, 0, 0, 1, 0, 0,
0), t1.item4.test1 = c(3, 1, NA, 1, 0, 0, 0, 1, 1, 0), t2.item1.test1 = c(1,
3, NA, 0, 1, 1, 0, 1, 1, 1), t2.item2.test1 = c(2, 2, NA, 1,
0, 1, 0, 1, 0, 1), t2.item3.test1 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t2.item4.test1 = c(3, 1, NA, 1, 1, 0, 1, 1, 1, 0), t1.item1.test2 = c(1,
3, NA, 0, 1, 0, 1, 0, 0, 1), t1.item2.test2 = c(2, 2, NA, 0,
0, 0, 1, 1, 0, 1), t1.item3.test2 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t1.item4.test2 = c(3, 1, NA, 1, 0, 0, 0, 1, 1, 0), t2.item1.test2 = c(1,
3, NA, 0, 1, 1, 0, 1, 1, 1), t2.item2.test2 = c(2, 2, NA, 1,
0, 1, 0, 1, 0, 1), t2.item3.test2 = c(1, 3, NA, 0, 0, 0, 1, 0,
0, 0), t2.item4.test2 = c(3, 1, NA, 1, 1, 0, 1, 1, 1, 0)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I would like to have a long data, which looks like the following:
Please and thank you for your guidance!
This answer requires heavy use of the new pivot_ functions in the dev version of tidyr. You can install that with devtools::install_github("tidyverse/tidyr") if you're willing to run the dev version.
First we split the data into item and participant info - you're not really getting any benefit from storing both in the same table:
item_info = dat[1:2, ]
participant_info = dat[4:nrow(dat), ] %>%
rename(participant_id = item_name)
Then it's time for a lot of pivoting:
# I have the dev version of tidyr so that is being loaded
library(tidyverse)
item_long = item_info %>%
select(-participant_variable_1, -condition) %>%
pivot_longer(
cols = t1.item1:t2.item4,
names_to = c("time", "item"),
names_pattern = "t(\\d)\\.(item\\d)",
) %>%
pivot_wider(names_from = item_name, values_from = value)
participant_long = participant_info %>%
pivot_longer(
cols = t1.item1:t2.item4,
names_to = c("time", "item"),
names_pattern = "t(\\d)\\.(item\\d)",
values_to = "response"
)
combined = participant_long %>%
left_join(item_long, by = c("item", "time"))
Result:
> combined
# A tibble: 56 x 8
participant_id participant_variable_1 condition time item response x1 x2
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 20 A 1 item1 0 1 3
2 1 20 A 1 item2 0 2 2
3 1 20 A 1 item3 0 1 3
4 1 20 A 1 item4 1 3 1
I am working with a data set that contains multiple observations for each prescription a patient is taking, with many different patients. Patients typically take one of several drugs, which are indicated as their own binary variables, Drug1, Drug2 and so on.
I am attempting to pull out only the individuals that have switched from one drug to the other, i.e, have a 1 in Drug1 column and Drug2, but these occur in different rows.
I have attempted to use newdata <- mydata[which(Drug1 == 1 & Drug2 == 1),] however, this assumes that the 1's are in the same row, which they are not.
Is there a way to select the patients that have received both drugs, but the indicator variables are in different rows?
Thank you
I believe this is a solution to what you are asking using dplyr.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 2),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0)
)
library(dplyr)
data %>%
group_by(id) %>%
mutate(both_drugs = ifelse(any(drug1 == 1) & any(drug2 == 1), 1, 0)) %>%
filter(both_drugs == 1)
Try creating a variable for each drug that indicates whether or not it was the only drug taken at that time by that individual.
data <- data.frame(id = rep(c(1, 2, 3, 4), each = 3),
drug1 = c(1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0),
drug2 = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0))
library(dplyr)
data %>%
group_by(id) %>%
mutate(drug1only = ifelse(drug1==1 & drug2==0, 1, 0),
drug2only = ifelse(drug2==1 & drug1==0, 1, 0)) %>%
summarise(
drug_switch = ifelse(max(drug1only)+max(drug2only)==2,1,0))
df <- data.frame("Minute" = c(rep(27, 3), rep(28, 3)),
"ID" <- c(1,2,3,1,2,3),
"dist1" = c(0, 1, 4, 2, 4, 1),
"dist2" = c(1, 0, 0, 0, 1, 0),
"dist3" = c(0, 0, 2, 1, 4, 0))
At minute 27 ID 1 has a value of 0 with dist3. At minute 27 ID 3 has a value of 4 with dist1. I want to know how to write an "if then" statement or something similar to identify if those values 1) are greater than zero or 2) match. If so, I want to replace the values in this data frame with ones. If not, they become zero.
Expected output for minute 27 only:
df2 <- data.frame("Minute" = c(rep(27, 3)),
"ID" = c(1,2,3),
"dist1" = c(0, 1, 0),
"dist2" = c(1, 0, 0),
"dist3" = c(0, 0, 0))
Another example:
df <- data.frame("Minute" = c(rep(28, 3)),
"ID" = c(1,2,3),
"dist1" = c(2, 4, 1),
"dist2" = c(2, 1, 0),
"dist3" = c(1, 4, 5))
Expected output:
df2 <- data.frame("Minute" = c(rep(28, 3)),
"ID" = c(1,2,3),
"dist1" = c(2, 1, 1),
"dist2" = c(1, 1, 0),
"dist3" = c(1, 0, 5))
Notice that ID 1 has a value greater than 0 for dist2 and ID2 has a value greater than 0 for dist1. Those two correspond by being above 0, so I want both of them to become 1.