I have a dataframe with the count of for instance males and females for certain groups arranged in this way:
df <- data.frame (
Round = c("R1", "R1", "R2", "R2"),
N. = c(20, 10, 15,15),
Gender = c("M", "F", "M","F"))
How can I create a table accounting for counts over, for instance, Round and Gender? I would like to show the distribution of gender for each round.
I have tried
table (df$Gender, df$Round)
but this is not what I need. I need instead to show N. by groups.
Something like this?
library(tidyr)
pivot_wider(df, names_from = Round, values_from = N.)
Gender R1 R2
1 M 20 15
2 F 10 15
Or in base R with reshape:
reshape(df, direction = "wide", idvar = "Gender", timevar = "Round")
Related
I have a data frame which is configured roughly like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(0,0,0))
words
frequency
count
hello
7
0
yes
8
0
example
5
0
What I'm trying to do is add values to the third column from a different data frame, which is similiar but looks like this:
df2 <- cbind(c('example','hello') ,c(5,6))
words
frequency
example
5
hello
6
My goal is to find matching values for the first column in both data frames (they have the same column name) and add matching values from the second data frame to the third column of the first data frame.
The result should look like this:
df <- cbind(c('hello', 'yes', 'example'),c(7,8,5),c(6,0,5))
words
frequency
count
hello
7
6
yes
8
0
example
5
5
What I've tried so far is:
df <- merge(df,df2, by = "words", all.x=TRUE)
However, it doesn't work.
I could use some help understanding how could it be done. Any help will be welcome.
This is an "update join". My favorite way to do it is in dplyr:
library(dplyr)
df %>% rows_update(rename(df2, count = frequency), by = "words")
In base R you could do the same thing like this:
names(df2)[2] = "count2"
df = merge(df, df2, by = "words", all.x=TRUE)
df$count = ifelse(is.na(df$coutn2), df$count, df$count2)
df$count2 = NULL
Here is an option with data.table:
library(data.table)
setDT(df)[setDT(df2), on = "words", count := i.frequency]
Output
words frequency count
<char> <num> <num>
1: hello 7 6
2: yes 8 0
3: example 5 5
Or using match in base R:
df$count[match(df2$words, df$words)] <- df2$frequency
Or another option with tidyverse using left_join and coalesce:
library(tidyverse)
left_join(df, df2 %>% rename(count.y = frequency), by = "words") %>%
mutate(count = pmax(count.y, count, na.rm = T)) %>%
select(-count.y)
Data
df <- structure(list(words = c("hello", "yes", "example"), frequency = c(7,
8, 5), count = c(0, 0, 0)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(words = c("example", "hello"), frequency = c(5, 6)), class = "data.frame", row.names = c(NA,
-2L))
I'm working in R, looking for a way to make a nice table. For each first name or last name, I want the percentage of all individuals counted at each site who have that name, and the sum of individuals with that name next to the percentage. This seems to only calculate percentages based on the number of rows, rather than the actual count data. Not necessarily married to table1 if there are other functions/packages that would do the trick!
Here is some example code:
set.seed(123)
dat <- data.frame(
site = factor(sample(c("A", "B", "C", "D"), 100, replace = TRUE)),
count = (sample(1:150, 100, replace = TRUE)),
first.name = factor(sample(c("John", "Sue", "Bob", "Mary", "Cara"), 100, replace = TRUE)),
last.name = factor(sample(c("Williams", "Smith", "Lee"), 100, replace = TRUE)))
library(table1)
tab<- table1(~ last.name + first.name|site, #only counts rows, doesn't sum "Count" column
data = dat,
render.continuous = ("Sum(Count) (PCTnoNA=100*Count/Sum(Count))"),
digits= 1)
tab
You will have to trick table1 into giving you what you want by making a data frame that has sum(dat$count) == 6873 rows. That is not difficult:
idx <- rep(rownames(dat), dat$count)
length(idx)
# [1] 6873
dat.big <- dat[idx, ]
table1(~ last.name + first.name|site, data = dat.big)
Update: With the hint of dcarlson(many thanks!) here is a gtsummary solution:
library(gtsummary)
library(dplyr)
dat.big %>%
tbl_summary(by = site) %>%
add_overall()
Are you looking for such a solution:
table1(~ last.name + first.name|site, data=dat)
I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))
I have a dateset that has a little over 1.32 million observations. I am trying to add a "growth.factor" column to the dataset that sets a specific value given the county and classification from another dataset called "cat.growth" which is 44x8.
I need to run the following code 352 times---changing the county and classification names---to get my desired result (44 counties, 8 different classifications):
parcel.data.1$growth.factor <- ifelse(parcel.data.1$classification == "Ag" & parcel.data.1$county == "Ada", 1 + cat.growth["Ada","Ag"], parcel.data.1$growth.factor)
If I do so, it takes approximately 16.7 seconds to run. But It takes up 352 lines of code. I can achieve the same thing in 4 lines of code with this for loop:
for (x in parcel.data.1) {
for (y in parcel.data.1$classification) {
parcel.data.1$growth.factor <- ifelse(parcel.data.1$classification == y & parcel.data.1$county == x, 1 + cat.growth[x,y], parcel.data.1$growth.factor)
}}
But when I run it, I cant even get it to complete (I gave up after 12 minutes). I've tried using all my cores in my Mac using:
library(foreach)
library(doSNOW)
c1 <- makeCluster(8, type = "SOCK")
registerDoSNOW(c1)
But that didn't help. I've looked at all the blogs and other posts regarding slow loops, but my code is only a single line so I didn't see anything that applied to making to faster in those other suggestions.
Any help getting this loop to run in less than a minute would be extremely appreciated.
As others have pointed out, you shouldn't use a loop. But your question seems to be "Why is this loop taking so long?"
The answer is that you seem to be looping over all 1.36 million elements of parcel.data.1$county and all 1.36 million elements of parcel.data.1$classification. This means that your loop is evaluating ifelse()
1360000^2 times, not 352 times.
If you are going to use a loop, then loop over the unique elements of each column, which are given by the row names and column names of cat.growth.
for (x in rownames(cat.growth)) { # loop over counties
for (y in colnames(cat.growth)) { # loop over classifications
...
}
}
This loop is equivalent to your original script with 352 lines of code, so it should have roughly the same run time of ~16 seconds.
Note that if you didn't already know the unique elements of those two vectors, then you could use unique() to find them.
This looks like the reason joins were created, and the dplyr package is what you want.
I don't have your data, but based on your code, I have assembled some simple fake data that looks like it may be structured like yours.
df1 <- data.frame(x = c("Ag", "Ag", "Be", "Be", "Mo", "Mo"),
y = c("A", "B", "A", "B", "A", "B"))
df2 <- data.frame(x = c("Ag", "Be", "Mo"),
A = c(1, 2, 3),
B = c(4, 5, 6))
library(dplyr)
library(tidyr)
df1 %>%
inner_join(df2 %>% pivot_longer(cols = c(A, B), names_to = "y")) %>%
mutate(value = value + 1)
Joining, by = c("x", "y")
x y value
1 Ag A 2
2 Ag B 5
3 Be A 3
4 Be B 6
5 Mo A 4
6 Mo B 7
The loop is probably not the best way of doing this. One alternative may be to reshape the cat.growth data (44 x 8) to a data.frame with variables for county, classification and growth factor (i.e. 352 x 3) and then use "merge" on this and the original data frame.
To illustrate what I mean (based on what I understand your data looks like):
cat.growth <- as.data.frame(matrix(nrow = 44, ncol = 8,
dimnames = list(1:44, letters[1:8]),
data = rnorm(44*8)))
parcel.data <- data.frame(county = sample(1:44, 1e06, replace = TRUE),
classification = sample(letters[1:8], 1e06, replace = TRUE))
cat.growthL = reshape(cat.growth, direction = "long",
idvar = "county",
ids = rownames(cat.growth),
varying = 1:8,
times = colnames(cat.growth),
timevar = "classification",
v.names = "growth.factor")
parcel.data2 = merge(parcel.data, cat.growthL)
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I am trying to add a time-varying predictor to a long-form dataframe using reshape2::melt but I was wondering if there was a faster way to do it.
Here is the toy data in wide form. There are three measures of an outcome variable (session1, session2, and session3) taken at different visits/time points. The duration between these three visits is different for each participant and ultimately I would like to factor these differences into a model.
id <- 1:10
group <- rep(c("A", "B"), times = 5)
session1 <- rnorm(10, 5, 1)
session2 <- rnorm(10, 3, 1)
session3 <- rnorm(10, 7, 2)
time1 <- rep(0, 10)
time2 <- rnorm(10, 24, 0.5)
time3 <- rnorm(10, 48, 0.5)
df <- data.frame(id, group, session1, session2, session3, time1, time2, time3)
Now I want to convert into a long-form dataframe. I use reshape2::melt. I can create this either around the scores like so:
library(reshape2)
dfLong <- melt(df, measure.vars = c("session1", "session2", "session3"), var = "session", value.name = "score")
Or I can create it around the time values.
dfLong2 <- melt(df, measure.vars = c("time1", "time2", "time3"), var = "time", value.name = "timeOut")
But I can't do both without doing the melt twice and performing some sort of opertation like this
dfLong$timeOut <- dfLong2$timeOut
Ultimately I would like the dataframe to look something like this
dfLong$time <- rep(c("time1", "time2", "time3"), each = 10)
dfLong <- dfLong[,which(names(dfLong) %in% c("id", "group", "time", "session", "score", "timeOut"))]
dfLong
Is there any way to melt two sets of columns at once?
We can use data.table
library(data.table)
res = melt(setDT(df), measure = patterns("^session", "^time"),
value.name = c("session", "time"))
You can setDF(res) to revert to a data.frame if you don't want to learn how to work with data.tables right now.