I have a column with a lot of data of the form "Male25" indicating sex and age. I just want to separate the column into two, one with the sex and the other one with the age. What is the best way to do that in R?
We can use separate
library(tidyverse)
as_tibble("Male25" ) %>%
separate(value, into = c("sex", "age"), "(?<=[a-z])(?=[0-9])", convert = TRUE)
# A tibble: 1 x 2
# sex age
#* <chr> <int>
#1 Male 25
you can try both the methods
d <- data.frame(a=c("male25","female24","male36","female20"))
cbind(a1=gsub("\\d","",d$a),a2=gsub("\\D","",d$a))
c <- data.frame(a1=gsub("\\d","",d$a),a2=as.numeric(gsub("\\D","",d$a)))
Related
I am using the tidycensus package to pull out some census variables. I am making a list of desired variables with set variable names (dummy data below). I want to also create a codebook, where, ideally, I'd use the list of variable names to pull the rest of the information from the variable list that you can access with the command load_variable. I'm not sure how to do that join, or pull out that information, just using a character list. Any suggestions?
library("tidycensus")
library("dplyr")
decvarlist <- load_variables(2000, "sf1")
desiredvars = c(var1 = "H001001",
var2 = "H002002",
var3 = "H002003"
)
#this bit doesnt work, but is sort of how I'm thinking of it
codebook <- left_join(desiredvars, decvarlist, by = ())
Perhaps we need to filter
library(dplyr)
decvarlist %>%
filter(name %in% desiredvars) %>%
mutate(id = names(desiredvars), .before = 1)
-output
# A tibble: 3 × 4
id name label concept
<chr> <chr> <chr> <chr>
1 var1 H001001 Total HOUSING UNITS [1]
2 var2 H002002 Total!!Urban URBAN AND RURAL [6]
3 var3 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL [6]
Actually this is linked to my previous question: Replace values across time series columns based on another column
However I need to modify values across a time series data set but based on a condition from the same row but across another set of time series columns. The dataset looks like this:
#there are many more years (yrs) in the data set
product<-c("01","02")
yr1<-c("1","7")
yr2<-c("3","4")
#these follow the number of years
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
#this is a reference column to pull values from in case the type value is "mixed"
mixed.rate<-c("1+5GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
Where the value 1 should be replaced by "1+5GBP" and 4 should be "7+3GBP". I am thinking of something like the below -- could anyone please help?
df %>%
mutate(across(c(starts_with('yr'),starts_with('type'), ~ifelse(type.x=="mixed", mixed.rate.x, .x)))
The final result should be:
product<-c("01","02")
yr1<-c("1+5GBP","7")
yr2<-c("3","7+3GBP")
type.yr1<-c("mixed","number")
type.yr2<-c("number","mixed")
mixed.rate<-c("1+5 GBP","7+3GBP")
df<-data.frame(product,yr1,yr2,type.yr1,type.yr2,mixed.rate)
If I understand you correctly, I think you might benefit from pivoting longer, replacing the values in a single if_else, and swinging back to wide.
df %>%
pivot_longer(cols = -c(product,mixed.rate), names_to=c(".value", "year"), names_pattern = "(.*)(\\d)") %>%
mutate(yr=if_else(type.yr=="mixed",mixed.rate,yr)) %>%
pivot_wider(names_from=year, values_from=c(yr,type.yr),names_sep = "")
Output:
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5 GBP 1+5 GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
You can use pivot_longer to have all yrs in one column and type.yrs in another column. Then record 1 into 1+5GBP and 4 into 7+3GBP if the type.yr column is mixed. then pivot_wider
df %>%
pivot_longer(contains('yr'), names_to = c('.value','grp'),
names_pattern = '(\\D+)(\\d+)') %>%
mutate(yr = ifelse(type.yr == 'mixed', recode(yr, '1' = '1+5GBP', '4' = '7+3GBP'), yr)) %>%
pivot_wider(c(product, mixed.rate), names_from = grp,
values_from = c(yr, type.yr), names_sep = '')
# A tibble: 2 x 6
product mixed.rate yr1 yr2 type.yr1 type.yr2
<chr> <chr> <chr> <chr> <chr> <chr>
1 01 1+5GBP 1+5GBP 3 mixed number
2 02 7+3GBP 7 7+3GBP number mixed
If you're happy to use base R instead of dplyr then the following will produce your required output:
for (i in 1:2) {
df[,paste0('yr',i)] <- if_else(df[,paste0('type.yr',i)]=='mixed',df[,'mixed.rate'],df[,paste0('yr',i)])
}
I'm trying to generate some descriptive summary tables in R using the CreateTableOne function. I have a series of variables that all have the same response options/levels (Yes or No), and want to generate a wide table where the levels are column headings, like this:
Variable
Yes
No
Var1
1
7
Var2
5
2
But CreateTableOne generates nested long tables, with one column for Level where Yes and No are values in rows, like this:
Variable
Level
Value
Var1
Yes
1
Var1
No
7
Is there a way to pivot the table to get what I want while still using this function, or is there a different function I should be using instead?
Here is my current code:
vars <- c('var1', 'var2')
Table <- CreateTableOne(vars=vars, data=dataframe, factorVars=vars)
Table_exp <- print(Table, showAllLevels = T, varLabels = T, format="f", test=FALSE, noSpaces = TRUE, printToggle = FALSE)
write.csv(Table_exp, file = "Table.csv")
Thanks!
You could use only the pivot_wider to make that table. Here is your data:
library(tidyverse)
dataframe = data.frame(Variable = c("Var1", "Var1", "Var2", "Var2"),
Level = c("Yes", "No", "Yes", "No"),
Value = c(1, 7, 5, 2))
Your data:
Variable Level Value
1 Var1 Yes 1
2 Var1 No 7
3 Var2 Yes 5
4 Var2 No 2
You can use this code to make the wider table:
dataframe %>%
pivot_wider(names_from = "Level", values_from = "Value")
Output:
# A tibble: 2 × 3
Variable Yes No
<chr> <dbl> <dbl>
1 Var1 1 7
2 Var2 5 2
So I got the answer to this question from a coworker, and it's very similar to what Quinten suggested but with some additional steps to account for the structure of my raw data.
The example tables I provided in my question were my desired outputs, not examples of my raw data. The number values weren't values in my dataset, but rather calculated counts of records, and the solution below includes steps for doing that calculation.
This is what my raw data looks like, and it's actually structured wide:
Participant_ID
Var1
Var2
Age
1
Yes
No
20
2
No
No
30
We started by creating a subset with just the relevant variables:
subset <- data |> select(Participant_Id, Var1, Var2)
Then pivoted the data longer first, in order to calculate the counts I wanted in my output table. In this code, we specify that we don't want to pivot Participant_ID and create columns called Vars and Response.
subsetlong <- subset |> pivot_longer(-c("Participant_Id"), names_to = "Vars", values_to="Response")
This is what subsetlong looks like:
Participant_ID
Vars
Response
1
Var1
Yes
1
Var2
No
2
Var1
No
2
Var2
No
Then we calculated the counts by Vars, putting that into a new dataframe called counts:
counts <- subsetlong |> group_by(Vars) |> count(Response)
And this is what counts looks like:
Vars
Response
n
Var1
Yes
1
Var1
No
7
Var2
Yes
5
Var2
No
2
Now that the calculation was done, we pivoted this back to wide again, specifying that any NAs should appear as 0s:
counts_wide <- counts |> pivot_wider(names_from="Response", values_from="n", values_fill = 0)
And finally got the desired structure:
Vars
Yes
No
Var1
1
7
Var2
5
2
I have a dataframe with 4 variables: id, measurement, date_a, date_b.
A single id can contribute to the df more than once. I want to subset this dataframe to only include one measurement for each id. I want to select a single row for each id based on the minimum difference between date_b and date_a, however this minimum difference is required to be at least one year. Is there a way to do this using dplyr using one line of code rather than creating a new variable for the difference in dates?
Here's some fake data. (It's best practice to include something like this in your question to avoid ambiguity or misunderstandings about your particular situation.)
set.seed(8601)
df <- data.frame(
id = rep(1:3, each = 5),
measurement = "foo",
date_a = as.Date(sample(1:3000, 15), origin = "2010-01-01")
)
df$date_b <- df$date_a + sample(1:1000, 15)
Here's a slightly-longer-than-one-line approach with dplyr:
library(dplyr)
df %>% group_by(id) %>% filter(date_b-date_a >= 365) %>% filter(date_b-date_a == min(date_b-date_a))
Result:
# A tibble: 3 x 4
# Groups: id [3]
id measurement date_a date_b
<int> <fct> <date> <date>
1 1 foo 2013-06-13 2014-11-26
2 2 foo 2014-10-05 2017-04-14
3 3 foo 2012-01-07 2014-02-11
Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6