I am quite new to R and have only basic skills so far and even though I checked functions like melt() and gather() they somehow do not work for me.
What I want to do is transform such data (considering that all options on HAS House /Renting and Homeless are only 1 and 0and you cannot have more than 1 (you cannot be Renting and Homeless at the same time)
Eg.
Passenger ID /// Has Own House /// Renting /// Homeless /// Age /// Gender
1 1 0 0 21 Male
2 0 1 0 24 Female
I want this data to look like this:
Passenger ID /// Housing /// Age /// Gender
1 Has own house 21 Male
2 Renting 24 Female
And when it comes to forecasting - please can you advise whether the above method (with the binary factors) will work better in terms of speed or having all in 1 column will be better solution?
try this
library(tidyverse)
# importing your data
df <- read_table("Passenger_ID Has_Own_House Renting Homeless Age Gender
1 1 0 0 21 Male
2 0 1 0 24 Female")
and run:
df %>%
gather(Housing, value, -Passenger_ID, -Age, -Gender) %>%
filter(value==1) %>%
select(-value)
the output is:
# A tibble: 2 x 4
# Passenger_ID Age Gender Housing
# <int> <int> <chr> <chr>
# 1 1 21 Male Has_Own_House
# 2 2 24 Female Renting
In base R with ifelse:
# Load Data
dat <- structure(list(Passenger_ID = 1:2, Has_Own_House = c(1L, 0L),
Renting = 0:1, Homeless = c(0L, 0L), Age = c(21L, 24L), Gender = structure(c(2L,
1L), .Label = c("Female", "Male"), class = "factor")), .Names = c("Passenger_ID",
"Has_Own_House", "Renting", "Homeless", "Age", "Gender"), class = "data.frame", row.names = c(NA,
-2L))
# Assign new column "Housing" based on testing nested ifelse statements:
dat2 <- within(dat, Housing <- ifelse(Has_Own_House==1, "Has_Own_House",
ifelse(Renting==1, "Renting",
ifelse(Homeless==1, "Homeless", NA))))
# Remove extra columns
dat2$Has_Own_House <- NULL
dat2$Renting <- NULL
dat2$Homeless <- NULL
Yielding
>dat2
Passenger_ID Age Gender Housing
1 21 Male Has_Own_House
2 24 Female Renting
In base R, you can simply assign a new column just in one line by applying to all lines (1 argument) of the data frame a function returning the appropriate column name (where the value is 1 thanks to which):
df = data.frame('Passenger ID' = 1:5,
'Has Own House' = c(1,0,0,1,0),
'Renting' = c(0,1,0,0,0),
'Homeless' = c(0,0,1,0,1),
'Age'=21:25,
'Gender' = c('Male', 'Female', 'Male', 'Female', 'Male'))
df$HOUSING = apply(df[, 2:4], 1, function(x) names(df)[2:4][which(x==1)])
df
# Passenger.ID Has.Own.House Renting Homeless Age Gender HOUSING
# 1 1 1 0 0 21 Male Has.Own.House
# 2 2 0 1 0 22 Female Renting
# 3 3 0 0 1 23 Male Homeless
# 4 4 1 0 0 24 Female Has.Own.House
# 5 5 0 0 1 25 Male Homeless
Related
I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
I have a data frame where certain observations are separated by commas and I would like to separate them into different rows. I know there is a way to do this using the separate_rows function from tidyr, but I have an additional constraint.
Here is code to get my data frame:
dat <- structure(list(cit.num = c("29496, 37063", "29496, 37063", "36706, 36707",
"36706, 36707"), civ.race = c("Black", "White", "Hispanic", "Hispanic"
), civ.sex = c("Male", "Female", "Female", "Male"), count = c(2L,
2L, 2L, 2L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
Here is what the data look like:
cit.num civ.race civ.sex count
1 29496, 37063 Black Male 2
2 29496, 37063 White Female 2
3 36706, 36707 Hispanic Female 2
4 36706, 36707 Hispanic Male 2
cit.num identifies an individual citizen. I know that 29496 refers to the black male, and 37063 refers to the white female. Is there a way to separate the rows such that the first value is matched with the correct civ.race and civ.sex? Here is my desired output:
cit.num civ.race civ.sex count
1 29496 Black Male 2
2 37063 White Female 2
3 36706 Hispanic Female 2
4 36707 Hispanic Male 2
If you already know the cit.num that should correspond to each combination of civ.race and civ.sex, I think it would be easier to do a join with the corresponding keys. Here is the code to do that using left_join.
library(tidyverse)
keys <- data.frame(civ.race = c("Black","Black","White","White","Hispanic","Hispanic"),
civ.sex = c("Male","Female","Male","Female","Male","Female"),
cit.num = c(29496,29495,37064,37063,36707,36706),
stringsAsFactors = F)
dat %>%
#Drop you original cit.num column
select(-cit.num) %>%
#Do the join using civ.race and civ sex to match the entries in dat and keys
left_join(keys,
by = c("civ.race","civ.sex"))
# A tibble: 4 x 4
# civ.race civ.sex count cit.num
# <chr> <chr> <int> <dbl>
# 1 Black Male 2 29496
# 2 White Female 2 37063
# 3 Hispanic Female 2 36706
# 4 Hispanic Male 2 36707
You could use a for loop:
Key to it is that you define a sequence of uneven numbers:
seq(1, nrow(dat), by = 2)
That sequence you instruct for to loop over:
for(i in seq(1, nrow(dat), by = 2)){
dat$cit.num[i] <- gsub(", \\d+", "", dat$cit.num[i])
dat$cit.num[i+1] <- gsub("\\d+, ", "", dat$cit.num[i+1])
}
Output:
dat
cit.num civ.race civ.sex count
1 29496 Black Male 2
2 37063 White Female 2
3 36706 Hispanic Female 2
4 36707 Hispanic Male 2
Here is a tidyverse alternative. You can separate your cit.num column into 2 columns, first and second.
Then, grouping by this combination, you set cit.num to be either the first or second number (first if the first of the two rows, and second otherwise).
library(tidyverse)
dat %>%
separate(cit.num, into = c("first", "second")) %>%
group_by(first, second) %>%
mutate(cit.num = ifelse(row_number() == 1, first, second)) %>%
ungroup() %>%
select(c(-first, -second))
Output
# A tibble: 4 x 4
civ.race civ.sex count cit.num
<chr> <chr> <int> <chr>
1 Black Male 2 29496
2 White Female 2 37063
3 Hispanic Female 2 36706
4 Hispanic Male 2 36707
If we have only two numbers in cit.num we could use separate_rows to get data in different rows and select 1st and 4th row in each cit.num.
library(dplyr)
dat %>%
mutate(temp = cit.num) %>%
tidyr::separate_rows(cit.num) %>%
group_by(temp) %>%
slice(c(1, 4)) %>%
ungroup() %>%
select(-temp)
# cit.num civ.race civ.sex count
# <chr> <chr> <chr> <int>
#1 29496 Black Male 2
#2 37063 White Female 2
#3 36706 Hispanic Female 2
#4 36707 Hispanic Male 2
thank you in advance for anyone who is going to try and help with this.
I'm using the Yelp data set and the question I want to answer is "which categories are positively correlated with higher stars for X category (Bars for example)"
The issue I'm encountering is that for each business the categories are lumped together into one column and row per businesss_id. So I need a means to separate out each category, turn them into columns and then check if the original category column contains the category that the column was created for.
My current train of thought is to use group_by with business_id and then unnest_tokens the column, then model.matrix() that column into the split I want and then join it onto the df I'm using. But I can't get model.matrix to pass and keep business_id connected to each row.
# an example of what I am using #
df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"))
# what I want it to look like #
desired_df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"),
Pizza = c(1, 1, 0),
Burgers = c(1, 0, 0),
Caterers = c(1, 0, 0),
Restaurants = c(0, 1, 1),
Bars = c(0, 1, 0),
American = c(0, 0, 1),
Barbeque = c(0, 0, 1))
# where I am stuck #
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
model.matrix(business_id ~ categories, data = .) %>%
as_data_frame
Edit: After this post and the answers below I encountered a duplicate identifiers error using spread(). Which brought me to this thread https://github.com/tidyverse/tidyr/issues/426 where the answer to my question was posted, I've repasted it below.
# duplicating the error with a smaller data.frame #
library(tidyverse)
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
df
#> # A tibble: 5 x 2
#> age gender
#> <chr> <fct>
#> 1 21 Male
#> 2 17 Female
#> 3 32 Female
#> 4 29 Male
#> 5 15 Male
df %>%
spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
# fixing the problem #
df %>%
group_by_at(vars(-age)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key=gender, value=age) %>% # spread
select(-row_id) # drop the index
#> # A tibble: 3 x 2
#> Female Male
#> <chr> <chr>
#> 1 17 21
#> 2 32 29
#> 3 NA 15
Building from your nice use of tidytext::unnest_tokens(), you can also use this alternative solution
library(dplyr)
library(tidyr)
library(tidytext)
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
mutate(value = 1) %>%
spread(categories, value, fill = 0)
# business_id american barbeque bars burgers caterers pizza restaurants
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# bus_1 0 0 0 1 1 1 0
# bus_2 0 0 1 0 0 1 1
# bus_3 1 1 0 0 0 0 1
Here is a simple tidyverse solution:
library(tidyverse)
df %>%
mutate(
ind = 1,
tmp = strsplit(categories, ", ")
) %>%
unnest(tmp) %>%
spread(tmp, ind, fill = 0)
## A tibble: 3 x 9
# business_id categories American Barbeque Bars Burgers Caterers Pizza Restaurants
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 bus_1 Pizza, Burgers, Caterers 0 0 0 1 1 1 0
#2 bus_2 Pizza, Restaurants, Bars 0 0 1 0 0 1 1
#3 bus_3 American, Barbeque, Restaurants 1 1 0 0 0 0 1
I have a household and member dataset in one long flat format. There is a fixed number of members and each corresponds to a column. For simplicity, assume 2 members per household and assume 2 questions are asked for the members- age (Q1), gender(Q2).
The file format looks as given below:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
And I want to convert it to the following format:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F
Let's say our data frame is test
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
You could try the reshape function on this data frame as below:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
The reshape() function comes from the base R. Broadly speaking, it can simultaneously melt over multiple sets of variables, by using the varying parameter and setting the direction to long.
For example in your case we have a list of three vectors of variable names to the varying argument:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
The output is below:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female
You can use tidyr::gather(), tidyr::separate(), and tidyr::spread() in order. Here household is the name of your data frame.
library(tidyverse)
1. gather
First, tidyr::gather(). Then you can get the below result.
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
Now all you have to do is
separate domestic column at _[0-9]: In regular expression, _(?=[0-9])
Changing the format into the wide format, you can see the output you want.
2. Conclusion: entire code
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F
I have dataset where each row represents a new test taken by individuals. It has four variables.
1) IDs of the test takers:
id <- c(1, 1, 1, 2, 2)
2) Dates the individuals took the test:
dates <- as.Date(c("2007-06-22", "2008-06-21", "2009-06-22", "2008-06-21", "2009-06-22"))
3) Scores they received on that test:
scores <- c(0, 12, 12, 1, 3)
4) Whether or not that score was the best score of the individual up to that time point.
improvement <- c("No", "Yes", "No", "No", "Yes")
So the dataset is:
df <- data.frame(id, dates, scores, improvement)
id dates scores improvement
1 1 2007-06-22 0 No
2 1 2008-06-21 12 Yes
3 1 2009-06-22 12 No
4 2 2008-06-21 1 No
5 2 2009-06-22 3 Yes
I've got a problem though. A score of 12 is the highest. So if someone gets a 12, there would be no more room for improvement. Do you know how I could make it so that when someone gets a 12, on any subsequent rows they get NA on improvement?
i.e.,
id dates scores improvement
1 1 2007-06-22 0 No
2 1 2008-06-21 12 Yes
3 1 2009-06-22 12 NA
4 2 2008-06-21 1 No
5 2 2009-06-22 3 Yes
How about this: We use dplyr to group by id, then for each id we check whether any score is equal to 12. If so, then we replace every value of improvement with NA in subsequent rows after the first instance of a score of 12.
library(dplyr)
df %>% group_by(id) %>% arrange(id, dates) %>%
mutate(improvement = replace(improvement, if(any(scores==12)) (min(which(scores==12))+1):n(), NA))
id dates scores improvement
<dbl> <date> <dbl> <fctr>
1 1 2007-06-22 0 No
2 1 2008-06-21 12 Yes
3 1 2009-06-22 12 NA
4 2 2008-06-21 1 No
5 2 2009-06-22 3 Yes
Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', order the 'id' and 'dates', get the logical index for maximum scores (scores == max(scores)), find the cumulative sum (cumsum(...)), convert it to a logical vector (>1) and get the row index (.I). Specify the row index in i and assign (:=) the elements in 'improvement' that corresponds to that index to NA
library(data.table)
setDT(df)[df[order(id, dates), .I[cumsum(scores == max(scores))>1],
by = id]$V1, improvement := NA]
df
# id dates scores improvement
#1: 1 2007-06-22 0 No
#2: 1 2008-06-21 12 Yes
#3: 1 2009-06-22 12 NA
#4: 2 2008-06-21 1 No
#5: 2 2009-06-22 3 Yes
Update
If the max values are not adjacent, either we can order by 'scores' as well or another option is
setDT(df1)[df1[order(id, dates), .I[cumsum(scores == max(scores))>1 &
scores ==max(scores)], by = id]$V1, improvement := NA]
df1
# id dates scores improvement
#1: 1 2007-06-22 0 No
#2: 1 2008-06-21 12 Yes
#3: 1 2009-06-22 5 No
#4: 1 2010-06-21 12 NA
#5: 2 2008-06-21 1 No
#6: 2 2009-06-22 3 Yes
A slight improvement to the above is to call the scores==max(scores) one time by creating an object
setDT(df1)[df1[order(id, dates), {mx <- scores == max(scores)
.I[cumsum(mx)>1 & mx]},
by = id]$V1, improvement := NA]
data
df1 <- structure(list(id = c(1, 1, 1, 1, 2, 2), dates = structure(c(13686,
14051, 14417, 14781, 14051, 14417), class = "Date"), scores = c(0,
12, 5, 12, 1, 3), improvement = structure(c(1L, 2L, 1L, 1L, 1L,
2L), .Label = c("No", "Yes"), class = "factor")), .Names = c("id",
"dates", "scores", "improvement"), row.names = c(NA, -6L),
class = "data.frame")
the operation can also be done just using the base package. Though it is bit messy, it creates opportunity for people who aren't familiar with the dplyr package, etc. Here is the code and I have explained my logic as comments in the code:
## Note: you cannot have factor levels in the `improvement` column
df$id <- as.character(df$id) ##IMPORTANT
df$improvement <- as.character(df$improvement) ##really important
new_df <- NULL #new data frame; placeholder for now
for(test_taker in unique(df$id)) {
## Sub-dataframe for each individual's record:
sub_df <- df[df$id == test_taker, ]
## For each individual's record, look for score of 12
## If there is such a score that occur more than once
## change the second score of 12 and beyond to NA
indices <- which(df$scores[df$id == test_taker] %in% c(12))
if(sum(indices) > 1) {
sub_df[indices[2:length(indices)], "improvement"] <-NA
}
## Update the new data.frame
new_df <- rbind(new_df, sub_df)
}
new_df
## id dates scores improvement
## 1 1 2007-06-22 0 No
## 2 1 2008-06-21 12 Yes
## 3 1 2009-06-22 12 <NA>
## 4 2 2008-06-21 1 No
## 5 2 2009-06-22 3 Yes