Reshape in R when "time" values differ between subjects - r

I've looked all over the web including StackOverflow, and tested various things before asking this question, but pardon me if I missed an excellent answer.
I see lots of help for the reshape function (and the package too, but I can't get either to do what I need). I have a "time" variable that differs by subject, e.g., it is not time1, time2, time3. I would like to make a wide data set that treats each unique time value by subject ID as just "time1", "time2", "time3", but I need to save the dates. To make this concrete, here is some sample data:
Id<-c(1, 1,1, 2,2,2, 3)
date<-c("Jan10", "Jun11", "Dec11", "Feb10", "May10", "Dec10", "Jan11")
Score<-c(52, 43, 67, 56, 33, 21, 20)
format2<-data.frame(Id, date, Score)
format2
Id date Score
1 1 Jan10 52
2 1 Jun11 43
3 1 Dec11 67
4 2 Feb10 56
5 2 May10 33
6 2 Dec10 21
7 3 Jan11 20
I would like it to look like this:
Id date1 Score1 date2 Score2 date3 Score3
1 Jan10 52 Jun11 43 Dec11 67
2 Feb10 56 Dec10 21 May10 33
3 Jan11 20 NA NA NA NA
Thank you for any help and my apologies if I have missed an obvious answer.

You need to generate a time variable, which can be done quickly using ave():
format2$time <- ave(format2$Id, format2$Id, FUN=seq_along)
reshape(format2, direction = "wide", idvar="Id", timevar="time")
# Id date.1 Score.1 date.2 Score.2 date.3 Score.3
# 1 1 Jan10 52 Jun11 43 Dec11 67
# 4 2 Feb10 56 May10 33 Dec10 21
# 7 3 Jan11 20 <NA> NA <NA> NA
Some people prefer the reshape2 package because of its syntax, but even there, you need to have a time variable before you can do anything interesting.
Continuing from above (where the time variable was created):
library(reshape2)
format2m <- melt(format2, id.vars=c("Id", "time"))
dcast(format2m, Id ~ variable + time)
# Id date_1 date_2 date_3 Score_1 Score_2 Score_3
# 1 1 Jan10 Jun11 Dec11 52 43 67
# 2 2 Feb10 May10 Dec10 56 33 21
# 3 3 Jan11 <NA> <NA> 20 <NA> <NA>

Related

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

R how to avoid "for" when I want to go through dataframe

give a brief example.
I have data frame data1.
name<-c("John","John","Mike","Amy".....)
nationality<-c("Canada","America","Spain","Japan".....)
data1<-data.frame(name,nationality....)
which mean the people is from different countries
each people is specialize by his name and country, and no repeat.
the second data frame is
name2<-c("John","John","Mike","John",......)
nationality2<-c("Canada","Canada","Canada".....)
score<-c(87,67,98,78,56......)
data2<-data.frame(name2,nationality2,score)
every people is promised to have 5 rows in data2, which means they have 5 scores but they are in random order.
what I want to do is to know every person's 5 scores, but I didn't care what his name is and where he is from.
the final data frame I want to have is
score1 score2 score3 score4 score5
1 89 89 87 78 90
2 ...
3 ...
every row represent one person 5 scores but I don't care who he is.
my data number is so large so I can not use for function.
what can I do?
Although there is an already accepted answer which uses base R I would like to suggest a solution which uses the convenient dcast() function for reshaping from wide to long form instead of using tapply() and repeated calls to rbind():
library(data.table) # CRAN version 1.10.4 used
dcast(setDT(data2)[setDT(data1), on = c(name2 = "name", nationality2 = "nationality")],
name2 + nationality2 ~ paste0("score", rowid(rleid(name2, nationality2))),
value.var = "score")
returns
name2 nationality2 score1 score2 score3 score4 score5
1: Amy Canada 93 91 73 8 79
2: John America 3 77 69 89 31
3: Mike Canada 76 92 46 47 75
It seems to me that's what you're asking:
data1 <- data.frame(name = c("John","Mike","Amy"),
nationality = c("America","Canada","Canada"))
data2 <- data.frame(name2 = rep(c("John","Mike","Amy","Jack","John"),each = 5),
score = sample(100,25), nationality2 =rep(c("America","Canada","Canada","Canada","Canada"),each = 5))
data3 <- merge(data2,data1,by.x=c("name2","nationality2"),by.y=c("name","nationality"))
data3$name_country <- paste(data3$name2,data3$nationality2)
all_scores_list <- tapply(data3$score,data3$name_country,c)
as.data.frame(do.call(rbind,all_scores_list))
# V1 V2 V3 V4 V5
# Amy Canada 57 69 90 81 50
# John America 4 92 75 15 2
# Mike Canada 25 86 51 20 12

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Prepare Time Series for Machine Learning - Long to Wide Format

I have a data frame of time series data in a 'long' format where there is 1 row/observation per day. I would like to transform this data into a 'wide' format. Each row/observation should have the time series value for the current date and the previous 2 days.
To provide a concrete example, I will use the Air Quality data available in R. This is what my input data frame looks like.
> input <- airquality[1:4,c("Month", "Day", "Ozone")]
> input
Month Day Ozone
1 5 1 41
2 5 2 36
3 5 3 12
4 5 4 18
I would like to transform this input so that it looks like the following.
output <- data.frame(Month = 5, Day = 1:4, Ozone=c(41,36,12,18), Ozone.Prev.1=c(NA,41,36,12), Ozone.Prev.2=c(NA,NA,41,36))
> output
Month Day Ozone Ozone.Prev.1 Ozone.Prev.2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36
Any suggestions on a nice, clean way to do this? Many thanks in advance.
You can use the lag function from zoo, but the following small function get's the trick done without using additional packages:
shift_vector = function(vec, n) c(rep(NA, n), head(vec, -n))
output = transform(input, prev_1 = shift_vector(Ozone, 1),
prev_2 = shift_vector(Ozone, 2))
output
Month Day Ozone prev_1 prev_2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36

Resources