Reshape dataframe by ID [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have a data set like
id age edu blood
1 30-39 Primary 5.5
1 20-29 Secondary 8.7
1 30-39 Primary 10
2 30-39 Primary 11
2 20-29 Secondary 10
2 20-29 Secondary 9
I want id wise output like this:
id age30_39count age20_29count edu_pri_count edu_sec_count blood_median
1 2 1 2 1 8.7
2 1 2 1 2 10
I have tried R code:
library(dplyr)
library(tidyr)
ddply(dat, "id", spread, age, age, edu, edu, blood, blood_median=median(blood))
But it not showing desired result. Could anybody do help?

You mean like this?
> library(dplyr)
> library(tidyr)
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
spread(tag,id) %>% select(-blood) %>%
select(-medblood,medblood)
# A tibble: 6 x 5
`age_20-29` `age_30-39` edu_Primary edu_Secondary medblood
<int> <int> <int> <int> <dbl>
1 NA 1 1 NA 8.70
2 1 NA NA 1 8.70
3 2 NA NA 2 10.0
4 NA 1 1 NA 8.70
5 2 NA NA 2 10.0
6 NA 2 2 NA 10.0
That last select(-medblood,medblood) moves the median blood column to the far right. You might possibly be wanting to do this though:
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
count(medblood,id,tag) %>% spread(tag,n)
# A tibble: 2 x 6
# Groups: id [2]
id medblood `age_20-29` `age_30-39` edu_Primary edu_Secondary
<int> <dbl> <int> <int> <int> <int>
1 1 8.70 1 2 2 1
2 2 10.0 2 1 1 2
Here is the dput of the data df used for this example:
> dput(df)
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), age = structure(c(2L,
1L, 2L, 2L, 1L, 1L), .Label = c("20-29", "30-39"), class = "factor"),
edu = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("Primary",
"Secondary"), class = "factor"), blood = c(5.5, 8.7, 10,
11, 10, 9)), .Names = c("id", "age", "edu", "blood"), class = "data.frame", row.names = c(NA,
-6L))

Related

Creating new columns in dataset as a lookup function in R?

so i lets say i have a datatable that consist of stock monthly returns:
Company
Year
return
next years return
1
1
5
1
2
6
1
3
2
1
4
4
For a large dataset, of multiple companies and years how can i get a new column that consist of next years returns, for example in first row there would be second years return of 6% etc etc? In excel i could simple use index match but no idea how its done in R. And the reason for not using excel is that it takes over 20 hours to compute all functions as index match is extremely slow. The code needs to do this for all companies so it has to find the correct company for correct year and then input it into new column.
You could group by the company and use lead() to get the next value:
library(dplyr)
df <- data.frame(
company = c(1L, 1L, 1L, 1L, 2L, 2L),
year = c(1L, 2L, 3L, 4L, 1L, 2L),
return_ = c(5L, 6L, 2L, 4L, 2L, 4L))
df
#> company year return_
#> 1 1 1 5
#> 2 1 2 6
#> 3 1 3 2
#> 4 1 4 4
#> 5 2 1 2
#> 6 2 2 4
df %>% group_by(company) %>%
mutate(next.years.return = lead(return_, order_by = year))
#> # A tibble: 6 × 4
#> # Groups: company [2]
#> company year return_ next.years.return
#> <int> <int> <int> <int>
#> 1 1 1 5 6
#> 2 1 2 6 2
#> 3 1 3 2 4
#> 4 1 4 4 NA
#> 5 2 1 2 4
#> 6 2 2 4 NA
Created on 2023-02-10 with reprex v2.0.2
Getting the next years return if its really the next year.
library(dplyr)
df %>%
group_by(Company) %>%
arrange(Company, Year) %>%
mutate("next years return" =
if_else(lead(Year) - Year == 1, lead(`return`), NA)) %>%
ungroup()
# A tibble: 8 × 4
Company Year return `next years return`
<dbl> <dbl> <int> <int>
1 1 1 5 NA
2 1 3 2 4
3 1 4 4 6
4 1 5 6 NA
5 2 1 5 6
6 2 2 6 2
7 2 3 2 4
8 2 4 4 NA
Data
df <- structure(list(Company = c(1, 1, 1, 1, 2, 2, 2, 2), Year = c(1,
5, 3, 4, 4, 3, 2, 1), return = c(5L, 6L, 2L, 4L, 4L, 2L, 6L,
5L)), row.names = c("1", "2", "3", "4", "41", "31", "21", "11"
), class = "data.frame")

Add a mean column to a table dataframe in R

I have a dataframe such as:
COL1 VALUE1 VALUE2
1 A,A 1 5
2 A,A,B 1 3
3 C 1 1
4 D 1 2
5 D 1 2
6 A,A 1 10
7 A,B,A 1 2
and I can succeed to remove duplicate within the COL1 and count the number of different duplicated in COL1 by using:
as.data.frame(table(tab$COL1)) %>%
group_by(Var1 = sapply(strsplit(as.character(Var1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(Freq))
And then I get:
# A tibble: 4 × 2
Var1 Freq
<chr> <int>
1 A 2
2 A, B 2
3 C 1
4 D 2
But I wondered if someone had an idea in order to add a new column called Mean which would be for each COL1 groups, the mean of the VALUE2 values and then get:
Var1 Freq Mean
1 A 2 7.5 < because (5+10)/2 =7.5
2 A, B 2 2.5 < because (3+2)/2 =2.5
3 C 1 1 < because 1/1 = 1
4 D 2 2 < because (2+2)/2 = 2
Here is the dataframe if it can helps:
structure(list(COL1 = structure(c(1L, 2L, 4L, 5L, 5L, 1L, 3L), .Label = c("A,A",
"A,A,B", "A,B,A", "C", "D"), class = "factor"), VALUE1 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L), VALUE2 = c(5L, 3L, 1L, 2L, 2L, 10L,
2L)), class = "data.frame", row.names = c(NA, -7L))
You can calculate the frequency table directly in the dplyr chain, and then just add a Mean = mean(VALUE2) in the summarise() call.
I.e.
tab %>%
group_by(Var1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
summarise(Freq = sum(VALUE1), Mean = mean(VALUE2))
# # A tibble: 4 x 3
# Var1 Freq Mean
# <chr> <int> <dbl>
# 1 A 2 7.5
# 2 A, B 2 2.5
# 3 C 1 1
# 4 D 2 2
Is this what you want:
library(dplyr)
tab %>%
mutate(COL1 = sapply(strsplit(as.character(COL1), ","), function(x) toString(unique(x)))) %>%
group_by(COL1) %>%
summarise(Freq = sum(VALUE1),
Mean = mean(VALUE2))
# A tibble: 4 x 3
COL1 Freq Mean
* <chr> <int> <dbl>
1 A 2 7.5
2 A, B 2 2.5
3 C 1 1
4 D 2 2

R: How to aggregate a dataframe using an index column?

I have a dataframe which looks as following:
head(test_df, n =15)
# print the first 15rows of the dataframe
value frequency index
1 -2.90267705917358 1 1
2 -2.90254878997803 1 1
3 -2.90252590179443 1 1
4 -2.90219354629517 1 1
5 -2.90201354026794 1 1
6 -2.9016375541687 1 1
7 -2.90107154846191 1 1
8 -2.90089440345764 1 1
9 -2.89996957778931 1 1
10 -2.89970088005066 1 1
11 -2.89928865432739 1 2
12 -2.89920520782471 1 2
13 -2.89907360076904 1 2
14 -2.89888191223145 1 2
15 -2.8988630771637 1 2
The dataframe has 3columns and 61819rows. To aggregate the dataframe, I want to get the mean value for the columns 'value' and 'frequency' for all rows with the same 'index'.
I already found some useful links, see:
https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/
Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?
However, I could not solve the problem yet.
test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)
index value frequency
1 1 NA 1
2 2 NA 1
3 3 NA 1
4 4 NA 1
5 5 NA 1
6 6 NA 1
7 7 NA 1
8 8 NA 1
9 9 NA 1
10 10 NA 1
11 11 NA 1
12 12 NA 1
13 13 NA 1
14 14 NA 1
15 15 NA 1
Since I just get NA values for the column 'value', I wonder whether it might just be a data type issue?! However also when I tried to convert the data type I failed...
base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"
1. Here is a base R solution.
aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
# index value frequency
#1 1 -2.901523 1
#2 2 -2.899062 1
2. And a simple dplyr solution.
library(dplyr)
test_df %>%
group_by(index) %>%
summarize(across(1:2, mean))
## A tibble: 2 x 3
# index value frequency
#* <int> <dbl> <dbl>
#1 1 -2.90 1
#2 2 -2.90 1
Data
test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803,
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687,
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066,
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145,
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Using data.table
library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]
Try tidyverse. test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T)).
Oh, and, of course, you should make sure you've checked the quality of your data and understand the ifs and whys of any NA's in your data set.

Cumulative sums by month in R

I want to transform my data from this
Month Expenditures
1 1
1 2
2 3
2 6
3 2
3 5
to this:
Month Cumulative_expenditures
1 3
2 12
3 19
, but can't seem to figure out how to do it.
I tried using the cumsum() function, but it counts each observation - it doesn't distinguish between groups.
Any help would be much appreciated!
A two steps base R solution would be:
#Code
df1 <- aggregate(Expenditures~Month,data=mydf,sum)
#Create cum sum
df1$Expenditures <- cumsum(df1$Expenditures)
Output:
Month Expenditures
1 1 3
2 2 12
3 3 19
Some data used:
#Data
mydf <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))
Using dplyr:
library(dplyr)
df %>%
group_by(Month) %>%
summarise(Expenditures = sum(Expenditures), .groups = "drop") %>%
mutate(Expenditures = cumsum(Expenditures))
#> # A tibble: 3 x 2
#> Month Expenditures
#> <int> <int>
#> 1 1 3
#> 2 2 12
#> 3 3 19
Or in base R:
data.frame(Month = unique(df$Month),
Expenditure = cumsum(tapply(df$Expenditure, df$Month, sum)))
#> Month Expenditure
#> 1 1 3
#> 2 2 12
#> 3 3 19
Here is another base R option using subset + ave
subset(
transform(df, Expenditures = cumsum(Expenditures)),
ave(rep(FALSE, nrow(df)), Month, FUN = function(x) seq_along(x) == length(x))
)
which gives
Month Expenditures
2 1 3
4 2 12
6 3 19
We can use base R
out <- with(df1, rowsum(Expenditures, Month))
data.frame(Month = row.names(out), Expenditure = cumsum(out))
# Month Expenditure
#1 1 3
#2 2 12
#3 3 19
Or more compactly
with(df1, stack(cumsum(rowsum(Expenditures, Month)[,1])))[2:1]
data
df1 <- structure(list(Month = c(1L, 1L, 2L, 2L, 3L, 3L), Expenditures = c(1L,
2L, 3L, 6L, 2L, 5L)), class = "data.frame", row.names = c(NA,
-6L))

calculate descriptives for a nested variable

I want to calculate the M, min, and max of a variable. Data were collected at different visits. My data look like this:
id visit V1
1 1 18
1 2 24
2 2 NA
2 3 5
2 4 6
I want it to look like this, where I have columns for the M, SD, min, and max for V1 for each participant.
id visit V1 M MIN MAX
1 1 18 21 18 24
2 2 3 4.67 3 6
In calculating the M, I want to take into account the # of visits (e.g., 18 + 24/2 visits). I tried this as a first step:
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1), na.rm = T)
When I try to handle the NAs by making sure they are not included, the na.rm = T results in a new column entitled "na.rm" with every value being true, which isn't what I want. Any thoughts on making this work?
The dplyr package makes this easy. You can group_by() a variable, and whatever you do after that only applies within the group. In dplyr notation, the %>% is a special operator that feeds the outcome of the function on the left into the first argument of the function on the right.
There are two ways to do it. The first way keeps all of the data, but your summary statistics are repeated in each row.
library(dplyr)
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1)
id visit V1 M MIN MAX
1 1 18 21 18 24
1 2 24 21 18 24
2 2 3 4.67 3 6
2 3 5 4.67 3 6
2 4 6 4.67 3 6
The second way provides only the summary statistics by the group.
library(dplyr)
df %>%
group_by(id) %>%
summarize(M = mean(V1), MIN = min(V1), MAX = max(V1)
id M MIN MAX
1 21 18 24
2 4.67 3 6
You can try this dplyr approach similar to #ThomasIsCoding that produces something similar to what you want:
library(dplyr)
#Data
df <- structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
The code:
df %>% group_by(id) %>% mutate(M=mean(V1),Min=min(V1),Max=max(V1),SD=sd(V1))
Output:
# A tibble: 5 x 7
# Groups: id [2]
id visit V1 M Min Max SD
<int> <int> <int> <dbl> <int> <int> <dbl>
1 1 1 18 21 18 24 4.24
2 1 2 24 21 18 24 4.24
3 2 2 3 4.67 3 6 1.53
4 2 3 5 4.67 3 6 1.53
5 2 4 6 4.67 3 6 1.53
Maybe you want something like below
transform(df,
M = ave(V1, id, FUN = mean),
MIN = ave(V1, id, FUN = min),
MAX = ave(V1, id, FUN = max)
)
which gives
id visit V1 M MIN MAX
1 1 1 18 21.000000 18 24
2 1 2 24 21.000000 18 24
3 2 2 3 4.666667 3 6
4 2 3 5 4.666667 3 6
5 2 4 6 4.666667 3 6
Data
> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))

Resources