How to transform data based on group and variable name [duplicate] - r

This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Reshape multiple value columns to wide format
(5 answers)
Closed 3 years ago.
How can I get from the data.frame below in R to the other data.frame. I am new to dplyr/tidyr, so do not know exactly what functions to use, but I guess it can be done using these packages.
NAME GROUP X1 X2
A G1 1 2
A G2 1 3
A G3 4 3
B G1 3 3
B G2 2 3
B G3 5 4
C G1 4 3
C G2 4 1
C G3 4 3
NAME X1_G1 X2_G1 X1_G2 X2_G2 X1_G3 X2_G3
A 1 2 1 3 4 3
B 4 3 3 3 2 3
C 4 3 4 1 4 3
Thank you for your help!

library(tidyr) #v1.0.0
pivot_wider(df, names_from = GROUP, values_from = c(X1, X2))
# A tibble: 3 x 7
NAME X1_G1 X1_G2 X1_G3 X2_G1 X2_G2 X2_G3
<chr> <int> <int> <int> <int> <int> <int>
1 A 1 1 4 2 3 3
2 B 3 2 5 3 3 4
3 C 4 4 4 3 1 3
PS1: Downvote why?
I know this is a dup, but this library is out for a month and this consider as a new way to solve this problem.
PS2: Dear Moderator, you should clarify before you delete someone's comment.

Related

How to subset multiple times from larger data frame in R

I have the following wide data in a .csv:
Subj
Show1_judgment1
Show1_judgment2
Show1_judgment3
Show1_judgment4
Show2_judgment1
Show2_judgment2
Show2_judgment3
Show2_judgment4
1
3
2
5
6
3
2
5
7
2
1
3
5
4
3
4
1
6
3
1
2
6
2
3
7
2
6
The columns keep going for the same four judgments for a series of 130 different shows.
I want to change this data into long form so that it looks like this:
Subj
show
judgment1
judgment2
judgment3
judgment4
1
show1
2
5
6
1
1
show2
3
5
4
4
1
show3
2
6
2
5
Usually, I would use base r to subset the columns into their own dataframes and then used rbind to put them into one dataframe.
But since there are so many different shows, it will be very inefficient to do it like that. I am relatively novice at R, so I can only do very basic for loops, but I think a for loop that subsets the subject column (first column in data) and then groups of 4 sequential columns would do this much more efficiently.
Can anyone help me create a for loop for this?
Thank you in advance for your help!
No for loop required, this is transforming or "pivoting" from wide to long format.
tidyr
tidyr::pivot_longer(dat, -Subj, names_pattern = "(.*)_(.*)", names_to = c("show", ".value"))
# # A tibble: 6 x 6
# Subj show judgment1 judgment2 judgment3 judgment4
# <int> <chr> <int> <int> <int> <int>
# 1 1 Show1 3 2 5 6
# 2 1 Show2 3 2 5 7
# 3 2 Show1 1 3 5 4
# 4 2 Show2 3 4 1 6
# 5 3 Show1 1 2 6 2
# 6 3 Show2 3 7 2 6
data.table
Requires data.table-1.14.3, relatively new (or from github).
data.table::melt(
dat, id.vars = "Subj",
measure.vars = measure(show, value.name, sep = "_"))

wide to long format R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I've got a dataset like this:
dat1 <- read.table(header=TRUE, text="
ID A_1 B_1 C_1 A_2 B_2 C_2
1 1 2 1 5 5 5
2 1 3 3 4 4 1
3 1 3 1 3 2 2
4 2 5 5 3 2 2
5 1 4 1 2 1 3
")
I would like to convert this to a long format, with one column for the ID, one for the system (1 or 2), one for the variable (A, B, or C) and one with the value.
I can't figure out how to do that, I would be very grateful if somebody could help me out.
I already tried the pivot_longer command, but it only gives me three columns one for the ID one for the variables and one for the value.
Thanks!!
You can use pivot_longer in the following way :
tidyr::pivot_longer(dat1, cols = -ID,
names_to = c('variable', 'system'), names_sep = '_')
# ID variable system value
# <int> <chr> <chr> <int>
# 1 1 A 1 1
# 2 1 B 1 2
# 3 1 C 1 1
# 4 1 A 2 5
# 5 1 B 2 5
# 6 1 C 2 5
# 7 2 A 1 1
# 8 2 B 1 3
# 9 2 C 1 3
#10 2 A 2 4
# … with 20 more rows

Replacing NA with observed values? [duplicate]

This question already has answers here:
Filling missing value in group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 2 years ago.
I have a dataset that contains multiple observations per person. In some cases an individual will have their ethnicity recorded in some rows but missing in others. In R, how can I replace the NA's with the ethnicity stated in the other rows without having to manually change them?
Example:
PersonID Ethnicity
1 A
1 A
1 NA
1 NA
1 A
2 NA
2 B
2 NA
3 NA
3 NA
3 A
3 NA
Need:
PersonID Ethnicity
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
3 A
3 A
3 A
3 A
You could use fill from tidyr
df %>%
group_by(PersonID)%>%
fill(Ethnicity,.direction = "downup")
# A tibble: 12 x 2
# Groups: PersonID [3]
PersonID Ethnicity
<int> <fct>
1 1 A
2 1 A
3 1 A
4 1 A
5 1 A
6 2 B
7 2 B
8 2 B
9 3 A
10 3 A
11 3 A
12 3 A

Generate data frame with parameters [duplicate]

This question already has answers here:
Fill missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 3 years ago.
I have a data frame of ids with number column
df <- read.table(text="
id nr
1 1
2 1
1 2
3 1
1 3
", header=TRUE)
I´d like to create new dataframe from it, where each id will have unique nr from df dataframe. As you may notice, id 3 have only nr 1, but no 2 and 3. So result should be.
result <- read.table(text="
id nr
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
", header=TRUE)
You can use expand.grid as:
library(dplyr)
result <- expand.grid(id = unique(df$id), nr = unique(df$nr)) %>%
arrange(id)
result
id nr
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
We can do:
tidyr::expand(df,id,nr)
# A tibble: 9 x 2
id nr
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3

Removing duplicates in R

I have a large dataset (>37 m individuals) and I am using R. I am very much a beginner. Currently, I'm trying (and trying, and trying) to calculate the average household size per Province in the Country that I am analyzing. I have managed to create a separate data frame, with the required variables to give an individual number to each person and thus a household number under the variable called HH (for HouseHolds). Now I want R to remove the duplicates from this specific column in the new data frame that I created, i.e. the HH column.
I have tried numerous times using the duplicate() and unique() functions but it does not work. I've also tried to isolate the this "HH" column in a separate sheet but these functions does still not remove the duplicates. I've also tried converting it into a vector and then doing the duplicate() and unique() functions (as you can see beneath).
When I use a smaller sample in excel it works perfectly well (asking excel to remove the duplicates).
This is how I created my dataset based on my initial dataset (i.e. PHCKCON):
HHvars<-c("eano", "county", "tif")
HHKE<-PHCKCON[HHvars]
as.numeric(HHKE$county)
HHKE$county<-as.numeric(HHKE$county)
Then I created an 4th column for my Households:
HHKE$HH<-(paste(HHKE$eano, HHKE$county, HHKE$tif))
Here is an example of my dataset:
The values in the first three columns are numeric whilst the last are classified as characters
Here is a small sample of the data (I invented these but same idea):
Enumeration.area County Household.members
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
And here is what I did to create my 4th column called HH:
mydata$HH<-paste(mydata$Enumeration.area, mydata$County, mydata$Household.members)
It then gives a fourth column.
HH
1 a 4
1 a 4
1 a 6
1 a 6
1 a 8
1 a 8
1 a 8
1 a 8
2 a 4
2 a 4
2 a 6
2 a 8
1 b 6
1 b 6
1 b 8
1 b 8
1 b 12
1 b 12
1 b 12
1 b 12
Then I created a separate dataset for my HH column (in order to duplicate):
attach(mydata)
HHvars<-c("HH")
EX2<-mydata[HHvars]
I then tried to duplicate EX2, HH colum:
EX2[!duplicated(EX2$HH),]
But it is not working. And not when using the
unique()
function either.
I hope that it is clearer! And still grateful for any help.
Cheers,
Madeleine
If what you're asking for is simply the mean and median for each county of each enumeration.area, you can do this rather quickly using dplyr. I made up some data below to somewhat match yours.
library(dplyr)
HH <- data.frame(
Enumeration.area=c(1,1,1,2,2,2,3,3,3),
County=c('a','a','b','a','a','a','b','a','b'),
Household.members=c(4,6,5,8,10,9,3,4,3)
)
HH %>% group_by(Enumeration.area,County) %>% summarise(mean=mean(Household.members),median=median(Household.members))
Which results in:
Enumeration.area County mean median
(dbl) (fctr) (dbl) (dbl)
1 1 a 5 5
2 1 b 5 5
3 2 a 9 9
4 3 a 4 4
5 3 b 3 3
Then each row of the resulting data set is a unique combination of Enumeration.area and County, and for each of those combinations you'll have your mean and median household numbers.
edit:
Since your desired output is regarding creating a concatenated identifier for each observation, this is how you could do that:
df <- HH %>% group_by(Enumeration.area,County) %>%
mutate(id=paste(Enumeration.area,County,Household.members))
This will create a character string that is the combination of Enumeration.area, County, and Household.members. Then using distinct(id) will remove any duplicates, as shown below:
df
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
9 3 b 3 3 b 3
df %>% distinct(id)
Enumeration.area County Household.members id
(dbl) (fctr) (dbl) (chr)
1 1 a 4 1 a 4
2 1 a 6 1 a 6
3 1 b 5 1 b 5
4 2 a 8 2 a 8
5 2 a 10 2 a 10
6 2 a 9 2 a 9
7 3 b 3 3 b 3
8 3 a 4 3 a 4
As you can see, the duplicate row "3 b 3" has now just been reduced to one unique observation.

Resources