Can you merge your data without creating separate dataframe in R? - r

My data frame is something like the follows:
sex year country value
F 2010 AU 350
F 2011 GE 258
M 2010 AU 250
F 2012 GE 928
In order to create another data frame that is merged by year and country, with sex and value being what you want to compare, you must first create separate data frames, like:
f <- subset(df, sex=="F")
m <- subset(df, sex=="M")
df_new <- merge(f, m, by=c("country", "year"), suffixes=c("_f", "_m"))
In this way, you can obtain a new data frame with year, and country being matched and just the value being different.
However, I don't like to bother to create separate data frames in order to merge. Is it possible to just write a code in one-line to achieve the data frame?

Considering dput(dft) as :
structure(list(sex = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"),
year = c(2010, 2011, 2010, 2012),
country = structure(c(1L, 2L, 1L, 2L), .Label = c("AU", "GE"), class = "factor"),
value = c(350, 258, 250, 928)), .Names = c("sex", "year", "country", "value"),row.names = c(NA, -4L), class = "data.frame")
you can use tidyverse and do:
dft %>% spread(sex,value)
which gives:
# year country F M
#1 2010 AU 350 250
#2 2011 GE 258 NA
#3 2012 GE 928 NA

We can do a split and then with Reduce/merge can get the expected output
Reduce(function(...) merge(..., by = c("country", "year"),
suffixes = c("_f", "_m")), split(df, df$sex))
# country year sex_f value_f sex_m value_m
#1 AU 2010 F 350 M 250
NOTE: This should also work when there are 'n' number of unique elements in the split by column (without the suffixes or its modification)
A reshaping option with data.table is
library(data.table)
na.omit(dcast(setDT(df), country + year ~ rowid(country, year),
value.var = c("sex", "value")))
# country year sex_1 sex_2 value_1 value_2
#1: AU 2010 F M 350 250

Related

Finding sum of data frame column in rows that contain certain value in R

I'm working on a March Madness project. I have a data frame df.A with every team and season.
For example:
Season Team Name Code
2003 Creighton 2003-1166
2003 Notre Dame 2003-1323
2003 Arizona 2003-1112
And another data frame df.B with game results of of every game every season:
WTeamScore LTeamScore WTeamCode LTeamCode
15 10 2003-1166 2003-1323
20 15 2003-1323 2003-1112
10 5 2003-1112 2003-1166
I'm trying to get a column in df.A that totals the number of points in both wins and losses. Basically:
Season Team Name Code Points
2003 Creighton 2003-1166 20
2003 Notre Dame 2003-1323 30
2003 Arizona 2003-1112 25
There are obviously thousands more rows in each data frame, but this is the general idea. What would be the best way of going about this?
Here is another option using tidyverse, where we can pivot df.B to long form, then get the sum for each team, then join back to df.A.
library(tidyverse)
df.B %>%
pivot_longer(everything(),names_pattern = "(WTeam|LTeam)(.*)",
names_to = c("rep", ".value")) %>%
group_by(Code) %>%
summarise(Points = sum(Score)) %>%
left_join(df.A, ., by = "Code")
Output
Season Team.Name Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
Data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), Team.Name = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")), class = "data.frame", row.names = c(NA,
-3L))
We may use match (from base R) between 'Code' on 'df.A' to 'WTeamCode', 'LTeamCode' in df.B to get the matching index, to extract the corresponding 'Score' columns and get the sum (+)
df.A$Points <- with(df.A, df.B$WTeamScore[match(Code,
df.B$WTeamCode)] +
df.B$LTeamScore[match(Code, df.B$LTeamCode)])
-output
> df.A
Season TeamName Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
If there are nonmatches resulting in missing values (NA) from match, cbind the vectors to create a matrix and use rowSums with na.rm = TRUE
df.A$Points <- with(df.A, rowSums(cbind(df.B$WTeamScore[match(Code,
df.B$WTeamCode)],
df.B$LTeamScore[match(Code, df.B$LTeamCode)]), na.rm = TRUE))
data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), TeamName = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")),
class = "data.frame", row.names = c(NA,
-3L))

r transfer values from one dataset to another by ID

I have two datasets , the first dataset is like this
ID Weight State
1 12.34 NA
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 NA
5 14.12 NA
The second dataset is a lookup table for state values by ID
ID State
1 WY
2 IA
3 MA
4 OR
4 CA
5 FL
As you can see there are two different state values for ID 4, which is normal.
What I want to do is replace the NAs in dataset1 State column with State values from dataset 2. Expected dataset
ID Weight State
1 12.34 WY
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 OR,CA
5 14.12 FL
Since ID 4 has two state values in dataset2 , these two values are collapsed and separated by , and used to replace the NA in dataset1. Any suggestion on accomplishing this is much appreciated. Thanks in advance.
Collapse df2 value and join it with df1 by 'ID'. Use coalesce to use non-NA value from the two state columns.
library(dplyr)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)), by = 'ID') %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-State.x, -State.y)
# ID Weight State
#1 1 12.3 WY
#2 2 11.2 IA
#3 2 13.1 IN
#4 3 12.7 MA
#5 4 10.9 OR, CA
#6 5 14.1 FL
In base R with merge and transform.
merge(df1, aggregate(State~ID, df2, toString), by = 'ID') |>
transform(State = ifelse(is.na(State.x), State.y, State.x))
Tidyverse way:
library(tidyverse)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)) %>%
ungroup(), by = 'ID') %>%
transmute(ID, Weight, State = coalesce(State.x, State.y))
Base R alternative:
na_idx <- which(is.na(df1$State))
df1$State[na_idx] <- with(
aggregate(State ~ ID, df2, toString),
State[match(df1$ID, ID)]
)[na_idx]
Data:
df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 5L), Weight = c(12.34,
11.23, 13.12, 12.67, 10.89, 14.12), State = c("WY", "IA", "IN",
"MA", "OR, CA", "FL")), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L), State = c("WY",
"IA", "MA", "OR", "CA", "FL")), class = "data.frame", row.names = c(NA,
-6L))

How to undummy a datasset with R

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

In old data frame, order by two columns and store first of each row into new data frame

I have a data frame that contains 3 columns and I'd like use the columns date and location to obtain the most recent observation of each location and store it into a new data frame.
> old.data
date location amount
2014 NY 1
2015 NJ 2
2016 NY 3
2015 NM 4
2013 NY 5
2014 NJ 6
2016 NM 7
2016 NJ 8
2015 NY 9
> new.data
date location amount
2016 NJ 8
2016 NM 7
2016 NY 3
Using dplyr:
library(dplyr)
new.data <- old.data %>% arrange(desc(date), location) %>% group_by(location) %>% slice(1)
new.data
Source: local data frame [3 x 2]
Groups: location [3]
date location
<int> <fctr>
1 2016 NJ
2 2016 NM
3 2016 NY
Using data.table:
library(data.table)
# Code updated by Arun
setDT(old.data)[order(-date, location), .(date = date[1L]), by = location]
location date
1: NJ 2016
2: NM 2016
3: NY 2016
Data
old.data <- structure(list(date = c(2014L, 2015L, 2016L, 2015L, 2013L, 2014L,
2016L, 2016L, 2015L), location = structure(c(3L, 1L, 3L, 2L,
3L, 1L, 2L, 1L, 3L), .Label = c("NJ", "NM", "NY"), class = "factor")), .Names = c("date",
"location"), class = "data.frame", row.names = c(NA, -9L))
Update (as OP changed the original dataframe)
The dplyr solution is still valid.
For data.table, this is the only way I could think of:
setDT(old.data)[order(-date, location), colnames(old.data), with = F][date == max(date)]
date location amount
1: 2016 NJ 8
2: 2016 NM 7
3: 2016 NY 3
Using .SD and .SDcols as suggested by Arun
# adding more data
old.data$amount <- 1:9
old.data$a <- 10:18
# Retain all columns
keep_cols <- colnames(old.data)[-2] # Remove the column which is mentioned in by
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = keep_cols]
# or assigning colnames to .SDcols directly:
setDT(old.data)[order(-date, location), .SD[1L], by = location, .SDcols = (colnames(old.data)[-2])]
location date amount a
1: NJ 2016 8 17
2: NM 2016 7 16
3: NY 2016 3 12
What about this:
library(dplyr)
date <- c(2014, 2015, 2016, 2015, 2013, 2014, 2016, 2016, 2015)
location <- c("NY", "NJ", "NY", "NM", "NY", "NJ", "NM", "NJ", "NY")
old.data <- data.frame(date, location)
new.data <- group_by(old.data, location)
new.data <- summarise(new.data, year = max(date))
Using the data.table package:
library(data.table)
setDT(dat)[order(-date), .SD[1L], by = location]
# location date
# 1: NY 2016
# 2: NM 2016
# 3: NJ 2016

How to divide the column values in to ranges

I would like to know how to divide the column values in to three different ranges based on scores.
Here's the following data I have
Name V1.1 V1.2 V2.1 V2.2 V3.1 V3.2
John French 86 Math 78 English 56
Sam Math 97 French 86 English 79
Viru English 93 Math 44 French 34
If I consider three ranges. First rangewith 0-60, Second rangewith 61-90 and third range with 91-100.
I would like to the subject name across all the skills.
Expected result would be
Name Level1 Level2 Level3
John English Math,French Null
Sam Null French,Eng Math
Viru Math,Fren Null English
First you need to convert the data to long form, one row per observation (where an observation is a single score. You need to do a melt, but it is complicated because your wide form consists of not only observations but observation classes. One way to do it is to use melt.data.table twice, but you may be more comfortable with dplyr, which has more accessible syntax.
# first you need to convert to long form
library("data.table")
setDT(df)
lhs <- melt.data.table(df, id = "Name", measure = patterns("\\.2"),
variable.name = "obs", value.name = "score")
lhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
lhs
rhs <- melt.data.table(df, id = "Name", measure = patterns("V\\d\\.1"),
variable.name = "obs", value.name = "subject")
rhs[, obs := gsub("(V\\d+)\\.\\d+","\\1",obs)]
rhs
df2 <- merge(lhs, rhs, by = c("Name","obs"))
# Name obs score1 subject1
# 1: John V1 86 French
# 2: John V2 78 Math
# 3: John V3 56 English
# 4: Sam V1 97 Math
# 5: Sam V2 86 French
# 6: Sam V3 79 English
# 7: Viru V1 93 English
# 8: Viru V2 44 Math
# 9: Viru V3 34 French
Then you need to use cut or some other function to create the three levels based on score1.
Then you should group by these levels and apply concatenation to the subjects, such as paste(..., collapse = ",").
Then you need to use cast or spread to return it to wide form.
Do give it some effort, and edit your question with what you've tried, and try to come up with a more specific question, not just "please do this for me".
Another option using splitstackshape and nested ifelse
library(splitstackshape)
library(tidyr)
# prepare data to convert in long format
data$subjects = do.call(paste, c(data[,grep("\\.1", colnames(data))], sep = ','))
data$marks = do.call(paste, c(data[,grep("\\.2", colnames(data))], sep = ','))
data = data[,-grep("V", colnames(data))]
# use cSplit to convert wide to long
out = cSplit(setDT(data), sep = ",", c("subjects", "marks"), "long")
# nested ifelse to assign level based on the score range
out[, level := ifelse(marks <= 60, "level1",
ifelse(marks <= 90, "level2", "level3"))]
req = out[, toString(subjects), by= c("Name","level")]
this will give
#> req
# Name level V1
#1: John level2 French, Math
#2: John level1 English
#3: Sam level3 Math
#4: Sam level2 French, English
#5: Viru level3 English
#6: Viru level1 Math, French
you can reshape either using dcast or spread from tidyr
spread(req, level, V1)
# Name level1 level2 level3
#1: John English French, Math NA
#2: Sam NA French, English Math
#3: Viru Math, French NA English
data
data = structure(list(Name = structure(1:3, .Label = c("John", "Sam",
"Viru"), class = "factor"), V1.1 = structure(c(2L, 3L, 1L), .Label = c("English",
"French", "Math"), class = "factor"), V1.2 = c(86L, 97L, 93L),
V2.1 = structure(c(2L, 1L, 2L), .Label = c("French", "Math"
), class = "factor"), V2.2 = c(78L, 86L, 44L), V3.1 = structure(c(1L,
1L, 2L), .Label = c("English", "French"), class = "factor"),
V3.2 = c(56L, 79L, 34L)), .Names = c("Name", "V1.1", "V1.2",
"V2.1", "V2.2", "V3.1", "V3.2"), class = "data.frame", row.names = c(NA,
-3L))
Not very intuitive, but leads to the requested output. Requires the package sjmisc!
library(sjmisc)
mydat <- data.frame(Name = c("John", "Sam", "Viru"),
V1.1 = c("French", "Math", "English"),
V1.2 = c(86, 97, 93),
V2.1 = c("Math", "French", "Math"),
V2.2 = c(78, 86, 44),
V3.1 = c("English", "English", "French"),
V3.2 = c(56, 79, 34))
# recode into groups
rec(mydat[, c(3,5,7)]) <- "min:60=1; 61:90=2; 91:max=3"
# convert to long format
newdf <- to_long(mydat, "no_use",
c("subject", "score"),
c("V1.1", "V2.1", "V3.1"),
c("V1.2", "V2.2", "V3.2")) %>%
select(-no_use) %>%
arrange(Name, score)
# at this point we are at a similar stage as described in the
# other answers, so we have our data in a long format
newdf
fdf <- list()
# iterate all unique names
for (i in unique(newdf$Name)) {
dummy <- c()
# iterare all three scores
for (s in 1:3) {
# find subjects related to the score
dat <- newdf$subject[newdf$Name == i & newdf$score == s]
if (length(dat) == 0) dat <- ""
dat <- paste0(dat, collapse = ",")
dummy <- c(dummy, dat)
}
# add character vector with sorted subjects to list
fdf[[length(fdf) + 1]] <- dummy
}
# list to data frame
finaldf <- as.data.frame(t(as.data.frame(fdf)))
finaldf <- cbind(unique(newdf$Name), finaldf)
# proper row/col names
colnames(finaldf) <- c("Names", "Level1", "Level2", "Level3")
rownames(finaldf) <- 1:nrow(finaldf)
finaldf

Resources