two datasets to satisfy one condition in R [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I need help with working with two different datasets for my research project.
I have two different data frames, they have different number of columns and rows. I need to gather values from one dataset that satisfy a specific condition that involves both datasets.
The condition to be satisfied: that the combination of two values (in the same row but different columns) are the same.
For example, in my dataset, the values in 'data$Regular_response' should be the same as 'swow$response', and data$Test_word should be the same as swow$cue.
data$Regular_response = swow$response
data$Test_word = swow$cue
In other words, I am looking for equal word pairs in both datasets.
When this condition is satisfied, I need the value of swow$R123.Strength to be printed in a new column in data$strength
How do I do that??
> head(swow)
# A tibble: 6 x 5
cue response R123 N R123.Strength
<chr> <chr> <chr> <chr> <chr>
1 a one 31 257 0.120622568093385
2 a the 26 257 0.101167315175097
3 a an 17 257 0.066147859922179
4 a b 14 257 0.0544747081712062
5 a single 9 257 0.0350194552529183
6 a article 6 257 0.0233463035019455
> head(data)
Regular_response Test_word Pyramids_and_Palms_Test
1: princess queen 92
2: shoes slippers 92
3: flowerpot vase 92
4: horse zebra 92
5: cup bowl 85
6: nun church 85
> filter(data, Test_word == 'queen', Regular_response == 'princess')
Regular_response Test_word Pyramids_and_Palms_Test
1 princess queen 92
2 princess queen 87
> filter(swow, cue == 'queen', response == 'princess')
# A tibble: 1 x 5
cue response R123 N R123.Strength
<chr> <chr> <chr> <chr> <chr>
1 queen princess 3 292 0.0102739726027397
I appreciate those who can help me with this code!

Try this solution as I told you earlier:
Merged <- merge(data,swow[,c("cue","response","R123.Strength")],by.x = c('Test_word','Regular_response'),by.y=c('cue','response'),all.x=T)

Sounds like a job for a join. So something like:
data <- data %>%
left_join(swow, by = c("Regular_response" = "response", "Test_word" = "cue")) %>%
mutate(strength = R123.Strength)

Related

how to combine info from two different sized tibbles? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Merge 2 data frame based on 2 columns with different column names
(1 answer)
How to specify names of columns for x and y when joining in dplyr?
(2 answers)
Closed last year.
New to R, just learning.
I have two tibbles, one, statecodes, having 67 rows, with the mapping from Census Bureau (CB) state code to state name/abbrev, and one, shapedata, 3233 row, with information about the size of each county in the US with the same statecode as in the first tibble. I would like to add the name and abbrev to the second tibble.
> head(statecodes)
# A tibble: 6 × 3
Name Abbrev Code
<chr> <chr> <chr>
1 Alabama AL 01
2 Alaska AK 02
3 Arizona AZ 04
4 Arkansas AR 05
5 California CA 06
6 Colorado CO 08
> head(shapedata)
# A tibble: 6 × 9
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND AWATER
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 39 071 01074048 0500000US39071 39071 Highland 06 1432479992 12194983
2 06 003 01675840 0500000US06003 06003 Alpine 06 1912292630 12557304
3 12 033 00295737 0500000US12033 12033 Escambia 06 1701544502 563927612
4 17 101 00424252 0500000US17101 17101 Lawrence 06 963936864 5077783
5 28 153 00695797 0500000US28153 28153 Wayne 06 2099745573 7255476
6 28 141 00695791 0500000US28141 28141 Tishomingo 06 1098938845 52360190
> nrow(statecodes)
[1] 67
> nrow(shapedata)
[1] 3233
I can't use mutate because the input/output are differently sized. I've looked at purrr, but don't see an obvious way to use it. I was trying to do something like this.
shapedata <- shapedata %>% mutate(statename = statecodes[statecodes$Code == STATEFP,]$Name)
where STATEFP is the same state code as is in `statecodes'
I could write a for loop, I'm just wondering if there's a more R-like method.
TIA
I guess you are looking for a join.
shapedata %>%
left_join(statecodes,
by = c("STATEFP" = "Code"))

Transforming big dataframe in more sensible form [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Reshaping wide to long with multiple values columns [duplicate]
(5 answers)
Closed 1 year ago.
Dataframe consist of 3 rows: wine_id, taste_group and and evaluated matching score for each of that group:
wine_id
taste_group
score
22
tree_fruit
87
22
citrus_fruit
98
22
tropical_fruit
17
22
earth
8
22
microbio
6
22
oak
7
22
vegetal
1
How to achieve to make a separate column for each taste_group and to list scores in rows?
Hence this:
wine_id
tree_fruit
citrus_fruit
tropical_fruit
earth
microbio
oak
vegetal
22
87
98
17
8
6
7
1
There are 13 taste groups overall, along with more than 6000 Wines.
If the wine doesn't have a score for taste_group row takes value 0.
I used
length(unique(tastes$Group))
length(unique(tastes$Wine_Id))
in R to question basic measures.
How to proceed to wanted format?
Assuming your dataframe is named tastes, you'll want something like:
library(tidyr)
tastes %>%
# Get into desired wide format
pivot_wider(names_from = taste_group, values_from = score, values_fill = 0)
In R, this is called as the long-to-wide reshaping, you can also use dcast to do that.
library(data.table)
dt <- fread("
wine_id taste_group score
22 tree_fruit 87
22 citrus_fruit 98
22 tropical_fruit 17
22 earth 8
22 microbio 6
22 oak 7
22 vegetal 1
")
dcast(dt, wine_id ~ taste_group, value.var = "score")
#wine_id citrus_fruit earth microbio oak tree_fruit tropical_fruit vegetal
# <int> <int> <int> <int> <int> <int> <int> <int>
# 22 98 8 6 7 87 17 1
Consider reshape:
wide_df <- reshape(
my_data,
timevar="taste_group",
v.names = "score",
idvar = "wine_id",
direction = "wide"
)

How do I convert a data frame where some rows are duplicates except for one column [duplicate]

This question already has answers here:
Changing Values from Wide to Long: 1) Group_By, 2) Spread/Dcast [duplicate]
(3 answers)
Convert long to wide with variable number of columns
(1 answer)
Closed 3 years ago.
I have data on taxpayers which have multiple branches which produces multiple rows which are almost duplicates except for one relevant column (enterprise activity). I want to change this so that each taxpayer has only one row, which involves creating columns 'enterprise_activity_1', 'enterprise_activity_2' and so on.
I realise this is similar to reshaping but I can't think of a way to use tidyr::spread to achieve this.
For simplicity, say we just have a dataframe like:
df <- tibble::tibble(
TAXPAYER_ID = c(100, 151, 250, 250, 267, 296, 296, 304),
ENTERPRISE_ACTIVITY = rep(c("AGRICULTURE", "MANUFACTURING"), 4)
)
What I would like to achieve is this:
TAXPAYER_ID ENTERPRISE_ACTIVITY_1 ENTERPRISE_ACTIVITY_2
100 AGRICULTURE NA
151 MANUFACTURING NA
250 AGRICULTURE MANUFACTURING
267 AGRICULTURE NA
296 MANUFACTURING AGRICULTURE
304 MANUFACTURING NA
My actual data has varying numbers of branches per taxpayer so the number of columns should be the maximum number of branches that one taxpayer has.
Basically you need to group by the taxpayer ID, create a column to handle duplicated identifiers and spread, i.e.
library(tidyverse)
df %>%
group_by(TAXPAYER_ID) %>%
mutate(ENTERPRISE_ACT = row_number()) %>%
spread(key = ENTERPRISE_ACT, ENTERPRISE_ACTIVITY, sep = '_')
# A tibble: 6 x 3
# Groups: TAXPAYER_ID [6]
# TAXPAYER_ID ENTERPRISE_ACT_1 ENTERPRISE_ACT_2
# <dbl> <chr> <chr>
#1 100 AGRICULTURE <NA>
#2 151 MANUFACTURING <NA>
#3 250 AGRICULTURE MANUFACTURING
#4 267 AGRICULTURE <NA>
#5 296 MANUFACTURING AGRICULTURE
#6 304 MANUFACTURING <NA>

Select TailNumber with MaxAirTime by UniqueCarrier [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 5 years ago.
I have a similar problem with my own dataset and decided to practice on an example dataset. I'm trying to select the TailNumbers associated with the Max Air Time by Carrier.
Here's my solution thus far:
library(hflights)
hflights %>%
group_by(UniqueCarrier, TailNum) %>%
summarise(maxAT = max(AirTime)) %>%
arrange(desc(maxAT))
This provides three columns to which I can eyeball the max Air Time values and then filter them down using filter() statements. However, I feel like there's a more elegant way to do so.
You can use which.max to find out the row with the maximum AirTime and then slice the rows:
hflights %>%
select(UniqueCarrier, TailNum, AirTime) %>%
group_by(UniqueCarrier) %>%
slice(which.max(AirTime))
# A tibble: 15 x 3
# Groups: UniqueCarrier [15]
# UniqueCarrier TailNum AirTime
# <chr> <chr> <int>
# 1 AA N3FNAA 161
# 2 AS N626AS 315
# 3 B6 N283JB 258
# 4 CO N77066 549
# 5 DL N358NB 188
# 6 EV N716EV 173
# 7 F9 N905FR 190
# 8 FL N176AT 186
# 9 MQ N526MQ 220
#10 OO N744SK 225
#11 UA N457UA 276
#12 US N950UW 212
#13 WN N256WN 288
#14 XE N11199 204
#15 YV N907FJ 150

Calculate Percentage Column for List of Dataframes When Total Value is Hidden Within the Rows

library(tidyverse)
I feel like there is a simple solution for this but I'm stuck. The code below creates a simple list of two dataframes (they are the same for simplicity of the example, but the real data has different values)
Loc<-c("Montreal","Toronto","Vancouver","Quebec","Ottawa","Hamilton","Total")
Count<-c("2344","2322","122","45","4544","44","9421")
Data<-data_frame(Loc,Count)
Data2<-data_frame(Loc,Count)
Data3<-list(Data,Data2)
Each dataframe has "Total" within the "Loc" column with the corresponding overall total of the "Count" column. I would like to calculate percentages for each dataframe by dividing each value in the "Count" column by the total, which is the last number in the "Count" column.
I would like the percentages to be added as new columns for each dataframe.
For this example, the total is the last number in the column, but in reality, it may be mixed anywhere in the column and can be found by the corresponding "Total" value in the "Loc" column.
I would like to use purrr and Tidyverse:
Below is an example of the code, but I'm stuck on the percentage...
Data3%>%map(~mutate(.x,paste0(round(100* (MISSING PERCENTAGE),2),"%"))
This solution uses only base-R:
for (i in seq_along(Data3)) {
Data3[[i]]$Count <- as.numeric(Data3[[i]]$Count)
n <- nrow(Data3[[i]])
Data3[[i]]$perc <- Data3[[i]]$Count / Data3[[i]]$Count[n]
}
> Data3
[[1]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000
[[2]]
# A tibble: 7 x 3
Loc Count perc
<chr> <dbl> <dbl>
1 Montreal 2344 0.248805859
2 Toronto 2322 0.246470651
3 Vancouver 122 0.012949793
4 Quebec 45 0.004776563
5 Ottawa 4544 0.482326717
6 Hamilton 44 0.004670417
7 Total 9421 1.000000000

Resources