Reshape long to wide on repeated rows - r

I have a data frame df that looks like the following:
Label Info
1 0-22 Records N/A
2 0-22 Records Poland
3 0-22 Records N/A
4 0-22 Records active
5 0-22 Records Hardcore
6 0-22 Records N/A
7 0-22 Records N/A
8 Nuclear Blast "Oeschstr. 40 73072 Donzdorf"
9 Nuclear Blast Germany
10 Nuclear Blast +49 7162 9280-0
11 Nuclear Blast active
12 Nuclear Blast Hardcore (early), Metal and subgenres
13 Nuclear Blast 1987
14 Nuclear Blast "Anstalt Records, Arctic Serenades, Cannibalised Serial Killer, Deathwish Office, Epica, Gore Records, Grind Syndicate Media, Mind Control Records, Nuclear Blast America, Nuclear Blast Brasil, Nuclear Blast Entertainment, Radiation Records, Revolution Entertainment"
15 Nuclear Blast Yes
I would like to reshape to wide where df will look like:
Label Address Country Phone Status Genre Year Sub Online
1 0-22 Records N/A Poland N/A active Hardcore N/A N/A N/A
2 Nuclear Blast "Oes.." Germany +49...
.
.
The number of repeated rows varies from 7 to 9 and I used reshape and reshape2 with the key assigned to "Label" to no avail.
EDIT: dput:
structure(list(label = c("0-22 Records", "0-22 Records", "0-22 Records",
"0-22 Records", "0-22 Records", "0-22 Records", "0-22 Records",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Metal Blade Records", "Metal Blade Records", "Metal Blade Records",
"Metal Blade Records", "Metal Blade Records"), info = c(" N/A ",
"Poland", " N/A ", "active", " Hardcore ", " N/A ", "N/A", " Oeschstr.
40\r\n73072 Donzdorf ",
"Germany", " +49 7162 9280-0 ", "active", " Hardcore (early), Metal and
subgenres ", " 1987 ", "\n\t\t\t\t\t\t\t\t\tAnstalt
Records,\t\t\t\t\t\t\t\t\tArctic Serenades,\t\t\t\t\t\t\t\t\tCannibalised
Serial Killer,\t\t\t\t\t\t\t\t\tDeathwish
Office,\t\t\t\t\t\t\t\t\tEpica,\t\t\t\t\t\t\t\t\tGore
Records,\t\t\t\t\t\t\t\t\tGrind Syndicate Media,\t\t\t\t\t\t\t\t\tMind
Control Records,\t\t\t\t\t\t\t\t\tNuclear Blast
America,\t\t\t\t\t\t\t\t\tNuclear Blast Brasil,\t\t\t\t\t\t\t\t\tNuclear
Blast Entertainment,\t\t\t\t\t\t\t\t\tRadiation
Records,\t\t\t\t\t\t\t\t\tRevolution Entertainment\t\t\t\t\t ",
"Yes", " 5737 Kanan Road #143\r\nAgoura Hills, California 91301 ",
"United States", " N/A ", "active", " Heavy Metal, Extreme Metal "
)), .Names = c("label", "info"), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x10200db78>)

The new column names for the wide data frame (e.g., Address, Country, etc.) don't appear in df. We need to add a column to df that maps info to the correct column names for the wide data frame in order to ensure that a given row's data ends up in the correct columns after reshaping.
The challenge is that we need to find ways to exploit regularities in the data in order to figure out which values of info represent Genre, Country, Year, etc. Based on the data sample you've provided, here are some initial ideas. In the code below, the case_when statement is an attempt to map info to the new column names. Going in order, the statements within the case_when statement are trying to do the following:
Find Country by identifying strings containing country names
Find Status (assuming it can only be either "active" or "inactive")
Find Genre. Here you'll need to cover more possibilities.
Find Year. I've assumed any row with a four-digit number in the range 1950-2017 represents a year. Adjust as necessary.
Find Phone. I've assumed it always starts with +, so you may need something more complex here.
Find Online (assuming it can only be either "Yes" or "No", and that no row that would be mapped to a different column would ever contain only the word "Yes" or "No")
Find Sub. You'll likely need a more complex strategy here. For now I've assumed rows that contain the words "Records" or "Entertainment" or that have three or more commas are Sub rows.
If a row doesn't match any of the above statements, assume it's an address.
You'll need to play around with these and see what works in the context of your data.
library(stringr)
library(tidyverse)
library(countrycode)
data("countrycode_data")
df %>%
filter(!grepl("N/A", info)) %>%
mutate(info = str_trim(gsub("\r*\t*|\n*| {2,}", "", info)),
NewCols = case_when(sapply(info, function(x) any(grepl(x, countrycode_data$country.name.en))) ~ "Country",
grepl("active", info) ~ "Status",
grepl("hardcore|metal|rock|classical", info, ignore.case=TRUE) ~ "Genre",
info %in% 1950:2017 ~ "Year",
grepl("^\\+", info) ~ "Phone",
grepl("^Yes$|^No$", info) ~ "Online",
grepl("Records|Entertainment|,{3,}", info) ~ "Sub",
TRUE ~ "Address")) %>%
group_by(label) %>%
spread(NewCols, info)
Here's the output (where I've truncated the long value of Sub to save space):
label Address Country Genre Online Phone Status Sub Year
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 0-22 Records <NA> Poland Hardcore <NA> <NA> active NA <NA>
2 Metal Blade Records 5737 Kanan Road #143Agoura Hills, California 91301 United States Heavy Metal, Extreme Metal <NA> <NA> active NA <NA>
3 Nuclear Blast Oeschstr. 4073072 Donzdorf Germany Hardcore (early), Metal and subgenres Yes +49 7162 9280-0 active Anstalt Re... 1987
Original answer (before data sample was available)
If you had all nine rows for each Label, and the data type in each row is always in the same order for each Label, then one solution would be:
library(tidyverse)
df.wide = df %>%
group_by(Label) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
You can implement this in your real data for any level of Label that has 9 rows.
df.wide9 = df %>%
group_by(Label) %>%
filter(n()==9) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
For the levels of Label with 8 or 7 rows, if the missing rows always represent the same type of data, for example, say the address row is the one that's always missing for the 8-row levels of Label, then you could do (once again, assuming the data data types are in the same order for each Label):
df.wide8 = df %>%
group_by(Label) %>%
filter(n()==8) %>%
mutate(NewCols = rep(c("Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
Then you could put them together with df.wide = bind_rows(df.wide8, df.wide9).
If you provide more information, we might be able to come up with a solution that works for your actual data.

Related

Combine every two rows of data in R

I have a csv file that I have read in but I now need to combine every two rows together. There is a total of 2000 rows but I need to reduce to 1000 rows. Every two rows is has the same account number in one column and the address split into two rows in another. Two rows are taken up for each observation and I want to combine two address rows into one. For example rows 1 and 2 are Acct# 1234 and have 123 Hollywood Blvd and LA California 90028 on their own lines respectively.
Using the tidyverse, you can group_by the Acct number and summarise with str_c:
library(tidyverse)
df %>%
group_by(Acct) %>%
summarise(Address = str_c(Address, collapse = " "))
# A tibble: 2 × 2
Acct Address
<dbl> <chr>
1 1234 123 Hollywood Blvd LA California 90028
2 4321 55 Park Avenue NY New York State 6666
Data:
df <- data.frame(
Acct = c(1234, 1234, 4321, 4321),
Address = c("123 Hollywood Blvd", "LA California 90028",
"55 Park Avenue", "NY New York State 6666")
)
It can be fairly simple with data.table package:
# assuming `dataset` is the name of your dataset, column with account number is called 'actN' and column with adress is 'adr'
library(data.table)
dataset2 <- data.table(dataset)[,.(whole = paste0(adr, collapse = ", ")), by = .(adr)]

How to join the corrects values according multiple conditions?

Hello¡ I hope be clear and consisstant because I think the proccess is a little complicated to understand. I will show an example what I need after an explanation. I have a dataset (base 2) that contains two data I want to join with another base (base 1) but there are many facts. The first is that I need two values that are in the same variable but these values are designed according another variable. The second is that I need to join the correct value according the time period.
I show an example from one case.
base 1:
STORE CODE
PERIOD
60M4
1105
base 2:
VALUE
CODE
WEEK FROM
WEEK TO
STORE CODE
CHANEL
AREA I
BR AREA
945
1189
60M4
NA
AREA I
BR AREA
1190
NA
60M4
NA
BIG
STORE TYPE
1198
NA
60M4
5
Joined base:
STORE CODE
PERIOD
BR AREA
STORE TYPE
CHANEL
60M4
1105
AREA I
BIG
5
In base2, the variable 'CODE' has the two variables (BR AREA & STORE TYPE) I need in the joining but as values in rows and the values I need are in 'VALUE' (AREA I & BIG). Then, the joining part to getting the AREA are connected to a period of time, this means that the store from the periods 945 to 1189 was in AREA I and then from 1190-NA (up to day) is in AREA I (the same), so, I need to join the correct period of time, and as I show, my period in this case is 1105, that means that I need to join the AREA in the period 945-1189 in addition to join the store type an chanel.
First I tried to filter the information, but it hasn't worked for me. I have thousands of rows and I unkwnow if could be possible with a cicle or just the correct filter.
Thank you so much
library(dplyr)
library(tidyr)
library(tibble)
df1 <- tribble(~STORE_CODE, ~PERIOD,
"60M4", 1105)
df2 <- tribble(~VALUE, ~CODE, ~WEEK_FROM, ~WEEK_TO, ~STORE_CODE, ~CHANEL,
"AREA I", "BR AREA", 945, 1189, "60M4", NA,
"AREA I", "BR AREA", 1190, NA, "60M4", NA,
"BIG", "STORE TYPE", 1198, NA, "60M4", 5)
df2 |>
pivot_wider(values_from = VALUE, names_from = CODE) |>
group_by(STORE_CODE) |>
fill(CHANEL, `BR AREA`, `STORE TYPE`, .direction = "downup") |>
ungroup() |>
right_join(df1, by = "STORE_CODE") |>
filter(PERIOD >= WEEK_FROM, PERIOD <= WEEK_TO) |>
select(STORE_CODE, PERIOD, `BR AREA`, `STORE TYPE`, CHANEL)
# A tibble: 1 × 5
STORE_CODE PERIOD `BR AREA` `STORE TYPE` CHANEL
<chr> <dbl> <chr> <chr> <dbl>
1 60M4 1105 AREA I BIG 5
This is assuming that variables like BR AREA, STORE TYPE, and CHANEL are the same each for each STORE_CODE.

Is it possible to get R to identify countries in a dataframe?

This is what my dataset currently looks like. I'm hoping to add a column with the country names that correspond with the 'paragraph' column, but I don't even know how to start going about with that. Should I upload a list of all country names and then use the match function?
Any suggestions for a more optimal way would be appreciated! Thank you.
The output of dput(head(dataset, 20)) is as follows:
structure(list(category = c("State Ownership and Privatization;...row.names = c(NA, 20L), class = "data.frame")
Use the package "countrycode":
Toy data:
df <- data.frame(entry_number = 1:5,
text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
"More text that might contain myanmar or burma, as well as thailand",
"sentences that do not contain a country name can be returned as NA",
"some variant of U.S or the united states",
"something with an accent samóoa"))
This is how you can match the country names in a separate column:
library(tidyr)
library(dplyr)
#install.packages("countrycode")
library(countrycode)
all_country <- countryname_dict %>%
# filter out non-ASCII country names:
filter(grepl('[A-Za-z]', country.name.alt)) %>%
# define column `country.name.alt` as an atomic vector:
pull(country.name.alt) %>%
# change to lower-case:
tolower()
# define alternation pattern of all country names:
library(stringr)
pattern <- str_c(all_country, collapse = '|') # A huge alternation pattern!
df %>%
# extract country name matches
mutate(country = str_extract_all(tolower(text), pattern))
entry_number text
1 1 a few paragraphs that might contain the country name congo or democratic republic of congo
2 2 More text that might contain myanmar or burma, as well as thailand
3 3 sentences that do not contain a country name can be returned as NA
4 4 some variant of U.S or the united states
5 5 something with an accent samóoa
country
1 congo, democratic republic of congo
2 myanma, burma, thailand
3
4 united states
5 samóoa

R extract multiple variables from column

I'm new to R so my apologies if this is unclear.
My data contains 1,000 observations of 3 variable columns: (a) person, (b) vignette, (c) response. The vignette column contains demographic information presented in a paragraph, including age (20, 80), sex (male, female), employment (employed, not employed, retired), etc. Each person received a vignette that randomly presented one of the values for age (20 or 80), sex (male or female), employment (employed, not employed, retired), etc.
(e.x. Person #1 received: A(n) 20 year old male is unemployed. Person #2 received: A(n) 80 year old female is retired. Person #3 received: A(n) 20 year old male is unemployed... Person # 1,000 received: A(n) 20 year old female is employed.)
I'm trying to use tidyr:extract on (b) vignette to extract the rest of the demographic information and create several new variable columns labeled "age", "sex" "employment" etc. So far, I've only been able to extract "age" using this code:
tidyr::extract(data, vignette, c("age"), "([20:80]+)")
I want to extract all of the demographic information and create variable columns for (b) age, (c) sex, (d) employment, etc. My goal is to have 1,000 observation rows with several variable columns like this:
(a) person, (b) age, (c) sex, (d) employment (e) response
Person #1 20 Male unemployed Very Likely
Person #2 80 Female retired Somewhat Likely
Person #3 20 Male unemployed Very Unlikely
...
Person #1,000 20 Female employed Neither Likely nor Unlikely
Vignette Example:
structure(list(Response_ID = "R_86Tm81WUuyFBZhH", Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?", Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
I appreciate any guidance or help!
I made up some regex's to pull out your info. Experience shows that you're going to spend many hours tweaking the regex before you get anything reasonably satisfactory. E.g. you won't pull the employment status correctly out of a sentence like "Neither she nor her boyfriend are employed"
raw <- structure(list(Response_ID = "R_86Tm81WUuyFBZhH",
Vignette = "A(n) 18 year-old Hispanic woman uses heroin several times a week. This person is receiving welfare, is employed and has no previous criminal conviction for drug possession. - Based on this description, how likely or unlikely is it that this person has a drug addiction?",
Response = "Very Likely"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
raw2 <- raw %>%
add_row(Response_ID = "R_xesrew",
Vignette = "A 22 year-old White boy drinks bleach. He is unemployed",
Response = "Unlikely")
rzlt <- raw2 %>%
tidyr::extract(Vignette, "Age", "(?ix) (\\d+) \\s* year\\-old", remove = FALSE) %>%
tidyr::extract(Vignette, "Race", "(?ix) (hispanic|white|asian|black|native \\s* american)", remove = FALSE) %>%
tidyr::extract(Vignette, "Job", "(?ix) (not \\s+ employed|unemployed|employed|jobless)", remove = FALSE) %>%
tidyr::extract(Vignette, "Sex", "(?ix) (female|male|woman|man|boy|girl)", remove = FALSE) %>%
select(- Vignette)
Gives
# A tibble: 2 x 6
Response_ID Sex Job Race Age Response
<chr> <chr> <chr> <chr> <chr> <chr>
1 R_86Tm81WUuyFBZhH woman employed Hispanic 18 Very Likely
2 R_xesrew boy unemployed White 22 Unlikely
Save your work
library(readr)
write_csv(rzlt, "myResponses.csv")
Alternatively
library(openxlsx)
openxlsx::write.xlsx(rzlt, "myResponses.xlsx", asTable = TRUE)

180 nested conditions in a separate file to create a new id variable for each row in the my dataframe

I need to identify 180 short sentences written by experiment participants and match to each sentence, a serial number in a new column. I have 180 conditions in a separate file. All the texts are in Hebrew but I attach examples in English that can be understood.
I'm adding example of seven lines from 180-line experiment data. There are 181 different conditions. Each has its own serial number. So I also add small 6-conditions example that match this participant data:
data_participant <- data.frame("text" = c("I put a binder on a high shelf",
"My friend and me are eating chocolate",
"I wake up with superhero powers",
"Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7) )
data_condition <- data.frame("condition_a" = c("wooden table" , "eating" , "loves",
"binder", "handsome", "superhero"),
"condition_b" = c("cubes", "chocolate", "protects me",
"shelf","campos", "powers"),
"condition_c" = c("0", "0", "0", "0", "me out", "0"),
"i.d." = (1:6) )
I decided to use ifelse function and a nested conditions strategy and to write 181 lines of code. For each condition one line. It's also cumbersome because it requires moving from English to Hebrew. But after 30 lines I started getting an error message:
contextstack overflow
A screenshot of the error in line 147 means that after 33 conditions.
In the example, there are at most 3 keywords per condition but in the full data there are conditions with 5 or 6 keywords. (The reason for this is the diversity in the participants' verbal formulations). Therefore, the original table of conditions has 7 columns: on for i.d. no. and the rest are the words identifiers for the same condition with operator "or".
data <- mutate(data, script_id = ifelse((grepl( "wooden table" ,data$imagery))|(grepl( "cubes" ,data$imagery))
,"1",
ifelse((grepl( "eating" ,data$imagery))|(grepl( "chocolate" ,data$imagery))
,"2",
ifelse((grepl( "loves" ,data$imagery))|(grepl( "protect me" ,data$imagery))
,"3",
ifelse((grepl( "binder" ,data$imagery))|(grepl( "shelf" ,data$imagery))
,"4",
ifelse( (grepl("handsome" ,data$imagery)) |(grepl( "campus" ,data$imagery) )|(grepl( "me out" ,data$imagery))
,"5",
ifelse((grepl("superhero", data$imagery)) | (grepl( "powers" , data$imagery ))
,"6",
"181")))))))
# I expect the output will be new column in the participant data frame
# with the corresponding ID number for each text.
# I managed to get it when I made 33 conditions rows. And then I started
# to get an error message contextstack overflow.
final_output <- data.frame("text" = c("I put a binder on a high shelf", "My friend and me are eating chocolate",
"I wake up with superhero powers", "Low wooden table with cubes",
"The most handsome man in camopas invites me out",
"My mother tells me she loves me and protects me",
"My laptop drops and breaks"),
"trial" = (1:7),
"i.d." = c(4, 2, 6, 1, 5, 3, 181) )
Here's an approach using fuzzymatch::regex_left_join.
data_condition_long <- data_condition %>%
gather(col, text_match, -`i.d.`) %>%
filter(text_match != 0) %>%
arrange(`i.d.`)
data_participant %>%
fuzzyjoin::regex_left_join(data_condition_long %>% select(-col),
by = c("text" = "text_match")) %>%
mutate(`i.d.` = if_else(is.na(`i.d.`), 181L, `i.d.`)) %>%
# if `i.d.` is doubles instead of integers, use this:
# mutate(`i.d.` = if_else(is.na(`i.d.`), 181, `i.d.`)) %>%
group_by(trial) %>%
slice(1) %>%
ungroup() %>%
select(-text_match)
# A tibble: 7 x 3
text trial i.d.
<fct> <int> <int>
1 I put a binder on a high shelf 1 4
2 My friend and me are eating chocolate 2 2
3 I wake up with superhero powers 3 6
4 Low wooden table with cubes 4 1
5 The most handsome man in camopas invites me out 5 5
6 My mother tells me she loves me and protects me 6 3
7 My laptop drops and breaks 7 181

Resources