Listing all the diferent strings from a dataframe in R

Listing all the diferent strings from a dataframe in R - r

i'm still a newbie with R and I can't figure this out. I have a dataframe that looks like this:
Age State Diagnosis
12 Texas Lung Cancer
67 California Colon Cancer
45 Wyoming Lung Cancer
36 New Mex. Leukemia
58 Arizona Colon Cancer
35 Colorado Leukemia
I need a program that somehow prints or adds into another dataframe all the different strings that are located in each column. So I can Know all the "types". For example, in the case of the column "diagnosis", the program should create a dataframe with only "Lung cancer, colon cancer and leukemia" since there are only those 3 types, even though they are repeated.

You can use unique.
Assuming you have a dataframe data with all the information, you can use the function unique() to list all the occurences, removing repetitions:
types <- unique(data$diagnosis)

you can do the following to get the data
AllDiagnosis <- unique(data$Diagnosis)

Here is another option with distinct
library(dplyr)
data %>%
distinct(diagnosis) %>%
pull(diagnosis)

Related

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.

I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Subsetting to Output only certain CHAR names in a single column

I want to be able to keep all rows where the "conm" column does contain certain bank names. you can tell from the code I am trying to use subset to do this but to no avail.
I have tried using subset to do this.
CMPSTPRFT12 <- subset(CMPSPRFT11, conm = MORGUARD CORP | conm = LEHMAN BROTHERS HOLDINGS INC)
I expect the output in rstudio to just show all rows where the column containing the names of banks includes certain banks, not all banks. I want SUnTrust, Lehman Brothers, Morgan Stanley, Goldman Sachs, PennyMac, Bank of America, and Fannie Mae.

Please see other posts on how to phrase your questions more helpfully for others. How to make a great R reproducible example
You can use dplyr and filter.
df <- data.frame(bank=letters[1:10],
value=10:19)
df %>% filter(bank=='a' | bank=='b')
bank value
1 a 10
2 b 11
banks <- c('d','g','j')
df %>% filter(bank %in% banks)
bank value
1 d 13
2 g 16
3 j 19

apply nested within lapply not working in R

just earlier today I received a very helpful answer for a problem I was running into that allowed me to move onto the next step of one of my projects. However, I got stuck again later on in the project, and I'm wondering if any of you can help me move forward.
Context
Currently, I have a list of data frames that are full of soccer matches called wc_match_dataframes. Here is what one of the data frames looks like:
type_id tourn_id day month year team_A score_A score_B team_B win loss
f wc_1934 27 5 1934 Germany 5 2 Belgium Germany Belgium
I wasn't able to fit the data for the final three columns, draw, drawA, and drawB but basically the draw column is TRUE if the match is a draw, if not, it is FALSE. In the case of a draw, the win and loss columns are just filled by Draw. The drawA column is filled by team_A if the match was a draw, and likewise, the drawB column is filled by team_B.
The type_id is either f or q depending on if the match was a World Cup qualifier or a World Cup finals match. The tourn_id refers to the tournament the match was for, whether it was a qualifier or finals.
There are a total of 39 of these data frames, with a "finals" data frame for each of the 20 World Cup tournaments, and a "qualifiers" data frame for 19 tournaments (the first World Cup did not have qualifying).
What I Want To Do
I'm trying to populate a different list of data frames wc_dataframes with data for each of the 20 World Cups at the country level as opposed to the match level. Each of these twenty data frames will have the countries that made it to the finals of said tournament and their data like so:
Country
Wins in qualifying
Wins in finals
Losses in qualifying
Losses in finals
... and so on.
I have been able to populate the first country column for every World Cup no problem, but I'm running into issues for the rest of the columns.
Here is what I'm doing
This is the unlooped (only works for one World Cup) version of my code that works successfully:
wc_dataframes$wc_1930$fw <- apply(wc_dataframes$wc_1930, MARGIN = 1, function(country)
sum(wc_match_dataframes$`wc_1930 f`$w == country, na.rm = TRUE))
This is successfully populating the finals win column in the wc_dataframes$wc_1930 data frame by counting the number of wins.
Now, when I try and nest this under lapply to do it across all World Cup years like so:
lapply(names(wc_dataframes), function(year)
wc_dataframes$year$fw <- apply(wc_dataframes$year, MARGIN = 1, function(country)
sum(wc_match_dataframes$`year f`$w == country, na.rm = TRUE)))
It does not work for me. I suspect that the issue has to do with defining the year function and running into issues in the sum portion of my code. I come from a background in STATA so I am more used to running for loops and what not. I'm still getting used to R and lists and everything so I really appreciate the help.
Thank you!
Thank you so much in advance for the help, and happy holidays! :)

What you need is to output whatever you have replaced:
lapply(names(wc_dataframes), function(year){
wc_dataframes[[year]]$fw <- apply(wc_dataframes[[year]], MARGIN = 1, function(country)
sum(wc_match_dataframes[[paste(year,'f')]]$w == country, na.rm = TRUE));
wc_dataframes}
)

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.

The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

getting the max() of a data frame under certain conditions

I have a rather large dataframe with 13 variables. Here is the first line just to give an idea:
prov_code nuts1 nuts1name nuts2 nuts2name prov_geoorder prov_name NUTS_ID EDAD year ORDER graphs value prov_geo
1. 15 1 NW 11 Galicia 1 La Corunna ES111 11 1975 1 1 0.000000000 La Corunna
I would like to obtain the maximum for a certain set of variables according to a combination of variables year ORDER and prov_code (ie, f_all being my data.frame: f_all[(f_all$year==1975)&(f_all$ORDER==1)&(f_all$prov_code=="1"),] ). The goal is to repeat the operation in order to obtain a new data frame containing all the maximum values for each year, ORDER, prov_code.
Is there a simple and quick way to do this?
Thanks for any suggestion on the matter,

There are several way of doing this, for example the one #James mentions. I want to suggest using plyr:
library(ply)
ddply(f_all, .(year, ORDER, prov_code), summarise, mx_value = max(value))
Alternatively, if you have a lot of data, data.table provides similar functionality, but is much much faster in that case.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Listing all the diferent strings from a dataframe in R - r

You can use unique. Assuming you have a dataframe data with all the information, you can use the function unique() to list all the occurences, removing repetitions: types <- unique(data$diagnosis)

you can do the following to get the data AllDiagnosis <- unique(data$Diagnosis)

Here is another option with distinct library(dplyr) data %>% distinct(diagnosis) %>% pull(diagnosis)

Related

Categorizing types of duplicates in R

Subsetting to Output only certain CHAR names in a single column

apply nested within lapply not working in R

What's the smart way to aggregate data?

getting the max() of a data frame under certain conditions

Categories

Resources