How to make multiple rows from a single row in R? - r

I currently have a data set that has all information within one single row (or column if I transpose). The very first items in the data are actually column names:
Country
Population
O+
A+
B+
AB+
O-
A-
B-
AB-
Albania
3,074,579
32.1%
31.2%
14.5%
5.2%
6.0%
5.5%
2.6%
0.9%
Algeria
43,576,691
5019
40.0%
30.0%
15.0%
4.25%
6.6%
2.3%
1.1%
Argentina
45,479,118
8017
48.9%
2.45%
4.9%
3.16%
0.8%
0.25%
...
Armenia
3,021,324
8870
29.0%
46.3%
12.0%
5.6%
2.0%
...
...
The problem is that right now, my table has all these values within ONE single column (or row if I transpose).
How can I make sure to have a new row at each country?
I'm truly just trying to web scrape the blood type distribution by country table found here but after attempting to do so, I have encountered this problem. Help on either would be appreciated!
Thank you.

This should work
library(rvest)
library(tidyverse)
baseurl=("https://en.wikipedia.org/wiki/Blood_type_distribution_by_country")
fullurl=URLencode(baseurl)
tables = read_html(fullurl) %>%
html_table(fill = TRUE)
df = tables[[2]]

Related

Listing all the diferent strings from a dataframe in R

i'm still a newbie with R and I can't figure this out. I have a dataframe that looks like this:
Age State Diagnosis
12 Texas Lung Cancer
67 California Colon Cancer
45 Wyoming Lung Cancer
36 New Mex. Leukemia
58 Arizona Colon Cancer
35 Colorado Leukemia
I need a program that somehow prints or adds into another dataframe all the different strings that are located in each column. So I can Know all the "types". For example, in the case of the column "diagnosis", the program should create a dataframe with only "Lung cancer, colon cancer and leukemia" since there are only those 3 types, even though they are repeated.
You can use unique.
Assuming you have a dataframe data with all the information, you can use the function unique() to list all the occurences, removing repetitions:
types <- unique(data$diagnosis)
you can do the following to get the data
AllDiagnosis <- unique(data$Diagnosis)
Here is another option with distinct
library(dplyr)
data %>%
distinct(diagnosis) %>%
pull(diagnosis)

Conditional calculation based on other columns lagged values

Newbie: I have a dataset where I want to calculate the y-o-y growth of sales of a company. The dataset contains approx. 1000 companies with each different number of years listed on a public stock exchange. The data looks like this:
# gvkey fyear at company name
#22 17436 2010 59393 BASF SE
#23 17436 2011 61175 BASF SE
#24 17436 2012 64327 BASF SE
...
#30 17436 2018 86556 BASF SE
#31 17828 1989 62737 DAIMLER AG
#32 17828 1990 67339 DAIMLER AG
#33 17828 1991 75714 DAIMLER AG
...
#60 17828 2018 281619 DAIMLER AG
I would like to create a new column growth where I calculate the percentage increase of at from e.g. BASF SE (gvkey 17436) from 2010 to 2011, to 2012 and so on. In row #31 the conditional statement is supposed to work that it would not calculate the increase based on values that belong to BASF but rather have a NA value. Therefore the next value in this new column "growth" in row 32 would be the percentage increase of DAIMLER (gvkey 17828) from 62727 to 67339
So far I tried:
if TA$gvkey == lag(TA$gvkey) {mutate(TA, growth = (at - lag(at))/lag(at))} else {NULL}
Basically I tried to condition the calculation on the change of the gvkey identifier as this makes the most sense to me. I believe there is a nicer way of maybe running a loop until the gvkey changes and the continue with the next set of values - but I simply don't know how to code that.
I am very new to R and quite lost. I would appreciate every support! Thank you, guys :)
I do not see a way to do this in one line. Assuming you data is called data you may try:
for(i in data$gvkey){
a = subset(data,data$gvkey==i) # a now contains the data of one company
# calculate pairwise relative difference (assumes sorted years!)
rel_diff = diff(a)/head(a,-1) #diff computes pariwise difference and divide by a ( head(a,-1) removes the last element)
a$growth = c(0,rel_diff) # extend data frame by result, first difference is 0
#output tro somewhere
}
This is a solution with r-base. There might be more efficient ways but this is easy to understand.
In this case the group_by function in dplyr is a good tool to use.By group_by() ing your gv column you will segment out your mutate() call to apply separately for each distinct value of gv. Here is a quick example I made with some dummy data and your same column values:
library(dplyr)
dummyData =
data.frame(gvkey = c(111,111,111,222,222,222),
fyear = c(2010,2012,2011,2010,2011,2013),
at =c(2,4,2,4,5,10)
)
dummyDataTransformed = dummyData %>%
group_by(gvkey) %>%
arrange(fyear) %>% #to make sure we are chronologically in order
mutate(growth = at/lag(at,1) -1) %>% #subtract 1 to get year over year change
ungroup() #I like to ungroup just to make sure i'm not bugging out any calculations I might add further down the line

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Indexing with mutate

I have an unbalanced panel by country as the following:
cname year disability_PC family_PC ... allFunctions_PC
Denmark 1992 953.42 1143.25 ... 9672.43
Denmark 1995 1167.33 1361.62 ... 11002.45
Denmark 2000 1341 1470.54 ... 11200
Finland 1991 1095 955 ... 7164
Finland 1996 1067 1040 ... 7600
And so on for more years and countries. What I would like to do is to compute the mobile indexing for each of the type of social expenditures (disability_PC, family_PC, ... allFunctions_PC).
Therefore, I tried the following:
pdata %>%
group_by(cname) %>%
mutate_at(vars(disability_absPC, family_absPC, Health_absPC, oldage_absPC, unemp_absPC, housing_absPC, allFunctions_absPC),
funs(chg = ((./lag(.))*100)))
The code seems to work, as R reports the first 10 columns and correctly says "with 56 more rows, and 13 more variables". However, these are not added to the data frame. I mean, typing
view(pdata)
the variables are not existing, as if the mutate command did not create these variables.
What am I doing wrong?
Thank you for the support.
We can make this simpler with some of the select_helpers and also the funs is deprecated. In place, we can use the list
library(dplyr)
pdata <- pdata %>%
group_by(cname) %>%
mutate_at(vars(ends_with('absPC')), list(chg = ~ ((./lag(.))*100))
Regarding the issue of not creating the variables, based on the OP's code, the output is not assigned to any object identifier or updated the original object (<-). If it is done, the columns will be created

Using name full name and maiden name strings (and birthdays) to match individuals across time

I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)

Resources