Indexing with mutate - r

I have an unbalanced panel by country as the following:
cname year disability_PC family_PC ... allFunctions_PC
Denmark 1992 953.42 1143.25 ... 9672.43
Denmark 1995 1167.33 1361.62 ... 11002.45
Denmark 2000 1341 1470.54 ... 11200
Finland 1991 1095 955 ... 7164
Finland 1996 1067 1040 ... 7600
And so on for more years and countries. What I would like to do is to compute the mobile indexing for each of the type of social expenditures (disability_PC, family_PC, ... allFunctions_PC).
Therefore, I tried the following:
pdata %>%
group_by(cname) %>%
mutate_at(vars(disability_absPC, family_absPC, Health_absPC, oldage_absPC, unemp_absPC, housing_absPC, allFunctions_absPC),
funs(chg = ((./lag(.))*100)))
The code seems to work, as R reports the first 10 columns and correctly says "with 56 more rows, and 13 more variables". However, these are not added to the data frame. I mean, typing
view(pdata)
the variables are not existing, as if the mutate command did not create these variables.
What am I doing wrong?
Thank you for the support.

We can make this simpler with some of the select_helpers and also the funs is deprecated. In place, we can use the list
library(dplyr)
pdata <- pdata %>%
group_by(cname) %>%
mutate_at(vars(ends_with('absPC')), list(chg = ~ ((./lag(.))*100))
Regarding the issue of not creating the variables, based on the OP's code, the output is not assigned to any object identifier or updated the original object (<-). If it is done, the columns will be created

Related

How to make multiple rows from a single row in R?

I currently have a data set that has all information within one single row (or column if I transpose). The very first items in the data are actually column names:
Country
Population
O+
A+
B+
AB+
O-
A-
B-
AB-
Albania
3,074,579
32.1%
31.2%
14.5%
5.2%
6.0%
5.5%
2.6%
0.9%
Algeria
43,576,691
5019
40.0%
30.0%
15.0%
4.25%
6.6%
2.3%
1.1%
Argentina
45,479,118
8017
48.9%
2.45%
4.9%
3.16%
0.8%
0.25%
...
Armenia
3,021,324
8870
29.0%
46.3%
12.0%
5.6%
2.0%
...
...
The problem is that right now, my table has all these values within ONE single column (or row if I transpose).
How can I make sure to have a new row at each country?
I'm truly just trying to web scrape the blood type distribution by country table found here but after attempting to do so, I have encountered this problem. Help on either would be appreciated!
Thank you.
This should work
library(rvest)
library(tidyverse)
baseurl=("https://en.wikipedia.org/wiki/Blood_type_distribution_by_country")
fullurl=URLencode(baseurl)
tables = read_html(fullurl) %>%
html_table(fill = TRUE)
df = tables[[2]]

How to use group_by() to display earliest date

I have a tibble called master_table that is 488 rows by 9 variables. The two relevant variables to the problem are wrestler_name and reign_begin. There are multiple repeats of certain wrestler_name values. I want to reduce the rows to only be one instance of each unique wrestler_name value, decided by the earliest reign_begin date value. Sample tibble is linked below:
So, in this slice of the tibble, the end goal would be to have just five rows instead of eleven, and the single Ric Flair row, for example, would have a reign_begin date of 9/17/1981, the earliest of the four Ric Flair reign_begin values.
I thought that group_by would make the most sense, but the more I think about it and try to use it, the more I think it might not be the right path. Here are some things I tried:
group_by(wrestler_name) %>%
tibbletime::filter_time(reign_begin, 'start' ~ 'start')
#Trying to get it to just filter the first date it finds for each wrestler_name group, but did not work
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin)
#I know that this would not work, but its the place I'm basically stuck
edit: Per request, here is the head(master_table), which contains slightly different data, but it still expresses the issue:
1 Ric Flair NWA World Heavyweight Championship 40 8 69 1991-01-11 1991-03-21
2 Sting NWA World Heavyweight Championship 39 1 188 1990-07-07 1991-01-11
3 Ric Flair NWA World Heavyweight Championship 38 7 426 1989-05-07 1990-07-07
4 Ricky Steamboat NWA World Heavyweight Championship 37 1 76 1989-02-20 1989-05-07
5 Ric Flair NWA World Heavyweight Championship 36 6 452 1987-11-26 1989-02-20
6 Ronnie Garvin NWA World Heavyweight Championship 35 1 62 1987-09-25 1987-11-26
city_state country
1 East Rutherford, New Jersey USA
2 Baltimore, Maryland USA
3 Nashville, Tennessee USA
4 Chicago, Illinois USA
5 Chicago, Illinois USA
6 Detroit, Michigan USA
The common way to do this for databases involves a join:
earliest <- master_table %>%
group_by(wrestler_name) %>%
summarise(reign_begin = min(reign_begin)
master_table_2 <- master_table %>%
inner_join(earliest, by = c("wrestler_name", "reign_begin"))
No filter is required as an inner join only include overlaps.
The above approach is often required for database because of how they calculate summaries. But as #Martin_Gal suggests R can handle this a different way because it stores the data in memory.
master_table_2 <- master_table %>%
group_by(wrestler_name) %>%
filter(reign_begin == min(reign_begin))
You may also find having the lubridate package installed assist for working with dates.

Listing all the diferent strings from a dataframe in R

i'm still a newbie with R and I can't figure this out. I have a dataframe that looks like this:
Age State Diagnosis
12 Texas Lung Cancer
67 California Colon Cancer
45 Wyoming Lung Cancer
36 New Mex. Leukemia
58 Arizona Colon Cancer
35 Colorado Leukemia
I need a program that somehow prints or adds into another dataframe all the different strings that are located in each column. So I can Know all the "types". For example, in the case of the column "diagnosis", the program should create a dataframe with only "Lung cancer, colon cancer and leukemia" since there are only those 3 types, even though they are repeated.
You can use unique.
Assuming you have a dataframe data with all the information, you can use the function unique() to list all the occurences, removing repetitions:
types <- unique(data$diagnosis)
you can do the following to get the data
AllDiagnosis <- unique(data$Diagnosis)
Here is another option with distinct
library(dplyr)
data %>%
distinct(diagnosis) %>%
pull(diagnosis)

R adding rows of data and summarize them by group

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you
As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)
A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.
Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.
Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

R treats the subset as factor variables instead of numeric variables

In a comlex dataframe I am having a column with a net recalled salary inclusive NAs that I want to exclude plus a column with the year when the study was conducted ranging from 1992 to 2010, more or less like this:
q32 pgssyear
2000 1992
1000 1992
NA 1992
3000 1994
etc.
If I try to draw a boxplot like:
boxplot(dataset$q32~pgssyear,data=dataset, main="Recalled Net Salary per Month (PLN)",
xlab="Year", ylab="Net Salary")
it seems to work, however NAs might distort the calculations, so I wanted to get rid of them:
boxplot(na.omit(dataset$q32)~pgssyear,data=dataset, main="Recalled Net Salary per Month (PLN)",
xlab="Year", ylab="Net Salary")
Then I get a warning message that the length of pgsyear and q32 do not match, most likely cause I removed NAs from q32, so I tried to shorten the pgsyear, so that it does not include the rows that correspond to NAs from the q32 column:
pgssyearprim <- subset(dataset$pgssyear, dataset$q32!= NA )
however then the pgsyearprim gets treated as a factor variable:
pgssyearprim
factor(0)
and I get the same warning message if I introduce it to the boxplot formula...
Levels: 1992 1993 1994 1995 1997 1999 2002 2005 2008 2010
Of course they wouldn't ... you removed some of the data only from the LHS with na.omit(dataset$q32)~pgssyear. Instead use !is.na(dataset$q32) as a subset argument

Resources