Data Wrangling Using Dplyr

Data Wrangling Using Dplyr - r

Using Dplyr, I am trying to find which country has the largest increase in wealth between 2002 and 2006 from the following data.
Country wealth_2002 wealth_2004 wealth_2006
Country_A 1000 1600 2200
Country_B 1200 1300 1800
Country_C 1400 1100 1200
Country_D 1500 1000 1100
Country_E 1100 1800 1900
To get the country's name, I have used
largest_increase <- df %>%
group_by(Country) %>%
filter(max(wealth_2006 - wealth_2002)) %>%
And this gives me
Error in filter_impl(.data, quo) :
Argument 2 filter condition does not evaluate to a logical vector
I would be really grateful if someone can help me what I am doing wrong and how I can fix this. I am very new to R so any help would be appreciated.

Using Base R you can use which.max to index your country column:
# This is my dummy data, you can ignore it
country <- c("Sweden", "Finland")
X1 <- c(1050, 1067)
X2 <- c(1045, 1069)
DF <- data.frame(country, X1, X2)
# Modify this to suit
DF$country[which.max(DF$X2- DF$X1)]
So for yours it would be:
df$Country[which.max(df$wealth_2006 - df$wealth_2002)]

Look at how filter works - you need to provide a logical "test" for each row, if it passes, it will keep the row. Also no real need to group_by country since each country is already its own row. Try something like this, where you calculate and store the wealth change for each country then keep the country/countries which have that max value:
library(dplyr)
df <- read.table(
text = "
Country wealth_2002 wealth_2004 wealth_2006
Country_A 1000 1600 2200
Country_B 1200 1300 1800
Country_C 1400 1100 1200
Country_D 1500 1000 1100
Country_E 1100 1800 1900
", header = TRUE, stringsAsFactors = FALSE
)
df %>%
mutate(wealth_change = wealth_2006 - wealth_2002) %>%
filter(wealth_change == max(wealth_change)) %>%
pull(Country) # gives us the Country column
Output:
[1] "Country_A"

Use dput(data) to help answers.
structure(list(Country = structure(1:5, .Label = c("Country_A",
"Country_B", "Country_C", "Country_D", "Country_E"), class = "factor"),
wealth_2002 = c(1000L, 1200L, 1400L, 1500L, 1100L), wealth_2004 = c(1600L,
1300L, 1100L, 1000L, 1800L), wealth_2006 = c(2200L, 1800L,
1200L, 1100L, 1900L)), .Names = c("Country", "wealth_2002",
"wealth_2004", "wealth_2006"), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
data %>%
mutate(delta = wealth_2006 - wealth_2004) %>% #Create a new variable called delta with mutate
arrange(desc(delta)) %>% #sort descending by 'delta'
head(1) #return the top line.. pull out the specific value if needed
This will return the top row... of the greatest change.
Country A has a change of 600

You can also use top_n :
library(dplyr)
df %>% top_n(1,wealth_2006 - wealth_2002)
# Country wealth_2002 wealth_2004 wealth_2006
# 1 Country_A 1000 1600 2200

Related

Should I be using unnest_wider and rowMeans to get the average of a list column?

I have a simple data set. The row names are a meaningful index and column 1 has a list of values. What I eventually want is the average of that list for each row name.
What it looks like now:
row name
years
108457
[1200, 1200, 1540, 1890]
237021
[1600, 1270, 1270]
What I eventually want it to look like:
row name
years
108457
mean of list
237021
mean of list
Currently, I'm trying to use unnest_wider(years). My plan is to then afterwards use rowMeans() to find the mean of the unnested row. I can then merge the row name and average value with my main data set, so I'm not too concerned with deleting the new columns.
However, this whole process is taking a while and I'm having some issues with unnest_wider. Currently, when I try:
unnest_wider(dataset, colname)
I get the following error:
Error in as_indices_impl():
! Must subset columns with a valid subscript vector.
✖ Subscript has the wrong type data.frame<years:list>.
ℹ It must be numeric or character.
When I try:
unnest_wider(colname)
My computer just runs endlessly and it looks like it's counting... it doesn't stop and I have to quit the application to terminate processing.
I had previously tried to directly apply rowMeans, use mean(df$ColName), and use apply(ColName, mean).
I wonder if there's a more efficient way?
It may be that I shouldn't have created the list in the first place. It looks like it does now because I converted it from this format:
Column A
Column B
108457
1200
108457
1200
108457
1540
237021
1600
108457
1890
237021
1270
I converted it using pivot_wider and then as.data.frame.(t(dataset))
Should I have tried to get the averages directly from this format? If so, how would I do that?

For your vectors in each row, you can use sapply to iterate over each row to calculate the mean, then just return the mean for each row name.
df$years <- sapply(df$years, mean, na.rm = TRUE)
Output
years
108457 1457.5
237021 1380.0
Data
df <- structure(list(years = structure(list(c(1200, 1200, 1540, 1890
), c(1600, 1270, 1270)), class = "AsIs")), row.names = c("108457",
"237021"), class = "data.frame")
Or we can use data.table to get the mean, if your data look like the latter, long format dataset.
library(data.table)
as.data.table(df2)[, list(ColumnB = mean(ColumnB)), by = ColumnA]
Output
ColumnA ColumnB
1: 108457 1457.5
2: 237021 1435.0
Data
df2 <- structure(list(ColumnA = c(108457L, 108457L, 108457L, 237021L,
108457L, 237021L), ColumnB = c(1200L, 1200L, 1540L, 1600L, 1890L,
1270L)), class = "data.frame", row.names = c(NA, -6L))

If your original data look like they do in the latter table, you could simply find the mean by group based on ColumnA:
Data
df <- read.table(text = "ColumnA ColumnB
108457 1200
108457 1200
108457 1540
237021 1600
108457 1890
237021 1270", header = TRUE)
Base R
aggregate(df$ColumnB, list(df$ColumnA), FUN=mean)
# Group.1 x
# 1 108457 1457.5
# 2 237021 1435.0
Dplyr
library(dplyr)
df %>%
group_by(ColumnA) %>%
summarise(mean_years = mean(ColumnB))
# ColumnA mean_years
# <int> <dbl>
#1 108457 1458.
#2 237021 1435

Sort table rows by column values in R

I have a classic output of the BLAST tool that it is like the table below. To make the table easier to read, I reduced the number of columns.
query
subject
startinsubject
endinsubject
1
SRR
50
100
1
SRR
500
450
What I would need would be to create another column, called "strand", where when the query is forward as in the first row, and therefore the startinsubject is less than the endinsubject, writes in the new column F.
On the other hand, when the query is in reverse, as in the second row, where the startinsubject is higher than the endinsubject, it adds an R in the new "strand" column.
I would like to get a new table like this one below. Could anyone help me? a thousand thanks
query
subject
startinsubject
endinsubject
strand
1
SRR
50
100
F
1
SRR
500
450
R

This is an ifelse option. You can use the following code:
df <- data.frame(query = c(1,1),
subject = c("SRR", "SRR"),
startinsubject = c(50, 500),
endinsubject = c(100, 450))
library(dplyr)
df %>%
mutate(strand = ifelse(startinsubject > endinsubject, "R", "F"))
Output:
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R

We may either use ifelse/case_when or just convert the logical to numeric index for replacement
library(dplyr)
df1 <- df1 %>%
mutate(strand = c("R", "F")[1 + (startinsubject < endinsubject)])
-output
df1
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R
data
df1 <- structure(list(query = c(1L, 1L), subject = c("SRR", "SRR"),
startinsubject = c(50L, 500L), endinsubject = c(100L, 450L
)), class = "data.frame", row.names = c(NA, -2L))

Pivotlonger with identical Column names

my data looks like
Nr. Type 1 Type 2 Type 1 Type 2
1 400 600 100 800
2 500 400 900 300
3 200 200 400 700
4 300 600 800 300
and I want to create Boxolplots of Type 1 and type 2.
Pivotlonger makes Type 1 and Type 1.1 which is not what I Need.
Maybe someone can help me.

It turns out your issue was not with the pivot_longer() but with the subsetting of your original data.frame using [. There is no direct control over the requirement that the output of [ or base::subset() have unique column names so you need do something else to subset your data and avoid losing column names. This is discussed in this question so borrowing from one of the answers, you can use:
library(tidyverse)
# data with extra column to be removed
d <- structure(list(Nr. = 1:4, `Type 1` = c(400L, 500L, 200L, 300L), x = 1:4, `Type 2` = c(600L,
400L, 200L, 600L), `Type 1` = c(100L, 900L, 400L, 800L), `Type 2` = c(800L,
300L, 700L, 300L)), row.names = c(NA, -4L), class = "data.frame")
# remove extra column without changing names then pivot
data.frame(as.list(d)[-3], check.names = FALSE) %>%
pivot_longer(-Nr.) %>%
ggplot(aes(name, value)) +
geom_boxplot()
Created on 2022-02-22 by the reprex package (v2.0.1)

Is there a way on R to combine rows to make a total/average?

I have a big df which looks like this:
Name Year Runs Average
J. Doe 2016 432 44.5
J. Doe 2017 325 37.4
J. Bloggs 2016 289 54.3
I want to concatenate rows so that I can make a total for each name, rather than split by year. Some columns e.g. Runs would need to be summed and others e.g. Average would need other formulae dependent on other columns. The df is too big to do it manually, so is there a function I can use to combine these rows whenever there is a repeated name?

You can use dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
summarise(sum_of_runs = sum(Runs),
average_of_column_x = mean(column_x, na.rm = TRUE))

If you want to sum Runs column and take mean of Average column for each unique value in Name, using data.table you can do :
library(data.table)
setDT(df)[, .(Runs = sum(Runs), Avg = mean(Average)), Name]
# Name Runs Avg
#1: J.Doe 757 41.0
#2: J.Bloggs 289 54.3
Add na.rm = TRUE in sum and mean functions if you have NA values.
data
df <- structure(list(Name = c("J.Doe", "J.Doe", "J.Bloggs"), Year = c(2016L,
2017L, 2016L), Runs = c(432L, 325L, 289L), Average = c(44.5,
37.4, 54.3)), class = "data.frame", row.names = c(NA, -3L))

Splitting coloumn with differing syntax in R

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated

1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).

Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))

One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex