How do I compare two columns and delete the not overlapping elements? - r

I have two columns in two data frames, where the longer one includes all elements of the other column. Now I want to delete elements in the longer column that do not overlap with the other, together with the corresponding row. I identified the "difference" using:
diff <- setdiff(gdp$country, tfpg$country)
and I tried to use two FOR loops to get this done:
for (i in 1:28) { for(j in 1:123) {if(diff[i] == gdp$country[j]) {gdp <- gdp[-c(j),]}}}
where 28 is the number of rows I want to delete (length of diff) and 123 is the length of the longer column. This does not work, the error message:
Error in if (diff[i] == gdp$country[j]) { :
missing value where TRUE/FALSE needed
So how do I fix this? Or is there a better way to do this?
Thank you very much.
I have a data frame called "gdp" here:
country wto y1990 y1991 y1992
Austria 1995 251540 260197 265644
Belgium 1995 322113 328017 333038
Cyprus 1995 14436 14537 15898
Denmark 1995 177089 179392 182936
Finland 1995 149584 140737 136058
France 1995 1804032 1822778 1851937
There are 123 rows.
I would like to delete rows with country names specified in another vector:
diff ["Austria","China",...,"Yemen"]

there is a better way! What you're describing is the equivalent of a left join, or inner join. But in R the way to achieve it is using the merge command:
## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"),
incomparables = NULL, ...)
In your case:
merge(gdp, tfpg, by = intersect('country', 'country'))
E.g.
x = data.frame(foo = c(1,2,3,4,5), bar=c("A","B","C","D","E"))
y = data.frame(baz = c(6,7,8,9), bar=c("A","C","E","F"))
z = merge(x,y,by=intersect('bar','bar'))
gives
bar foo baz
1 A 1 6
2 C 3 7
3 E 5 8

Related

Select columns from a data frame

I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.

A way to get Column Names as Row Names?

My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]

Carrying out a simple dataframe subset with dplyr

Consider the following dataframe slice:
df = data.frame(locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
row.names = c("a091", "b231", "a234", "d154"))
df
locations score
a091 argentina 1
b231 brazil 2
a234 argentina 3
d154 denmark 4
sorted = c("a234","d154","a091") #in my real task these strings are provided from an exogenous function
df2 = df[sorted,] #quick and simple subset using rownames
EDIT: Here I'm trying to subset AND order the data according to sorted - sorry that was not clear before. So the output, importantly, is:
locations score
a234 argentina 1
d154 denmark 4
a091 argentina 3
And not as you would get from a simple subset operation:
locations score
a091 argentina 1
a234 argentina 3
d154 denmark 4
I'd like to do the exactly same thing in dplyr. Here is an inelegant hack:
require(dplyr)
dt = as_tibble(df)
rownames(dt) = rownames(df)
Warning message:
Setting row names on a tibble is deprecated.
dt2 = dt[sorted,]
I'd like to do it properly, where the rownames are an index in the data table:
dt_proper = as_tibble(x = df,rownames = "index")
dt_proper2 = dt_proper %>% ?some_function(index, sorted)? #what would this be?
dt_proper2
# A tibble: 3 x 3
index locations score
<chr> <fct> <int>
1 a091 argentina 1
2 d154 denmark 4
3 a234 argentina 3
But I can't for the life of me figure out how to do this using filter or some other dplyr function, and without some convoluted conversion to factor, re-order factor levels, etc.
Hy,
you can simply use mutate and filter to get the row.names of your data frame into a index column and filter to the vector "sorted" and sort the data frame due to the vector "sorted":
df2 <- df %>% mutate(index=row.names(.)) %>% filter(index %in% sorted)
df2 <- df2[order(match(df2[,"index"], sorted))]
I think I've figured it out:
dt_proper2 = dt_proper[match(sorted,dt_proper$index),]
Seems to be shortest implementation of what df[sorted,] will do.
Functions in the tidyverse (dplyr, tibble, etc.) are built around the concept (as far as I know), that rows only contain attributes (columns) and no row names / labels / indexes. So in order to sort columns, you have to introduce a new column containing the ranks of each row.
The way I would do it is to create another tibble containing your "sorting information" (sorting attribute, rank) and inner join it to your original tibble. Then I could order the rows by rank.
library(tidyverse)
# note that I've changed the third column's name to avoid confusion
df = tibble(
locations = c("argentina","brazil","argentina","denmark"),
score = 1:4,
custom_id = c("a091", "b231", "a234", "d154")
)
sorted_ids = c("a234","d154","a091")
sorting_info = tibble(
custom_id = sorted_ids,
rank = 1:length(sorted_ids)
)
ordered_ids = df %>%
inner_join(sorting_info) %>%
arrange(rank) %>%
select(-rank)

Merge 4 data objects with different columns (variables) in R

So initially I had the following object:
> head(gs)
year disturbance lek_id complex tot_male
1 2006 N 3T Diamond 3
2 2007 N 3T Diamond 17
3 1981 N bare 3corners 4
4 1982 N bare 3corners 7
5 1983 N bare 3corners 2
6 1985 N bare 3corners 5
With that I computed general statistics min, max, mean, and sd of tot_male for year within complex. I used R data splitting functions, and assigned logical column names where it seemed appropriate and ultimately made them different objects.
> tyc_min = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=min)
> names(tyc_min) = c("year", "complex", "tot_male_min")
> tyc_max = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=max)
> names(tyc_max) = c("year", "complex", "tot_male_max")
> tyc_mean = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=mean)
> names(tyc_mean) = c("year", "complex", "tot_male_mean")
> tyc_sd = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=sd)
> names(tyc_sd) = c("year", "complex", "tot_male_sd")
Example Output (2nd Object - Tyc_max):
year complex tot_male_max
1 2003 0
2 1970 3corners 26
3 1971 3corners 22
4 1972 3corners 26
5 1973 3corners 32
6 1974 3corners 18
Now I need to add the number of samples per year/complex combination as well. Then I need to merge these into single data object, and export as a .csv file
I know I need to use merge() function along with all.y but have no idea how to handle this error:
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
Or.. add the number of samples per year and complex. Any suggestions?
This might work (but hard to check without a reproducible example):
gsnew <- Reduce(function(...) merge(..., all = TRUE, by = c("year","complex")),
list(tyc_min, tyc_max, tyc_mean, tyc_sd))
But instead of aggregating for the separate statistics and then merging, you can also aggregate everything at once into a new dataframe / datatable with for example data.table, dplyr or base R. Then you don't have to merge afterwards (for a base R solution see the other answer):
library(data.table)
gsnew <- setDT(gs)[, .(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male), by = .(year, complex)]
library(dplyr)
gsnew <- gs %>% group_by(year, complex) %>%
summarise(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male))
mystat <- function(x) c(mi=min(x), ma=max(x))
aggregate(Sepal.Length~Species, FUN=mystat, data=iris)
for you:
mystat <- function(x) c(mi=min(x), ma=max(x), m=mean(x), s=sd(x), l=length(x))
aggregate(tot_male~year+complex, FUN=mystat, data=gs)

List objects from sub-elements of a list

What I want to do is make a list, then make a list from part of the elements of that list. I can do it in 2 steps using subset and then dlply, but I'm wondering if there's a faster way with any of the XXply methods.
So I have a dataframe:
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas","asia"), 50, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),5),
revenue = sample(500:1000,50,replace=T),
orders = sample(0:2,50,replace=T)
)
Ultimately, what I'm looking for here is: For each region, a list of identity values organized by business.
The messy approach is to take a subset for each region then simply turn that into a list:
mideast <- subset(data, region == "mideast")
americas <- subset(data, region == "americas")
asia <- subset(data, region == "asia")
mideast.list <- dlply(mideast, .(biz), identity)
americas.list <- dlply(americas, .(biz), identity)
asia.list <- dlply(asia, .(biz), identity)
Easy enough but it gets unwieldy with bigger datasets.
If I use dlply on the original data, it gives me the values I'm looking for, but again, I want to have actual list objects for each region. So:
list2 <- dlply(data, .(region, biz), identity)
But then how do I access just the regions from list2 and create separate list objects out of them?
I'm not 100% clear I understand what you're trying to do, but maybe this is it?
lst <- lapply(
split(data, data$region),
function(df) lapply(split(df, df$biz), identity)
)
lst[["americas"]][["shipping"]]
# biz region date revenue orders
# 3 shipping americas 2010-02-03 621 2
# 23 shipping americas 2010-02-03 799 2
# 33 shipping americas 2010-02-03 920 0
# 34 shipping americas 2010-02-04 705 2
This matches the structure of americas.list, so I think this is what you're trying to do. Also, note that you can skip the inner lapply if identity is really the function you want to apply (split alone does what you need).

Resources