Removing duplicates in R based on condition - r

I need to embed a condition in a remove duplicates function. I am working with large student database from South Africa, a highly multilingual country. Last week you guys gave me the code to remove duplicates caused by retakes, but I now realise my language exam data shows some students offering more than 2 different languages.
The source data, simplified looks like this
STUDID MATSUBJ SCORE
101 AFRIKAANSB 1
101 AFRIKAANSB 4
102 ENGLISHB 2
102 ISIZULUB 7
102 ENGLISHB 5
The result file I need is
STUDID MATSUBJ SCORE flagextra
101 AFRIKAANS 4
102 ENGLISH 5
102 ISIZULUB 7 1
I need to flag the extra language so that I can see what languages they are and make new category for this

Two stage procedure works better for me as a newbie to R:
1- remove the duplicates caused by subject retakes:
df <- LANGSEC%>%
group_by(STUDID,MATRICSUBJ) %>%
top_n(1,SUBJSCORE)
2- Then flag one of the two subjects causing the remaining duplicates:
LANGSEC$flagextra <- as.integer(duplicated(LANGSEC$STUDID),LANGSEC$MATRICSUBJ
Then filter for this third language and make new file:
LANG3<-LANGSEC%>% filter(flagextra==1)
Then remove these from the other file:
LANG2<-LANGSEC %>% filter(!flagextra==1)

May be this helps
library(tidyverse)
df1 %>%
group_by(STUDID, MATSUBJ) %>%
summarise(SCORE = max(SCORE),
flagextra = as.integer(!sum(duplicated(MATSUBJ))))
# A tibble: 3 x 4
# Groups: STUDID [?]
# STUDID MATSUBJ SCORE flagextra
# <int> <chr> <dbl> <int>
#1 101 AFRIKAANSB 4 0
#2 102 ENGLISHB 5 0
#3 102 ISIZULUB 7 1
Or with base R
i1 <- !(duplicated(df1[1:2])|duplicated(df1[1:2], fromLast = TRUE))
transform(aggregate(SCORE ~ ., df1, max),
flagextra = as.integer(MATSUBJ %in% df1$MATSUBJ[i1]))
data
df1 <- structure(list(STUDID = c(101L, 101L, 102L, 102L, 102L), MATSUBJ
= c("AFRIKAANSB",
"AFRIKAANSB", "ENGLISHB", "ISIZULUB", "ENGLISHB"), SCORE = c(1L,
4L, 2L, 7L, 5L)), class = "data.frame", row.names = c(NA, -5L
))

Related

Efficient way to repeat operations with columns with similar name in R

I am a beginner with R and have found myself repeatedly running into a problem of this kind. Say I have a dataframe with columns:
company, shares_2010, shares_2011, ... , shares_2020, share_price_2010, ... , share_price_2020
TeslaInc 1000 1200 2000 8 40
.
.
.
I then want to go ahead and calculate the market value in each year. Ordinarily I would do it this way:
dataframe <- dataframe %>%
mutate(value_2010 = shares_2010*share_price_2010,
value_2011 = shares_2011*share_price_2011,
.
:
value_2020 = shares_2020*share_price_2020)
Clearly, all of this is rather cumbersome to type out each time and it cannot be made dynamic with respect to the number of time periods included. Is there any clever way to do these operations in one line instead? I am suspecting something may be possible to do with a combination of starts_with() and some lambda function, but I just haven't been able to figure out how to make the correct things multiply yet. Surely the tidyverse must have a better way to do this?
Any help is much appreciated!
You're right, this is a very common situation in data management.
Let's make a minimal, reproducible example:
dat <- data.frame(
company = c("TeslaInc", "Merta"),
shares_2010 = c(1000L, 1500L),
shares_2011 = c(1200L, 1100L),
shareprice_2010 = 8:7,
shareprice_2011 = c(40L, 12L)
)
dat
#> company shares_2010 shares_2011 shareprice_2010 shareprice_2011
#> 1 TeslaInc 1000 1200 8 40
#> 2 Merta 1500 1100 7 12
This dataset has two issues:
It's in a wide format. This is relatively easy to visualise for humans, but it's not ideal for data analysis. We can fix this with pivot_longer() from tidyr.
Each column actually contains two variables: measure (share or share price) and year. We can fix this with separate() from the same package.
library(tidyr)
dat_reshaped <- dat |>
pivot_longer(shares_2010:shareprice_2011) |>
separate(name, into = c("name", "year")) |>
pivot_wider(everything(), values_from = value, names_from = name)
dat_reshaped
#> # A tibble: 4 × 4
#> company year shares shareprice
#> <chr> <chr> <int> <int>
#> 1 TeslaInc 2010 1000 8
#> 2 TeslaInc 2011 1200 40
#> 3 Merta 2010 1500 7
#> 4 Merta 2011 1100 12
The last pivot_wider() is needed to have shares and shareprice as two separate columns, for ease of further calculations.
We can finally use mutate() to calculate in one go all the new values.
dat_reshaped |>
dplyr::mutate(value = shares * shareprice)
#> # A tibble: 4 × 5
#> company year shares shareprice value
#> <chr> <chr> <int> <int> <int>
#> 1 TeslaInc 2010 1000 8 8000
#> 2 TeslaInc 2011 1200 40 48000
#> 3 Merta 2010 1500 7 10500
#> 4 Merta 2011 1100 12 13200
I recommend you read this chapter of R4DS to better understand these concepts - it's worth the effort!
I think further analysis will be simpler if you reshape your data long.
Here, we can extract the shares, share_price, and year from the header names using pivot_longer. Here, I specify that I want to split the headers into two pieces separated by _, and I want to put the name (aka .value) from the beginning of the header (that is, share or share_price) next to the year that came from the end of the header.
Then the calculation is a simple one-liner.
library(tidyr); library(dplyr)
data.frame(company = "Tesla",
shares_2010 = 5, shares_2011 = 6,
share_price_2010 = 100, share_price_2011 = 110) %>%
pivot_longer(-company,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)") %>%
mutate(value = shares * share_price)
# A tibble: 2 × 5
company year shares share_price value
<chr> <chr> <dbl> <dbl> <dbl>
1 Tesla 2010 5 100 500
2 Tesla 2011 6 110 660
I agree with the other posts about pivoting this data into a longer format. Just to add a different approach that works well with this type of example: you can create a list of expressions and then use the splice operator !!! to evaluate these expressions within your context:
library(purrr)
library(dplyr)
library(rlang)
library(glue)
lexprs <- set_names(2010:2011, paste0("value_", 2010:2011)) %>%
map_chr(~ glue("shares_{.x} * share_price_{.x}")) %>%
parse_exprs()
df %>%
mutate(!!! lexprs)
Output
company shares_2010 shares_2011 share_price_2010 share_price_2011 value_2010
1 TeslaInc 1000 1200 8 40 8000
2 Merta 1500 1100 7 12 10500
value_2011
1 48000
2 13200
Data
Thanks to Andrea M
structure(list(company = c("TeslaInc", "Merta"), shares_2010 = c(1000L,
1500L), shares_2011 = c(1200L, 1100L), share_price_2010 = 8:7,
share_price_2011 = c(40L, 12L)), class = "data.frame", row.names = c(NA,
-2L))
How it works
With this usage, the splice operator takes a named list of expressions. The names of the list become the variable names and the expressions are evaluated in the context of your mutate statement.
> lexprs
$value_2010
shares_2010 * share_price_2010
$value_2011
shares_2011 * share_price_2011
To see how this injection will resolve, we can use rlang::qq_show:
> rlang::qq_show(df %>% mutate(!!! lexprs))
df %>% mutate(value_2010 = shares_2010 * share_price_2010, value_2011 = shares_2011 *
share_price_2011)
It is indeed likely you may need to have your data in a long format. But in case you don't, you can do this:
# thanks Andrea M!
df <- data.frame(
company=c("TeslaInc", "Merta"),
shares_2010=c(1000L, 1500L),
shares_2011=c(1200L, 1100L),
share_price_2010=8:7,
share_price_2011=c(40L, 12L)
)
years <- sub('shares_', '', grep('^shares_', names(df), value=T))
for (year in years) {
df[[paste0('value_', year)]] <-
df[[paste0('shares_', year)]] * df[[paste0('share_price_', year)]]
}
If you wanted to avoid the loop (for (...) {...}) you can use this instead:
sp <- df[, paste0('shares_', years)] * df[, paste0('share_price_', years)]
names(sp) <- paste0('value_', years)
df <- cbind(df, sp)

Select last observation of a date variable - SPSS or R

I'm relatively new to R, so I realise this type of question is asked often but I've read a lot of stack overflow posts and still can't quite get something to work on my data.
I have data on spss, in two datasets imported into R. Both of my datasets include an id (IDC), which I have been using to merge them. Before merging, I need to filter one of the datasets to select specifically the last observation of a date variable.
My dataset, d1, has a longitudinal measure in long format. There are multiple rows per IDC, representing different places of residence (neighborhood). Each row has its own "start_date", which is a variable that is NOT necessarily unique.
As it looks on spss :
IDC
neighborhood
start_date
1
22
08.07.2001
1
44
04.02.2005
1
13
21.06.2010
2
44
24.12.2014
2
3
06.03.2002
3
22
04.01.2006
4
13
08.07.2001
4
2
15.06.2011
In R, the start dates do not look the same, instead they are one long number like "13529462400". I do not understand this format but I assume it still would retain the date order...
Here are all my attempts so far to select the last date. All attempts ran, there was no error. The output just didn't give me what I want. To my perception, none of these made any change in the number of repetitions of IDC, so none of them actually selected *only the last date.
##### attempt 1 --- not working
d1$start_date_filt <- d1$start_date
d1[order(d1$IDC,d1$start_date_filt),] # Sort by ID and week
d1[!duplicated(d1$IDC, fromLast=T),] # Keep last observation per ID)
###### attempt 2--- not working
myid.uni <- unique(d1$IDC)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(d1, IDC==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
##### atempt 3 -- doesn't work
do.call("rbind",
by(d1,INDICES = d1$IDC,
FUN=function(DF)
DF[which.max(DF$start_date),]))
#### attempt 4 -- doesnt work
library(plyr)
ddply(d1,.(IDC), function(X)
X[which.max(X$start_date),])
### merger code -- in case something has to change with that after only the last start_date is selected
merge(d1,d2, IDC)
My goal dataset d1 would look like this:
IDC
neighborhood
start_date
1
13
21.06.2010
2
44
24.12.2014
3
22
04.01.2006
4
2
15.06.2011
I'm grateful for any help, many thanks <3
There are some problems with most approaches dealing with this data: because your dates are arbitrary strings in a format that does not sort correctly, it just-so-happens to work here because the maximum day-of-month also happens in the maximum year.
It would generally be better to work with that column as a Date object in R, so that comparisons can be better.
dat$start_date <- as.Date(dat$start_date, format = "%d.%m.%Y")
dat
# IDC neighborhood start_date
# 1 1 22 2001-07-08
# 2 1 44 2005-02-04
# 3 1 13 2010-06-21
# 4 2 44 2014-12-24
# 5 2 3 2002-03-06
# 6 3 22 2006-01-04
# 7 4 13 2001-07-08
# 8 4 2 2011-06-15
From here, things are a bit simpler:
Base R
do.call(rbind, by(dat, dat[,c("IDC"),drop=FALSE], function(z) z[which.max(z$start_date),]))
# IDC neighborhood start_date
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
dplyr
dat %>%
group_by(IDC) %>%
slice(which.max(start_date)) %>%
ungroup()
# # A tibble: 4 x 3
# IDC neighborhood start_date
# <int> <int> <date>
# 1 1 13 2010-06-21
# 2 2 44 2014-12-24
# 3 3 22 2006-01-04
# 4 4 2 2011-06-15
Data
dat <- structure(list(IDC = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L), neighborhood = c(22L, 44L, 13L, 44L, 3L, 22L, 13L, 2L), start_date = c("08.07.2001", "04.02.2005", "21.06.2010", "24.12.2014", "06.03.2002", "04.01.2006", "08.07.2001", "15.06.2011")), class = "data.frame", row.names = c(NA, -8L))

How to pull string from character in one dataframe and place into a new table

I am working on a new shiny project and trying to re-use some of my collegues work that he has done in SQL to speed up the time it takes to build the data for this app.
I don't exactly know how to describe this problem so I will do so by showing what I have and explaining what I want to get.
Essentially we have an SQL script which spits out a bunch of data into two columns.
Is an identifier column which in the past we use vlookup to split the string components and fill in cells in excel.
Is the value of said identifier whether it be counts, averages or percentages.
It would look like the following.
lookup output
1: dataAU20161 142
2: dataAU20171 246
3: dataAU20181 17
4: dataAU20191 3
5: dataAU20162 193
6: dataAU20172 203
7: dataAU20182 11
8: dataAU20192 9
So ideally I would like to transform this data into the following format where the 'data' string identifies that they will go into the same data frame. The years in the string will be implemented into columns and the number following the years (1 or 2) will be implemented as a column as a factor variable.
x 2016 2017 2018 2019
--------------------------------
1 142 246 17 3
2 193 203 11 9
Any help with this would be greatly appreciated!
An option would to be separate thecolumn 'lookup'into two
library(dplyr)
library(tidyr)
df1 %>%
extract(lookup, into = c('lookup', 'rn'), 'dataAU(\\d{4})(\\d{1})') %>%
pivot_wider(names_from= lookup, values_from =output) %>%
dplyr::select(-rn)
# A tibble: 2 x 4
# `2016` `2017` `2018` `2019`
# <int> <int> <int> <int>
#1 142 246 17 3
#2 193 203 11 9
data
df1 <- structure(list(lookup = c("dataAU20161", "dataAU20171", "dataAU20181",
"dataAU20191", "dataAU20162", "dataAU20172", "dataAU20182", "dataAU20192"
), output = c(142L, 246L, 17L, 3L, 193L, 203L, 11L, 9L)), class = "data.frame",
row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:"))

R separate lines into columns specified by start and end

I'd like to split a dataset made of character strings into columns specified by start and end.
My dataset looks something like this:
>head(templines,3)
[1] "201801 1 78"
[2] "201801 2 67"
[3] "201801 1 13"
and i'd like to split it by specifying my columns using the data dictionary:
>dictionary
col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 12
so it becomes:
year week gender age
2018 01 1 78
2018 01 2 67
2018 01 1 13
In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.
In tidyr::separate it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?
I thought of doing this with read_fwf but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:
write_lines(templines,"t1.txt")
read_fwf("t1.txt",
fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name)
is it possible to use read_fwf on an already loaded dataset?
Answering your question directly: yes, it is possible to use read_fwf with already loaded data. The relevant part of the docs is the part about the argument file:
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
...
Literal data is most useful for examples and tests.
It must contain at least one new line to be recognised as data (instead of a path).
Thus, you can simply collapse your data and then use read_fwf:
templines %>%
paste(collapse = "\n") %>%
read_fwf(., fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name))
This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).
There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to age, 11, 12 it works as expected.
A solution with substring:
library(data.table)
x <- transpose(lapply(templines, substring, dictionary$col_start, dictionary$col_end))
setDT(x)
setnames(x, dictionary$col_name)
# > x
# year week gender age
# 1: 2018 01 1 78
# 2: 2018 01 2 67
# 3: 2018 01 1 13
How about this?
data.frame(year=substr(templines,1,4),
week=substr(templines,5,6),
gender=substr(templines,7,8),
age=substr(templines,11,13))
Using base R:
m = list(`attr<-`(dat$col_start,"match.length",dat$col_end-dat$col_start+1))
d = do.call(rbind,regmatches(x,rep(m,length(x))))
setNames(data.frame(d),dat$col_name)
year week gender age
1 2018 01 1 78
2 2018 01 2 67
3 2018 01 1 13
DATA USED:
x = c("201801 1 78", "201801 2 67", "201801 1 13")
dat=read.table(text="col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 13 ",h=T)
We could use separate from tidyverse
library(tidyverse)
data.frame(Col = templines) %>%
separate(Col, into = dictionary$col_name, sep= head(dictionary$col_end, -1))
# year week gender age
#1 2018 01 1 78
#2 2018 01 2 67
#3 2018 01 1 13
The convert = TRUE argument can also be used with separate to have numeric columns as output
tibble(Col = templines) %>%
separate(Col, into = dictionary$col_name,
sep= head(dictionary$col_end, -1), convert = TRUE)
# A tibble: 3 x 4
# year week gender age
# <int> <int> <int> <int>
#1 2018 1 1 78
#2 2018 1 2 67
#3 2018 1 1 13
data
dictionary <- structure(list(col_name = c("year", "week", "gender", "age"),
col_start = c(1L, 5L, 8L, 11L), col_end = c(4L, 6L, 8L, 13L
)), .Names = c("col_name", "col_start", "col_end"),
class = "data.frame", row.names = c(NA, -4L))
templines <- c("201801 1 78", "201801 2 67", "201801 1 13")
This is an explicit function which seems to be working the way you wanted.
split_func<-function(char,ref,name,start,end){
res<-data.table("ID" = 1:length(char))
for(i in 1:nrow(ref)){
res[,ref[[name]][i] := substr(x = char,start = ref[[start]][i],stop = ref[[end]][i])]
}
return(res)
}
I have created the same input files as you:
templines<-c("201801 1 78","201801 2 67","201801 1 13")
dictionary<-data.table("col_name" = c("year","week","gender","age"),"col_start" = c(1,5,8,11),
"col_end" = c(4,6,8,13))
# col_name col_start col_end
#1: year 1 4
#2: week 5 6
#3: gender 8 8
#4: age 11 13
As for the arguments,
char - The character vector with the values you want to split
ref - The reference table or dictionary
name - The column number in the reference table containing the column names you want
start - The column number in the reference table containing the start points
end - The column number in the reference table containing the stop points
If I use this function with these inputs, I get the following result:
out<-split_func(char = templines,ref = dictionary,name = 1,start = 2,end = 3)
#>out
# ID year week gender age
#1: 1 2018 01 1 78
#2: 2 2018 01 2 67
#3: 3 2018 01 1 13
I had to include an "ID" column to initiate the data table and make this easier. In case you want to drop it later you can just use:
out[,ID := NULL]
Hope this is closer to the solution you were looking for.

Compare values in data.frame from different rows [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an R data.frame of college football data, with two entries for each game (one for each team, with stats and whatnot). I would like to compare points from these to create a binary Win/Loss variable, but I have no idea how (I'm not very experienced with R).
Is there a way I can iterate through the columns and try to match them up against another column (I have a game ID variable, so I'd match on that) and create aforementioned binary Win/Loss variable by comparing points values?
Excerpt of dataframe (many variables left out):
Team Code Name Game Code Date Site Points
5 Akron 5050320051201 12/1/2005 NEUTRAL 32
5 Akron 404000520051226 12/26/2005 NEUTRAL 23
8 Alabama 419000820050903 9/3/2005 TEAM 37
8 Alabama 664000820050910 9/10/2005 TEAM 43
What I want is to append a new column, a binary variable that's assigned 1 or 0 based on if the team won or lost. To figure this out, I need to take the game code, say 5050320051201, find the other row with that same game code (there's only one other row with that same game code, for the other team in that game), and compare the points value for the two, and use that to assign the 1 or 0 for the Win/Loss variable.
Assuming that your data has exactly two teams for each unique Game Code and there are no tie games as given by the following example:
df <- structure(list(`Team Code` = c(5L, 6L, 5L, 5L, 8L, 9L, 9L, 8L
), Name = c("Akron", "St. Joseph", "Akron", "Miami(Ohio)", "Alabama",
"Florida", "Tennessee", "Alabama"), `Game Code` = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L), .Label = c("5050320051201", "404000520051226",
"419000820050903", "664000820050910"), class = "factor"), Date = structure(c(13118,
13118, 13143, 13143, 13029, 13029, 13036, 13036), class = "Date"),
Site = c("NEUTRAL", "NEUTRAL", "NEUTRAL", "NEUTRAL", "TEAM",
"AWAY", "AWAY", "TEAM"), Points = c(32L, 25L, 23L, 42L, 37L,
45L, 42L, 43L)), .Names = c("Team Code", "Name", "Game Code",
"Date", "Site", "Points"), row.names = c(NA, -8L), class = "data.frame")
print(df)
## Team Code Name Game Code Date Site Points
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37
##6 9 Florida 419000820050903 2005-09-03 AWAY 45
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43
You can use dplyr to generate what you want:
library(dplyr)
result <- df %>% group_by(`Game Code`) %>%
mutate(`Win/Loss`=if(first(Points) > last(Points)) as.integer(c(1,0)) else as.integer(c(0,1)))
print(result)
##Source: local data frame [8 x 7]
##Groups: Game Code [4]
##
## Team Code Name Game Code Date Site Points Win/Loss
## <int> <chr> <fctr> <date> <chr> <int> <int>
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32 1
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25 0
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23 0
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42 1
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37 0
##6 9 Florida 419000820050903 2005-09-03 AWAY 45 1
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42 0
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43 1
Here, we first group_by the Game Code and then use mutate to create the Win/Loss column for each group. The logic here is simply that if the first Points is greater than the last (there are only two by assumption), then we set the column to c(1,0). Otherwise, we set it to (0,1). Note that this logic does not handle ties, but can easily be extended to do so. Note also that we surround the column names with back-quotes because of special characters such as space and /.
footballdata$SomeVariable[footballdata$Wins == "1"] = stuff
call yours wins by either 1 or 0, thus binomial
R's data frames are nice in that you can aggregate what you want like, I only want the data frames with wins are 1. Then you can set the data to some variable as above. If you wanna do another data frame to populate a data frame, make sure they have the same amount of data.
footballdata$SomeVariable[footballdata$Wins == "1"][footballdata$Team == "Browns"] = Hopeful

Resources