Locating duplicated entries in a column of a dataframe?

Locating duplicated entries in a column of a dataframe? - r

In rows, 11:13, and in 14:16, it can be observed that there are duplicate entries in column 'C2_xsampa' for 'm:' and 'n:'. Each value in 'C2_xsampa' has two levels, Singleton or Geminate but it is not the case among 'm:' and 'n:'. This yields wrong mean values for numeric columns.
My question is: How do I filter which row is being duplicated? I have manually checked the parent dataset through which means values are obtained. All looks fine there.
Earlier, I was using subset () to rectify the 'real' errors in entry.
Data:
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2
Thanks.

You could check that the values for the two columns are unique throughout the dataset
df = df.drop_duplicates(subset=['C2_xsampa','Consonant'])
You can get the inverse df[~df] to get the rows that are incorrect
edit just saw the r language tag
I believe distinct(select(df, C2_xsampa, Consonant)) will do

It seems there are unnecessary symbols and spaces in some of the values of C2_xsampa. Here is a suggestion using {tidyverse}. First, it removes the symbols/spaces and then identifies duplicated rows by C2_xsampa and Consonant. You can filter the duplicated rows using dup column.
library(tidyverse)
dat1 <- dat %>%
mutate(C2_xsampa = str_trim(C2_xsampa)) %>%
group_by(C2_xsampa, Consonant) %>%
mutate(dup = n()) %>%
ungroup()
dat1
# # A tibble: 20 x 9
# C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn dup
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 d_d Singleton 8.5 11.9 7.82 13 7.65 40.3 1
# 2 d_d: Geminate 9 11.6 11.9 11.4 7.46 42.3 1
# 3 dZ Singleton 8.31 7.79 7.47 14.9 9.81 40 1
# 4 dZ: Geminate 8.08 7.72 13.4 12.8 9.61 43.6 1
# 5 g Singleton 9 12.1 11.3 11.9 8.56 43.9 1
# 6 g: Geminate 8.69 11.3 11.1 12.7 10.2 45.3 1
# 7 k Singleton 9.5 12.3 14.4 9.71 6.97 43.4 1
# 8 k: Geminate 9 14.7 16.1 10.1 7.37 48.2 1
# 9 l Singleton 8.69 11.9 6.33 11.5 10.2 40 1
# 10 l: Geminate 8.81 11.3 10 10 11.5 42.8 1
# 11 m Singleton 8.36 13.6 9.11 11.1 9.2 43 1
# 12 m: Geminate 8.85 13.7 10.9 9.95 8.42 43 2
# 13 m: Geminate 14 14.6 12.4 5.66 5.01 37.7 2
# 14 n Singleton 8 15.1 4.44 11.6 8.99 40.2 1
# 15 n: Geminate 8.21 21.4 10.1 10.2 9.32 51 2
# 16 n: Geminate 11.3 32 10.4 8.09 7.94 58.5 2
# 17 p Singleton 8.4 11.2 11.9 7.98 6.53 37.7 1
# 18 p: Geminate 8.81 13.2 12.7 8.57 11.3 45.8 1
# 19 t` Singleton 9 12.9 10.5 8.69 9.2 41.3 1
# 20 t`: Geminate 9 13.1 13.1 8.39 10.6 45.2 1
Here is the code for the dataset:
dat <- read.table(
text = '
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2',
header = TRUE
)

My favorite approach for this is:
subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))

Related

dplyr - programming dynamic variable & function name - ascending & descending

I am trying to find way to shorten my code using dynamic naming variables & functions related with ascending & descending order. Though I can do desc but couldn't find anything for ascending. Below is the reproducible example to demonstrate my problem.
Here is the sample dataset
library(dplyr)
set.seed(100)
data <- tibble(a = runif(20, min = 0, max = 100),
b = runif(20, min = 0, max = 100),
c = runif(20, min = 0, max = 100))
Dynamically passing variable with percent rank in ascending order
current_var <- "a" # dynamic variable name
data %>%
mutate("percent_rank_{current_var}" := percent_rank(!!sym(current_var)))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.263
#> 2 25.8 71.1 86.5 0.158
#> 3 55.2 53.8 77.8 0.684
#> 4 5.64 74.9 82.7 0
#> 5 46.9 42.0 60.3 0.526
#> 6 48.4 17.1 49.1 0.579
#> 7 81.2 77.0 78.0 0.947
#> 8 37.0 88.2 88.4 0.421
#> 9 54.7 54.9 20.8 0.632
#> 10 17.0 27.8 30.7 0.0526
#> 11 62.5 48.8 33.1 0.737
#> 12 88.2 92.9 19.9 1
#> 13 28.0 34.9 23.6 0.211
#> 14 39.8 95.4 27.5 0.474
#> 15 76.3 69.5 59.1 0.895
#> 16 66.9 88.9 25.3 0.789
#> 17 20.5 18.0 12.3 0.105
#> 18 35.8 62.9 23.0 0.316
#> 19 35.9 99.0 59.8 0.368
#> 20 69.0 13.0 21.1 0.842
Dynamically passing variable with percent rank in descending order
data %>%
mutate("percent_rank_{current_var}" := percent_rank(desc(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
How to combine both into one statement? - I can do for desc but couldn't find any explicit statement for ascending order
rank_function <- desc # dynamic function for ranking
data %>%
mutate("percent_rank_{current_var}" := percent_rank(rank_function(!!sym(current_var))))
#> # A tibble: 20 × 4
#> a b c percent_rank_a
#> <dbl> <dbl> <dbl> <dbl>
#> 1 30.8 53.6 33.1 0.737
#> 2 25.8 71.1 86.5 0.842
#> 3 55.2 53.8 77.8 0.316
#> 4 5.64 74.9 82.7 1
#> 5 46.9 42.0 60.3 0.474
#> 6 48.4 17.1 49.1 0.421
#> 7 81.2 77.0 78.0 0.0526
#> 8 37.0 88.2 88.4 0.579
#> 9 54.7 54.9 20.8 0.368
#> 10 17.0 27.8 30.7 0.947
#> 11 62.5 48.8 33.1 0.263
#> 12 88.2 92.9 19.9 0
#> 13 28.0 34.9 23.6 0.789
#> 14 39.8 95.4 27.5 0.526
#> 15 76.3 69.5 59.1 0.105
#> 16 66.9 88.9 25.3 0.211
#> 17 20.5 18.0 12.3 0.895
#> 18 35.8 62.9 23.0 0.684
#> 19 35.9 99.0 59.8 0.632
#> 20 69.0 13.0 21.1 0.158
Created on 2022-08-17 by the reprex package (v2.0.1)

You could compose a function to return its input:
rank_function <- function(x) x
Actually this function has been defined in base, i.e. identity.
rank_function <- identity
Also, you can explore the source code of desc:
desc
function (x) -xtfrm(x)
Apparently desc is just the opposite number of xtfrm. So you can use it for ascending ordering.
rank_function <- xtfrm
In the help document of xtfrm(x):
A generic auxiliary function that produces a numeric vector which will sort in the same order as x.

Taylor Diagrams by Group in R (openair)

I'm trying to create a taylor diagram to show agreement between observations and model output. The openair package lets you differentiate by a group, which I would like to do for each site.
This is the code that I'm using:
TaylorDiagram(month_join, obs = "temp", mod = "temp_surf", group = "dataset_id", normalise = TRUE, cex = 1)
The observation variable is temp, model variable is temp_surf, and site that I want to differentiate by different groups, is dataset_id.
When I do this, though there are 17 different datasets, they are binned into four groups. I can't find any help online about this. The function documentation says that for the group argument, "The total number of models compared will be equal to the number of unique values of group". I have 17 unique values in the group but they are automatically binned into 4.
Taylor diagram with 4 groups instead of 17
[Edit: first 20 rows of data from month_join]
# A tibble: 20 × 14
# Groups: dataset_id [2]
dataset_id month temp_surf temp_mid temp_bot ph_surf ph_mid ph_bot do_surf do_mid do_bot temp ph do
<dbl> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 13.4 13.3 13.2 8.01 7.99 7.97 244. 232. 220. 13.3 8.00 NaN
2 3 2 13.3 13.2 13.0 8.01 7.98 7.96 245. 232. 218. 12.5 7.99 NaN
3 3 3 12.9 12.7 12.5 7.97 7.94 7.91 233. 216. 199. 12.7 8.04 NaN
4 3 4 12.6 12.4 12.2 7.93 7.91 7.89 223. 207. 190. NaN NaN NaN
5 3 5 12.9 12.7 12.4 7.93 7.91 7.89 223. 208. 193. NaN NaN NaN
6 3 6 13.5 13.2 12.9 7.94 7.92 7.90 226. 212. 198. 15.1 8.04 NaN
7 3 7 14.3 13.9 13.5 7.97 7.95 7.94 236. 224. 212. 16.0 8.09 NaN
8 3 8 14.4 14.1 13.8 7.98 7.97 7.95 238. 228. 217. 16.6 8.06 NaN
9 3 9 14.8 14.5 14.1 8.00 7.99 7.97 244. 235. 227. 16.7 8.05 NaN
10 3 10 14.8 14.4 14.1 8.00 7.98 7.96 243. 233. 222. 16.2 8.05 NaN
11 3 11 14.3 14.0 13.7 7.99 7.96 7.94 237. 224. 211. 15.5 8.05 NaN
12 3 12 13.6 13.4 13.3 7.99 7.97 7.94 237. 225. 213. 14.4 8.05 NaN
13 6 1 14.3 9.48 4.70 8.07 7.84 7.62 261. 143. 24.7 13.6 NaN NaN
14 6 2 14.2 9.42 4.68 8.07 7.84 7.62 264. 144. 24.4 13.5 NaN NaN
15 6 3 14.5 9.61 4.67 8.07 7.84 7.61 266. 145. 24.2 14.0 NaN NaN
16 6 4 15.0 9.86 4.68 8.06 7.84 7.61 264. 144. 24.0 14.3 NaN NaN
17 6 5 16.0 10.4 4.68 8.05 7.83 7.61 262. 143. 24.0 16.4 NaN NaN
18 6 6 17.3 11.0 4.68 8.04 7.83 7.61 257. 141. 23.9 17.6 NaN NaN
19 6 7 18.8 11.7 4.71 8.03 7.82 7.61 251. 138. 24.2 19.3 NaN NaN
20 6 8 19.2 12.0 4.76 8.03 7.82 7.61 248. 136. 24.7 NA NA NA

Draw regression line per row in R

I have the following data.
HEIrank1
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
12 TH 7.4 10.0 5.8 8.8 8.7 8.6
13 CC 12.1 11.0 12.2 12.1 14.9 15.0
14 MM 11.7 24.2 18.4 18.6 31.9 31.7
15 MC 19.0 13.7 17.0 20.4 20.5 12.1
16 SH 11.4 24.8 26.1 12.7 19.9 25.9
17 SB 13.0 22.8 15.9 17.6 17.2 9.6
18 SN 11.5 18.6 22.9 12.0 20.3 11.6
19 ER 10.8 13.2 20.0 11.0 14.9 14.2
20 SL 44.9 21.6 21.3 26.5 17.0 8.0
I try following commends to draw regression line for each HEIs.
year <- c(2007 , 2008 , 2009 , 2010 , 2011, 2012)
op <- as.numeric(HEIrank1[1,])
lm.r <- lm(op~year)
plot(year, op)
abline(lm.r)
I want to draw to draw regression line for each college in one graph and I do not how.can you help me.

Here's my approach with ggplot2 but the graph is uninterpretable with that many lines.
library(ggplot2);library(reshape2)
mdat <- melt(HEIrank1, variable.name="year")
mdat$year <- as.numeric(substring(mdat$year, 2))
ggplot(mdat, aes(year, value, colour=HEI.ID, group=HEI.ID)) +
geom_point() + stat_smooth(se = FALSE, method="lm")
Faceting may be a better way to got:
ggplot(mdat, aes(year, value, group=HEI.ID)) +
geom_point() + stat_smooth(se = FALSE, method="lm") +
facet_wrap(~HEI.ID)

Draw histograms per row over multiple columns in R

I'm using R for the analysis of my master thesis
I have the following data frame: STOF: Student to staff ratio
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 41.8 147.6 90.3 82.9 106.8 63.0
2 MO 20.0 20.8 21.1 20.9 12.6 20.6
3 SD 21.2 32.3 25.7 23.9 25.0 40.1
4 UN 51.8 39.8 19.9 20.9 21.6 22.5
5 WS 18.0 19.9 15.3 13.6 15.7 15.2
6 BF 11.5 36.9 20.0 23.2 18.2 23.8
7 ME 34.2 30.3 28.4 30.1 31.5 25.6
8 IM 7.7 18.1 20.5 14.6 17.2 17.1
9 OM 11.4 11.2 12.2 11.1 13.4 19.2
10 DC 14.3 28.7 20.1 17.0 22.3 16.2
11 OC 28.6 44.0 24.9 27.9 34.0 30.7
Then I rank colleges using this commend
HEIrank1<-(STOF[,-c(1)])
rank1 <- apply(HEIrank1,2,rank)
> HEIrank11
HEI.ID X2007 X2008 X2009 X2010 X2011 X2012
1 OP 18.0 20 20.0 20.0 20.0 20
2 MO 14.0 9 13.0 13.5 2.0 12
3 SD 15.0 16 17.0 16.0 16.0 19
4 UN 20.0 18 8.0 13.5 14.0 13
5 WS 12.0 8 4.0 7.0 6.0 8
6 BF 6.5 17 9.5 15.0 10.0 14
7 ME 17.0 15 19.0 19.0 17.0 15
8 IM 2.0 6 12.0 8.0 8.5 10
9 OM 4.5 3 2.5 3.0 3.0 11
10 DC 11.0 14 11.0 9.0 15.0 9
11 OC 16.0 19 16.0 18.0 19.0 17
I would like to draw histogram for each HEIs (for each row)?

If you use ggplot you won't need to do it as a loop, you can plot them all at once. Also, you need to reformat your data so that it's in long format not short format. You can use the melt function from the reshape package to do so.
library(reshape2)
new.df<-melt(HEIrank11,id.vars="HEI.ID")
names(new.df)=c("HEI.ID","Year","Rank")
substring is just getting rid of the X in each year
library(ggplot2)
ggplot(new.df, aes(x=HEI.ID,y=Rank,fill=substring(Year,2)))+
geom_histogram(stat="identity",position="dodge")

Here's a solution in lattice:
require(lattice)
barchart(X2007+X2008+X2009+X2010+X2011+X2012 ~ HEI.ID,
data=HEIrank11,
auto.key=list(space='right')
)

Inserting another column to a data frame and incrementing its value per row

I have this data frame:
head(df,10)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
3 36.4 13.1 13.9 36.6 9.26 57.9 28.0 34.96 26049 3492
4 31.1 11.2 12.6 45.1 7.81 48.8 25.9 37.85 17515 2754
5 33.2 13.4 13.2 40.3 8.69 54.3 26.9 35.67 23510 3265
6 34.0 12.8 13.7 39.4 8.77 54.8 26.5 35.19 25151 3305
7 32.7 12.4 13.6 41.3 8.49 53.0 25.9 35.97 25214 3201
8 33.4 13.7 12.5 40.3 8.76 54.7 27.1 36.50 23943 3391
9 35.2 13.8 13.5 37.5 9.20 57.5 27.8 33.08 25647 3385
10 34.6 14.9 14.9 35.6 9.35 58.4 27.8 35.81 27324 3790
11 30.4 13.3 13.0 43.3 8.29 51.8 24.9 38.31 25178 2881
12 32.0 13.3 14.0 40.7 8.58 53.6 26.1 35.97 25677 3162
I have DateTime is this:
DateTime<-Sys.time()
I would like to insert another column this df and increment the DateTime value by 30 seconds for each row.
Im doing this:
for (i in 1:nrow(df)) {
df[1,]$DateTime<-DateTime
DateTime<-DateTime+30
}
This loop is not doing what Im trying to do. Any help is greatly appreicated.

df$DateTime <- Sys.time() + 30 * (seq_len(nrow(df))-1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Locating duplicated entries in a column of a dataframe? - r

My favorite approach for this is: subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))

Related

dplyr - programming dynamic variable & function name - ascending & descending

Taylor Diagrams by Group in R (openair)

Draw regression line per row in R

Draw histograms per row over multiple columns in R

Inserting another column to a data frame and incrementing its value per row

Categories

Resources