trying to do some quick data manipulation in R, and I am very new to it.
So I am trying to use the unique function on some data, what I want to achieve is being able to keep unique rows based on some combination of columns. As I understand from the documentation this should be possible using the 'by' argument for the unique method, but as I cannot get this to work.
I have the dataTest:
name age
1 A 1
2 B 2
3 C 1
after using unique(dataTest,by="age"), the output does not change while I would expect it to change to name age
1 A 1
2 B 2
see attatchment for the code in action.
Again, its probably a beginner mistake but I cannot seem to figure it out, help much appreciated.
I think you have a dataframe, convert your dataframe to data.table and it should work. See the difference in output
1) When it is a dataframe.
df <- structure(list(name = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), age = c(1L, 2L, 1L)), class = "data.frame",
row.names = c("1", "2", "3"))
unique(df, by = "age")
# name age
#1 A 1
#2 B 2
#3 C 1
2) After changing it to data.table
library(data.table)
setDT(df)
unique(df, by = "age")
# name age
#1: A 1
#2: B 2
Another option is to use duplicated
df[!duplicated(df$age), ]
We can use distinct
library(dplyr)
df %>%
distinct(age)
Related
I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.
group_a
group_b
n_variants_a
n_variants_b
1
NA
1
2
NA
2
5
4
1
2
2
0
I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.
Running this over just one group and n_variant set works:
sum(subset(df, (!is.na(group_a)))$n_variants_a
However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.
summed_variants <- lapply(list_of_groups, function(g) {
lapply(list_of_variants, function(v) {
sum(subset(df, !(is.na(g)))$v)
I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.
Thanks for your help!
We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum
mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]),
names(df1)[1:2], names(df1)[3:4])
group_a group_b
3 4
Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum
library(stringr)
library(dplyr)
df1 %>%
summarise(across(contains('variants'),
~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))
-output
n_variants_a n_variants_b
1 3 4
data
df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))
I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]
I have two data.frame tables in R. Both have IDs for users who took particular actions. The users in the second table should all have done the actions in the first table, but I want to confirm. What would be the best way to determine if all the IDs in table 2 are represented in table, and if not what IDs aren't?
Table A
**Unique ID** **Count**
abc123 1
zyx456 15
888aaaa 4
Table B
**Unique ID** **Count**
abc123 1
zyx456 1
zzzzz123 2
I'm trying to get a response that abc123 and zyx456 in Table B are in Table A and that zzzzz123 is not represented in Table A but is in B (which would be an error, since all B should be in A).
This is an efficient one-liner in base R:
setdiff(TableB$ID, TableA$ID)
It will return an empty result if everything in TableB is in TableA, and return the missing fields if there are any.
Other answers may be better choices with broader context, but this is a simple solution for a simple problem.
We can do this easily with a join in the tidyverse:
library(tidyverse)
JoinedTable = full_join(
x = TableA %>% mutate(in.A = TRUE),
y = TableB %>% mutate(in.B = TRUE).
by = "UniqueID",
suffix = c(".A",".B")
)
### Use whichever of the following is applicable
## Is in both
JoinedTable %>%
filter(in.A, in.B)
## In A only
JoinedTable %>%
filter(in.A, !in.B)
## In B only
JoinedTable %>%
filter(!in.A, in.B)
Use a full_join to intersect the tables; set "by" to your ID column and adding a suffix to differentiate other columns that aren't unique to a particular column. I've added mutates to make the filtering code more clear, but you could also just look for NAs in the respective Counts columns (i.e. filter(!is.na(Count.A), is.na(Count.B)) to find ones in A but not B).
If you just want a vector of the ones that meet each condition, just tack on %>% pull(UniqueID) to grab that.
You can add another column to table B show if it is also in table A. Here is the code can make it (assuming dfA and dfB denote tables A and B):
dfB <- within(dfB, in_dfA <- UniqueID %in% tbla$UniqueID)
gives
> dfB
UniqueID Count in_dfA
1 abc123 1 TRUE
2 zyx456 1 TRUE
3 zzzzz123 2 FALSE
DATA
dfA <- structure(list(UniqueID = structure(c(2L, 3L, 1L), .Label = c("888aaaa",
"abc123", "zyx456"), class = "factor"), Count = c(1L, 15L, 4L
)), class = "data.frame", row.names = c(NA, -3L))
dfB <- structure(list(UniqueID = structure(1:3, .Label = c("abc123",
"zyx456", "zzzzz123"), class = "factor"), Count = c(1L, 1L, 2L
), in_dfA = c(TRUE, TRUE, FALSE)), row.names = c(NA, -3L), class = "data.frame")
How about using the %in% operator to see which are in both versus those that are not:
library(tibble)
library(tidyverse)
df1 <- tribble(~ID, ~Count,
'abc', 1,
'zyx', 15,
'other', 3)
df2 <- tribble(~ID, ~Count,
'abc', 2,
'zyx', 33,
'another', 334)
match <- df2[which(df2$ID %in% df1$ID),'ID']
notmatch <- df2[which(!(df2$ID %in% df1$ID)),'ID']
This outputs two comparisons that you can use to check for values in a function and pass errors if need be:
match
A tibble: 2 x 1
ID
<chr>
1 abc
2 zyx
notmatch
# A tibble: 1 x 1
ID
<chr>
1 another
You could do an update join to see which IDs are/aren't in the first table
tblb[tbla, on = 'UniqueID', in_tbla := i.UniqueID
][, in_tbla := !is.na(in_tbla)]
tblb
# UniqueID Count in_tbla
# 1: abc123 1 TRUE
# 2: zyx456 1 TRUE
# 3: zzzzz123 2 FALSE
Not sure if that's any better than #Onyambu's suggestion though (same output)
tblb[, in_tbla := UniqueID %in% tbla$UniqueID]
Data used:
tbla <- fread('
UniqueID Count
abc123 1
zyx456 15
888aaaa 4
')
tblb <- fread('
UniqueID Count
abc123 1
zyx456 1
zzzzz123 2
')
dataset=structure(list(goods = structure(1:6, .Label = c("a", "b", "c",
"d", "e", "f"), class = "factor")), .Names = "goods", class = "data.frame", row.names = c(NA,
-6L))
goods
1 a
2 b
3 c
4 d
5 e
6 f
i want create new data, simple i do
df1=dataset$goods
but after it df1 doesn't have name column goods.
Why?
str(df1)
Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
As you can see it hasn't name goods
How to do that df1 data has name column goods?
If this post is dublicate, let me know, i delete it.
You are assigning a column vector, not a data frame. To assign the whole data frame, simply do
df = dataset
If you want to preserve only some columns and not all, use column subsetting (documentation):
df = dataset[, "goods", drop = FALSE]
drop = FALSE is necessary here because the dataframe subset operator will otherwise return a vector instead of a data frame with a single column (this is arguably a bug, which is why tidyverse tibbles behave differently).
Using tidyverse operations (aka the “modern” R way), this would be written as
library(dplyr)
df = select(dataset, goods)
df1=data.frame(goods=dataset$goods, stringsAsFactors=F) works perfectly well, or you can use the longer but (somewhat?) more explicit:
ds <- dataset[,c("goods")]
df1=data.frame(goods=dataset$goods)
library(dplyr)
ds <- dataset[,c("goods")] %>% as.data.frame(stringsAsFactors=F)
colnames(ds) <- "goods"
edit: Added the stringsAsFactors option as it is useful to control where you'd like factor conversion or not. c("goods") is equivalent to "goods", but I left it as a template in case you need to add more columns.
My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.
If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]
If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2
library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.
One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3
if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))