Modify DataFrame, remove double Data with for each, R [duplicate] - r

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 2 years ago.
Im about to modify a dataframe because it includes double values
Data Frame:
Id Name Account
1 X 1
1 Y 2
1 Z 3
2 J 1
2 T 4
3 O 2
So when there are multiple rows with same Id I just want to keep the last row.
The desired output would be
Id Name Account
1 Z 3
2 T 4
3 O 2
This is my current Code:
for (i in 1:(nrow(mylist)-1)) {
if(mylist$Id[c(i)] == mylist$Id[c(i+1)]){
mylist <- mylist[-c(i), ]
}
}
I have Problems when a row is removed because all other rows get a lower index and the System skips rows in the next step.

You can do this easily with the dplyr package:
library(dplyr)
mylist %>%
group_by(Id) %>%
slice(n()) %>%
ungroup()
First you group_by the Id column. Afterwards you select only the last entry (slice(n())) of each group.

One option in Base-R is
mylist[cumsum(sapply(split(mylist,mylist$Id),nrow)),]
Id Name Account
3 1 Z 3
5 2 T 4
6 3 O 2

Related

How to count frequency in one column based on unique values in another column in R?

I have a dataset that looks like this:
Product Patient_ID
1 A 1
2 A 1
3 A 1
4 A 3
5 D 3
6 D 4
7 D 5
8 E 5
9 E 6
10 F 7
11 G 8
Where I'd like to count the number of unique individuals who have used a product. In other words, I would like to get a frequency for the 'Product' column, based on unique 'Patient IDs'.
My desired dataset would look something like this:
Product Freq
1 A 2
2 D 3
3 E 2
4 F 1
5 G 1
How can I go about doing this?
Reproducible data:
test_data<-data.frame(Product=c("A","A","A","A","D","D","D","E","E","F","G"),Patient_ID=c("1","1","1","3","3","4","5","5","6","7","8"))
This should help
you first load the tidyverse package
Use the distinct() function to get the distinct values in the columns, then group the column based on the Products names and Use the summarize () function to get the unique count frequency
unique_count <- test_data %<% distinct(Product,Patient_ID) %<% group_by(Product)%>% summarize ("Freq" = n())
This should get you the result
base R solution:
you can first remove duplicates with the !duplicated function remove the merged column's value using the paste0 function, then, use table for the product column.
rez<-as.data.frame(table(test_data[!duplicated(paste0(test_data$Product,test_data$Patient_ID)),"Product"]))
colnames(rez)[1]<-"Product"
If you have only these two column you can skip the paste0 and do:
rez<-as.data.frame(table(test_data[!duplicated(test_data),"Product"]))

How to modify the variable names by combining current variable names and row 1 values? [duplicate]

This question already has answers here:
Concatenate column name and 1st row in R
(2 answers)
Pasting the first row to the column name within a list
(1 answer)
Closed 1 year ago.
How can I modify raw_dataframe to wished_dataframe?
raw_dataframe <- data.frame(
category=c('a','1','2','3','4'),
subcategory=c('b','3','2','1','0'),
item=c('wd','4','5','7','0'))
wished_dataframe <- data.frame(
category_a=c('1','2','3','4'),
subcategory_b=c('3','2','1','0'),
item_wd=c('4','5','7','0'))
I actually have many csv files, the structure like 'raw_dataframe ' and (I want to combine row 1 and row 2 as the variable name. Any one can help?
# Paste colnames with values of row 1
colnames(raw_dataframe) <- paste0(colnames(raw_dataframe), "_", raw_dataframe[1, ])
# Remove row 1 and save in `wished_dataframe`
wished_dataframe <- raw_dataframe[-1, ]
A dplyr way: We could use rename_with:
library(dplyr)
raw_dataframe %>%
rename_with(~paste0(.,"_", raw_dataframe[1,])) %>%
slice(-1)
category_a subcategory_b item_wd
1 1 3 4
2 2 2 5
3 3 1 7
4 4 0 0
An option with janitor
library(janitor)
library(stringr)
library(dplyr)
row_to_names(raw_dataframe, 1) %>%
rename_with(~ str_c(names(raw_dataframe), '_', .))
category_a subcategory_b item_wd
2 1 3 4
3 2 2 5
4 3 1 7
5 4 0 0

Reshaping dataframe to list values over unique id - back and forth [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
I want to condense information in a dataframe to reduce the number of rows.
Consider the dataframe:
df <- data.frame(id=c("A","A","A","B","B","C","C","C"),b=c(4,5,6,1,2,7,8,9))
df
id b
1 A 4
2 A 5
3 A 6
4 B 1
5 B 2
6 C 7
7 C 8
8 C 9
I want to collapse the dataframe to all unique values of "id" and list the values in variable b. The result should look like
df.results <- data.frame(id=c("A","B","C"),b=c("4,5,6","1,2","7,8,9"))
df.results
id b
1 A 4,5,6
2 B 1,2
3 C 7,8,9
A solution for the first step is:
library(dplyr)
df.results <- df %>%
group_by(id) %>%
summarise(b = toString(b)) %>%
ungroup()
How would you turn df.results back into df?

How to use R to return max value of one column and the contents of the corresponding row following column [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have a data frame like this but much longer:
A B
1 0
3 9
7 3
6 2
1 4
2 1
I want to get the maximum value of column A and the value in column B that corresponds with it, regardless of whether it is also the maximum value. So for this data set I would like to get 7 and 3. But if I use:
Max<-apply(df,2,max)
I get 7 and 9.
Thanks for your help!
You want the row at which A has its maximum: df[which.max(df$A), ]
We can use dplyr
library(dplyr)
df1 %>%
slice(which.max(A))
# A tibble: 1 x 2
# A B
# <int> <int>
#1 7 3

R - Delete rows based on duplicate and values in another column [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 6 years ago.
I have a data.frame in R that looks like the following:
> inputtable <- data.frame(TN = c("T","N","T","N","N","T","T","N"),
+ Value = c(1,1,2,2,2,3,3,5))
> inputtable
TN Value
1 T 1
2 N 1
3 T 2
4 N 2
5 N 2
6 T 3
7 T 3
8 N 5
I want to remove values that duplicated in the Value column, but ONLY if one row has "T" and the other has "N" in the TN column.
I played around with duplicated, but this doesn't work the way I've coded it:
TNoverlaps.duprem <- TNoverlaps[ !(duplicated(TNoverlaps$Barcode) & ("T" %in% TNoverlaps$TN & "N" %in% TNoverlaps$TN)), ]
and
TNoverlaps.duprem <- TNoverlaps[ duplicated(TNoverlaps$Barcode) & !duplicated(TNoverlaps$Barcode, TNoverlaps$TN), ]
If there are more than two rows, as in rows 3-5 above, I want to remove all of those, because at least one is "T" and one is "N" in the TN column.
Here's the output I want
> outputtable
TN Value
6 T 3
7 T 3
8 N 5
I found plenty of questions about duplicated rows, and removing rows based on multiple columns. But I didn't see one that did something like this.
You could try:
library(dplyr)
inputtable %>% group_by(Value) %>% filter(!(n_distinct(TN) >= 2))
Source: local data frame [3 x 2]
Groups: Value [2]
TN Value
(fctr) (dbl)
1 T 3
2 T 3
3 N 5

Resources