Erase lines on a dataframe based on a specific calculation - r

I have a dataframe like this :
mydata
and I would like to erase a specific amount of lines based on the IDNumber and #Paymts.
Basically I want to keep only the 2 last lines for each IDNumber. For example, for IDNumber = 230, i have 5 lines (indicated in the column #Paymts), I want to erase all the first lines and just keep the two last.
Any idea ?
Thanks in advance!

We can use slice
library(dplyr)
df1 %>%
group_by(IDNumber) %>%
slice(tail(row_number(), 2)) #or
#slice((n()-1):n())
data
set.seed(24)
df1 <- data.frame(IDNumber = rep(LETTERS[1:3], each = 5), Date = Sys.Date(),
`#Paymts` = sample(1:9, 15, replace = TRUE), check.names = FALSE)

You can do it this way:
trimmed_df = df[unlist(tapply(rownames(df), df$id_number, tail, 2)), ]

Related

How to Count the frequency of specific words for each column using R?

I am using this dataset https://archive.ics.uci.edu/ml/datasets/Eco-hotel
I am trying to figure out how to count the frequency of certain words like "room" or "vacation" within each column. I have attempted following tutorials online, but unfortunately, I have had no luck.
Using the iris dataset as an example,
what you can do is:
library(tidyverse)
iris %>%
summarize(across(everything(), ~ sum(str_detect(., 'setosa'))))
Of course, you‘d need to change the seqrch term to what you need.
If you want to have dedicated columns for each of your search patterns, you could alternatively do sth. like:
df <- data.frame(x = sample(letters, 10, replace = TRUE),
y = sample(letters, 10, replace = TRUE))
df |>
summarize(across(c(x, y), ~sum(str_count(., c("u"))), .names = "{.col}_u"),
across(c(x, y), ~sum(str_count(., c("g"))), .names = "{.col}_g"))
Here I'M searching for letters "u" and "g", respectively.

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

Making Calculations on Several Textfiles and making a Dataframe from it R

I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.
How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...
I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3

r aggregate and collapse several cells into one

I have a data frame:
x <- data.frame(id = 1:18,
super = c(rep("A", 12), rep("B", 6)),
category = c(rep("one", 6), rep("two", 6), rep("three", 6)),
root = sort(rep(letters[1:6], 3)),
coldefs = letters[1:18], stringsAsFactors = F)
x
I am creating a new column by concatenating 3 columns:
myvars <- c("super", "category", "root")
library(tidyverse)
x <- x %>% unite(col = concat, myvars, sep = "_", remove = F)
x
Now, for each unique value of column 'concat' the values of column 'super' are the same, the values of column 'category' are the same, and the values of column "root" are the same. However, for each unique value of column 'concat' the values of column 'id' are different. The same is true for column 'coldefs'.
I would like to collapse (aggregate) x so that it has only as many rows as there are unique values in column 'concat' (i.e., 6 rows). In each row, I want one value from column 'super', one value from column 'category', one value from column 'root'; and then 3 values of column 'id' (concatenated like this: 1;2;3) and 3 values of column 'coldefs' (concatenated like this: a;b;c).
What's the best way of doing it?
I am trying the following, but it's not working:
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"),
super = unique(super), category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))
I am clearly doing something wrong.
Thanks a lot for your help!
I must say this is a bit (or completely) crazy! I tried my code (the one at the bottom) piece by piece and it worked. I merged it all together - and it worked. I don't understand why was I getting an error before. Here is the correct code that works (at least now):
x %>% group_by(concat) %>% summarize(id = paste(id, collapse = ";"), super = unique(super),
category = unique(category), root = unique(root),
coldefs = paste(coldefs, collapse = ";"))

Matching across two data frames with certain observations having multiple entries to match against

I working with two data frames corresponding to the sample below:
# Data sets
set.seed(1)
dta_a <- data.frame(some_value = runif(n = 10),
identifier=c("A0001","A0002","A0003","A0004","A0005",
"A0006","B0001","B0002","B0003","B0004"),
other_val = runif(n = 10))
dta_b <- data.frame(variable_abc = runif(n = 6),
identifier=c("A0001","A0002","A0003,A0004,A0005,C0001",
"B0001,B0002","B0003","B0004"),
variable_df = runif(n = 6))
I would like to merge those two data frames and obtain a data frame similar to the one presented below:
The resulting data frame would have the following qualities:
For the observations where only one identifier is present the merge command performs with all.y = TRUE and all.x = FALSE assuming that y is dta_b.
For the observations where multiple identifiers are provided only the first matched value from the dta_a is taken with the remaining values ignored. If there is no match on the first identifier (A0003) I would like for the command to attempt to match the next one (A0004).
I made a reference to the merge command but, naturally, dplyr and other solutions are fine.
you can 'melt' the dta_b so to have one row per identifier with a preference order and then join all the identifiers:
library(dplyr)
library(tidyr)
melt_dta_b = lapply(1:nrow(dta_b), function(i){
split_identifier = strsplit(as.character(dta_b$identifier[i]), split = ",", fixed = TRUE)[[1]]
data_frame(identifier = split_identifier,
original_identifier = dta_b$identifier[i], original_row = i, preference = 1:length(identifier),
variable_abc = dta_b$variable_abc[i], variable_df = dta_b$variable_df[i])
})
melt_dta_b = rbind_all(melt_dta_b)
At that point you can select only the one with the highest preference score:
joined_df = left_join(melt_dta_b, dta_a) %>%
filter(!is.na(some_value)) %>%
group_by(original_row) %>%
filter(preference == min(preference)) %>%
ungroup()
UPDATE
in order to not explicitly call the variables by name you can use the following code that binds all the 'unused' columns of the orginal df:
melt_dta_b = lapply(1:nrow(dta_b), function(i){
tmp = dta_b[i,]
split_identifier = strsplit(as.character(tmp$identifier), split = ",", fixed = TRUE)[[1]]
colnames(tmp)[2] = "original_identifier"
data_frame(identifier = split_identifier, original_row = i, preference = 1:length(identifier)) %>%
cbind(tmp)
})
melt_dta_b = rbind_all(melt_dta_b)
Just one way of doing it, but not best way I guess. Just made a try.
Split the identifiers and merge according to the first one.
dta_a$identifier = as.vector(dta_a$identifier)
dta_a1 = data.frame(dta_a, identifier_split = do.call(rbind, strsplit(dta_a$identifier, split = ",", fixed = T)))
dta_b$identifier = as.vector(dta_b$identifier)
dta_b1 = data.frame(dta_b, identifier_split = do.call(rbind, strsplit(dta_b$identifier, split = ",", fixed = T)))
dta_join = merge(dta_a1, dta_b1, by = "identifier_split.1", all.x = F, all.y = T)
In cases you don't have a match for the first one, you'll see NAs and you can subset them and merge with second ones ("identifier_split.2")

Resources