My Sample data set have the following look.
Country Population Capital Area
A 210000210 Sydney/Landon 10000000
B 420000000 Landon 42100000
C 500000 Italy42/Rome1 9200000
D 520000100 Dubai/Vienna21A 720000
How to delete the entire row with a pattern / in its column. I have tried to look in the following link R: Delete rows based on different values following a certain pattern, but it does not help.
You can try grepl
df[!grepl('[/]', df$Capital),]
# Country Population Capital Area
#2 B 420000000 Landon 42100000
library(stringr)
library(tidyverse)
df2 <- df %>%
filter(!str_detect(Capital, "\\/"))
# Country Population Capital Area
# 1 B 420000000 Landon 42100000
Data
df <- structure(list(Country = c("A", "B", "C", "D"), Population = c(210000210L,420000000L, 500000L, 520000100L),
Capital = c("Sydney/Landon", "Landon", "Italy42/Rome1", "Dubai/Vienna21A"),
Area = c(10000000L, 42100000L, 9200000L, 720000L)), class = "data.frame", row.names = c(NA,-4L))
Related
I have to create a synthetic dataset with multiple variables and >50 observations. I have selected to create a synthetic data for an oil field which has 10 wells and five producing reservoirs. So my dataframe would have 3 variables - "Well ID","Reservoir Name" and "Reservoir Quality".
So, I want to create a dataframe in which for each well, I would have 5 reservoirs, and for each reservoir, I would have 3 rock qualities - "Sand","Shale", and "Cement".
I tried for 2 variables in a crude way -
well1 <- data.frame(Wells = rep(1, 5), Reservoirs = c("A", "B", "C", "D","E"))
well2 <- data.frame(Wells = rep(2, 5), Reservoirs = c("A", "B", "C", "D","E"))
.
.
static_data <- rbind(well1,well2,...)
Now, I am struggling how to add the 3rd variable, and is there any smarter way of doing this?
I
I am looking for something like this -
Well
Reservoir
Rock Quality
1
A
Sand
1
A
Shale
1
A
Cement
1
B
Sand
1
B
Shale
1
B
Cement
The package data.table has a cross-join function that gives what I think you need.
library(data.table)
CJ(a=c(1,2,3), b=c('a', 'b'), c=c('Y', 'Z'))
Can't seem to wrap my head around a seemingly simple task: how to filter a dataframe based on a pattern in one column, which, however, is to match only if a pattern in another column matches:
Data:
df <- data.frame(
Speaker = c("A", NA, "B", "C", "A", "B", "A", "B", "C"),
Utterance = c("uh-huh",
"(0.666)",
"WOW!",
"#yeah#",
"=right=",
"oka::y¿",
"okay",
"some stuff",
"!more! £TAlk£"),
Orthographic = c("uh-huh", "NA", "wow", "yeah", "right", "okay", "okay", "some stuff", "more talk")
)
I want to remove rows in df where the pattern ^(yeah|okay|right|mhm|mm|uh(-| )?huh)$ matches in column Orthographic but not if these rows contain any character from character class [A-Z:↑↓£#¿?!] in column Utterance.
Expected outcome:
df
Speaker Utterance Orthographic
3 B WOW! wow
4 C #yeah# yeah
6 B oka::y¿ okay
8 B some stuff some stuff
9 C !more! £TAlk£ more talk
Attempts so far: (filters too much!)
library(dplyr)
df %>%
filter(!is.na(Speaker)) %>%
filter(!grepl("^(yeah|okay|right|mhm|mm|uh(-| )?huh)$", Orthographic)
& grepl("[A-Z:↑↓£#¿?!]", Utterance))
Speaker Utterance Orthographic
1 B WOW! wow
2 C !more! £TAlk£ more talk
I think you need | :
library(dplyr)
df %>%
filter(!is.na(Speaker)) %>%
filter(!grepl("^(yeah|okay|right|mhm|mm|uh(-| )?huh)$", Orthographic)
| grepl("[A-Z:↑↓£#¿?!]", Utterance))
# Speaker Utterance Orthographic
#1 B WOW! wow
#2 C #yeah# yeah
#3 B oka::y¿ okay
#4 B some stuff some stuff
#5 C !more! £TAlk£ more talk
Keep rows that does not have ^(yeah|okay|right|mhm|mm|uh(-| )?huh)$ Or have [A-Z:↑↓£#¿?!].
I'm having some trouble when I try to merge two data frames. Here is an example:
Number <- c("1", "2", "3")
Letter <- factor(c("a", "b", "c"))
map <- data.frame(Number, Letter, row.names = c("Belgium", "Italy", "Senegal"))
This is my first data frame called "map", it looks like this:
Number Letter
Belgium 1 a
Italy 2 b
Senegal 3 c
And if I try to select by row and column I don't have any problem:
map["Belgium", "Number"]
[1] "1"
Here I have my second data frame called "calendar":
Month <- c("January", "February", "March")
calendar <- data.frame(Month, row.names = c("Belgium", "Italy", "Senegal"))
It looks like this:
Month
Belgium January
Italy February
Senegal March
The problem comes when I try to merge both data frames:
map.amp = merge(map, calendar, by = 0)
Row.names Number Letter Month
1 Belgium 1 a January
2 Italy 2 b February
3 Senegal 3 c March
Now, when I try to select a cell using rows and columns, the outcome is always NA
map.amp["Italy", "Month"]
[1] NA
map.amp["Belgium", "Number"]
[1] NA
How can I merge both data frames so I can keep using that kind of select function?
You have to re-set the row names:
row.names(map.amp) <- map.amp$Row.names
If you want to keep using those row names you have to set the Row.names column back to row names. tibble::column_to_rownames is a nice option for this:
map.amp <- merge(map, calendar, by = 0) %>% tibble::column_to_rownames(var = "Row.names")
map.amp[map.amp$Row.names =='Italy', 'Month']
Will work now as row.names is also a column now
You could use the answer in the comment by #thelatemail. Or use
subset(map.amp, Row.names =='Italy')[[ 'Month']] # first get matching rows but them narrow to named column.
or
subset(map.amp, Row.names =='Italy', 'Month') # third argument is for column selection
I have a data frame (df) like:
database minrna genesymbol
A mir-1 abc
A mir-2 bcc
B mir-1 abc
B mir-3 xyb
c mir-1 abc
I want to extract mirna that is predicted at least by two databases. For example in the above df, mir-1' is predicted by databaseA,BandC` and hence, the result I want would be:
database minrna genesymbol
A mir-1 abc
B mir-1 abc
c mir-1 abc
I have tried to search similar questions but I couldn't find something similar to this. Could you please help me to solve this out. Thank you.
We can count number of unique database for each minrna and filter based on that.
This can be done in base R :
subset(df, ave(database, minrna, FUN = function(x) length(unique(x))) >= 2)
# database minrna genesymbol
#1 A mir-1 abc
#3 B mir-1 abc
#5 c mir-1 abc
In dplyr :
library(dplyr)
df %>% group_by(minrna) %>% filter(n_distinct(database) >= 2)
Or with data.table :
library(data.table)
setDT(df)[, .SD[uniqueN(database) >=2], minrna]
data
df <- structure(list(database = c("A", "A", "B", "B", "c"), minrna = c("mir-1",
"mir-2", "mir-1", "mir-3", "mir-1"), genesymbol = c("abc", "bcc",
"abc", "xyb", "abc")), row.names = c(NA, -5L), class = "data.frame")
Use group_by function from {dplyr} package, I will let you figure out the details as a form of exercise.
https://dplyr.tidyverse.org/
I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
DF:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll.
If GLDESC contains the word Supply anywhere in the string then I want KIND to be Supply.
In all other cases I want KIND to be Other.
Then, I found this:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
But with that, I have everything that matches Supply, for example, classified. However, as in DF lines 4 and 5, the same GL has two Supply, which for me is unnecessary. In fact, I need only one type of GLDESC to be matched if for the same GL the string is repeated.
Edit: I can not delet any row. I want to have this as output:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll
If we need the repeating element to be NA, use duplicated on 'GLDESC' to get a logical vector and assign those elements in 'KIND' created with ifelse to NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
If we need to change the values by a grouping variable
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
Or with the full changes
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
data
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")