How to split data frame with multiple delimiter using str_split_fixed? - r

How can i split a column separated by multiple delimiter into separate columns in data frame
read.table(text = " Chr Nm1 Nm2 Nm3
chr10_100064111-100064134+Nfif 20 20 20
chr10_100064115-100064138-Kitl 30 19 40
chr10_100076865-100076888+Tert 60 440 18
chr10_100079974-100079997-Itg 50 11 23
chr10_100466221-100466244+Tmtc3 55 24 53", header = TRUE)
Chr gene Nm1 Nm2 Nm3
chr10_100064111-100064134 Nfif 20 20 20
chr10_100064115-100064138 Kitl 30 19 40
chr10_100076865-100076888 Tert 60 440 18
chr10_100079974-100079997 Itg 50 11 23 12
chr10_100466221-100466244 Tmtc3 55 24 53 12
i used
library(stringr)
df2 <- str_split_fixed(df1$name, "\\+", 2)
I would like to know how can we include both + and - delimiter

If you're trying to split one column into multiple, tidyr::separate is handy:
library(tidyr)
dat %>% separate(Chr, into = paste0('Chr', 1:3), sep = '[+-]')
# Chr1 Chr2 Chr3 Nm1 Nm2 Nm3
# 1 chr10_100064111 100064134 Nfif 20 20 20
# 2 chr10_100064115 100064138 Kitl 30 19 40
# 3 chr10_100076865 100076888 Tert 60 440 18
# 4 chr10_100079974 100079997 Itg 50 11 23
# 5 chr10_100466221 100466244 Tmtc3 55 24 53

This should work:
str_split_fixed(a, "[-+]", 2)

Here is a way to do this in base R with strsplit:
# split Chr into a list
tempList <- strsplit(as.character(df$Chr), split="[+-]")
# replace Chr with desired values
df$Chr <- sapply(tempList, function(i) paste(i[[1]], i[[2]], sep="-"))
# get Gene variable
df$gene <- sapply(tempList, "[[", 3)

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

Apply loop for rollapply windows

I currently have a dataset with 50,000+ rows of data for which I need to find rolling sums. I have completed this using rollaply which has worked perfectly. I need to apply these rolling sums across a range of widths (600, 1200, 1800...6000) which I have done by cut and pasting each line of script and changing the width. While it works, I'd like to tidy my script but applying a loop, or similar, if possible so that once the rollapply function has completed it's first 'pass' at 600 width, it then completes the same with 1200 and so on. Example:
Var1 Var2 Var3
1 11 19
43 12 1
4 13 47
21 14 29
41 15 42
16 16 5
17 17 16
10 18 15
20 19 41
44 20 27
width_2 <- rollapply(x$Var1, FUN = sum, width = 2)
width_3 <- rollapply(x$Var1, FUN = sum, width = 3)
width_4 <- rollapply(x$Var1, FUN = sum, width = 4)
Is there a way to run widths 2, 3, then 4 in a simpler way rather than cut and paste, particularly when I have up to 10 widths, and then need to run this across other cols. Any help would be appreciated.
We can use lapply in base R
lst1 <- lapply(2:4, function(i) rollapply(x$Var1, FUN = sum, width = i))
names(lst1) <- paste0('width_', 2:4)
list2env(lst1, .GlobalEnv)
NOTE: It is not recommended to create multiple objects in the global environment. Instead, the list would be better
Or with a for loop
for(v in 2:4) {
assign(paste0('width_', v), rollapply(x$Var1, FUN = sum, width = v))
}
Create a function to do this for multiple dataset
f1 <- function(col1, i) {
rollapply(col1, FUN = sum, width = i)
}
lapply(x[c('Var1', 'Var2')], function(x) lapply(2:4, function(i)
f1(x, i)))
Instead of creating separate vectors in global environment probably you can add these as new columns in the already existing dataframe.
Note that rollaplly(..., FUN = sum) is same as rollsum.
library(dplyr)
library(zoo)
bind_cols(x, purrr::map_dfc(2:4,
~x %>% transmute(!!paste0('Var1_roll_', .x) := rollsumr(Var1, .x, fill = NA))))
# Var1 Var2 Var3 Var1_roll_2 Var1_roll_3 Var1_roll_4
#1 1 11 19 NA NA NA
#2 43 12 1 44 NA NA
#3 4 13 47 47 48 NA
#4 21 14 29 25 68 69
#5 41 15 42 62 66 109
#6 16 16 5 57 78 82
#7 17 17 16 33 74 95
#8 10 18 15 27 43 84
#9 20 19 41 30 47 63
#10 44 20 27 64 74 91
You can use seq to generate the variable window size.
seq(600, 6000, 600)
#[1] 600 1200 1800 2400 3000 3600 4200 4800 5400 6000

Difficulty converting wide format to tidy format in dataset

I am using Kaggles gun violence dataset. My goal is to use Tableau for a interactive visualization for some of the regions and specifics relating to gun crimes there. My goal is to turn this dataframe into tidy format. Link:
https://www.kaggle.com/jameslko/gun-violence-data/version/1
With that being the case, there are a couple columns formatted like this that I am having issues wrangling in R. There are around 20 or so columns, these 4 are formatted like this:
A little background: there can be more than one gun involved in a crime, and more than one participant. Due to this, these columns contain information for each gun/participant split by '||'. The 0:, 1: ... indicates details for that specific gun/participant.
My goal is to capture the unique instances in each column and disregard the 0:, 1:, 2:, ...
Here is my code so far:
df= read.csv("C:/Users/rmahesh/Desktop/gun-violence-data_01-2013_03-2018.csv")
df$incident_id = NULL
df$incident_url = NULL
df$source_url = NULL
df$participant_name = NULL
df$participant_relationship = NULL
df$sources = NULL
df$incident_url_fields_missing = NULL
df$participant_status = NULL
df$participant_age_group = NULL
df$participant_type = NULL
df$incident_characteristics = NULL
#Subset of columns with formatting issues:
df2 = df[, c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')]
I have yet to run into an issue like this, and would love any help figuring out how to solve my problem. Any help would be greatly appreciated!
Edit1: I have created the first 3 rows of the columns in question. The format is identical more or less with some columns missing at times:
gun_stolen,gun_type,participant_age,participant_gender
0::Unknown||1::Unknown, 0::Unknown||1::Unknown, 0::25||1::31||2::33||3::34||4::33, 0::Male||1::Male||2::Male||3::Male||4::Male
0::Unknown||1::Unknown,0::22 LR||1::223 Rem [AR-15],0::51||1::40||2::9||3::5||4::2||5::15,0::Male||1::Female||2::Male||3::Female||4::Female||5::Male
0::Unknown,0::Shotgun,3::78||4::48,0::Male||1::Male||2::Male||3::Male||4::Male
As Frank said in the comments, "tidy" can mean different things. Here we turn all specified columns in just two: one with the original column name ("key"), the other with the individual values after splitting the strings and removing the prefixes, one row for each ("value").
library(tidyr)
library(dplyr)
library(stringr)
myvars <- c('gun_stolen', 'gun_type', 'participant_age', 'participant_gender')
res <- as_tibble(df2) %>%
tibble::rowid_to_column() %>%
# Split strings in selected columns at "||". This turns those columns in
# list-columns of character vectors
mutate_at(myvars, str_split, pattern = fixed("||")) %>%
# Go from wide to long format: in the new 'key' column are the original column
# names, and 'value' is the one list-column of character vectors
gather(key, value, one_of(myvars)) %>%
# unnest turns the 'value' list-column into a regular character column, with
# duplication of rows that contain a 'value' of length greater than 1
unnest(value) %>%
filter(value != "") %>%
# Remove the "x::" prefixes
mutate(value = str_split_fixed(value, fixed("::"), n = 2)[, 2]) %>%
# Deduplicate
distinct() %>%
arrange(rowid, key, value)
# # A tibble: 732,017 x 3
# rowid key value
# <int> <chr> <chr>
# 1 1 participant_age 20
# 2 1 participant_gender Female
# 3 1 participant_gender Male
# 4 2 participant_age 20
# 5 2 participant_gender Male
# 6 3 gun_stolen Unknown
# 7 3 gun_type Unknown
# 8 3 participant_age 25
# 9 3 participant_age 31
# 10 3 participant_age 33
# # ... with 732,007 more rows
Also expanding on #Ben G's comment:
res %>%
count(key, value) %>%
arrange(key, desc(n))
# # A tibble: 141 x 3
# key value n
# <chr> <chr> <int>
# 1 gun_stolen Unknown 132099
# 2 gun_stolen Stolen 7350
# 3 gun_stolen Not-stolen 1560
# 4 gun_stolen "" 355
# 5 gun_type Unknown 98892
# 6 gun_type Handgun 17609
# 7 gun_type 9mm 6040
# 8 gun_type Shotgun 3560
# 9 gun_type Rifle 3196
# 10 gun_type 22 LR 3093
# 11 gun_type 40 SW 2624
# 12 gun_type 380 Auto 2323
# 13 gun_type 45 Auto 2234
# 14 gun_type 38 Spl 1758
# 15 gun_type 223 Rem [AR-15] 1248
# 16 gun_type 12 gauge 975
# 17 gun_type Other 892
# 18 gun_type 7.62 [AK-47] 854
# 19 gun_type 357 Mag 800
# 20 gun_type 25 Auto 601
# 21 gun_type 32 Auto 481
# 22 gun_type "" 356
# 23 gun_type 20 gauge 194
# 24 gun_type 44 Mag 192
# 25 gun_type 30-30 Win 105
# 26 gun_type 410 gauge 96
# 27 gun_type 308 Win 88
# 28 gun_type 30-06 Spr 71
# 29 gun_type 10mm 50
# 30 gun_type 16 gauge 30
# 31 gun_type 300 Win 23
# 32 gun_type 28 gauge 6
# 33 participant_age 19 10541
# 34 participant_age 20 9919
# 35 participant_age 18 9826
# 36 participant_age 21 9795
# 37 participant_age 22 9642
# 38 participant_age 23 9383
# 39 participant_age 24 9204
# 40 participant_age 25 8562
# 41 participant_age 26 7815
# 42 participant_age 17 7416
# 43 participant_age 27 7228
# 44 participant_age 28 6528
# 45 participant_age 29 6055
# 46 participant_age 30 5652
# 47 participant_age 31 5145
# 48 participant_age 32 5039
# 49 participant_age 16 4977
# 50 participant_age 33 4662
# # ... with 91 more rows
I think by tidying you mean split the contents of delimited columns and separate into rows. You can either take the first element or take each element as its own row.
df<-data.frame(instance=1:5,
gun_type=c("", "0::Unknown||1::Unknown", "",
"0::Handgun||1::Handgun", ""), stringsAsFactors=FALSE)
df$first<-sapply(strsplit(df$gun_type, "\\|\\|"), '[', 1)
splitType<-strsplit(df$gun_type, "\\|\\|")
df.2<-df[rep(1:nrow(df), sapply(splitType, length)),]
df.2$splitType<-unlist(splitType)
If you want just the unique values then use:
splitTypeUnique<-sapply(splitType, unique)
df.2<-df[rep(1:nrow(df), sapply(splitTypeUnique, length)),]
df.2$splitType<-unlist(splitTypeUnique)
but you will have to do a little wrangling to get the unique part to work

r count cells with missing values across each row [duplicate]

This question already has answers here:
Count NAs per row in dataframe [duplicate]
(2 answers)
Closed 6 years ago.
I have a dataframe as shown below
Id Date Col1 Col2 Col3 Col4
30 2012-03-31 A42.2 20.46 NA
36 1996-11-15 NA V73 55
96 2010-02-07 X48 Z16 13
40 2010-03-18 AD14 20.12 36
69 2012-02-21 22.45
11 2013-07-03 81 V017 TCG11
22 2001-06-01 67
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12
34 2014-03-10 82.12 N72.22 V45.44
I am trying to count the number of NA or Empty cells across each row and the final expected output is as follows
Id Date Col1 Col2 Col3 Col4 MissCount
30 2012-03-31 A42.2 20.46 NA 2
36 1996-11-15 NA V73 55 2
96 2010-02-07 X48 Z16 13 1
40 2010-03-18 AD14 20.12 36 1
69 2012-02-21 22.45 3
11 2013-07-03 81 V017 TCG11 1
22 2001-06-01 67 3
83 2005-03-16 80.45 V22.15 46.52 X29.11 0
92 2012-02-12 4
34 2014-03-10 82.12 N72.22 V45.44 1
The last column MissCount will store the number of NAs or empty cells for each row. Any help is much appreciated.
The one-liner
rowSums(is.na(df) | df == "")
given by #DavidArenburg in his comment is definitely the way to go, assuming that you don't mind checking every column in the data frame. If you really only want to check Col1 through Col4, then using an apply function might make more sense.
apply(df, 1, function(x) {
sum(is.na(x[c("Col1", "Col2", "Col3", "Col4")])) +
sum(x[c("Col1", "Col2", "Col3", "Col4")] == "", na.rm=TRUE)
})
Edit: Shortened code
apply(df[c("Col1", "Col2", "Col3", "Col4")], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
or if data columns are exactly like the example data:
apply(df[3:6], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
This should do it.
yourframe$MissCount = rowSums(is.na(yourframe) | yourframe == "" | yourframe == " "))
You can use by_row from library purrr:
library(purrr)
#sample data frame
x <- data.frame(A1=c(1,NA,3,NA),
A2=c("A","B"," ","C"),
A3=c(" "," ",NA,"t"))
Here you apply a function on each row, you can edit it according to your condition. And you can use whatever function you want.
In the following example, I counted empty or NA entries in each row by using sum(...):
by_row(x, function(y) sum(y==" "| (is.na(y))),
.to="MissCount",
.collate = "cols"
)
You will get:
# A tibble: 4 x 4
A1 A2 A3 MissCount
<dbl> <fctr> <fctr> <int>
1 1 A 1
2 NA B 2
3 3 NA 2
4 NA C t 1
We can use
Reduce(`+`, lapply(df, function(x) is.na(x)|!nzchar(as.character(x))))

Outputting percentiles by filtering a data frame

Note that, as requested in the comments, that this question has been revised.
Consider the following example:
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be
PERCENTILE df_1 df_2 ... df_10
1 [first percentiles]
2 [second percentiles]
etc., where df_i is based on FILTER == i.
Note that FILTER, although it contains numbers, is actually categorical.
The way I have been doing this is by using dplyr:
nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}
and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).
Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))
> out
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75
I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.
I suggest that you use a list.
list_of_dfs <- list()
nums <- 1:10
for (i in nums){
list_of_dfs[[i]] <- nums*i
}
df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)
You'll get the result you want:
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
How about using get?
df <- data.frame(1:10)
for (i in nums) {
df <- cbind(df, get(paste0("df_", i)))
}
# get rid of first useless column
df <- df[, -1]
# get names
names(df) <- paste0("df_", nums)
df

Resources