How can I improve my code to convert a column of factor into list within a data.frame? - r

I want to convert a column of factor into lists within a data.frame.
I made it with the code below, but I'm feeling this is not the right way.
How can I improve the code below ?
The data I'm dealing with is a result of association rules.(Using the package: arules) (it's in Japanese)
Here are 3 rows of the column "rules":
rules
{道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,歩道設置率=100%,バス優先.専用レーンの有無=なし} => {事故類型=車両相互_追突}
{道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,バス優先.専用レーンの有無=なし} => {事故類型=車両相互_追突}
{道路構造=交差点_交差点付近,歩道設置率=100%,バス優先.専用レーンの有無=なし,代表沿道状況=人口集中地区(商業地域を除く)} => {事故類型=車両相互_追突}
And str(data)
'data.frame': 50 obs. of 5 variables:
$ rules : Factor w/ 50 levels "{道路構造=交差点_交差点付近,バス優先.専用レーンの有無=なし,指定最高速度=50} => {事故類型=車両相互_追突}",..: 9 8 35 38 10 31 11 25 3 7 ...
$ support : Factor w/ 48 levels "0.050295052",..: 5 14 5 10 24 1 30 13 15 18 ...
$ confidence: Factor w/ 50 levels "0.555131629",..: 50 49 48 47 46 45 44 43 42 41 ...
$ lift : Factor w/ 50 levels "1.894879112",..: 50 49 48 47 46 45 44 43 42 41 ...
$ count : Factor w/ 48 levels "1013","1250",..: 9 18 9 14 28 5 34 17 19 22 ...
# convert factor to character
data %>% mutate_if(is.factor, as.character) -> data
# delete the RHS in rules(the part after '=>' )
data$rules <- strsplit(data$rules, " =>")
i = 1
for (i in 1:length(data$rules)) {
data$rules[[i]] <- data$rules[[i]][[-2]]
}
# delete "{" and "}"
data$rules <- as.character(data$rules)
data$rules <- strsplit(data$rules, "[{]")
i = 1
for (i in 1:length(data$rules)) {
data$rules[[i]] <- data$rules[[i]][[-1]]
}
data$rules <- as.character(data$rules)
data$rules <- strsplit(data$rules, "[}]")
# split character to list (:length(data$rules[[1]] -> 4))
data$rules <- as.character(data$rules)
data$rules <- strsplit(data$rules, ",")
The output should be like this:
[[1]]
[1] "道路構造=交差点_交差点付近" "昼間12時間平均旅行速度=20~30km/h" "歩道設置率=100%" "バス優先.専用レーンの有無=なし"
[[2]]
[1] "道路構造=交差点_交差点付近" "昼間12時間平均旅行速度=20~30km/h" "バス優先.専用レーンの有無=なし"
[[3]]
[1] "道路構造=交差点_交差点付近" "歩道設置率=100%" "バス優先.専用レーンの有無=なし"
[4] "代表沿道状況=人口集中地区(商業地域を除く)"
My code did work, however, I just feel it's not beautiful, or efficient.
So could you improve it. Or, the right way to do this work.

We can use str_extract
library(stringr)
library(dplyr)
out <- data %>%
mutate(rules = trimws(str_extract(rules, "(?<=\\{)[^}]+")))
out$rules
#[1] "道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,歩道設置率=100%,バス優先.専用レーンの有無=なし"
#[2] "道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,バス優先.専用レーンの有無=なし"
#[3] "道路構造=交差点_交差点付近,歩道設置率=100%,バス優先.専用レーンの有無=なし,代表沿道状況=人口集中地区(商業地域を除く)"
If we want to split the 'rules' by , and create a list column
out$rules <- str_split(out$rules, ",")
data
data <- structure(list(rules = c("{道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,歩道設置率=100%,バス優先.専用レーンの有無=なし} => {事故類型=車両相互_追突}",
"{道路構造=交差点_交差点付近,昼間12時間平均旅行速度=20~30km/h,バス優先.専用レーンの有無=なし} => {事故類型=車両相互_追突}",
"{道路構造=交差点_交差点付近,歩道設置率=100%,バス優先.専用レーンの有無=なし,代表沿道状況=人口集中地区(商業地域を除く)} => {事故類型=車両相互_追突}"
)), class = "data.frame", row.names = c(NA, -3L))

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

I was trying to mutate a new numeric column in a dataframe but the compliler is taking it as char's and i am not even able to access it using index

library(dslabs)
data(heights)
library(dplyr)
mutate(heights, ht_cm = height * 2.54, stringsAsFactor = FALSE )
str(heights) # not showing ht_cm as a variable in the data frame
mean(heights$ht_cm) # giving error that argument is not numeric
You just used mutate, but if you want to add the new column in height you need to:
Code
heights <-
heights %>%
mutate(ht_cm = height * 2.54)
Output
str(heights)
'data.frame': 1050 obs. of 3 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
$ ht_cm : num 190 178 173 188 155 ...

Writing functions: Creating data processing functions with R software [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Hello fellow "R" users!
Please spare me some of your time on helping me with the use of "R" software(Beginner) regarding "Data processing function", wherein I have three (3) different .csv files named "x2013, x2014, x2015" that has the same 6 columns as per respective year based on the image below: Problem and started typing the commands:
filenames=list.files()
library(plyr)
install.packages("plyr")
import.list=adply(filenames,1,read.csv)
Although I just really wanted to summarize all the calls from the three source (csv). Any kind of help would be appreciated. Thank you for assisting me!
If you want to summarize the results of read.csv into one data.frame you can use the following approach with do.call and rbind, given that csv-files has the same amount of columns. The code below takes all csv files (the amount of columns should be the same) from the project home directory and concatenate into one data.frame:
# simulation of 3 data.frames with 6 columns and 10 rows
df1 <- as.data.frame(matrix(1:(10 * 6), ncol = 6))
df2 <- df1 * 2
df3 <- df1 * 3
write.csv(df1, "X2012.csv")
write.csv(df2, "X2013.csv")
write.csv(df3, "X2014.csv")
# Load all csv files from home directory
filenames <- list.files(".", pattern = "csv$")
import.list<- lapply(filenames, read.csv)
# concatenate list of data.frames into one data.frame
df_res <- do.call(rbind, import.list)
str(df_res)
Output is a data.frame with 6 columns and 30 rows:
'data.frame': 30 obs. of 7 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ V1: int 1 2 3 4 5 6 7 8 9 10 ...
$ V2: int 11 12 13 14 15 16 17 18 19 20 ...
$ V3: int 21 22 23 24 25 26 27 28 29 30 ...
$ V4: int 31 32 33 34 35 36 37 38 39 40 ...
$ V5: int 41 42 43 44 45 46 47 48 49 50 ...
$ V6: int 51 52 53 54 55 56 57 58 59 60 ...

How to split data frame with multiple delimiter using str_split_fixed?

How can i split a column separated by multiple delimiter into separate columns in data frame
read.table(text = " Chr Nm1 Nm2 Nm3
chr10_100064111-100064134+Nfif 20 20 20
chr10_100064115-100064138-Kitl 30 19 40
chr10_100076865-100076888+Tert 60 440 18
chr10_100079974-100079997-Itg 50 11 23
chr10_100466221-100466244+Tmtc3 55 24 53", header = TRUE)
Chr gene Nm1 Nm2 Nm3
chr10_100064111-100064134 Nfif 20 20 20
chr10_100064115-100064138 Kitl 30 19 40
chr10_100076865-100076888 Tert 60 440 18
chr10_100079974-100079997 Itg 50 11 23 12
chr10_100466221-100466244 Tmtc3 55 24 53 12
i used
library(stringr)
df2 <- str_split_fixed(df1$name, "\\+", 2)
I would like to know how can we include both + and - delimiter
If you're trying to split one column into multiple, tidyr::separate is handy:
library(tidyr)
dat %>% separate(Chr, into = paste0('Chr', 1:3), sep = '[+-]')
# Chr1 Chr2 Chr3 Nm1 Nm2 Nm3
# 1 chr10_100064111 100064134 Nfif 20 20 20
# 2 chr10_100064115 100064138 Kitl 30 19 40
# 3 chr10_100076865 100076888 Tert 60 440 18
# 4 chr10_100079974 100079997 Itg 50 11 23
# 5 chr10_100466221 100466244 Tmtc3 55 24 53
This should work:
str_split_fixed(a, "[-+]", 2)
Here is a way to do this in base R with strsplit:
# split Chr into a list
tempList <- strsplit(as.character(df$Chr), split="[+-]")
# replace Chr with desired values
df$Chr <- sapply(tempList, function(i) paste(i[[1]], i[[2]], sep="-"))
# get Gene variable
df$gene <- sapply(tempList, "[[", 3)

times series import from Excel and date manipulation in R

I have 2 columns of a time series in a .csv excel file, "date" and "widgets"
I import the file into R using:
widgets<-read.csv("C:things.csv")
str(things)
'data.frame': 280 obs. of 2 variables:
$ date: Factor w/ 280 levels "2012-09-12","2012-09-13",..: 1 2 3 4 5 6 7 8 9 10 ...
$ widgets : int 5 10 15 20 30 35 40 50 55 60 65 70 75 80 85 90 95 100 ...
How do I convert the factor things$date into either xts or Time Series format?
for instance when I:
hist(things)
Error in hist.default(things) : 'x' must be numeric
Try reading it in as a zoo object and then converting:
Lines <- "date,widgets
2012-09-12,5
2012-09-13,10
"
library(zoo)
# replace first argument with: file="C:things.csv"
z <- read.zoo(text = Lines, header = TRUE, sep = ",")
x <- as.xts(z)

Resources