converting multiple columns from wide to long using pivot_longer - r

I get an error message when I want to convert multiple columns from wide to long with pivot_longer
I have code which converts from wide to long with gather but I have to do this column by column. I want to use pivot_longer to gather multiple columns rather than column by column.
This is some input data:
structure(list(id = c("81", "83", "85", "88", "1", "2"), look_work = c("yes",
"yes", "yes", "yes", "yes", "yes"), current_work = c("no", "yes",
"no", "no", "no", "no"), before_work = c("no", "NULL", "yes",
"yes", "yes", "yes"), keen_move = c("yes", "yes", "no", "no",
"no", "no"), city_size = c("village", "more than 500k inhabitants",
"more than 500k inhabitants", "village", "city up to 20k inhabitants",
"100k - 199k inhabitants"), gender = c("male", "female", "female",
"male", "female", "male"), age = c("18 - 24 years", "18 - 24 years",
"more than 50 years", "18 - 24 years", "31 - 40 years", "more than 50 years"
), education = c("secondary", "vocational", "secondary", "secondary",
"secondary", "secondary"), hf1 = c("", "", "", "1", "1", "1"),
hf2 = c("", "1", "1", "", "", ""), hf3 = c("", "", "", "",
"", ""), hf4 = c("", "", "", "", "", ""), hf5 = c("", "",
"", "", "", ""), hf6 = c("", "", "", "", "", ""), ac1 = c("",
"", "", "", "", "1"), ac2 = c("", "1", "1", "", "1", ""),
ac3 = c("", "", "", "", "1", ""), ac4 = c("", "", "", "",
"", ""), ac5 = c("", "", "", "", "", ""), ac6 = c("", "",
"", "", "", ""), cs1 = c("", "", "", "", "", ""), cs2 = c("",
"1", "1", "", "1", ""), cs3 = c("", "", "", "", "", "1"),
cs4 = c("", "", "", "1", "", ""), cs5 = c("", "", "", "",
"", ""), cs6 = c("", "", "", "", "", ""), cs7 = c("", "",
"", "", "", ""), cs8 = c("", "", "", "", "", ""), se1 = c("",
"", "1", "1", "", ""), se2 = c("", "", "", "", "1", ""),
se3 = c("", "1", "", "", "1", "1"), se4 = c("", "", "", "",
"", ""), se5 = c("", "", "", "", "", ""), se6 = c("", "",
"", "", "", ""), se7 = c("", "", "", "", "", ""), se8 = c("",
"", "", "1", "", "")), row.names = c(NA, 6L), class = "data.frame")
The code using gather is:
df1 <- df %>%
gather(key = "hf_com", value = "hf_com_freq", hf_<:hf6) %>%
gather(key = "ac_com", value = "ac_com_freq", ac1:ac6) %>%
filter(substring(hf_com, 3) == substring(ac_com, 3))
df1 <- df1 %>%
gather(key = "curr_sal", value = "curr_sal_freq", cs1:cs8) %>%
gather(key = "exp_sal", value = "exp_sal_freq", se1:se8) %>%
filter(substring(curr_sal, 3) == substring(exp_sal, 3))
The code using pivot_longer is:
df_longer <- df %>%
pivot_longer(
cols = starts_with("hf"),
names_to = "hf_com",
values_to = "hf_freq",
names_prefix = "hf",
na.rm = TRUE)
The expected results which I get with gather are:
structure(list(id = c("81", "83", "85", "88", "1", "2"), look_work = c("yes",
"yes", "yes", "yes", "yes", "yes"), current_work = c("no", "yes",
"no", "no", "no", "no"), before_work = c("no", "NULL", "yes",
"yes", "yes", "yes"), keen_move = c("yes", "yes", "no", "no",
"no", "no"), city_size = c("village", "more than 500k inhabitants",
"more than 500k inhabitants", "village", "city up to 20k inhabitants",
"100k - 199k inhabitants"), gender = c("male", "female", "female",
"male", "female", "male"), age = c("18 - 24 years", "18 - 24 years",
"more than 50 years", "18 - 24 years", "31 - 40 years", "more than 50 years"
), education = c("secondary", "vocational", "secondary", "secondary",
"secondary", "secondary"), hf_com = c("hf1", "hf1", "hf1", "hf1",
"hf1", "hf1"), hf_com_freq = c("", "", "", "1", "1", "1"), ac_com = c("ac1",
"ac1", "ac1", "ac1", "ac1", "ac1"), ac_com_freq = c("", "", "",
"", "", "1"), curr_sal = c("cs1", "cs1", "cs1", "cs1", "cs1",
"cs1"), curr_sal_freq = c("", "", "", "", "", ""), exp_sal = c("se1",
"se1", "se1", "se1", "se1", "se1"), exp_sal_freq = c("", "",
"1", "1", "", "")), row.names = c(NA, 6L), class = "data.frame")
With pivot_longer, I get the following error message:
Error in pivot_longer(., cols = starts_with("hf"), names_to = "hf_com", :
unused argument (na.rm = TRUE)
Also, if there is no solution with pivot_longer, then a solution with data.table would be appreciated.

I have solved the problem:
This needs to be changed from:
df_longer <- df %>%
pivot_longer(
cols = starts_with("hf"),
names_to = "hf_com",
values_to = "hf_freq",
names_prefix = "hf",
na.rm = TRUE)
to:
df_longer <- df %>%
pivot_longer(
cols = starts_with("hf"),
names_to = "hf_com",
values_to = "hf_freq",
names_prefix = "hf",
values_drop_na = TRUE)

Related

Find a string in one row, but then replace value in next row

I am trying to find a way to search for a string, in my eg "Prep" and then replace the cell in the row below with a specific value, in my eg "SINGLE".
This is my example input and output. I can grep in $V4 and find the values, but I can't seem to work out how to replace the row below with my desired text.
Can anyone give me a tip on what I'm doing wrong? I've tried a number of mutate functions and can't find one to work.
input = structure(list(V1 = c("Fred", "", "John", "", "Max", "", "Tim",
""), V2 = c("Chicago", "", "Boston", "", "London", "", "Paris",
""), V3 = c("", "Red", "", "Yellow", "", "Red", "", "Blue"),
V4 = c("Final", "TEAM", "Prep", "TEAM", "Prep", "TEAM", "Final",
"SINGLE")), row.names = c(NA, 8L), class = "data.frame")
output = structure(list(V1 = c("Fred", "", "John", "", "Max", "", "Tim",
""), V2 = c("Chicago", "", "Boston", "", "London", "", "Paris",
""), V3 = c("", "Red", "", "Yellow", "", "Red", "", "Blue"),
V4 = c("Final", "TEAM", "Prep", "SINGLE", "Prep", "SINGLE",
"Final", "SINGLE")), row.names = 9:16, class = "data.frame")
Here is a potential solution based on the lag() function from the dplyr package (https://dplyr.tidyverse.org/reference/lead-lag.html):
library(dplyr)
input <- structure(list(V1 = c("Fred", "", "John", "", "Max", "", "Tim",
""), V2 = c("Chicago", "", "Boston", "", "London", "", "Paris",
""), V3 = c("", "Red", "", "Yellow", "", "Red", "", "Blue"),
V4 = c("Final", "TEAM", "Prep", "TEAM", "Prep", "TEAM", "Final",
"SINGLE")), row.names = c(NA, 8L), class = "data.frame")
output <- structure(list(V1 = c("Fred", "", "John", "", "Max", "", "Tim",
""), V2 = c("Chicago", "", "Boston", "", "London", "", "Paris",
""), V3 = c("", "Red", "", "Yellow", "", "Red", "", "Blue"),
V4 = c("Final", "TEAM", "Prep", "SINGLE", "Prep", "SINGLE",
"Final", "SINGLE")), row.names = 9:16, class = "data.frame")
answer <- input %>%
mutate(V4 = ifelse(lag(V4, default = first(V4)) == "Prep", "SINGLE", V4))
all_equal(output, answer)
#> [1] TRUE
Created on 2022-11-10 by the reprex package (v2.0.1)

How to wrangle the dataset in R: reshaping and creating new columns with given information

I have a dataset that looks like below,
structure(list(nonyeasted_19 = c("Force (N)", "0", "-0.0077",
"0.0023", "-0.0707", "-0.2155", "-0.2026", "-0.0628", "-0.0481",
"-0.0601", "0.0302", "0.0475", "-0.0176", "0.008", "0.0569",
"0.0242", "0.0003", "0.0295", "0.028", "-0.0221", "-0.0333",
"0.0034", "0.004", "-0.0219", "-0.0216", "-0.0261"), nonyeasted_19.1 = c("Distance (m)",
"0", "0", "0", "0", "0", "0", "0.000002", "0.000004", "0.000006",
"0.000008", "0.00001", "0.000012", "0.000014", "0.000016", "0.000018",
"0.00002", "0.000022", "0.000024", "0.000026", "0.000028", "0.00003",
"0.000032", "0.000034", "0.000036", "0.000038"), nonyeasted_19.2 = c("Time (sec)",
"0", "0.002", "0.004", "0.006", "0.008", "0.01", "0.012", "0.014",
"0.016", "0.018", "0.02", "0.022", "0.024", "0.026", "0.028",
"0.03", "0.032", "0.034", "0.036", "0.038", "0.04", "0.042",
"0.044", "0.046", "0.048"), nonyeasted_19.3 = c("Status", "101",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"), yeasted_01 = c("Force (N)",
"0", "0.0024", "0.0307", "-0.0487", "-0.2063", "-0.1928", "-0.0421",
"-0.0278", "-0.0586", "0.0251", "0.0373", "-0.0084", "0.0597",
"0.091", "0.0246", "0.0318", "", "", "", "", "", "", "", "",
""), yeasted_01.1 = c("Distance (m)", "0", "0", "0", "0", "0",
"0", "0", "0.000001", "0.000003", "0.000005", "0.000007", "0.000009",
"0.000011", "0.000013", "0.000015", "0.000017", "", "", "", "",
"", "", "", "", ""), yeasted_01.2 = c("Time (sec)", "0", "0.002",
"0.004", "0.006", "0.008", "0.01", "0.012", "0.014", "0.016",
"0.018", "0.02", "0.022", "0.024", "0.026", "0.028", "0.03",
"", "", "", "", "", "", "", "", ""), yeasted_01.3 = c("Status",
"101", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "", "", "", "", "", "", "", "", "")), class = "data.frame", row.names = c(NA,
-26L))
Every four columns are in one group, and the group names are in the first row, while the column names are in the second row. I wonder whether there are any ways to concatenate the groups vertically and create two new columns with the group name row, where column 1 contains the text before the underscore and column 2 contains the text after the underscore.
I tried to use tidyverse, but after read.csv(), the variable names could not be preserved.
one approach:
sample data (example_data.csv):
group_A,group_A,group_B,group_B
var_1,var_2,var_3,var_4
143,897,234,382
Code:
library(readr) ## for the read_lines function
library(tidyr) ## wrangling (pivoting etc.)
## read csv but skip first line (containing group names):
df <- read.csv('path/to/example_data.csv',skip = 1)
## read first line of csv and convert it to vector of group names:
group_names <- read_lines('path/to/example_data.csv', n_max = 1) %>%
strsplit(',') %>% unlist
## change names of dataframe df to: variable_name;group_name
names(df) <- paste(group_names, names(df), sep = ';')
## wrangle data (for documentation see https://tidyr.tidyverse.org/ )
df %>%
pivot_longer(everything(), names_to = 'group_var', values_to = 'value') %>%
separate(group_var, into = c('group', 'var'), sep = ';') %>%
separate(group, into = c('yeasted_status', 'index'), sep='_') %>%
pivot_wider(names_from = var, values_from = value)
Result:
## A tibble: 2 x 6
# yeasted_status index var_1 var_2 var3 var4
# <chr> <chr> <int> <int> <int> <int>
# 1 group A 143 897 NA NA
# 2 group B NA NA 234 382
edit
or, if df is the dataframe derived from your dput output:
df[-1,] %>%
pivot_longer(everything(),names_to = 'group_var', values_to = 'value') %>% head %>%
mutate(ID = paste(row_number(),group_var)) %>%
separate(group_var, into = c('group', 'var'), sep = ';') %>%
separate(group, into = c('yeasted_status', 'index'), sep='_') %>%
mutate(value = as.double(value)) %>%
pivot_wider(id_cols = c(ID, yeasted_status,index), names_from = var, values_from = value) %>%
select(-ID)

Conditional str_remove based on data frame column

I have a dataframe (pasted below), in which I am trying to set to blank the value of one column based on the value of another column. The idea is that if X6 equals Nbre CV or if X6equals Nbre BVD then I want X6for that row to be blank.
Unfortunately using the following code the entire X6 column turns to NA or missing.
extractstack <- extractstack %>%
mutate(across(everything(), as.character) %>%
mutate(X6 = if_else(X6 == `Nbre CV`, str_remove(X6, `Nbre CV`), X6)) %>%
mutate(X6 = if_else(X6 == `Nbre CV`, str_remove(X6, `Nbre BVD`), X6)))
structure(list(X1 = c("", "", "40", "", "", "41", "", "", "42",
"", "", "43", "", "", "44", ""), X2 = c("", "", "EP. KAPALA",
"", "", "INST. MOTULE", "", "", "CABANE BABOA", "", "", "CABANE BANANGI",
"", "", "E.P.BINZI", ""), X3 = c("", "", "MOBATI-BOYELE", "",
"", "MOBATI-BOYELE", "", "", "MOBATI-BOYELE", "", "", "AVURU-GATANGA",
"", "", "AVURU-GATANGA", ""), X4 = c("", "", "BOGBASA", "", "",
"BOSOBEA", "", "", "BOSOBEA", "", "", "BANANGI", "", "", "GURUZA",
""), X5 = c("", "", "", "", "", "MOBENGE", "", "", "BABOA", "",
"", "DIFONGO", "", "", "DULIA", ""), X6 = c("", "", "BOGBASA",
"", "", "", "1", "", "", "1", "", "", "1", "", "", "1"), X7 = c("1",
"", "", "1", "", "", "4", "", "", "1", "", "", "1", "", "", "5"
), X8 = c("2", "", "", "2", "", "", "510 110", "", "", "510 111",
"", "", "510 112", "", "", "510 113"), X9 = c("510 108", "",
"", "510 109", "", "", "A - D", "", "", "A", "", "", "A", "",
"", "A - E"), page = c("4", "4", "4", "4", "5", "5", "5", "5",
"5", "5", "5", "5", "5", "5", "5", "5"), Plage = c("A - B", NA,
NA, "A - B", NA, NA, "A - D", NA, NA, "A", NA, NA, "A", NA, NA,
"A - E"), `Code SV` = c("510 108", NA, NA, "510 109", NA, NA,
"510 110", NA, NA, "510 111", NA, NA, "510 112", NA, NA, "510 113"
), `Nbre BVD` = c("2", NA, NA, "2", NA, NA, "4", NA, NA, "1",
NA, NA, "1", NA, NA, "5"), `Nbre CV` = c("1", NA, NA, "1", NA,
NA, "1", NA, NA, "1", NA, NA, "1", NA, NA, "1")), class = "data.frame", row.names = c(NA,
-16L))
That's basically Chris Ruehlemann's answer (I don't know why he removed it, I would remove this one for the original one):
library(dplyr)
extractstack %>%
mutate(across(everything(), as.character),
X6 = coalesce(ifelse(X6 == `Nbre BVD` | X6 == `Nbre CV`, "", X6), X6))
compares X6 with the columns Nbre BVD and Nbre CV. If there is matching content, X6 will be changed to an empty string "", else X6 stays unchanged. But for your given data, this code doesn't replace anything, since there are simply no matches in X6 with Nbre BVD and Nbre CV besides NA-values.

loop through a r dataframe and pass rows as parameters to a function

I want to loop through a dataframe and pass the rows as arguments to a function to summarise the totals from a dataframe named df3.
I have tried code using a traditional for loop but there are not results.
I have looked at pmap in https://adv-r.hadley.nz/functionals.html#pmap
but the I don't see how to apply this example to my code.
Here is some data from the original data:
dput(head(df3,n=3))
structure(list(id = c("81", "83", "85"), look_work = c("yes",
"yes", "yes"), current_work = c("no", "yes", "no"), hf_l5k = c("",
"", ""), ac_l5k = c("", "", ""), hf_5_10k = c("", "1", "1"),
ac_5_10k = c("", "1", "1"), hf_11_20k = c("", "", ""), ac_11_20k = c("",
"", ""), hf_21_50k = c("", "", ""), ac_21_50k = c("", "",
""), hf_51_100k = c("", "", ""), ac_51_100k = c("", "", ""
), hf_m100k = c("", "", ""), ac_m100k = c("", "", ""), s_l1000 = c("",
"", ""), se_l1000 = c("", "", "1"), s_1001_1500 = c("", "1",
"1"), se_1001_1500 = c("", "", ""), s_2001_3000 = c("", "",
""), se_2001_3000 = c("", "1", ""), s_3001_4000 = c("", "",
""), se_3001_4000 = c("", "", ""), s_4001_5000 = c("", "",
""), se_4001_5000 = c("", "", ""), s_5001_6000 = c("", "",
""), se_5001_6000 = c("", "", ""), s_m6000 = c("", "", ""
), se_m6000 = c("", "", ""), s_n_ans = c("", "", ""), se_n_ans = c("",
"", ""), before_work = c("no", "NULL", "yes"), keen_move = c("yes",
"yes", "no"), city_size = c("village", "more than 500k inhabitants",
"more than 500k inhabitants"), gender = c("male", "female",
"female"), age = c("18 - 24 years", "18 - 24 years", "more than 50 years"
), education = c("secondary", "vocational", "secondary")), row.names = c(NA,
3L), class = "data.frame")
Here is the dataframe hf_names for the parameters:
structure(list(hf_names = c("hf_l5k", "hf_5_10k", "hf_11_20k",
"hf_21_50k", "hf_51_100k", "hf_m100k"), job = c("hf_l5k_job",
"hf_5_10k_job", "hf_11_20k_job", "hf_21_50k_job", "hf_51_100k_job",
"hf_m100k_job"), tot = c("hf_l5k_tot", "hf_5_10k_tot", "hf_11_20k_tot",
"hf_21_50k_tot", "hf_51_100k_tot", "hf_m100k_tot")), class = "data.frame", row.names = c(NA,
-6L))
Here is the code I have tried with a traditional for loop:
library(dplyr)
tot_function <- function(df, filter_tot, col_name1, col_name2) {
# filter desired columns for all jobs
filter_tot <- df %>% filter(col_name1=="1") %>%
summarise(col_name2 = n())
}
for (i in seq_along(hf_names3)) {
tot_function(df3, hf_names3$tot[i], hf_names3$hf_names[i], hf_names3$job[i])
}
The expected results would be dataframes or vectors:
hf_l5k_jobs hf_l5_10k_jobs
10 193
but nothing is generated by this code as it looks at simple functions such as trim and runif.
I don't think you need to overcomplicate this. You can take names from hf_names, subset that column from df3 and count the number of 1's in that column.
sapply(hf_names$hf_names, function(x) sum(df3[[x]] == 1))
# hf_l5k hf_5_10k hf_11_20k hf_21_50k hf_51_100k hf_m100k
# 0 2 0 0 0 0
If you prefer tidyverse you can change sapply to map.* variations
purrr::map_int(hf_names$hf_names, ~sum(df3[[.]] == 1))

How to read csv file for text mining

I will be using tm for text mining purpose.However, my file CSV file is weired .Below is the dput,after I used read.table function in r. There are three column lie, sentiment and review. However the fourth coulmn contain review with no column name.I am New to R and Text mining. If I use read.csvit is getting me an error. Please suggest better approach for reading csv file.
Update:
> dput(head(df))
structure(list(V1 = c("lie,sentiment,review", "f,n,'Mike\\'s",
"f,n,'i", "f,n,'After", "f,n,'Olive", "f,n,'I"), V2 = c("", "Pizza",
"really", "I", "Oil", "went"), V3 = c("", "High", "like", "went",
"Garden", "to"), V4 = c("", "Point,", "this", "shopping", "was",
"the"), V5 = c("", "NY", "buffet", "with", "very", "Chilis"),
V6 = c("", "Service", "restaurant", "some", "disappointing.",
"on"), V7 = c("", "was", "in", "of", "I", "Erie"), V8 = c("",
"very", "Marshall", "my", "expect", "Blvd"), V9 = c("", "slow",
"street.", "friend,", "good", "and"), V10 = c("", "and",
"they", "we", "food", "had"), V11 = c("", "the", "have",
"went", "and", "the"), V12 = c("", "quality", "a", "to",
"good", "worst"), V13 = c("", "was", "lot", "DODO", "service",
"meal"), V14 = c("", "low.", "of", "restaurant", "(at", "of"
), V15 = c("", "You", "selection", "for", "least!!)", "my"
), V16 = c("", "would", "of", "dinner.", "when", "life."),
V17 = c("", "think", "american,", "I", "I", "We"), V18 = c("",
"they", "japanese,", "found", "go", "arrived"), V19 = c("",
"would", "and", "worm", "out", "and"), V20 = c("", "know",
"chinese", "in", "to", "waited"), V21 = c("", "at", "dishes.",
"one", "eat.", "5"), V22 = c("", "least", "we", "of", "The",
"minutes"), V23 = c("", "how", "also", "the", "meal", "for"
), V24 = c("", "to", "got", "dishes", "was", "a"), V25 = c("",
"make", "a", ".'", "cold", "hostess,"), V26 = c("", "good",
"free", "", "when", "and"), V27 = c("", "pizza,", "drink",
"", "we", "then"), V28 = c("", "not.", "and", "", "got",
"were"), V29 = c("", "Stick", "free", "", "it,", "seated"
), V30 = c("", "to", "refill.", "", "and", "by"), V31 = c("",
"pre-made", "there", "", "the", "a"), V32 = c("", "dishes",
"are", "", "waitor", "waiter"), V33 = c("", "like", "also",
"", "had", "who"), V34 = c("", "stuffed", "different", "",
"no", "was"), V35 = c("", "pasta", "kinds", "", "manners",
"obviously"), V36 = c("", "or", "of", "", "whatsoever.",
"in"), V37 = c("", "a", "dessert.", "", "Don\\'t", "a"),
V38 = c("", "salad.", "the", "", "go", "terrible"), V39 = c("",
"You", "staff", "", "to", "mood."), V40 = c("", "should",
"is", "", "the", "We"), V41 = c("", "consider", "very", "",
"Olive", "order"), V42 = c("", "dining", "friendly.", "",
"Oil", "drinks"), V43 = c("", "else", "it", "", "Garden.",
"and"), V44 = c("", "where.'", "is", "", "\nf,n,", "it"),
V45 = c("", "", "also", "", "The", "took"), V46 = c("", "",
"quite", "", "Seven", "them"), V47 = c("", "", "cheap", "",
"Heaven", "15"), V48 = c("", "", "compared", "", "restaurant",
"minutes"), V49 = c("", "", "with", "", "was", "to"), V50 = c("",
"", "the", "", "never", "bring"), V51 = c("", "", "other",
"", "known", "us"), V52 = c("", "", "restaurant", "", "for",
"both"), V53 = c("", "", "in", "", "a", "the"), V54 = c("",
"", "syracuse", "", "superior", "wrong"), V55 = c("", "",
"area.", "", "service", "beers"), V56 = c("", "", "i", "",
"but", "which"), V57 = c("", "", "will", "", "what", "were"
), V58 = c("", "", "definitely", "", "we", "barely"), V59 = c("",
"", "coming", "", "experienced", "cold."), V60 = c("", "",
"back", "", "last", "Then"), V61 = c("", "", "here.'", "",
"week", "we"), V62 = c("", "", "", "", "was", "order"), V63 = c("",
"", "", "", "a", "an"), V64 = c("", "", "", "", "disaster.",
"appetizer"), V65 = c("", "", "", "", "The", "and"), V66 = c("",
"", "", "", "waiter", "wait"), V67 = c("", "", "", "", "would",
"25"), V68 = c("", "", "", "", "not", "minutes"), V69 = c("",
"", "", "", "notice", "for"), V70 = c("", "", "", "", "us",
"cold"), V71 = c("", "", "", "", "until", "southwest"), V72 = c("",
"", "", "", "we", "egg"), V73 = c("", "", "", "", "asked",
"rolls,"), V74 = c("", "", "", "", "him", "at"), V75 = c("",
"", "", "", "4", "which"), V76 = c("", "", "", "", "times",
"point"), V77 = c("", "", "", "", "to", "we"), V78 = c("",
"", "", "", "bring", "just"), V79 = c("", "", "", "", "us",
"paid"), V80 = c("", "", "", "", "the", "and"), V81 = c("",
"", "", "", "menu.", "left."), V82 = c("", "", "", "", "The",
"Don\\'t"), V83 = c("", "", "", "", "food", "go.'"), V84 = c("",
"", "", "", "was", ""), V85 = c("", "", "", "", "not", ""
), V86 = c("", "", "", "", "exceptional", ""), V87 = c("",
"", "", "", "either.", ""), V88 = c("", "", "", "", "It",
""), V89 = c("", "", "", "", "took", ""), V90 = c("", "",
"", "", "them", ""), V91 = c("", "", "", "", "though", ""
), V92 = c("", "", "", "", "2", ""), V93 = c("", "", "",
"", "minutes", ""), V94 = c("", "", "", "", "to", ""), V95 = c("",
"", "", "", "bring", ""), V96 = c("", "", "", "", "us", ""
), V97 = c("", "", "", "", "a", ""), V98 = c("", "", "",
"", "check", ""), V99 = c("", "", "", "", "after", ""), V100 = c("",
"", "", "", "they", ""), V101 = c("", "", "", "", "spotted",
""), V102 = c("", "", "", "", "we", ""), V103 = c("", "",
"", "", "finished", ""), V104 = c("", "", "", "", "eating",
""), V105 = c("", "", "", "", "and", ""), V106 = c("", "",
"", "", "are", ""), V107 = c("", "", "", "", "not", ""),
V108 = c("", "", "", "", "ordering", ""), V109 = c("", "",
"", "", "more.", ""), V110 = c("", "", "", "", "Well,", ""
), V111 = c("", "", "", "", "never", ""), V112 = c("", "",
"", "", "more.", ""), V113 = c("", "", "", "", "\nf,n,",
""), V114 = c("", "", "", "", "I", ""), V115 = c("", "",
"", "", "went", ""), V116 = c("", "", "", "", "to", ""),
V117 = c("", "", "", "", "XYZ", ""), V118 = c("", "", "",
"", "restaurant", ""), V119 = c("", "", "", "", "and", ""
), V120 = c("", "", "", "", "had", ""), V121 = c("", "",
"", "", "a", ""), V122 = c("", "", "", "", "terrible", ""
), V123 = c("", "", "", "", "experience.", ""), V124 = c("",
"", "", "", "I", ""), V125 = c("", "", "", "", "had", ""),
V126 = c("", "", "", "", "a", ""), V127 = c("", "", "", "",
"YELP", ""), V128 = c("", "", "", "", "Free", ""), V129 = c("",
"", "", "", "Appetizer", ""), V130 = c("", "", "", "", "coupon",
""), V131 = c("", "", "", "", "which", ""), V132 = c("",
"", "", "", "could", ""), V133 = c("", "", "", "", "be",
""), V134 = c("", "", "", "", "applied", ""), V135 = c("",
"", "", "", "upon", ""), V136 = c("", "", "", "", "checking",
""), V137 = c("", "", "", "", "in", ""), V138 = c("", "",
"", "", "to", ""), V139 = c("", "", "", "", "the", ""), V140 = c("",
"", "", "", "restaurant.", ""), V141 = c("", "", "", "",
"The", ""), V142 = c("", "", "", "", "person", ""), V143 = c("",
"", "", "", "serving", ""), V144 = c("", "", "", "", "us",
""), V145 = c("", "", "", "", "was", ""), V146 = c("", "",
"", "", "very", ""), V147 = c("", "", "", "", "rude", ""),
V148 = c("", "", "", "", "and", ""), V149 = c("", "", "",
"", "didn\\'t", ""), V150 = c("", "", "", "", "acknowledge",
""), V151 = c("", "", "", "", "the", ""), V152 = c("", "",
"", "", "coupon.", ""), V153 = c("", "", "", "", "When",
""), V154 = c("", "", "", "", "I", ""), V155 = c("", "",
"", "", "asked", ""), V156 = c("", "", "", "", "her", ""),
V157 = c("", "", "", "", "about", ""), V158 = c("", "", "",
"", "it,", ""), V159 = c("", "", "", "", "she", ""), V160 = c("",
"", "", "", "rudely", ""), V161 = c("", "", "", "", "replied",
""), V162 = c("", "", "", "", "back", ""), V163 = c("", "",
"", "", "saying", ""), V164 = c("", "", "", "", "she", ""
), V165 = c("", "", "", "", "had", ""), V166 = c("", "",
"", "", "already", ""), V167 = c("", "", "", "", "applied",
""), V168 = c("", "", "", "", "it.", ""), V169 = c("", "",
"", "", "Then", ""), V170 = c("", "", "", "", "I", ""), V171 = c("",
"", "", "", "inquired", ""), V172 = c("", "", "", "", "about",
""), V173 = c("", "", "", "", "the", ""), V174 = c("", "",
"", "", "free", ""), V175 = c("", "", "", "", "salad", ""
), V176 = c("", "", "", "", "that", ""), V177 = c("", "",
"", "", "they", ""), V178 = c("", "", "", "", "serve.", ""
), V179 = c("", "", "", "", "She", ""), V180 = c("", "",
"", "", "rudely", ""), V181 = c("", "", "", "", "said", ""
), V182 = c("", "", "", "", "that", ""), V183 = c("", "",
"", "", "you", ""), V184 = c("", "", "", "", "have", ""),
V185 = c("", "", "", "", "to", ""), V186 = c("", "", "",
"", "order", ""), V187 = c("", "", "", "", "the", ""), V188 = c("",
"", "", "", "main", ""), V189 = c("", "", "", "", "course",
""), V190 = c("", "", "", "", "to", ""), V191 = c("", "",
"", "", "get", ""), V192 = c("", "", "", "", "that.", ""),
V193 = c("", "", "", "", "Overall,", ""), V194 = c("", "",
"", "", "I", ""), V195 = c("", "", "", "", "had", ""), V196 = c("",
"", "", "", "a", ""), V197 = c("", "", "", "", "bad", ""),
V198 = c("", "", "", "", "experience", ""), V199 = c("",
"", "", "", "as", ""), V200 = c("", "", "", "", "I", ""),
V201 = c("", "", "", "", "had", ""), V202 = c("", "", "",
"", "taken", ""), V203 = c("", "", "", "", "my", ""), V204 = c("",
"", "", "", "family", ""), V205 = c("", "", "", "", "to",
""), V206 = c("", "", "", "", "that", ""), V207 = c("", "",
"", "", "restaurant", ""), V208 = c("", "", "", "", "for",
""), V209 = c("", "", "", "", "the", ""), V210 = c("", "",
"", "", "first", ""), V211 = c("", "", "", "", "time", ""
), V212 = c("", "", "", "", "and", ""), V213 = c("", "",
"", "", "I", ""), V214 = c("", "", "", "", "had", ""), V215 = c("",
"", "", "", "high", ""), V216 = c("", "", "", "", "hopes",
""), V217 = c("", "", "", "", "from", ""), V218 = c("", "",
"", "", "the", ""), V219 = c("", "", "", "", "restaurant",
""), V220 = c("", "", "", "", "which", ""), V221 = c("",
"", "", "", "is,", ""), V222 = c("", "", "", "", "otherwise,",
""), V223 = c("", "", "", "", "my", ""), V224 = c("", "",
"", "", "favorite", ""), V225 = c("", "", "", "", "place",
""), V226 = c("", "", "", "", "to", ""), V227 = c("", "",
"", "", "dine.", ""), V228 = c("", "", "", "", "\nf,n,",
""), V229 = c("", "", "", "", "I", ""), V230 = c("", "",
"", "", "went", ""), V231 = c("", "", "", "", "to", ""),
V232 = c("", "", "", "", "ABC", ""), V233 = c("", "", "",
"", "restaurant", ""), V234 = c("", "", "", "", "two", ""
), V235 = c("", "", "", "", "days", ""), V236 = c("", "",
"", "", "ago", ""), V237 = c("", "", "", "", "and", ""),
V238 = c("", "", "", "", "I", ""), V239 = c("", "", "", "",
"hated", ""), V240 = c("", "", "", "", "the", ""), V241 = c("",
"", "", "", "food", ""), V242 = c("", "", "", "", "and",
""), V243 = c("", "", "", "", "the", ""), V244 = c("", "",
"", "", "service.", ""), V245 = c("", "", "", "", "We", ""
), V246 = c("", "", "", "", "were", ""), V247 = c("", "",
"", "", "kept", ""), V248 = c("", "", "", "", "waiting",
""), V249 = c("", "", "", "", "for", ""), V250 = c("", "",
"", "", "over", ""), V251 = c("", "", "", "", "an", ""),
V252 = c("", "", "", "", "hour", ""), V253 = c("", "", "",
"", "just", ""), V254 = c("", "", "", "", "to", ""), V255 = c("",
"", "", "", "get", ""), V256 = c("", "", "", "", "seated",
""), V257 = c("", "", "", "", "and", ""), V258 = c("", "",
"", "", "once", ""), V259 = c("", "", "", "", "we", ""),
V260 = c("", "", "", "", "ordered,", ""), V261 = c("", "",
"", "", "our", ""), V262 = c("", "", "", "", "food", ""),
V263 = c("", "", "", "", "came", ""), V264 = c("", "", "",
"", "out", ""), V265 = c("", "", "", "", "cold.", ""), V266 = c("",
"", "", "", "I", ""), V267 = c("", "", "", "", "ordered",
""), V268 = c("", "", "", "", "the", ""), V269 = c("", "",
"", "", "pasta", ""), V270 = c("", "", "", "", "and", ""),
V271 = c("", "", "", "", "it", ""), V272 = c("", "", "",
"", "was", ""), V273 = c("", "", "", "", "terrible", ""),
V274 = c("", "", "", "", "-", ""), V275 = c("", "", "", "",
"completely", ""), V276 = c("", "", "", "", "bland", ""),
V277 = c("", "", "", "", "and", ""), V278 = c("", "", "",
"", "very", ""), V279 = c("", "", "", "", "unappatizing.",
""), V280 = c("", "", "", "", "I", ""), V281 = c("", "",
"", "", "definitely", ""), V282 = c("", "", "", "", "would",
""), V283 = c("", "", "", "", "not", ""), V284 = c("", "",
"", "", "recommend", ""), V285 = c("", "", "", "", "going",
""), V286 = c("", "", "", "", "there,", ""), V287 = c("",
"", "", "", "especially", ""), V288 = c("", "", "", "", "if",
""), V289 = c("", "", "", "", "you\\'re", ""), V290 = c("",
"", "", "", "in", ""), V291 = c("", "", "", "", "a", ""),
V292 = c("", "", "", "", "hurry!'", "")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11",
"V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20",
"V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29",
"V30", "V31", "V32", "V33", "V34", "V35", "V36", "V37", "V38",
"V39", "V40", "V41", "V42", "V43", "V44", "V45", "V46", "V47",
"V48", "V49", "V50", "V51", "V52", "V53", "V54", "V55", "V56",
"V57", "V58", "V59", "V60", "V61", "V62", "V63", "V64", "V65",
"V66", "V67", "V68", "V69", "V70", "V71", "V72", "V73", "V74",
"V75", "V76", "V77", "V78", "V79", "V80", "V81", "V82", "V83",
"V84", "V85", "V86", "V87", "V88", "V89", "V90", "V91", "V92",
"V93", "V94", "V95", "V96", "V97", "V98", "V99", "V100", "V101",
"V102", "V103", "V104", "V105", "V106", "V107", "V108", "V109",
"V110", "V111", "V112", "V113", "V114", "V115", "V116", "V117",
"V118", "V119", "V120", "V121", "V122", "V123", "V124", "V125",
"V126", "V127", "V128", "V129", "V130", "V131", "V132", "V133",
"V134", "V135", "V136", "V137", "V138", "V139", "V140", "V141",
"V142", "V143", "V144", "V145", "V146", "V147", "V148", "V149",
"V150", "V151", "V152", "V153", "V154", "V155", "V156", "V157",
"V158", "V159", "V160", "V161", "V162", "V163", "V164", "V165",
"V166", "V167", "V168", "V169", "V170", "V171", "V172", "V173",
"V174", "V175", "V176", "V177", "V178", "V179", "V180", "V181",
"V182", "V183", "V184", "V185", "V186", "V187", "V188", "V189",
"V190", "V191", "V192", "V193", "V194", "V195", "V196", "V197",
"V198", "V199", "V200", "V201", "V202", "V203", "V204", "V205",
"V206", "V207", "V208", "V209", "V210", "V211", "V212", "V213",
"V214", "V215", "V216", "V217", "V218", "V219", "V220", "V221",
"V222", "V223", "V224", "V225", "V226", "V227", "V228", "V229",
"V230", "V231", "V232", "V233", "V234", "V235", "V236", "V237",
"V238", "V239", "V240", "V241", "V242", "V243", "V244", "V245",
"V246", "V247", "V248", "V249", "V250", "V251", "V252", "V253",
"V254", "V255", "V256", "V257", "V258", "V259", "V260", "V261",
"V262", "V263", "V264", "V265", "V266", "V267", "V268", "V269",
"V270", "V271", "V272", "V273", "V274", "V275", "V276", "V277",
"V278", "V279", "V280", "V281", "V282", "V283", "V284", "V285",
"V286", "V287", "V288", "V289", "V290", "V291", "V292"), row.names = c(NA,
6L), class = "data.frame")
Dataset:
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
Thanks in advance,
I don't know why you removed the file from the original post, #Yes Boss but this answer is based on this file, rather than your dput output. The file basically had two problems why you couldn't read it in. 1. Your quote character was ' instead of the more common "; 2. ' is also used in the column review which is a bit too much for base (it tries to split into new columns in these instances). Luckily, the package data.table is a bit smarter and can take care of problem #2:
library(data.table)
df <- fread(file = "deception.csv", quote="\'")
The resulting object will be a data.table instead of a data.frame:
> str(df)
Classes ‘data.table’ and 'data.frame': 92 obs. of 3 variables:
$ lie : chr "f" "f" "f" "f" ...
$ sentiment: chr "n" "n" "n" "n" ...
$ review : chr "Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at"| __truncated__ "i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, an"| __truncated__ "After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes ." "Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat."| __truncated__ ...
- attr(*, ".internal.selfref")=<externalptr>
You can turn this behaviour off by setting data.table = FALSE in fread() (if you want to, I recommend learning how to work with data.table).
A personal opinionated note: If you want to get into text mining, look into the quanteda package instead of tm. It is a lot faster and has a more modern approach to many tasks.
For this particular text file, you need to look at the quote argument. In read.table(), the default quote argument is either a single or double quote. Here you need to make it just a single quote:
df <- read.table("filename", header = TRUE, quote = "\'")
str(df)
# 'data.frame': 9 obs. of 3 variables:
# $ lie : Factor w/ 1 level "f": 1 1 1 1 1 1 1 1 1
# $ sentiment: Factor w/ 1 level "n": 1 1 1 1 1 1 1 1 1
# $ review : Factor w/ 9 levels "After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .",..: 6 2 1 7 9 5 3 4 8
That should do it for you.
I'd recommend reading the help file for read.table() (all the way through). There's a lot to consider.

Resources