Problem with mutate when trying to create a line_id column - r

I need to create a line ID column within a dataframe for further pre-processing steps. The code worked fine up until yesterday. Today, however I am facing the error message:
"Error in mutate():
ℹ In argument: line_id = (function (x, y) ....
Caused by error:
! Can't convert y to match type of x ."
Here is my code - the dataframe consists of two character columns:
split_text <- raw_text %>%
mutate(text = enframe(strsplit(text, split = "\n", ))) %>%
unnest(cols = c(text)) %>%
unnest(cols = c(value)) %>%
rename(text_raw = value) %>%
select(-name) %>%
mutate(doc_id = str_remove(doc_id, ".txt")) %>%
# removing empty rows + add line_id
mutate(line_id = row_number())
Besides row_number(), I also tried rowid_to_column, and even c(1:1000) - the length of the dataframe. The error message stays the same.

Try explicitly specifying the data type of the "line_id" column as an integer using the as.integer() function, like this:
mutate(line_id = as.integer(row_number()))

This code works but is not fully satisfying, since I have to break the pipe:
split_text$line_id <- as.integer(c(1:nrow(split_text)))

Related

Errors converting Character to Numeric R

I want to make a character column to numeric, so I can calculate the mean of basepay. However I keep getting different errors.
I use the code
dataset <- read.csv("Wagegap.csv")
SFWage <- dataset %>%
as.numeric(dataset$BasePay)%>%
group_by(gender,JobTitle, Year) %>%
summarise(averageBasePay = mean(BasePay, na.rm=TRUE)) %>%
select(gender, JobTitle, averageBasePay, Year)
clean <- SFWage %>% filter(gender != "")
It either wont recognize my basepay column if i don't use $, and if i use $ it shows
Error in function_list[i] :
'list' object cannot be coerced to type 'double'
The basepay column shows numbers with a "." instead of "," so I don't have to use a gsub()?
Try this before all the piping :
dataset$BasePay <- as.numeric(dataset$BasePay)

How to convert to genre, production companies, cast which are JSON column into tibble or df?

https://www.kaggle.com/rounakbanik/the-movies-dataset
this is the dataset I was using.
I tried the following code in R:
m_genres=movies %>% filter(nchar(genres)>2) %>% mutate(js=lapply(genres,fromJSON)) %>% unnest(js) %>% select(id,title,genre=name)%>%group_by(title)%>%mutate(pos = 1:n())%>%ungroup()
m_production_companies=movies %>% filter(nchar(production_companies)>2) %>% mutate(js=lapply(production_companies,fromJSON)) %>% unnest(js) %>% select(id,title,production_company=name)%>%group_by(title)%>%mutate(pos = 1:n())%>%ungroup()
kable(m_genres[1:10,])
#convert json into data frame
genredf=movies %>% filter(nchar(genres)>2) %>% mutate(js=lapply(genres,fromJSON)) %>% unnest(js) %>% select(id,title,genre=name) #Convert JSON format into data frame
slice(genredf)
this is the error I got:
Error in FUN(X[[i]], ...) :
unexpected character "'"; expecting opening string quote (") for key value
I'm trying to run a logistic regression. Thus, would like to look at how these small detail affect the movie's budget or production companies.

R - mutate columns with different formats

I'm trying to do analysis from multiple csv files, and in order to create a key that can be used for left_join I think that I need to try and merge two columns. At present I'm trying to use the tidyverse packages (inc. mutate), but I'm running into an issue as the two columns to merge have different formatting: 1 is a double and the other is in date format. I'm using the following code
qlik2 <- qlik %>%
separate('Admit DateTime', into = c('Admit Date', 'Admit Time'), sep = 10) %>%
mutate(key = MRN + `Admit Date`)
and getting tis output error:
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
If there's another way around this (or if the error is actually related to something else), then I'd appreciate any thoughts on the matter. Equally, if people know of a way to left_join with multiple keys, then that would work as well.
Thanks,
Cal
Hard without a reproducible example. But if i understand your question you either want a numeric key, or trying to concatinate a string with the plus +.
Numeric key
library(hablar)
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
convert(num(MRN, `Admit Date`)) %>%
mutate(key = MRN + `Admit Date`)
String key
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
mutate(key = paste(MRN, `Admit Date`))

`gather` can't handle rownames

allcsvs = list.files(pattern = "*.csv$", recursive = TRUE)
library(tidyverse)
##LOOP to redact the snow data csvs##
for(x in 1:length(allcsvs)) {
df = read.csv(allcsvs[x], check.names = FALSE)
newdf = df %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
####TURN DATES UNAMBIGUOUS HERE####
df$DATE = lubridate::mdy(df$DATE)
finaldf = merge(newdf, df, all.y = TRUE)
write.csv(finaldf, allcsvs[x])
df = read.csv(allcsvs[x])
newdf = df[, -grep("X20", colnames(df))]
write.csv(newdf, allcsvs[x])
}
I am using the code above to populate a new column row-by-row using values from different existing columns, using date as selection criteria. If I manually open each .csv in excel and delete the first column, this code works great. However, if I run it on the .csvs "as is"
I get the following message:
Error: Column 1 must be named
So far I've tried putting -rownames within the parenthesis of gather, I've tried putting remove_rownames %>% below newdf = df %>%, but nothing seems to work. I tried reading the csv without the first column [,-1] or deleting the first column in R df[,1]<-NULL but for some reason when I do that my code returns an empty table instead of what I want it to. In other words, I can delete the rownames in Excel and it works great, if I delete them in R something funky happens.
Here is some sample data: https://drive.google.com/file/d/1RiMrx4wOpUdJkN4il6IopciSF6pKeNLr/view?usp=sharing
You can consider to import them with readr::read_csv.
An easy solution with tidyverse:
allcsvs %>%
map(read_csv) %>%
reduce(bind_rows) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
With utils::read.csv, you are importing strings are factors. as.Date(DATE,format = "%m/%d/%Y") evaluates NA.
Update
Above solution returns one single dataframe. To write the each data file separately with the for loop:
for(x in 1:length(allcsvs)) {
read_csv(allcsvs[x]) %>%
gather(COL_DATE, SNOW_DEPTH, -PT_ID, -DATE) %>%
mutate(
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE) %>%
write_csv(paste('tidy', allcsvs[x], sep = '_'))
}
Comparison
purrr:map and purrr:reduce can be used instead of for loop in some cases. Those functions take another functions as arguments.
readr::read_csv is typically 10x faster than base R equivalents. (More info: http://r4ds.had.co.nz/data-import.html). Also it can handle CSV files better.

How to append a Date filtering condition to my existing filtering codes in R?

I need to create a subset of my main data frame (mydata1) in R.
The Date column in mydata1 has already been formatted as a Date using the following codes:
mydata1$Date = as.Date(mydata1$Date)
I have the current codes running to create the subset of my data:
mydata3 <- mydata1 %>%
filter(Total.Extras.Per.GN >= 100) %>%
filter(Original.Meal.Plan.Code %in% target) %>%
filter(Date, between ("2017-01-01"), ("2017-06-01")) %>%
select(PropertyCode, Date, Market, Original.Meal.Plan.Code, GADR, Total.Extras.Per.GN)
However, the line filter(Date, between ("2017-01-01"), ("2017-06-01")) %>% is giving me an error. How do I write it properly so that it filters my Date column with the dates specified therein?
Error message:
Error in filter_impl(.data, dots) :
argument "left" is missing, with no default
Simply place Date inside the between arg and wrap date strings in as.Date() for comparison:
mydata3 <- mydata1 %>%
filter(Total.Extras.Per.GN >= 100) %>%
filter(Original.Meal.Plan.Code %in% target) %>%
filter(between(Date, as.Date("2017-01-01"), as.Date("2017-06-01"))) %>%
select(PropertyCode, Date, Market, Original.Meal.Plan.Code, GADR, Total.Extras.Per.GN)

Resources