Sparse.model.matrix error message - r

I'm trying to create a sparse-matrix and get this error message:
Error: fnames == names(mf) are not all TRUE
I think it has something to do with the column names of my data, maybe you can help.
Here are the column names:
Error: fnames == names(mf) are not all TRUE
colnames(trainDataShrinkage) <-"Bildungsgrad2_Lower_secondary_education"
,"Bildungsgrad3_Upper_secondary_education"
,"Bildungsgrad4_Post-secondary_non-tertiary_education"
,"Bildungsgrad5_Short-cycle_tertiary_education"
,"Bildungsgrad6_Bachelors_or_equivalent_level"
,"Bildungsgrad7_Masters_or_equivalent_level"
,"Bildungsgrad8_Doctoral_or_equivalent_level"
,"Familienstand2_Verheiratet,_getrenntlebend"
,"Familienstand3_Ledig"
,"Familienstand4_Geschieden,_eing._gleichg._Partn._aufgehoben"
,"Familienstand5_Verwitwet,_Lebenspartner/in_verstorben"
,"Familienstand6_Eing._gleichg._Partn.,_zusammenlebend"
,"Geschlecht2_Weiblich"
,"Migrationshintergrund2_direkter_Migrationshintergrund"
,"Migrationshintergrund3_indirekter_Migrationshintergrund"
,"Bundesland2_Hamburg"
,"Bundesland3_Niedersachsen"
,"Bundesland4_Bremen"
,"Bundesland5_Nordrhein-Westfalen"
,"Bundesland6_Hessen"
,"Bundesland7_Rheinland-Pfalz"
,"Bundesland8_Baden-Wuerttemberg"
,"Bundesland9_Bayern"
,"Bundesland10_Saarland"
,"Bundesland11_Berlin_(West_und_Ost)"
,"Bundesland12_Brandenburg"
,"Bundesland13_Mecklenburg-Vorpommern"
,"Bundesland14_Sachsen"
,"Bundesland15_Sachsen-Anhalt"
,"Bundesland16_Thueringen"
,"Unternehmengroesse2_5 bis_10"
,"Unternehmengroesse3_11_bis_unter_20"
,"Unternehmengroesse6_20_bis_unter_100"
,"Unternehmengroesse7_100_bis_unter_200"
,"Unternehmengroesse9_200_bis_unter_2000"
,"Unternehmengroesse10_2000_und_mehr"
,"Erwerbsstatus2_Teilzeitbeschaeftigung"
,"Erwerbsstatus4_Geringfuegig_beschaeftigt"
,"Stundenlohn"
,"AlterEins"
,"AlterZwei"
,"AlterDrei"
,"AlterFuenF"
,"AlterSechs"
,"BildungsjahreEins"
,"BildungsjahreZwei"
,"BildungsjahreDrei"
,"BildungsjahreVier"
,"BildungsjahreFuenf"
,"ArbeitsmarkterfahrungVollzeitEins"
,"ArbeitsmarkterfahrungVollzeitZwei"
,"ArbeitsmarkterfahrungVollzeitDrei"
,"ArbeitsmarkterfahrungVollzeitVier"
,"ArbeitsmarkterfahrungVollzeitFuenf"
,"ArbeitsmarkterfahrungTeilzeitEins"
,"ArbeitsmarkterfahrungTeilzeitZwei"
,"ArbeitsmarkterfahrungTeilzeitDrei"
,"ArbeitsmarkterfahrungTeilzeitVier"
,"ArbeitsmarkterfahrungTeilzeitFuenf"
,"BruttoverdienstLetztenMonatEins"
,"BruttoverdienstLetztenMonatZwei"
,"BruttoverdienstLetztenMonatDrei"
,"BruttoverdienstLetztenMonatVier"
,"BruttoverdienstLetztenMonatFuenf")

It does not like some special characters in the column names. I faced issues with column names starting with #, 1, . and /. Can you try to replace these occurrences with '_'.
Simplest way would be to trim your column names off any special characters. Let me know if you cannot rename them due to any limitation.

Related

How to remove paranthesis but keep the text in it in R

I am trying to clean a dataset with the column: ltaCpInfoDF$weekdays_rate_1
For some of the rows, I would like to do this:
input: Daily(7am-11pm): $1.20 ; output: 7am-11pm: $1.20
The values within the bracket can be different timings for the rows.
Initially, I was thinking of removing by part such as removing "Daily(" with gsub first then removing ")". However, I seem to be facing issues with that.
ltaCpInfoDF$weekdays_rate_1 <- gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1)
Here is the error shown:
Error in gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1) :
invalid regular expression 'Daily(', reason 'Missing ')''
In addition: Warning message:
In gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1) :
TRE pattern compilation error 'Missing ')''
Could someone share with me a better way? Thank you in advance!
Use sub with a capture group:
input <- "Daily(7am-11pm): $1.20"
output <- gsub("\\S+\\s*\\((.*?)\\)", "\\1", input)
output
[1] "7am-11pm: $1.20"
We may use without capturing
gsub("^[^(]+\\(|\\)", "", str1)
[1] "7am-11pm: $1.20"
data
str1 <- "Daily(7am-11pm): $1.20"

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

Change value column a if column b contains conditional string

This issue is giving me a lot of trouble, even though it should be fixed eaily. I have a dataset with the columns id and poster. I want to change the poster's value if the id value contains a certain string. See data below:
test_df
id poster
143537222999_2054 Kevin
143115551234_2049 Dave
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 Patrick
143115551234_4321 Lars
143537222999_56743 Iris
I would like to get
test_df
id poster
143537222999_2054 User
143115551234_2049 User
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 User
143115551234_4321 User
143537222999_56743 User
Both the columns are characters. I would like to change the poster's value to "User" if id value contains "143537222999", OR "143115551234". I have tried the following codes:
Match within/which
test_df <- within(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
This code gave me no errors, but it didn't change any of the values in the poster column. When I replace within for which, I get the error:
test_df <- which(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
Error in which(test_df, poster[match("143115551234", test_df$id) | :
argument to 'which' is not logical
Match different variant
test_df <- test_df[match(id, test_df, "143115551234") | match(id, test_df, "143537222999"), test_df$poster] <- 'User'
This code gives me the error:
Error in `[<-.data.frame`(`*tmp*`, match(id, test_df, "143115551234") | :
missing values are not allowed in subscripted assignments of data frames
In addition: Warning messages:
1: In match(id, test_df, "143115551234") :
NAs introduced by coercion to integer range
2: In match(id, test_df, "143537222999") :
NAs introduced by coercion to integer range
After looking up this error I found out that the integers in R are 32-bits and the maximum value of an integer is 2147483647. I'm not sure why i'm getting this error because R states that my column is a character.
> lapply(test_df, class)
$poster
[1] "character"
$id
[1] "character"
Grepl
test_df[grepl("143115551234", id | "143537222999", id), poster := "User"]
This code raises the error:
Error in `:=`(poster, "User") : could not find function ":="
I'm not sure what the best way is to fix this error, I have tried multiple variaties and keep getting across different errors.
I have tried multiple answers from multiple questions that were asked before on here, but I still can't get to fix some errors.
Use grepl with ifelse:
df$poster <- ifelse(grepl("143537222999|143115551234", df$id), "User", df$poster)
Demo
You may try this using grepl.
df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
So, all the true for above matched in poster column getting replaced by "User"
> df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
> df
id poster
1 143537222999_2054 User
2 143115551234_2049 User
3 14334_5334 Eric
4 1456322_4334 Mandy
5 143115551234_445633 User
6 143115551234_4321 User
7 143537222999_56743 User

Selecting Multiple Columns by Name without having to Type each name

How do I select multiple columns by name without having to type out each name.
For example I have the following code:
CTDB[, c(
"ENJOY_TV_RADIO_CHILD",
"ENJOY_FMLY_CLOSE_FRND_CHILD",
"ENJOY_HOBBIES_CHILD",
"ENJOY_FAV_MEAL_CHILD",
"ENJOY_SHOWER_CHILD",
"ENJOY_SCENT_CHILD",
"ENJOY_PPL_SMILE_CHILD",
"ENJOY_LOOK_SMART_CHILD",
"ENJOY_READ_CHILD",
"ENJOY_FAV_DRINK_CHILD",
"ENJOY_SMALL_THINGS_CHILD",
"ENJOY_LANDSCAPE_CHILD",
"ENJOY_HELP_OTHR_CHILD",
"ENJOY_PRAISE_CHILD")] <-revalue(as.matrix(CTDB[, c(
"ENJOY_TV_RADIO_CHILD",
"ENJOY_FMLY_CLOSE_FRND_CHILD",
"ENJOY_HOBBIES_CHILD",
"ENJOY_FAV_MEAL_CHILD",
"ENJOY_SHOWER_CHILD",
"ENJOY_SCENT_CHILD",
"ENJOY_PPL_SMILE_CHILD",
"ENJOY_LOOK_SMART_CHILD",
"ENJOY_READ_CHILD", '
"ENJOY_FAV_DRINK_CHILD",
"ENJOY_SMALL_THINGS_CHILD",
"ENJOY_LANDSCAPE_CHILD",
"ENJOY_HELP_OTHR_CHILD",
"ENJOY_PRAISE_CHILD")]), c("0"=3, "1"=2, "2"=1, "3"=0))
All the columns are in order but instead of selecting by number like below
CTDB[,74:87] <-revalue(as.matrix(CTDB[,74:87]), c("0"=3, "1"=2, "2"=1, "3"=0))
I would like to select by the name of the column.
Thank you!
You should use grep or grepl
CTBD[,grep("^ENJOY.*CHILD$",colnames(CTBD)]
or
CTBD[,grepl("^ENJOY.*CHILD$",colnames(CTBD)]
If you need to do this as part of a pipe, you can also use dplyr::select and its helper functions in two equivalent ways, including one that can avoid regular expressions:
CTBD %>% select(matches("^ENJOY.*CHILD$"))
CTBD %>% select(intersect(starts_with("ENJOY"), ends_with("CHILD")))

Incorrect number of dimensions: extracting elements from multiple rdata files

PROBLEM
I have many .RData files in one folder and I want to extract the coordinates continued in each .rdata file. I'd also like to link the concomitant file name(use_hab) and datetime(dt) to each row of their respective coordinates.
CODE
file.namez<-list.files("C:/fitting/fitdata/7 27 2015") #name of files
#file.namez.rev<-file.namez[grep(".RData",file.namez)]
datastor<-data.frame(matrix(NA,length(file.namez),4))
names(datastor)<-c("use_hab",paste("B",1:3,sep=""))
allresults<-NULL
for(i in 1:length(file.namez))
{
datastor<-NULL
print(file.namez[i])
load(paste("C:/fitting/fitdata/7 27 2015/",file.namez[i], sep=""))
use_hab <- as.character(as.data.frame(strsplit(file.namez[i],"_an"))[2,])# this line is used to remove unwanted parts of the file name
use_hab <- gsub(".RData","", use_hab)
datastor <- fitdata$coords
datastor$use_hab <- use_hab
datastor$dt <- fitdata$dt
allresults <- rbind(allresults, datastor[,c(3,4,1,2)])
}
This is only result before the error message:
[1] "fitdata_anw514_yr2008.RData"
ERROR
Error in datastor[, c(3, 4, 1, 2)] : incorrect number of dimensions
In addition: Warning message:
In datastor$use_hab <- use_hab : Coercing LHS to a list
QUESTION
How am I getting the incorrect number of dimensions? Each file name should have 1098 coordinates and date time. In total, 63 files x 1098 rows with 4 columns(filename, datetime, x, y).
The desired result is to have the file name as the first column, the date time as the second column, and the x and y coordinates as the third and fourth columns.
Replace
datastor <- fitdata$coords
with
datastor$coords <- fitdata$coords
The error message Coercing LHS to a list is thrown when you try to access something with $ that does not support this. datastor <- fitdata$coords changes datastor to the data type of fitdata$coords.
Also, you'd change
allresults<-NULL
datastor<-NULL
to
allresults <- data.frame()
datastor <- data.frame()
but this may just my personal preference.

Resources