Cannot remove rows from data.frame [duplicate] - r

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 2 years ago.
I have a series of excel files and I have been using this basic code to import my data for a very long time now. I have not made any changed to the data or the code, but I not read the data properly anymore. I read the files as follow:
apply_fun <- function(filename) {
data <- readxl::read_excel(filename, sheet = "Section Cut Forces - Design",
skip = 1, col_names = TRUE)
data <- data[-1,]
data <- subset(data, StepType == "Max")
data <- data[,c(1,2,6,10)]
data$id <- filename
return(data)
}
filenames <- list.files(pattern = "\\.xlsx", full.names = TRUE)
first <- lapply(filenames,apply_fun)
out <- do.call(rbind,first)
The first few rows of out look like:
structure(list(SectionCut = c("1", "1", "1", "1", "1", "2", "2",
"2", "2", "2"), OutputCase = c("Service (after losses)", "LL-1",
"LL-2", "LL-3", "LL-4", "Service (after losses)", "LL-1", "LL-2",
"LL-3", "LL-4"), V2 = c("11.522", "28.587", "42.246000000000002",
"44.212000000000003", "36.183", "9.8469999999999995", "23.989000000000001",
"37.408999999999999", "43.401000000000003", "40.450000000000003"
), M3 = c("299728.66100000002", "42863.517999999996", "63147.332999999999",
"69628.464000000007", "59196.74", "0", "27.942", "44.863999999999997",
"46.31", "36.204999999999998"), id = c("./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx", "./100-12-S00.xlsx",
"./100-12-S00.xlsx", "./100-12-S00.xlsx")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I try to remove rows as:
out2 <- out[!grep("Service (after losses)", out$OutputCase),]
but the result is 0 observations.
I must say that this just started being an issue for me. I have been able to run this code successfully for months now and never had an issue.

() are special symbols in regex. They have special meaning when you use them in functions like grep/grepl etc. You can use fixed = TRUE in grep to match them exactly. Also ! should be used with grepl and - should be used with grep to remove rows.
out[-grep("Service (after losses)", out$OutputCase, fixed = TRUE),]
Apart from that this looks like an exact match so why use pattern matching with grep? Try :
out[out$OutputCase != 'Service (after losses)', ]

Related

Dealing with data from the same user twice when moving through a loop in R

Let's say I have a dataframe that looks a bit like this:
df <- tribble(
~person_id, ~timestamp,
"1", "02:26:10.000000",
"1", "03:45:37.000000",
"2", "22:03:39.000000",
"3", "11:46:24.000000",
"4", "18:26:55.000000",
"5", "17:01:20.000000",
"5", "03:10:17.000000",
"6", "23:16:05.000000",
)
df
Now let's say I import individual .csv files that each match the person_id like so:
user_files <- list.files(pattern = "\\.csv$", path = here("data"),
full.names = TRUE)
user_files <- user_files[sub("\\.csv$", "", basename(user_files)) %in% df$person_id]
There will naturally be fewer .csv files than the length of df$person_id because persons "1" and "5 appear twice in df$person_id
I would now like to run a for loop that runs a program on each csv file. HOWEVER, where there are more than one of the same person_id, I would like to re-run the loop using the same csv file (since it's the same person again but a different timestamp so will yield different results).
This is what the loops look like
for(i in seq(1:length(user_files))) {
user_file <- read_csv(user_files[i])
#Run lots of analysis on the CSV file
}
Now I need something in the loop that says "if df$person_id occurs more than once, repeat the loop using the same CSV file". Thanks in advance for any assistance.
If the user_files are unique, then match the user_files with the 'df$person_id' use that index to subset the 'user_files'
v1 <- sub("\\.csv$", "", basename(user_files))
user_files2 <- na.omit(user_files[match(df$person_id, v1)])
Now, loop over 'user_files2'
Or a better approach is to merge/join with the original dataset and loop over the filtered data user_files column
library(dplyr)
df1 <- inner_join(df, tibble(user_files,
person_id = sub("\\.csv$", "", basename(user_files))))

R export to SPSS file, with variable names longer than 8 characters [duplicate]

I'm working in R, but I need to deliver some data in SPSS format with both 'variable labels' and 'value labels' and I'm kinda stuck.
I've added variable labels to my data using the Hmisc's label function. This add the variable labels as a label attribute, which is handy when using describe() from the Hmisc package. The problem is that I cannot get the write.foreign() function, from the foreign package, to recognize these labels as variable labels. I imagine I need to modify write.foreign() to use the label attribute as variable label when writing the .sps file.
I looked at the R list and at stackoverflow, but I could only find a post from 2006 on the R list regarding exporting varibles labels to SPSS from R and it doesn't seem to answer my question.
Here is my working example,
# First I create a dummy dataset
df <- data.frame(id = c(1:6), p.code = c(1, 5, 4, NA, 0, 5),
p.label = c('Optometrists', 'Nurses', 'Financial analysts',
'<NA>', '0', 'Nurses'), foo = LETTERS[1:6])
# Second, I add some variable labels using label from the Hmisc package
# install.packages('Hmisc', dependencies = TRUE)
library(Hmisc)
label(df) <- "Sweet sweet data"
label(df$id) <- "id !##$%^"
label(df$p.label) <- "Profession with human readable information"
label(df$p.code) <- "Profession code"
label(df$foo) <- "Variable label for variable x.var"
# modify the name of one varibes, just to see what happens when exported.
names(df)[4] <- "New crazy name for 'foo'"
# Third I export the data with write.foreign from the foreign package
# install.packages('foreign', dependencies = TRUE)
setwd('C:\\temp')
library(foreign)
write.foreign(df,"df.wf.txt","df.wf.sps", package="SPSS")
list.files()
[1] "df.wf.sps" "df.wf.txt"
When I inspect the .sps file (see the content of 'df.wf.sps' below) my variable labels are identical to my variable names, except for foo that I renamed to "New crazy name for 'foo'." This variable has a new and seemly random name, but the correct variable label.
Does anyone know how to get the label attributes and the variable names exported as 'variable labels' and 'labels names' into a .sps file? Maybe there is a smarter way to store 'variable labels' then my current method?
Any help would be greatly appreciated.
Thanks, Eric
Content of 'df.wf.sps' export using write.foreign from the foreign package
DATA LIST FILE= "df.wf.txt" free (",")
/ id p.code p.label Nwcnf.f. .
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
.
VALUE LABELS
/
p.label
1 "0"
2 "Financial analysts"
3 "Nurses"
4 "Optometrists"
/
Nwcnf.f.
1 "A"
2 "B"
3 "C"
4 "D"
5 "E"
6 "F"
.
EXECUTE.
Update April 16 2012 at 15:54:24 PDT;
What I am looking for is a way to tweak write.foreign to write a .sps file where this part,
[…]
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
[…]
looks like this,
[…]
VARIABLE LABELS
id "id !##$%^"
p.code "Profession code"
p.label "Profession with human readable information"
"New crazy name for 'foo'" "New crazy name for 'foo'"
[…]
The last line is a bit ambitious, I don't really need to have a variables with white spaces in the names, but I would like the label attributes to be transferred to the .spas file (that I produce with R).
Try this function and see if it works for you. If not, add a comment and I can see what I can do as far as troubleshooting goes.
# Step 1: Make a backup of your data, just in case
df.orig = df
# Step 2: Load the following function
get.var.labels = function(data) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
structure(c(b), .Names = names(data))
}
# Step 3: Apply the variable.label attributes
attributes(df)$variable.labels = get.var.labels(df)
# Step 4: Load the write.SPSS function available from
# https://stat.ethz.ch/pipermail/r-help/2006-January/085941.html
# Step 5: Write your SPSS datafile and codefile
write.SPSS(df, "df.sav", "df.sps")
The above example is assuming that your data is named df, and you have used Hmisc to add labels, as you described in your question.
Update: A Self-Contained Function
If you do not want to alter your original file, as in the example above, and if you are connected to the internet while you are using this function, you can try this self-contained function:
write.Hmisc.SPSS = function(data, datafile, codefile) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
label.temp = structure(c(b), .Names = names(data))
attributes(data)$variable.labels = label.temp
source("http://dl.dropbox.com/u/2556524/R%20Functions/writeSPSS.R")
write.SPSS(data, datafile, codefile)
}
Usage is simple:
write.Hmisc.SPSS(df, "df.sav", "df.sps")
The function that you linked to (here) should work, but I think the problem is that your dataset doesn't actually have the variable.label and label.table attributes that would be needed to write the SPSS script file.
I don't have access to SPSS, but try the following and see if it at least points you in the right direction. Unfortunately, I don't see an easy way to do this other than editing the output of dput manually.
df = structure(list(id = 1:6,
p.code = c(1, 5, 4, NA, 0, 5),
p.label = structure(c(5L, 4L, 2L, 3L, 1L, 4L),
.Label = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists"),
class = "factor"),
foo = structure(1:6,
.Label = c("A", "B", "C", "D", "E", "F"),
class = "factor")),
.Names = c("id", "p.code", "p.label", "foo"),
label.table = structure(list(id = NULL,
p.code = NULL,
p.label = structure(c("1", "2", "3", "4", "5"),
.Names = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists")),
foo = structure(1:6,
.Names = c("A", "B", "C", "D", "E", "F"))),
.Names = c("id", "p.code", "p.label", "foo")),
variable.labels = structure(c("id !##$%^", "Profession code",
"Profession with human readable information",
"New crazy name for 'foo'"),
.Names = c("id", "p.code", "p.label", "foo")),
codepage = 65001L)
Compare the above with the output of dput for your sample dataset. Notice that label.table and variable.labels have been added, and a line that said something like row.names = c(NA, -6L), class = "data.frame" was removed.
Update
NOTE: This will not work with the default write.foreign function in R. To test this you first need to load the write.SPSS function shared here, and (of course), make sure that you have the foreign package loaded. Then, you write your files as follows:
write.SPSS(df, datafile="df.sav", codefile="df.sps")

Invalid subscript type list, not sure why

I'm beginning to learn R, and I'm writing a script, but I'm getting a weird error. I have a data frame, and I'd like to take a subset of the columns. I created a variable called meansAndStdevs, which is a logical vector. I want to use this logical vector to subset the columns in my data frame. Here's the code I have:
features <- read.table("./features.txt")$V2;
meanAndStdevRegEx <- "(-mean\\(\\))|(-std\\(\\))";
meansAndStdevs <- as.logical(sapply(features, function(f) { grep(meanAndStdevRegEx, f); }));
fileData <- read.table(filePath);
fileDataSubset <- fileData[, meansAndStdevs]
However, I end up getting the error Error in .subset(x, j) : invalid subscript type 'list', and I'm not sure why! I think it might have something to do with my meansAndStdevs list having NAs in place of FALSEs. Hoping for some guidance.
Here are the first few items in the features list (it's class is actually "factor"):
features <- c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y", "tBodyAcc-mean()-Z",
"tBodyAcc-std()-X", "tBodyAcc-std()-Y", "tBodyAcc-std()-Z", "tBodyAcc-mad()-X",
"tBodyAcc-mad()-Y", "tBodyAcc-mad()-Z", "tBodyAcc-max()-X", "tBodyAcc-max()-Y",
"tBodyAcc-max()-Z", "tBodyAcc-min()-X", "tBodyAcc-min()-Y")
Here is the data in fileData: https://raw.githubusercontent.com/MDSilber/CourseProject/master/Dataset/test/X_test.txt
It's pretty large though, so here's some more info on it:
dput(fileData[1:5, 1:3])
structure(list(V1 = c(0.25717778, 0.28602671, 0.27548482, 0.27029822,
0.27483295), V2 = c(-0.02328523, -0.013163359, -0.02605042, -0.032613869,
-0.027847788), V3 = c(-0.014653762, -0.11908252, -0.11815167,
-0.11752018, -0.12952716)), .Names = c("V1", "V2", "V3"), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
This is a table of 561 columns. I'm trying to extract the columns that correspond to the TRUE values of the meansAndStdevs vector and create a new data frame in fileDataSubset from that.
Thanks in advance!
I figured out why it wasn't working. I was supposed to be using grepl, instead of grep, since grepl outputs a logical vector (which is what I wanted). Thanks for all your help!
When I run fileDataSubset <- fileData[, meansAndStdevs], I get the invalid columns error. This is because the logical vector meansAndStdevs has more columns than fileData. You can take a subset of meansAndStdevs which corresponds to your data, then subset fileData on that basis:
datacols <- meansAndStdevs[1:ncol(fileData)]
fileDataSubset <- fileData[, datacols]
I am assuming the following setup (showing for clarity because your post has them out of order):
fileData <- structure(list(V1 = c(0.25717778, 0.28602671, 0.27548482, 0.27029822,
0.27483295), V2 = c(-0.02328523, -0.013163359, -0.02605042, -0.032613869,
-0.027847788), V3 = c(-0.014653762, -0.11908252, -0.11815167,
-0.11752018, -0.12952716)), .Names = c("V1", "V2", "V3"), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
features <- c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y", "tBodyAcc-mean()-Z",
"tBodyAcc-std()-X", "tBodyAcc-std()-Y", "tBodyAcc-std()-Z", "tBodyAcc-mad()-X",
"tBodyAcc-mad()-Y", "tBodyAcc-mad()-Z", "tBodyAcc-max()-X", "tBodyAcc-max()-Y",
"tBodyAcc-max()-Z", "tBodyAcc-min()-X", "tBodyAcc-min()-Y")
meanAndStdevRegEx <- "(-mean\\(\\))|(-std\\(\\))";
meansAndStdevs <- as.logical(sapply(features, function(f) { grep(meanAndStdevRegEx, f); }));
You can then see that the sizes of meansAndStdevs and fileDataSubset are different:
> length(meansAndStdevs)
[1] 14
> ncol(fileDataSubset)
[1] 3
This is why you need to subset meansAndStdevs to use it as an array index.

R: Rename or copy dataframe and naming it as defined in a vector

I want to create a new dataframe from an existing one and naming it as defined in a vector:
I have a dataset with many different questions, and to go through the dataset a bit quicker, I have developed a list of generic functions that can be called upon. For each question, I define the specific values, such as can be seen below. In the second part, I more or less create a clean dataset for the question, which is saved as a dataframe called 'questionid'. Because that variable is overwritten with each question, I want to create a duplicate of this dataframe and call it as specified under 'questionname' (in this case "A1"). I find it very difficult to find easy ways to do that. I hope someone can help me.
# Specify vectors and variables
question <- "Would you recommend edX to a friend of you?"
questionname <- "A1"
edXid <- "i4x-DelftX-ET3034TUx-problem-b3d30df864ca41ffa0170e790f01a783_2_1"
clevels <- c("0 - Not at all likely", "1", "2", "3", "4", "5 - Neutral", "6", "7", "8", "9", "10 - Extremely likely")
csvname <- paste(questionname, ".csv", sep="")
pngname <- paste(questionname, ".png", sep="")
# Run code
questionid <- subset(allDatasolar, allDatasolar[,3]==edXid, select = -c(X,question))
questionid <- questionid[-grep("dummy", questionid$answer), ]
questionid <- droplevels(questionid)
# as.name(questionname) <- as.data.frame(questionid) # does not work
questionid$answer <- factor(questionid$answer, ordered=TRUE, levels=clevels)
write.csv(data.frame(summary(questionid$answer)), file = csvname)
png(file = pngname, width = 640)
barchart(questionid$answer, main = question, xlab = "", col='lightblue')
dev.off()
You're looking for assign
>question = "What do you need?"
>questionname = "A1"
>
>questionid = data.frame(question, x="minimal working example")
>
>assign(questionname, questionid)
>
>A1
question x
1 What do you need? minimal working example
Assign takes a string (or a character variable, in this case) as the first argument and makes an object with that name that is a copy of whatever is in the second argument. In this case, you can feel free to keep over-writing the questionid data frame, but you will be making copies along the way based on your "questionname" variable value.

information from `label attribute` in R to `VARIABLE LABELS` in SPSS

I'm working in R, but I need to deliver some data in SPSS format with both 'variable labels' and 'value labels' and I'm kinda stuck.
I've added variable labels to my data using the Hmisc's label function. This add the variable labels as a label attribute, which is handy when using describe() from the Hmisc package. The problem is that I cannot get the write.foreign() function, from the foreign package, to recognize these labels as variable labels. I imagine I need to modify write.foreign() to use the label attribute as variable label when writing the .sps file.
I looked at the R list and at stackoverflow, but I could only find a post from 2006 on the R list regarding exporting varibles labels to SPSS from R and it doesn't seem to answer my question.
Here is my working example,
# First I create a dummy dataset
df <- data.frame(id = c(1:6), p.code = c(1, 5, 4, NA, 0, 5),
p.label = c('Optometrists', 'Nurses', 'Financial analysts',
'<NA>', '0', 'Nurses'), foo = LETTERS[1:6])
# Second, I add some variable labels using label from the Hmisc package
# install.packages('Hmisc', dependencies = TRUE)
library(Hmisc)
label(df) <- "Sweet sweet data"
label(df$id) <- "id !##$%^"
label(df$p.label) <- "Profession with human readable information"
label(df$p.code) <- "Profession code"
label(df$foo) <- "Variable label for variable x.var"
# modify the name of one varibes, just to see what happens when exported.
names(df)[4] <- "New crazy name for 'foo'"
# Third I export the data with write.foreign from the foreign package
# install.packages('foreign', dependencies = TRUE)
setwd('C:\\temp')
library(foreign)
write.foreign(df,"df.wf.txt","df.wf.sps", package="SPSS")
list.files()
[1] "df.wf.sps" "df.wf.txt"
When I inspect the .sps file (see the content of 'df.wf.sps' below) my variable labels are identical to my variable names, except for foo that I renamed to "New crazy name for 'foo'." This variable has a new and seemly random name, but the correct variable label.
Does anyone know how to get the label attributes and the variable names exported as 'variable labels' and 'labels names' into a .sps file? Maybe there is a smarter way to store 'variable labels' then my current method?
Any help would be greatly appreciated.
Thanks, Eric
Content of 'df.wf.sps' export using write.foreign from the foreign package
DATA LIST FILE= "df.wf.txt" free (",")
/ id p.code p.label Nwcnf.f. .
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
.
VALUE LABELS
/
p.label
1 "0"
2 "Financial analysts"
3 "Nurses"
4 "Optometrists"
/
Nwcnf.f.
1 "A"
2 "B"
3 "C"
4 "D"
5 "E"
6 "F"
.
EXECUTE.
Update April 16 2012 at 15:54:24 PDT;
What I am looking for is a way to tweak write.foreign to write a .sps file where this part,
[…]
VARIABLE LABELS
id "id"
p.code "p.code"
p.label "p.label"
Nwcnf.f. "New crazy name for 'foo'"
[…]
looks like this,
[…]
VARIABLE LABELS
id "id !##$%^"
p.code "Profession code"
p.label "Profession with human readable information"
"New crazy name for 'foo'" "New crazy name for 'foo'"
[…]
The last line is a bit ambitious, I don't really need to have a variables with white spaces in the names, but I would like the label attributes to be transferred to the .spas file (that I produce with R).
Try this function and see if it works for you. If not, add a comment and I can see what I can do as far as troubleshooting goes.
# Step 1: Make a backup of your data, just in case
df.orig = df
# Step 2: Load the following function
get.var.labels = function(data) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
structure(c(b), .Names = names(data))
}
# Step 3: Apply the variable.label attributes
attributes(df)$variable.labels = get.var.labels(df)
# Step 4: Load the write.SPSS function available from
# https://stat.ethz.ch/pipermail/r-help/2006-January/085941.html
# Step 5: Write your SPSS datafile and codefile
write.SPSS(df, "df.sav", "df.sps")
The above example is assuming that your data is named df, and you have used Hmisc to add labels, as you described in your question.
Update: A Self-Contained Function
If you do not want to alter your original file, as in the example above, and if you are connected to the internet while you are using this function, you can try this self-contained function:
write.Hmisc.SPSS = function(data, datafile, codefile) {
a = do.call(llist, data)
tempout = vector("list", length(a))
for (i in 1:length(a)) {
tempout[[i]] = label(a[[i]])
}
b = unlist(tempout)
label.temp = structure(c(b), .Names = names(data))
attributes(data)$variable.labels = label.temp
source("http://dl.dropbox.com/u/2556524/R%20Functions/writeSPSS.R")
write.SPSS(data, datafile, codefile)
}
Usage is simple:
write.Hmisc.SPSS(df, "df.sav", "df.sps")
The function that you linked to (here) should work, but I think the problem is that your dataset doesn't actually have the variable.label and label.table attributes that would be needed to write the SPSS script file.
I don't have access to SPSS, but try the following and see if it at least points you in the right direction. Unfortunately, I don't see an easy way to do this other than editing the output of dput manually.
df = structure(list(id = 1:6,
p.code = c(1, 5, 4, NA, 0, 5),
p.label = structure(c(5L, 4L, 2L, 3L, 1L, 4L),
.Label = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists"),
class = "factor"),
foo = structure(1:6,
.Label = c("A", "B", "C", "D", "E", "F"),
class = "factor")),
.Names = c("id", "p.code", "p.label", "foo"),
label.table = structure(list(id = NULL,
p.code = NULL,
p.label = structure(c("1", "2", "3", "4", "5"),
.Names = c("0", "Financial analysts",
"<NA>", "Nurses",
"Optometrists")),
foo = structure(1:6,
.Names = c("A", "B", "C", "D", "E", "F"))),
.Names = c("id", "p.code", "p.label", "foo")),
variable.labels = structure(c("id !##$%^", "Profession code",
"Profession with human readable information",
"New crazy name for 'foo'"),
.Names = c("id", "p.code", "p.label", "foo")),
codepage = 65001L)
Compare the above with the output of dput for your sample dataset. Notice that label.table and variable.labels have been added, and a line that said something like row.names = c(NA, -6L), class = "data.frame" was removed.
Update
NOTE: This will not work with the default write.foreign function in R. To test this you first need to load the write.SPSS function shared here, and (of course), make sure that you have the foreign package loaded. Then, you write your files as follows:
write.SPSS(df, datafile="df.sav", codefile="df.sps")

Resources