R String split by spaces - r

I have a dataframe containing 2 columns, of which one of them is a string that contain spaces. I have used strsplit to split this string into character tokens based on spaces. I defined a function for that (split), which I want to apply on the entire data frame:
split <- function (str) strsplit(str, "\\s+")[[1]]
data.frame(raw_rt_data, apply(raw_rt_data$stimulusitem1,2, split) )
Here is more info about my dataframe :
str(raw_rt_data)
'data.frame': 5372 obs. of 2 variables:
$ stimulusitem1: Factor w/ 4313 levels "ABILITY TAX",..: 2483 3645 1339 2455 2769 3033 3998 2712 1313 250 ...
$ latency : int 4051 1266 2145 2959 1086 2956 3814 4924 4771 2654 ...
> head(raw_rt_data)
stimulusitem1 latency
1 MORNING BUBBLE 4051
2 SYSTEM MEN 1266
3 FRIEND PAIN 2145
4 MOMMY TINYURL 2959
5 PEACE INFORMATION 1086
6 PUBLIC SCRITS 2956
The problem is that executing the above code yields an error:
Error in apply(raw_rt_data$stimulusitem1, 2, split) :
dim(X) must have a positive length
What am I doing wrong? The desired result should be 2 new added columns: one containing the first token and the other one containing the 2nd token.
Any help appreciated

Related

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

TM - Clustering data with special date variable

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.
Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

Subsetting two corresponding variables if both of them are true

I'm trying to create a vector with two columns that contain the following strings given that the data in BOTH columns are true. I tried, unsuccessfully with:
CrimesAndLocation <- table(c(Crimes_Data$Primary.Type=="ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY",Crimes_Data$Location.Description=="RESIDENCE")))
I'm trying to get an output where:
Primary.Type, is one of the 8 specific felonies listed above. Thus, it should not show all 32 possible felonies, just out of the 8 listed above
Location.Description, is RESIDENCE
This is the goal of what I'm trying to do:
COLUMN 1 COLUMN 2
"ARSON" "RESIDENCE"
"KIDNAPPING" "RESIDENCE"
"BATTERY" "RESIDENCE"
"HOMICIDE" "RESIDENCE"
"ASSAULT" "RESIDENCE"
...
UPDATE: > str(Crimes_Data) :
'data.frame': 293036 obs. of 22 variables:
$ ID : int 10248194 10251162 10248198 10248242 10248228 10248223 10248192 10248157 10249529 10252453 ...
$ Case.Number : Factor w/ 293015 levels "F218264","HA168845",..: 292354 292350 292363 292359 292368 292366 292351 292348 292364 292816 ...
$ Date : Factor w/ 124573 levels "01/01/2015 01:00:00 AM",..: 94544 94542 94539 94536 94535 94535 94535 94535 94529 94528 ...
$ Block : Factor w/ 27983 levels "0000X E 100TH PL",..: 13541 7650 22635 1317 13262 9623 12854 8232 24201 14279 ...
$ IUCR : Factor w/ 334 levels "0110","0130",..: 49 139 321 33 251 82 38 282 97 38 ...
$ Primary.Type : Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 24 3 18 31 3 13 17 3 ...
$ Description : Factor w/ 313 levels "$500 AND UNDER",..: 111 281 119 35 131 1 260 193 274 260 ...
$ Location.Description: Factor w/ 121 levels "","ABANDONED BUILDING",..: 95 19 110 48 97 110 106 110 110 99 ...
$ Arrest : Factor w/ 2 levels "false","true": 1 1 2 1 2 2 1 2 2 1 ...
$ Domestic : Factor w/ 2 levels "false","true": 2 1 1 1 1 1 1 1 1 1 ...
$ Beat : int 835 333 733 634 1121 1432 1024 735 414 2535 ...
$ District : int 8 3 7 6 11 14 10 7 4 25 ...
$ Ward : int 18 5 6 21 27 1 22 17 7 26 ...
$ Community.Area : int 70 43 68 49 23 22 30 67 46 23 ...
$ FBI.Code : Factor w/ 26 levels "01A","01B","02",..: 11 17 26 6 21 8 11 25 9 11 ...
$ X.Coordinate : int 1154209 1190610 1172166 1176493 1153156 1159961 1154332 1163770 1193570 NA ...
$ Y.Coordinate : int 1852321 1856955 1858813 1841948 1904451 1915955 1887190 1857568 1852889 NA ...
$ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ Updated.On : Factor w/ 442 levels "01/01/2015 12:39:07 PM",..: 288 288 288 288 288 288 288 288 288 288 ...
$ Latitude : num 41.8 41.8 41.8 41.7 41.9 ...
$ Longitude : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
$ Location : Factor w/ 173646 levels "","(41.644604096, -87.610728247)",..: 31318 40835 45858 15601 116871 140063 84837 42961 32176 1 ...
This is a good job for the dplyr package. The filter function will filter a data frame according to any number of logical expressions that you feed it. The following should work for you:
library(dplyr)
filter(
Crimes_Data,
Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE", "HUMAN TRAFFICKING",
"KIDNAPPING", "ROBBERY"),
Location.Description == "RESIDENCE"
)
If you'd rather not use dplyr, you can do it the old fashioned way with base R, like this:
type.bool <- Crimes_Data$Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE",
"HUMAN TRAFFICKING", "KIDNAPPING",
"ROBBERY")
location.bool <- Crimes_Data$Location.Description == "RESIDENCE"
Crimes_Data[type.bool & location.bool, ]
Instead of an integer vector of indices, the [ subsetting operator can take a boolean vector. In that case, it will only return the rows of the data frame for which the corresponding elements of the boolean vector are TRUE.
Thanks for the str() aka "structure" output update, it makes it clearer to be able to help you.
To obtain a list of observations where
these eight felonies : "ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY"
occurred at RESIDENCE
Try breaking up the task into slightly smaller parts:
Step 1:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" | Primary.Type == "ASSAULT" | Primary.Type == "BATTERY" | Primary.Type == "BURGLARY" | Primary.Type == "HOMICIDE" | Primary.Type == "HUMAN TRAFFICKING" | Primary.Type == "KIDNAPPING" | Primary.Type == "ROBBERY")
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
Result:
ViolentCrimesResidence holds two columns with Column 1 being a list of Primary.Type and column 2 is Location.Description, where Column 1 only has values from the eight felonies of interest and column2 only "RESIDENCE"
Explanation
Step 1:
From R website's examples about subset and OR condition:
PineTreeGrade3Data<-subset(StudentData, SchoolName=="Pine Tree Elementary" | Grade==3)
Whereas we have:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" |
we use the subset() function
Crimes_Data is the existing data frame as input
next are the conditions. Which simply take the pattern of VectorName == "Some string", in this casePrimary.Type == "ARSON"`
But we want observations for the other types too, so use the "or" condition to include them
in R, "or" is written with | symbol. So we use this repeatedly to include each of the other felonies of interest
the equal sign = is synonymous with <- and assigns, saves this subset result, into to a new data frame we call ViolentCrimes.
note I prefer using = because it is less keystrokes to type than <-, either is correct
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
the input is ViolentCrimes data frame we made previously which contains only the eight violent crimes , the eight felonies "ARSON", "ASSAULT"...
now we are interested in, out of all these violent crimes, which ones occurred at home, so use condition Location.Description == "RESIDENCE"
but a further option of subset() we didn't use before, is the select = ... option
we do a select = c(Variable1, Variable2) to choose just the Primary.Type and Location.Description vectors
note that if you actually don't want to limit to the columns aka Variables, you simply omit this , select ... option
thus it saves this new subset into ViolentCrimesResidence
So, now in R when you:
ViolentCrimesResidence
You will see a two-column output you wanted of the eight felonies of interest, that happened in RESIDENCE.

R. Handling dates and wide format from an imported Stata file

I have been given a Stata data file (counts.dta) that contains daily counts for the years 1975 to 2006 stored in wide-format. The columns are labelled month (full name of the month as a character string), day (numeric with values 1-31), and then the years from 1975 to 2006 with labels '_1975', '_1976' ... '_2006'. I assume that the underline is a consequence of something in Stata. There are dummy counts of zero (0) inserted for the date 29 February when the year-column is not a leap year.
I want to do several things. First, convert to long form with a sensible representation for year. Second, change the tri-partite representation of the date to something more sensible.
My approach has been to change the character string month to a factor and then to get it into the correct order:
require("foreign")
counts <- read.dta(file='counts.dta')
counts[['month']] <- as.factor( counts[['month']] )
counts[['month']] <-
factor(counts[['month']], levels( counts[['month']] )[c(5,4,8,1,9,7,6,2,12,11,10,3)])
I then have
str( counts )
'data.frame': 366 obs. of 34 variables:
$ month: Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ _1975: int 515 649 745 599 445 667 725 749 646 740 ...
$ _1976: int 485 685 529 467 630 723 712 685 715 504 ...
$ _1977: int 505 437 489 588 634 734 682 537 453 673 ...
and so forth. Converting to long format
lcounts <- reshape(counts,
direction="long",
varying=list(names( counts )[3:34]),
v.names="n.counts",
idvar=c("month","day"),
timevar="Year",
times=1975:2006)
str( lcounts )
gives
'data.frame': 11712 obs. of 4 variables:
$ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 1975 1975 1975 1975 1975 1975 1975 1975 1975 1975 ...
$ n.counts: int 515 649 745 599 445 667 725 749 646 740 ...
plus some further lines relating to the original Stata file.
My questions are: (1) what is now a good way to convert to factor-month, numeric-year and numeric-day to a useful date format, so that I can determine, for example, the day of the week, the interval between two dates and so on? (2) Was there a better way to have tackled the problem from the start?
This should be pretty easy because all you have to do is paste together the rows of your data.frame and use as.Date to create a Date class vector.
Let's start with some data similar to yours:
dat <- data.frame(month = c(rep("January",31), rep("February",29)),
day = c(1:31, 1:29),
Year = 1975,
n.counts = 515)
Then the creation of the date variable is simple:
dat$Date <- as.Date(with(dat, paste(as.numeric(month), day, Year)), "%m %d %Y")
str(dat)
# 'data.frame': 60 obs. of 5 variables:
# $ month : Factor w/ 2 levels "February","January": 2 2 2 2 2 2 2 2 2 2 ...
# $ day : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Year : num 1975 1975 1975 1975 1975 ...
# $ n.counts: num 515 515 515 515 515 515 515 515 515 515 ...
# $ Date : Date, format: "1975-02-01" "1975-02-02" "1975-02-03" "1975-02-04" # ...
The main focus in this thread is naturally what to do in R after data import, but here I bundle together various details on the Stata side of this.
It is longstanding advice that data of this kind are much more easily handled in Stata in a long shape and reshape long is a standard command to do that conversion for data arriving with each year's data in a separate variable (R users: please read "column" as a translation). So, if possible, you should ask a provider of such Stata files to do that before export.
What the OP calls labels such as _1975 are legal variable names in Stata, and as the OP guesses the underscore is needed because variable names in Stata may not start with numeric characters.
On the information given, it would have been possible to export the data without loss from Stata in file formats other than .dta, notably as the usual kinds of text files (.csv, etc.).
Stata's preferred way of holding daily dates is as integers with origin 0 = 1 January 1960 (so 26 March 2015 would be 20173), which presumably is trivially easy to convert to any date representation in R.
In short, the particular and indeed peculiar form of the data as presented to the OP is in no sense either required by any Stata syntax or even recommended as part of good Stata practice.

Combining Two Rows with Different Levels according to Some Conditions into One in R

This is a part of my data: (The actual data contains about 10,000 observations with about 500 levels of SalesItem)
s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
s2<-c(155,153,154,150,176,165,159,143,179,150)
S<-data.frame(SalesItem=factor(s1), Sales=s2)
> str(S)
'data.frame': 10 obs. of 2 variables:
$ SalesItem: Factor w/ 10 levels "1008","1009",..: 1 2 3 4 5 6 7 8 9 10
$ Sales : num 155 153 154 150 176 165 159 143 179 150`
What I want to do is, if diff(SalesItem)=1, I want to combine the level of SalesItem into 1, for example: diff between SalesItem 1008 and 1009 equal to one, so, I want to rename SalesItem 1009 to 1008. So, later I can compute the sum of Sales for this SalesItem as one, because of my actual data=10,000, so, it is quite hard for me to do this one by one.
Is there any simplest way for me to do that?
Clearly the fact that you have converted the first column to a factor indicates that you might need those factors in some place. so i would suggest that instead of changing any of the columns, add a third column to your data frame which will help you maintain the SalesItem relevant to that value. here are the steps for it :
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> s3 = ifelse((s1-1) %in% s1, s1-1, s1)
> S <- data.frame(SalesItem=s1, Sales=s2, ItemId=s3)
then you can just count on the basis of the ItemId column.
This is not a terribly efficient solution, but since your data only contains 10000 records, it is not going to be a big problem.
Set up provided example data, but convert the SalesItem field to an integer so that the diff() operation makes sense.
> s1<-c('1008','1009','1012','1013','1016','1017','1018','1019','1054','1055')
> s2<-c(155,153,154,150,176,165,159,143,179,150)
> s1 = as.integer(s1)
> S<-data.frame(SalesItem=s1, Sales=s2)
Reorder data frame so that the SalesItem field is in ascending order (not necessary for current data set, but required for solution) then find the differences.
> S = S[order(S$SalesItem),]
> d = c(0, diff(S$SalesItem))
Duplicate the SalesItem data and then filter based on the values of the differences.
> labels = s1
> #
> for (n in 1:nrow(S)) {if (d[n] == 1) labels[n] = labels[n-1]}
> S$labels = labels
The (temporary) labels field now has the required new values for the SalesItem field. Once you are happy that this is doing the right thing, you can modify last line in above code to simply over-write the existing SalesItem field.
> S
SalesItem Sales labels
1 1008 155 1008
2 1009 153 1008
3 1012 154 1012
4 1013 150 1012
5 1016 176 1016
6 1017 165 1016
7 1018 159 1016
8 1019 143 1016
9 1054 179 1054
10 1055 150 1054

Resources