I have a data frame with some positive numbers, some negative numbers, some words, and some hyphen "cells" in it, as such:
Revenue 73.88 74.76 78.02 78.19 68.74
Other Revenue - Total - - - - -
Total Revenue 73.88 74.76 78.02 78.19 68.74
Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
Gross Profit 52.80 -53.15 -55.01 55.43 48.75
I want to replace the hyphens that are only found in the second to last columns with 0s, but only if the hyphens are not at the beginning of numbers. For example, I don't want to turn a negative number positive.
I've tried:
df[-1] <- lapply(df[-1], function(x) as.numeric(gsub("-", 0, x)))
but that returns the previous data frame as:
Revenue NA NA NA NA NA
Other Revenue - Total 0 0 0 0 0
Total Revenue NA NA NA NA NA
Cost of Revenue - Total NA NA NA NA NA
Gross Profit NA NA NA NA NA
which is something I definitely don't want. How can I fix this?
Thanks.
This is the output when I call str():
str(income)
'data.frame': 49 obs. of 6 variables:
$ Items : Factor w/ 49 levels "Accounting Change",..: 44 40 47 7 23 45 43 9 29 49 ...
$ Recent1: Factor w/ 14 levels "-","0.00","11,305.00",..: 4 1 4 11 14 6 5 1 1 1 ...
$ Recent2: Factor w/ 16 levels "-","-29.00","0.00",..: 5 1 5 15 16 9 6 1 1 2 ...
$ Recent3: Factor w/ 17 levels "-","0.00","11,449.00",..: 5 1 5 15 17 10 6 1 1 4 ...
$ Recent4: Factor w/ 18 levels "-","-31.00","0.00",..: 6 1 6 15 17 9 4 1 1 18 ...
$ Recent5: Factor w/ 14 levels "-","0.00","1,617.00",..: 4 1 4 10 13 5 3 1 1 1 ...
As #Joe hinted at, the values in a column of a data.frame have to be of the same type, so given that you have -s in the same vectors as what appear to be numerics (52.80, 21.09, etc...), each column is being forced to type character (presumably). Try gsubbing with "0" instead of 0 and then converting the columns to numeric. Since you are forcing a 0 into a character column vector, it is coercing the rest of the vector elements to NA.
DF <- data.frame(
X1=c(12,45,67,"-",9),
X2=c(34,45,56,"-",12))
str(DF)
'data.frame': 5 obs. of 2 variables:
$ X1: chr "12" "45" "67" "-" ...
$ X2: chr "34" "45" "56" "-" ...
##
DF2 <- DF
DF2$X1 <- gsub("-","0",DF2$X1)
DF2$X1 <- as.numeric(DF2$X1)
str(DF2)
'data.frame': 5 obs. of 2 variables:
$ X1: num 12 45 67 0 9
$ X2: chr "34" "45" "56" "-" ...
EDIT: To remove the commas in your values,
DF <- data.frame(
X0=c("A","B","C","D"),
X1=c("12,300.04","45.5","-","9,046.78"),
X2=c("1,0001.12","33","-","12.6"))
for(j in 2:ncol(DF)){
DF[,j] <- gsub(",","",as.character(DF[,j]))
for(i in 1:nrow(DF)){
if(nchar(DF[i,j])==1){
DF[i,j] <- gsub("-","0",DF[i,j])
} else {
next
}
}
DF[,j] <- as.numeric(DF[,j])
DF[,j]
}
There are more efficient ways of doing this with *apply functions and regular expressions but this should work. I had to account for the fact that some of your values are negative so assuming the cells with only a - in them are only one character long, this should fix them without affecting the negative values in other cells.
Assume it is named dat:
dat[2:6] <- lapply( dat[2:6], function(col) as.numeric( gsub("-$|\\,", "", col) ) )
dat[is.na(dat)] <- 0
Only replaces minus-signs at the end of a string, removes commas and the gsub coerces factors to character so you don't need to add as.character. When I imported your data using read.fwf and textConnection I got trailing spaces. You can either use gdata::trim to remove those first but this worked:
lapply(dat[2:6], function(col) as.numeric( gsub("-[ ]*$|\\,", "", col ) ) ) # on RHS
dat<-read.fwf(textConnection("Revenue 73.88 74.76 78.02 78.19 68.74
Other Revenue - Total - - - - -
Total Revenue 73.88 74.76 78.02 78.19 68.74
Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
Gross Profit 52.80 -53.15 -55.01 55.43 48.75"), widths=c(24, rep(8,5)))
dat[2:6] <- lapply( dat[2:6], function(col) as.numeric( gsub("-$|\\,", "", col) ) )
dat[is.na(dat)] <- 0
dat
#----------
V1 V2 V3 V4 V5 V6
1 Revenue 73.88 74.76 78.02 78.19 68.74
2 Other Revenue - Total 0.00 0.00 0.00 0.00 0.00
3 Total Revenue 73.88 74.76 78.02 78.19 68.74
4 Cost of Revenue - Total 21.09 21.61 23.01 22.76 19.99
5 Gross Profit 52.80 -53.15 -55.01 55.43 48.75
Related
I have the following data frame:
Date <- c("04.06.2013","05.06.2013","06.06.2013","07.06.2013","08.06.2013","09.06.2013")
discharge <- c("1000","2000","1100","3000","1700","1600")
concentration_1 <- c("25","20","11","6.4","17","16")
concentration_2 <- c("1.4","1.7","2.7","3.2","4","4.7")
concentration_3 <- c("1.2","1.3","1.9","2.2","2.4","3")
concentration_4 <- c("1","0.92","2.5","3","3.4","4.8")
y <- data.frame(Date, discharge,concentration_1,concentration_2,concentration_3,concentration_4, stringsAsFactors=FALSE)
y$Date <- as.Date(y$Date, format ="%d.%m.%Y")
y[-1] <- sapply(y[-1], as.numeric)
In each row, I need to multiply each concentration with the discharge.
I was looking into the apply function but couldn´t figure out how to solve it.
No apply needed, just multiply. But first let's get your data in decent shape.
They way you define your data, because you use quotes around the numbers, all the columns that should be numeric are factors. We use lapply to convert them safely to numeric:
y <- data.frame(Date, discharge,concentration_1,concentration_2,concentration_3,concentration_4)
y$Date <- as.Date(y$Date, format ="%d.%m.%Y")
str(y)
# 'data.frame': 6 obs. of 6 variables:
# $ Date : Date, format: "2013-06-04" "2013-06-05" "2013-06-06" "2013-06-07" ...
# $ discharge : Factor w/ 6 levels "1000","1100",..: 1 5 2 6 4 3
# $ concentration_1: Factor w/ 6 levels "11","16","17",..: 5 4 1 6 3 2
# $ concentration_2: Factor w/ 6 levels "1.4","1.7","2.7",..: 1 2 3 4 5 6
# $ concentration_3: Factor w/ 6 levels "1.2","1.3","1.9",..: 1 2 3 4 5 6
# $ concentration_4: Factor w/ 6 levels "0.92","1","2.5",..: 2 1 3 4 5 6
# convert all columns but the first safely to numeric
y[, -1] = lapply(y[, -1], function(x) as.numeric(as.character(x)))
str(y)
# 'data.frame': 6 obs. of 6 variables:
# $ Date : Date, format: "2013-06-04" "2013-06-05" "2013-06-06" "2013-06-07" ...
# $ discharge : num 1000 2000 1100 3000 1700 1600
# $ concentration_1: num 25 20 11 6.4 17 16
# $ concentration_2: num 1.4 1.7 2.7 3.2 4 4.7
# $ concentration_3: num 1.2 1.3 1.9 2.2 2.4 3
# $ concentration_4: num 1 0.92 2.5 3 3.4 4.8
With that done, we can just multiply the concentration columns by the discharge column. R will "recycle" the discharge column to multiply each of the concentration columns appropriately.
concentration_columns = paste0("concentration_", 1:4)
y[, concentration_columns] = y[, concentration_columns] * y[, "discharge"]
y
# Date discharge concentration_1 concentration_2 concentration_3 concentration_4
# 1 2013-06-04 1000 25000 1400 1200 1000
# 2 2013-06-05 2000 40000 3400 2600 1840
# 3 2013-06-06 1100 12100 2970 2090 2750
# 4 2013-06-07 3000 19200 9600 6600 9000
# 5 2013-06-08 1700 28900 6800 4080 5780
# 6 2013-06-09 1600 25600 7520 4800 7680
The multiplication is vectorized, just use the columns you want to multiply as operands.
y[, 2] * y[, -(1:2)]
Once your values as not character (not in ""), you can use apply like this:
new <- data.frame(y[,1:2],apply(y[,3:6],2,function(x) x*y$discharge))
Edited to add more context and data 5/12/2017
Using R version 3 on Windows
I have a data frame data2:
'data.frame': 1504 obs. of 14 variables:
$ Member.Name : chr "A" "B" "C"...
$ MSTATUS : Factor w/ 14 levels "","ACTIVE","ACTIVE;CHANGEDROLES;NONQUALIF",..: 13 2 2 2 2 4 13 13 2 13 ...
$ MCAT : Factor w/ 9 levels "","EDNEWCLASS",..: 5 4 9 6 6 6 9 9 4 4 ...
$ SALUTATION : Factor w/ 822 levels "","Aaron","Abigail",..: 285 2 2 2 4 4 4 4 5 5 ...
$ MEM_SUBCATEGORY : Factor w/ 22 levels "","AGENCYCEO",..: 22 6 8 15 8 6 8 1 6 6 ...
$ MEM_SUBTYPE : Factor w/ 25 levels "","AGENCY","AGENCYCEO",..: 24 6 6 20 6 6 6 6 6 6 ...
$ COUNTRY : Factor w/ 33 levels "","AE","AT","AU",..: 33 33 33 33 7 33 33 33 33 33 ...
$ F500 : Factor w/ 243 levels "","#1406 on Forbes Global 2000 ($11B)",..: 1 1 96 1 242 1 147 1 1 76 ...
$ OPT_LINE : Factor w/ 1467 levels "","(Formerly) Condé Nast",..: 1 1170 609 1333 251 1427 444 258 814 1207 ...
$ FLAGS : chr "2014PAGEJAMPARTICIPANT, \nPHOTO" "" "PUFOUNDINGMEMBER" "2014FLESPEAKER" ...
$ FLAGS_DESCR : chr "2014 Page Jam Participant, \nPhoto on File" "" "Page Up Founding Member" "2014 Future Leaders Experience Speaker" ...
$ Enroll.Date : Date, format: "2012-12-04" "2010-08-24" "2013-09-20" "2013-05-06" ...
$ Expiration.Date : Date, format: "2014-12-31" "2017-12-31" "2017-12-31" "2017-12-31" ...
$ Sponsorship.Amount: num 0 0 0 0 0 0 0 0 0 0 ...
For the FLAGS variable, I'd like to remove all row elements that contain a year less than 2014.
head(data2$FLAGS, n=3)
[1] "2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL" ""
[3] "2012COI"
So that FLAGS will look like:
head(data2$FLAGS, n=3)
[1] "\n2016CHAIRCOUNCIL" ""
[3] ""
The rows with no values can either be blank or NA, BUT if a row does contain an event with a year >=2014 and an event with a year <2014 than just delete the event less than 2014 and keep the other events in the row.
This regex works for your example. The idea is to match the first 3 characters of year for those elements that fail and drop them.
FLAGS[-grep("20(0|1[0123])", FLAGS)]
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT" "\n2014PUSPONSOR, \nPHOTO"
or, using invert, you'd have
FLAGS[grep("20(0|1[0123])", FLAGS, invert=TRUE)]
Note that it won't catch pre-2000s and you should be cautious if there are other "numeric" values in the vector.
To return a vector of the same length, with NAs replacing the earlier years, you could use is.na<- and grepl like this
is.na(FLAGS) <- grepl("20(0|1[0123])", FLAGS)
original data
FLAGS<-c("2014PAGEJAMPARTICIPANT, \nPHOTO", "2001ANNUALCONFERENCECOMM",
"\n2011GOVERNANCE", "\n2014PAGEJAMPARTICIPANT", "2013NEWMEMBERNOMINATOR",
"\n2014PUSPONSOR, \nPHOTO")
given OP's second question. The following more or less works:
sapply(strsplit(FLAGS, ","),
function(x) paste(gsub("(\\n)?20(0|1[0123]).*?(, |$)", "", trimws(x)), collapse=" "))
[1] " 2016CHAIRCOUNCIL" "" ""
Note that a "\n" is missing at the beginning and there is an additional (set of) space(s) at the beginning of the first element. The "\n" is removed be trimws. This makes the string a bit easier to work with. The additional spaces can be removed by wrapping the above expression in trimws, for example, trimws(sapply(strsplit(...))).
additional data
FLAGS <- c("2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL", "", "2012COI")
Here is one solution using stringr package:
library(stringr)
FLAGS[sapply(str_extract_all(FLAGS, '[0-9]{4}'),
function(x) !any(as.integer(x) < 2014))]
This solution assumes you may have more than one year in each value. If that is not the case, you can do something more simple like:
FLAGS[as.integer(str_extract(FLAGS, '[0-9]{4}')) >= 2014]
Assuming FLAGS is as follows:
FLAGS
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "2001ANNUALCONFERENCECOMM"
[3] "\n2011GOVERNANCE" "\n2014PAGEJAMPARTICIPANT"
[5] "2013NEWMEMBERNOMINATOR" "\n2014PUSPONSOR, \nPHOTO"
You get result as:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT"
[3] "\n2014PUSPONSOR, \nPHOTO"
EDITING ANSWER BASED ON QUESTION EDIT ABOVE
You can keep only values with 2014 or above and fill with NAs otherwise as follows:
data2$FLAGS <- ifelse(as.integer(str_extract(data2$FLAGS, '\\d+')) >= 2014,
data2$FLAGS, NA)
Result is as follows:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" NA
[3] NA "\n2014PAGEJAMPARTICIPANT"
[5] NA "\n2014PUSPONSOR, \nPHOTO"
I am trying to read in the Baylor dataset but I can't use read.csv since the spaces are not consistent.
I do have the column numbers so I was thinking read.fwf would help fix my issue but that means I have to review more than 100 attributes and check the line widths.
Is there an easier way to read the data?
baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)
Column Numbers
Baylor Religion 2007 Survey Data
I haven't tested carefully, but I think this does it:
Define URLs:
lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"
Read file with column info:
nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)
Extract starting column for each field:
startcol <- as.numeric( ## convert to numeric
sapply(
strsplit(nums[,3],"-"), ## split strings on dashes
"[",1)) ## select first element of each result
## sapply(z,"[",1) == sapply(z,function(x) x[1])
Field widths are differences (assume last field is length 1):
w <- c(diff(startcol),1)
Read fixed width:
r <- read.fwf(url(survey_url),widths=w)
Assign field names:
names(r) <- gsub(":","",nums$COL)
Some quick checks:
str(r[,1:8])
## 'data.frame': 1648 obs. of 8 variables:
## $ ID : num 1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
## $ WEIGHT : num 0.822 0.312 1.604 1.184 1.35 ...
## $ REGION : int 3 3 4 3 2 2 2 4 2 2 ...
## $ RELIG1 : int 12 12 46 45 14 31 16 33 16 16 ...
## $ RELIG2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ DENOM : Factor w/ 301 levels " ",..: 231 231 1 1 1 1 83 113 1 23 ...
## $ RELGIOUS: int 3 4 1 3 3 4 4 4 3 4 ...
## $ ATTEND : int 5 8 0 8 3 0 8 7 1 8 ...
tail(sort(levels(r$DENOM)))
## [1] " RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] " ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] " WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] " THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] " GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"
Some more processing (e.g. stripping white space in the denominations) might be in order, and I would certainly further check these results, but this should get you most of the way there.
For future reference it might be worth downloading the data from the original download site and checking cross-tabulations against the code book ...
I'm trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don't know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What's more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn't. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it's better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I'd be very grateful for any help or guidance.
If you are looking for which things in x$id are in y$id2, then you can use
x$id %in% y$id2
to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
x$match.y <- x$id %in% y$id2
to see what rows of x have a corresponding ID in y.
To see which observations are 1-to-1, you could do something like
y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
to filter out elements that appear more than once in y$id2. You can also add this to x:
x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.
The reason your merge failed was that you gave it two different structures (one a numeric matrix and the other a character matrix) for x and y. Using cbind when data.frame should be chosen is a common strategy for failure.
> str(x)
chr [1:110, 1:4] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "id" "X1" "month" "year"
> str(y)
num [1:11, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "id2" "Y1"
If you used the data.frame function (since dataframes are what merge is supposed to be working with) it would have succeeded:
> x <- data.frame(id, X1, month, year); y <- data.frame(id2,Y1)
> str( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
'data.frame': 111 obs. of 5 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ X1 : num 1.5063 2.5035 0.7889 -0.4907 -0.0446 ...
$ month: Factor w/ 10 levels "Apr","Aug","Dec",..: 6 6 2 2 10 10 9 9 8 8 ...
$ year : Factor w/ 5 levels "2001","2002",..: 3 3 4 4 5 5 1 1 2 2 ...
$ Y1 : num 1.449 1.449 -0.134 -0.134 -0.828 ...
> tail( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
id X1 month year Y1
106 96 -0.3869157 Dec 2004 NA
107 97 0.6373009 Jan 2005 NA
108 98 -0.7735626 Feb 2006 NA
109 99 -1.3537915 Mar 2001 NA
110 100 0.2626190 Apr 2002 NA
111 200 NA <NA> <NA> -1.509818
If you have duplicates in your 'x' argument, then you should get duplicates in the result. It's then your responsibility to use !duplicated in whatever manner you deem appropriate (either before or after the merge), but you cannot expect merge to be making decisions like that for you.
New to R.
I have a data.frame
'data.frame': 2070 obs. of 5 variables:
$ id : int 16625062 16711130 16625064 16668358 16625066 16711227 16711290 16668746 16711502 16625494 ...
$ subj : Factor w/ 3 levels "L","M","S": 1 1 1 1 1 1 1 1 1 1 ...
$ grade: int 4 6 4 5 4 6 6 5 6 4 ...
$ score: int 225 225 0 225 225 375 375 125 225 125 ...
$ level: logi NA NA NA NA NA NA ...
and a list of named numbers called lookup
Named num [1:12] 12 19 20 26 31 32 49 67 72 73 ...
- attr(*, "names")= chr [1:12] "0" "50" "100" "125" ...
I'd like to find a way to update the data frame "level" column by looking up values in the lookup list, matching the data frame "score" column with the name of the number in the lookup list. In other words, the score values in the data frame are used to lookup the number (that will go in the level column) in the lookup list.
So... if anyone understands what I mean... please help.
Thanks Robn
You should be able to do this with (assuming your data frame is called d):
d$level = as.numeric(lookup[as.character(d$score)])
For example:
lookup = list(1, 2, 3, 4)
names(lookup) = c("0", "50", "100", "150")
d = data.frame(score=c(50, 150, 0, 0), level=NA)
d$level = as.numeric(lookup[as.character(d$score)])
print(d)
# score level
# 1 50 2
# 2 150 4
# 3 0 1
# 4 0 1