Specialized output with for loop - r

I have an csv file and i want to get a specialized output with just typing in the ID (PIPAPNr) for a letter
for example
input = PIPAP1147
output Roger Nadal 11.07.1993
Pipapnr="PIPAP1147"
for (i in 1:nrow(Patienten)){
if (Patienten$PIPAP.Nr.==Pipapnr)
DOB<- (Patienten$Geburtsdatum[i])
Name<- (Patienten$Name[i])}
The error is
In if (Patienten$PIPAP.Nr. == Pipapnr) DOB <- (Patienten$Geburtsdatum[i]) :
the condition has length > 1 and only the first element will be used

In if (Patienten$PIPAP.Nr. == Pipapnr) DOB <- (Patienten$Geburtsdatum[i]) :
the condition has length > 1 and only the first element will be used
In this code Pipapnr contains just one value, however Patienten$PIPAP.Nr. probably contains lots of values so there are many comparisons and only the first is used.
That is the explanation of the error message. Probably you wanted the ifclause to read as if (Patienten$PIPAP.Nr.[i]==Pipapnr) ...
Still, John Garland is right in his comment, that these things can be handled more elegantly in R. Maybe something like which(Patienten$PIPAP.Nr.==Pipapnr?
As your Code reads German, maybe you are interested in the German R forum at http://forum.r-statistik.de ?

Related

Why does read.csv2 work just fine, yet read.csv2.sql shows an error/warning?

I am trying to read a csv file in R using read.csv2.sql, since I would like to use a SELECT query from SQL to help me filter my data, but before I can even get to my SELECT query, I discovered that simply reading my csv file using read.csv2.sql already generates a warning message.
This is my code:
investment2 <- read.csv2.sql("investmentdata.csv")
This is the warning message:
Warning message:
In result_fetch(res#ptr, n = n) :
Column 'Capital.Investment': mixed type, first seen values of type real, coercing other values of type string
However, when I use the normal read.csv2 function, there is no error. In particular, the following code works fine with no warning messages:
investment <- read.csv2("investmentdata.csv")
Next, I tried to resolve this issue by casting the Capital.Investment column to be real as follows:
investment3 <- read.csv2.sql("investmentdata.csv", "SELECT *, CAST(Capital.Investment AS real) FROM file")
However, R now generates the following error:
Error: no such column: Capital.Investment
Thus, I have two questions. Firstly, why does using read.csv2.sql generate that warning message when read.csv2 works just fine? Secondly, why does R (or SQL) not recognise my Capital.Investment column when I try to cast it as real?
Perhaps it is also worth noting that I cannot simply ignore this warning that the read.csv2.sql function is showing, because I discovered that as a consequence of this warning, it has automatically casted some of the NA rows in my Capital.Investment column to 0, which I cannot allow - the NA rows must stay as NA. I do not seem to be having this problem with the other columns of my csv file though.
As I am quite new to R, any help and explanations will be greatly appreciated :)
Edit
The coded version of what my truncated csv file looks like is as follows. In particular, the name of the column-in-question is indeed Capital.Investment.
id;targetC;year;comp_id;homeC;Industry.Activity;Capital.Investment;Estimated;Jobs.Created;Estimated.1;Project.Type;geographic distance;SIC;listed;sales;assets;cap_structure;rnd;profit;rndintensity;polcon;homeC_gdp;targetC_gdp;homeC_gdppc;targetC_gdppc
1302;AUS;2008;FR338966385;FRA;Design, Development & Testing;33.1;Yes;36;Yes;New;15.26414042;3669;Unlisted;4333088.972;4037211.732;0;NA;-1339221.733;NA;0.489032525;2.92347E+12;1.05456E+12;45413.06571;49628.11513
1311;AUS;2008;US*190521496652;USA;Research & Development;8.4;Yes;30;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1313;AUS;2008;GB05817296;GBR;Business Services;9.7;Yes;10;Yes;New;15.31094496;7389;Unlisted;NA;87.64187374;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
1318;AUS;2008;US129687150L;USA;Business Services;1.3;Yes;225;Yes;New;15.24712914;7373;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
1351;AUS;2008;GB*P0060071;GBR;Electricity;516;No;51;Yes;New;15.31094496;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9925;AUS;2008;GB00034121;GBR;Business Services;34.8;Yes;37;Yes;New;15.31094496;4412;Unlisted;NA;2079288.611;0.355157008;NA;94320.15469;NA;0.489032525;2.87546E+12;1.05456E+12;46523.26545;49628.11513
9932;AUS;2008;CA30060NC;CAN;Sales, Marketing & Support;3.2;Yes;11;Yes;New;14.88812529;1094;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.54913E+12;1.05456E+12;46596.33599;49628.11513
9935;AUS;2008;US940890210;USA;Manufacturing;771;Yes;266;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9938;AUS;2008;US770059951;USA;Technical Support Centre;9.1;Yes;104;Yes;Co-Locati;15.24712914;3661;Listed;34922000;53340000;0.120134983;4598000;7333000;0.086201723;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9946;AUS;2008;US010562944;USA;Extraction;535.8;Yes;198;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9955;AUS;2008;DE5030147191;DEU;Logistics, Distribution & Transportation;21.2;Yes;134;Yes;New;14.6718338;4311;Listed;93495971.01;346629334.8;0.036629492;0;2044745.934;0;0.489032525;3.75237E+12;1.05456E+12;45699.19832;49628.11513
9958;AUS;2008;US126012192L;USA;Business Services;9.7;Yes;10;Yes;New;15.24712914;8111;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9969;AUS;2008;US135409005;USA;Extraction;NA;No;538;Yes;New;15.24712914;2911;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9977;AUS;2008;JP000000728JPN;JPN;ICT & Internet Infrastructure;128.6;Yes;77;Yes;New;7.0333688;3571;Listed;53255396.85;38181450.16;0.190244908;2584585.523;480589.4308;0.067692176;0.489032525;5.03791E+12;1.05456E+12;39339.29757;49628.11513
9984;AUS;2008;US841547578;USA;Sales, Marketing & Support;13.6;Yes;23;Yes;New;15.24712914;2095;Listed;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
9993;AUS;2008;US258715604L;USA;Customer Contact Centre;1.8;No;40;No;New;15.24712914;NA;Unlisted;NA;NA;NA;NA;NA;NA;0.489032525;1.47E+13;1.05456E+12;48401.42734;49628.11513
This issue was resolved in chat, to be one of two issues:
see my original answer below, this was causing an Error; when that is fixed, we see that ...
there is a warning, informing about the fact that a column (happens to be the same column) looks numeric but has a non-numeric cell somewhere within the guts of the file.
The first is resolved below, the second is just a warning.
However, because the OP is asking to convert to numeric via SQL, the NA is converted to 0, which is not good. My recommendation is to either cast([Capital.Investment] as char) as [Capital.Investment] and use R's as.numeric to convert to numeric (preserving the NA-nature), or to just read.csv2(.) the file outright and use sqldf(.) to use its SQL querying on table-like data.
Up front: add brackets or quotes around your column name.
Rationale: Capital.Investment is seen as a dot-delimited table-column or schema-table or something similarly not what you intend. I believe in general in SQL that field names with embedded dots need this escaping. If your data has an embedded space, realize that R does not like spaces in its field names, so it is by-default using make.names when reading it in (which replaces spaces with dots).
Setup:
Save the following as "quux.csv". (I've named it csv, but since I'm changing it to be ;-delimited, it behaves the same.)
quux;Capital.Investment
1;100
2;200
(Or you can use Capital Investment, it's the same thing.)
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast(Capital.Investment as real) from file')
# Error: no such column: Capital.Investment
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast([Capital.Investment] as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200
sqldf::read.csv2.sql("quux.csv", sql='select quux, cast("Capital.Investment" as real) as CI from file')
# quux CI
# 1 1 100
# 2 2 200

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

How to use partial matches across multiple columns in R to set final value

I am new to R, moving over from Excel VBA. I would like to categorize a final value based on the text provided in multiple columns and 20k+ rows.
I've been semi-successful with "if" and "identical" but have struggled with partial matches through using "grep"
I'll share psuedo-code of what I'm trying to achieve:
If d$Removal_Reason_Code contains "SCH" AND
If d$Shop_Action_Code is an exact match to "Test" AND
If d$Repair_Summary contains "No Fault Found"
Then
set d$Category to "NFF"
Else
go back to row 1 and check against other keywords
I can post the working VBA code if that is helpful. I'm just getting my head round how R works, and was hoping it may be a quick and easy answer for one of you gurus!
Much appreciated :)
We can use grepl for partial matches
i1 <- with(d, grepl("SCH", Removal_Reason_Code) & Shop_Action_Code == "TEST" &
grepl("No Fault Found", Repair_Summary))
d$Category[i1] <- "NFF"

read in csv file in R and make a list out of last column

Content of my.csv
project names,task names
Build Finances,Calculate Earnings
Build Roads,Calculate Equipment Costs
Buy Food, Calculate Grocery Costs
The code I'm using to read /tmp/my.csv into a variable/vector is:
taskNamesAndprojectNames <- read.csv("/tmp/my.csv", header=TRUE)
What I want to do is to grab the last column of my.csv file which has been put into the csvContent variable. And then make a list out of it.
So, something like this:
#!/usr/bin/Rscript
taskNamesAndprojectNames <- read.csv("/tmp/my.csv", header=TRUE)
#str(tasklists)
#tasklists
#tasklists[,ncol(tasklists)]
taskNames <- list(taskNamesAndprojectNames[,-1])
typeof(taskNames)
length(taskNames)
The problem with the above code is, when i run length on the taskNames variable/vector to confirm that it has the correct number of items or elements, I only get a response of 1. Which is not accurate.
[roywell#test01 data]$ ~/readincsv.r
[1] "list"
[1] 1
What am I doing wrong here? Can someone help me correct this code? What i want to do is grab the last column of an excel csv sheet, get the values in that last column and put them into a variable. Make a list out of it. Then iterate through the list to confirm that values/input provided by a user matches at least one of the elements in the list.
taskNames <- list(taskNamesAndprojectNames[,-1]) makes a list with one element that is a character vector of length 3.
It sounds like you are looking for a vector in this case:
taskNames <- taskNamesAndprojectNames[,-1]
typeof(taskNames)
[1] "character"
length(taskNames)
[1] 3

Scientific notation issue in R

I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column

Resources