how to convert and store text file to csv - r

13-JUL-17
Bank User Space Occupied(GB)
------------------------------ ------------------
CKYC_MNSB .004211426
CORE_AMARNATH_ASP 8.75262451
CORE_AMBUJA 6.80389404
CORE_AMBUJA_ASP 10.0085449
CORE_ANAND_MERC_ASP 18.9866333
CORE_BALOTRA 17.8280029
CORE_BASODA 4.55432129
CORE_CHHAPI_ASP 11.9767456
CORE_DHANGDHRA_ASP 13.1849976
CORE_IDAR_ASP 13.3209229
CORE_JANTA_HALOL_ASP 12.7955933
Bank User Space Occupied(GB)
------------------------------ ------------------
CORE_JHALOD_URBAN_ASP 9.19219971
CORE_MANINAGAR 5.36090088
CORE_MANINAGAR_ASP 6.31414795
CORE_SANKHEDA 20.4329834
CORE_SMCB_ANAND_ASP 11.3191528
CORE_TARAPUR_ASP 8.24627686
CORE_VUCB .000610352
TBA_TEMP 5.39910889
TEST_DUNIA 4.15698242
20 rows selected.
TABLESPACE NAME Free Space in GB
------------------------------ ----------------
TBAPROJ 33.2736816
I have above text file.
How to store in CSV file with column separated?
I have load file but its very difficult to remove blank space from the file.

Each line you want matches the pattern of a word made from capital letters and underscores, then spaces, then a number that has a decimal point in it. so this grep will filter those out:
> file_raw <- readLines('file.txt')
> read.table(
text=paste(
file_raw[
grep("^[A-Z_].*\\s*\\.",file_raw)
],
collapse="\n"),
sep="",head=FALSE)
V1 V2
1 CKYC_MNSB 0.004211426
2 CORE_AMARNATH_ASP 8.752624510
3 CORE_AMBUJA 6.803894040
4 CORE_AMBUJA_ASP 10.008544900
5 CORE_ANAND_MERC_ASP 18.986633300
6 CORE_BALOTRA 17.828002900
7 CORE_BASODA 4.554321290
8 CORE_CHHAPI_ASP 11.976745600
9 CORE_DHANGDHRA_ASP 13.184997600
10 CORE_IDAR_ASP 13.320922900
11 CORE_JANTA_HALOL_ASP 12.795593300
12 CORE_JHALOD_URBAN_ASP 9.192199710
13 CORE_MANINAGAR 5.360900880
14 CORE_MANINAGAR_ASP 6.314147950
15 CORE_SANKHEDA 20.432983400
16 CORE_SMCB_ANAND_ASP 11.319152800
17 CORE_TARAPUR_ASP 8.246276860
18 CORE_VUCB 0.000610352
19 TBA_TEMP 5.399108890
20 TEST_DUNIA 4.156982420
21 TBAPROJ 33.273681600
Note that if you are expecting any of the first tokens to not match the pattern, for example CORE_999 or lower_case then you need to adjust the pattern. But without a formal spec we can only go on what you supplied.

There might be possibly a more elegant way, but this does the trick:
# read raw file in lines
file_raw <- readLines('file.txt')
# remove whitespace
file_trim <- trimws(file_raw,which = 'both')
# remove empty lines
file_trim <- file_trim[file_trim != '']
# sub white space with separator ,
file_csv <- gsub('\\s{2,}',',',file_trim)
In the end there will be still some things left like the -- separators and 20 rows selected., but that can be filtered out easily if you want, before writing or after reading it:
file_clean <- file_csv[!grepl('(-){3,}|rows selected',file_csv)]
write.csv(file_clean,'file_cleaned.csv')
> read.csv('file_cleaned.csv')
X x
1 1 13-JUL-17
2 2 Bank User,Space Occupied(GB)
3 3 CKYC_MNSB,.004211426
4 4 CORE_AMARNATH_ASP,8.75262451
5 5 CORE_AMBUJA,6.80389404
6 6 CORE_AMBUJA_ASP,10.0085449
7 7 CORE_ANAND_MERC_ASP,18.9866333
8 8 CORE_BALOTRA,17.8280029
9 9 CORE_BASODA,4.55432129
10 10 CORE_CHHAPI_ASP,11.9767456
11 11 CORE_DHANGDHRA_ASP,13.1849976
12 12 CORE_IDAR_ASP,13.3209229
13 13 CORE_JANTA_HALOL_ASP,12.7955933
14 14 Bank User,Space Occupied(GB)
15 15 CORE_JHALOD_URBAN_ASP,9.19219971
16 16 CORE_MANINAGAR,5.36090088
17 17 CORE_MANINAGAR_ASP,6.31414795
18 18 CORE_SANKHEDA,20.4329834
19 19 CORE_SMCB_ANAND_ASP,11.3191528
20 20 CORE_TARAPUR_ASP,8.24627686
21 21 CORE_VUCB,.000610352
22 22 TBA_TEMP,5.39910889
23 23 TEST_DUNIA,4.15698242
24 24 TABLESPACE NAME,Free Space in GB
25 25 TBAPROJ,33.2736816

Related

Show only even numbers from a data set

I am trying to extract only the even numbers from the "cars" data set.
I know I need to create a new function.
I have come this far:
Is.even = function(x) x %% 2 == 0
When I enter in:
Is.even(cars[1])
It gives me back a logical response. I want to only display the actual even numbers in integer form and hide the odd numbers.
What am I doing wrong?
Apart from #neilfws' suggestion, if you pass your values as a vector you can also use Filter
Filter(Is.even, cars[, 1])
#[1] 4 4 8 10 10 10 12 12 12 12 14 14 14 14 16 16 18 18 18 18 20 20 20 20 20 22 24 24 24 24

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

how to assign value to users in a data.frame based on user ID records from another data.frame

I have read excel file in R, where sheet1 has 51500 rows and 5 column and sheet 2 has user ID of buyers (only one column). Objective: Aim to extract the user in sheet_1 whose User Id are occurred in sheet 2.
Here is the two example input files and desired output:
df <- data.frame(User.ID=c(12: 17), Group="Test", Spend=c(15:20), Purchase=c(5:10))
df
User.ID Group Spend Purchase
1 12 Test 15 5
2 13 Test 16 6
3 14 Test 17 7
4 15 Test 18 8
5 16 Test 19 9
6 17 Test 20 10
hash.ID <- data.frame(User.ID= c(13:16))
User.ID
1 13
2 14
3 15
4 16
desired output :
User.ID Group Spend Purchase Redem_Status
1 12 Test 15 5 Test_NonRedeemer
2 13 Test 16 6 Test_Redeemer
3 14 Test 17 7 Test_Redeemer
4 15 Test 18 8 Test_Redeemer
5 16 Test 19 9 Test_Redeemer
6 17 Test 20 10 Test_NonRedeemer
based on above example, we can see that if user Id from df is existed in hash.ID table, then we add new column and label it as Test_Redeemer, otherwise label it as Test_NonRedeemer. Is there any straightforward approach that can do this task ? Thanks a lot !!
The testcase you presented helped, thanks. As mentioned in the comments, you need to subset the rows you're interested in and assign them value. By placing ! in front of the statement (notice the braces!) you negate the statement and thus select all records not selected in the previous call.
df[df$User.ID %in% hash.ID$User.ID, "Redem_Status"] <- "Test_Redeemer"
df[!(df$User.ID %in% hash.ID$User.ID), "Redem_Status"] <- "Test_NonRedeemer"
df
User.ID Group Spend Purchase Redem_Status
1 12 Test 15 5 Test_NonRedeemer
2 13 Test 16 6 Test_Redeemer
3 14 Test 17 7 Test_Redeemer
4 15 Test 18 8 Test_Redeemer
5 16 Test 19 9 Test_Redeemer
6 17 Test 20 10 Test_NonRedeemer

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Creating a numerical variable order

I have a set of data with 3 columns: index column (with no name), colour, colour of seed, and germination time.
How do I create a numerical variable called 'order' with values 1 to 22 (the number of data sets)?
I don't know if I get you right, but simplest way would be:
> order <- c(1:22)
> order
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
No, if you run:
class(order)
you will get:
[1] "integer"
but you can easily get every element of object order (especially in a loop)
for(i in 1:length(order)){
print(order[i])
}

Resources