How to sum up CSV column without excel
Moderators: Dorian (MJT support), JRL
How to sum up CSV column without excel
Hi All,
Im stuck with this problem: I have some pretty large CSV files (1mb each), and have to automate suming one precise column (the third one, filled with float numbers).
How to do that? The .CSV has each time different number of rows.
I tried to make a function, which found 2nd and 3rd semicolon, and ripped the text between, but I think the Position> function can't handle the semicolon in Macro Scheduler 11... which I have.
Maybe some SQL trick should be done here? Has anyone encountered such problem? To make things harder, I want to use this script on a server without Microsoft Office installed.
Im stuck with this problem: I have some pretty large CSV files (1mb each), and have to automate suming one precise column (the third one, filled with float numbers).
How to do that? The .CSV has each time different number of rows.
I tried to make a function, which found 2nd and 3rd semicolon, and ripped the text between, but I think the Position> function can't handle the semicolon in Macro Scheduler 11... which I have.
Maybe some SQL trick should be done here? Has anyone encountered such problem? To make things harder, I want to use this script on a server without Microsoft Office installed.
Last edited by rblack on Sun Nov 03, 2013 1:33 pm, edited 1 time in total.
You could use CSVFileToArray> then add up the array results of the third column like this.
I created a fake csv file that is 30,000 rows and 10 columns with numbers in the third column. The following script gives me an answer in about 20 seconds. It might be faster if you use SQL but if the computer doesn't have Office installed you won't be able to query a text file. See Marcus' blog for a way to fix that.
I created a fake csv file that is 30,000 rows and 10 columns with numbers in the third column. The following script gives me an answer in about 20 seconds. It might be faster if you use SQL but if the computer doesn't have Office installed you won't be able to query a text file. See Marcus' blog for a way to fix that.
Code: Select all
Timer>start
CSVFileToArray>%temp_dir%fakeCSV.csv,var
Let>Total=0
Let>kk=0
Repeat>kk
Add>kk,1
Let>Value=var_%kk%_2
LEt>Total=%Total%+%value%
Until>kk=var_count
Timer>stop
Let>TotalTime=%stop%-%start%
MDl>%Total%%crlf%%TotalTime%
I think this will work for column 3 in version 11. The regex was provided by Marcus here
Code: Select all
Timer>start
ReadFile>%temp_dir%fakeCSV.csv,vData
Separate>vData,crlf,vLine
Let>Total=0
Let>LL=0
Repeat>LL
Add>LL,1
Let>text=vLine_%LL%
//RegEx to extract each field to an array
Let>pattern="[^"\r\n]*"|[^,\r\n]*
RegEx>pattern,text,0,field,num,0
//RegEx to remove starting and ending quotes if present
//RegEx>^"|"$,field_3,0,matches,n,1,,field_3
Let>Total=%Total%+%field_3%
Let>field_3=0
Until>LL=vLine_Count
Timer>stop
Let>TotalTime=%stop%-%start%
MDl>%Total%%crlf%%TotalTime%%crlf%%vLine_Count%
@armsys,
Here's a sample using SQL.
Here's a sample using SQL.
Code: Select all
Let>FileLocationPath=%temp_dir%
Let>CSVFileName=FakeCSV.csv
//You need a header row in the CSV file. In this case I just added F1,F2,F3,etc
Let>ColumnNameToSum=F3
Timer>Start
GoSub>GetSumFromCSV
Timer>stop
Let>TotalTime=%stop%-%start%
MDL>Time to Process = %TotalTime%%crlf%Sum = %SummedResult_1_1%
SRT>GetSumFromCSV
Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=%FileLocationPath%;Extensions=asc,csv,tab,txt;
DBConnect>ConStr,dbH
Let>sql=select SUM(%ColumnNameToSum%) from %CSVFileName%;
DBQuery>dbH,sql,SummedResult,numrecords,numfields
DBClose>dbH
END>GetSumFromCSV
HI JPL,
Thanks for your fast help.
I tried your SQL sample script as follows:
But at line 17, an error message pops up:
Error in: __debug.dbg
Line 17 - [Microsoft][ODBC TextDriver] Too few
parameters. Expected1
Thanks for your fast help.
I tried your SQL sample script as follows:
Code: Select all
Let>FileLocationPath=C:\Temp\
Let>CSVFileName=ur_lib_import.txt
//You need a header row in the CSV file. In this case I just added F1,F2,F3,etc
Let>ColumnNameToSum=FileSize
Timer>Start
GoSub>GetSumFromCSV
Timer>stop
Let>TotalTime=%stop%-%start%
MDL>Time to Process = %TotalTime%%crlf%Sum = %SummedResult_1_1%
SRT>GetSumFromCSV
Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=%FileLocationPath%;Extensions=asc,csv,tab,txt;
DBConnect>ConStr,dbH
Let>sql=select SUM(%ColumnNameToSum%) from %CSVFileName%;
DBQuery>dbH,sql,SummedResult,numrecords,numfields
DBClose>dbH
END>GetSumFromCSV
Error in: __debug.dbg
Line 17 - [Microsoft][ODBC TextDriver] Too few
parameters. Expected1
Sorry for my late reply.JRL wrote:Does your csv file have column headers? I got that same highly uninformative message before I put in headers.
The CSV text:
Code: Select all
URL,FileName,FilePath,FileExt,FileDateModified,FileDateCreated,FileSize,FileHash
"C:\Developing Large Web Applications.pdf","Developing Large Web Applications.pdf","C:","pdf","11/7/2013 12:15:14 PM","11/7/2013 12:14:34 PM","1941715","2bededce623e705a8b1f398f318a5c530e3432537a8254f8187b2cbd981ff735"
Armsys,
Took your data and created "armsys.csv" in my temp directory. The following script gives me the correct answer.
Took your data and created "armsys.csv" in my temp directory. The following script gives me the correct answer.
Code: Select all
Let>FileLocationPath=%temp_dir%
Let>CSVFileName=armsys.csv
//You need a header row in the CSV file. In this case I just added F1,F2,F3,etc
Let>ColumnNameToSum=FileSize
Timer>Start
GoSub>GetSumFromCSV
Timer>stop
Let>TotalTime=%stop%-%start%
MDL>Time to Process = %TotalTime%%crlf%Sum = %SummedResult_1_1%
SRT>GetSumFromCSV
Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=%FileLocationPath%;Extensions=asc,csv,tab,txt;
DBConnect>ConStr,dbH
Let>sql=select SUM(%ColumnNameToSum%) from %CSVFileName%;
DBQuery>dbH,sql,SummedResult,numrecords,numfields
DBClose>dbH
END>GetSumFromCSV
The same "Microsoft ODBC text driver too few parameters expected 1" error message pops up. I suspect it's likely caused by my ODBC driver installation. Once I find out the answer, I'll let you know.JRL wrote:Took your data and created "armsys.csv" in my temp directory. The following script gives me the correct answer.
JPL, thanks for taking time to validate the script.
JRL,
After 2 days of repeated experiments, I succeed.
My test CSV can be simplified as follows:
My new source:
For 10,000 records, it takes 76ms. Not too bad.
JRL, thanks for your help.
After 2 days of repeated experiments, I succeed.
My test CSV can be simplified as follows:
Code: Select all
A1,A2
1,2
3,4
Code: Select all
Timer>Start
GoSub>GetSumFromCSV
Timer>stop
Let>TotalTime=%stop%-%start%
MDL>Time to Process = %TotalTime%%crlf% Records = %num_recs% %num_fields% Sum = %Sum_1_1%
SRT>GetSumFromCSV
Let>ConStr=Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=C:\ArmSys\ODBC\;
DBConnect>ConStr,dbH
Let>sql=select SUM(%ColumnNameToSum%)from armsys.txt
DBQuery>dbH,select SUM(A1) from armsys.txt,Sum,num_recs,num_fields
DBClose>dbH
END>GetSumFromCSV
JRL, thanks for your help.
Re: How to sum up CSV column without excel
Hi there, I've got a similar problem. I do not have Office installed. I would like to sum up the seventh column in my csv file. This column doesn't have any caption. The solution that you provided breaks when there is a semicolon inside the double quoted in text any column before the seventh column. Here is an example.
Can someone help me to construct a RegEx that will work also when there is no semicolon inside the quoted text and will work also when the text is separeted by a comma not semicolon?
Code: Select all
2012-01-01;2012-01-01;"column3";"column4";"column5";"column6";"6,60";"USD";"140,60";USD
2012-01-01;2012-01-01;"column3";"column4";"column5";"column6";"50,00";"USD";"134,00";USD
2012-01-01;2012-01-01;"column3";"column4";"column5";"[b]column6 with ;semicolon inside[/b]";"43,13";"USD";"84,00";USD
2012-01-01;2012-01-01;"column3";"column4";"column5";"column6";"40,87";"USD";"40,87";USD